Artificial Intelligence: An Experiment in Animal Self-Classification
The purpose of this experiment is to see whether an explorative data
mining program could recognize degree of similarities in about 60 species of
animals, from honeybees to lions, to starfish, parrots, etc.
The software program used is from www.viscovery.net . Some uses of this program
are: (a) clustering of customer data according to their behaviour. (b) In gene
data analytics to identify biomarkers for the diagnosis of diseases. (c) For
fraud profiling and forensic applications.
The data is from the University of California Irvine data archive (archive.ics.uci.edu/ml).
The ‘properties’ (attributes) considered were: backbone, legs, fins,
lungs, hair, feathers, eggs, milk, tail, domestic, aquatic, airborne, venomous,
predator. Presence of an attribute is denoted by 1, and absence by 0, except
for Legs, where it has a numerical value of: 2,4,6,8 legs.
The tool was made to learn the attributes of all the animals and group them
into clusters, each cluster containing animals with some degree of overall similarity
in total attributes. The output of the tool used is a self-organized map where
each cluster is denoted by a different colour. Within each cluster, two animals
can be considered similar if they are close by each other. The distance in
position of the animal from the centre of the cluster is a measure of how
strongly it belongs to the cluster. Since we know the real types of the animals
classified, the clustering found by the tool is good if different types are
also grouped together in different clusters.
With these comments in mind, let’s take a look at the results:
a. There are 7
clusters (C1, C2….C7)
b. The classification
is generally accurate: e.g. every fish is in C3. All birds are in C1 (except
Tortoise which is in the wrong cluster).
c. Note that C5 has
all the venomous creatures, except Frog where the clustering is wrong.
e. C7 has all the domestic
animals. Reindeer is classified as domestic animal, not sure why, though I know
people in Finland keep them like cattle.
f.
Strangely, ‘Girl’ is in the same C7 cluster as domestic animals right
next to ‘Pussycat’.
g. In C2, you can see
lion, leopard, cheetah are very close together which is a fair indication of
what we perceive in real life.
h. Also in C2, you can
see that flying mammals (fruit bat and vampire bat) are in the same cluster as
the other mammals but far away from the animals which typify our perception of
‘mammal’.
i. In C1, the birds
cluster, you can see that gull, skimmer and skua (all seabirds) are very close
together though I never told the program that they are seabirds, neither is
their diet specified.
j. And hey, a Ladybird
(which is an insect) is on the border between the Insect (C4) and the Bird
clusters (C1)!
k. Quite interesting that Starfish is
in the same cluster as Crayfish, Crab, Lobster, Clam
l. Fruitbat and Vampire Bat are in C2
on the border between birds and mammals which is quite good since they are
flying mammals.
Of what use is such a self-organizing
map?
With the right data, the right ‘tuning’ and some additional post-processing
it could be used for e.g.
1. Stock market
arbitrage. The program could be presented with a list of attributes such as
P/E, EPS, volatility, market capitalization, analyst rating, Sharpe Ratio,
Beta, % price change, average volume, etc. (all pre-processed and normalized),
then stocks that are temporarily out of line with its peers despite having many
similar characteristics, can be bought/ sold in anticipation of reversion to
the normal situation.
2. Medical conditions,
diseases, and their symptoms can also be fed into such a network to see
estimate the likelihood of having/not having the disease. Of course this is not
a simplistic task and requires information from medical specialists (and their
input can also be fed into the program). Medical diagnosis is also not a White
or Black process but with many shades of Gray.
3. This tool is also
suitable for classifying data with a mix of quantitative (measurable) and
qualitative (non-measurable except as a ranking or grading) attributes such as
for different types of wine. On the one hand we may have numeric values such as
acidity, price per x litre, age. On the other hand we also
have the wine-taster’s grading, the color, the country of origin etc. All these
could be used as input to see if our program’s classification matches the
degree of distance between $ price.
Comments
Post a Comment