Documente Academic
Documente Profesional
Documente Cultură
ICO4168
Wassner Hubert
wassner@esiea.fr
http://professeurs.esiea.fr/wassner/
Hans the clever : an old story to explain how difficult and surprising it can be to teach a
trick to something.
Table of contents
Course prsentation............................................................................................4
Prerequisits .....................................................................................................4
Resume ...........................................................................................................4
Presented techniques.......................................................................................4
Objectives........................................................................................................4
Introduction.........................................................................................................5
Foreword on fractal approach ....................................................................5
Bionics (From Wikipedia, the free encyclopedia).............................................6
Neuron (From Wikipedia, the free encyclopedia).............................................7
Contents.......................................................................................................8
History.........................................................................................................8
Anatomy and histology.................................................................................8
Classes.......................................................................................................10
p. 1/60
Wassner Hubert
wassner@esiea.fr
Connectivity...............................................................................................10
Adaptations to carrying action potentials...................................................11
Challenges to the neuron doctrine.............................................................12
Neurons in the brain..................................................................................12
See also......................................................................................................13
Sources......................................................................................................13
External links.............................................................................................13
Artificial neuron (From Wikipedia, the free encyclopedia).............................14
Contents.....................................................................................................14
Basic structure...........................................................................................14
History.......................................................................................................15
Types of transfer functions.........................................................................15
Bibliography...............................................................................................16
Mathematical model and properties of Artificial Neural Networks................17
Mathematical model...................................................................................17
Properties...................................................................................................20
Training and using (simplified)......................................................................24
Training a NN............................................................................................24
Using a NN.................................................................................................25
Applications................................................................................................25
Deeper insight...................................................................................................26
Foreword on classification and predictions systems......................................26
Threshold finding.......................................................................................27
Receiver operating characteristic (From Wikipedia, the free encyclopedia)
...................................................................................................................29
Testing precautions....................................................................................31
The art of splitting datasets........................................................................31
Datasets cost..............................................................................................31
Are we doing that good ? ...........................................................................32
Training testing and using (the real stuff)......................................................33
Preparing the datasets...............................................................................33
Supervised learning...................................................................................34
Gradient descent (From Wikipedia, the free encyclopedia)........................35
Description of the method..........................................................................35
Comments..................................................................................................37
See also......................................................................................................37
Backpropagation (From Wikipedia, the free encyclopedia)........................38
When the problem is not that simple... Neural cooking ............................39
Local optimum............................................................................................39
Over-fitting/Under-fitting...........................................................................40
Contents.............................................................................................41
Clever Hans and Pfungst's study........................................................41
Clever Hans effect..............................................................................42
Reference...........................................................................................42
Unsupervised learning...................................................................................45
K-means algorithm (From Wikipedia, the free encyclopedia).....................46
References.................................................................................................46
Kohonen network .....................................................................................47
The wide field of neural-network.......................................................................54
p. 2/60
Wassner Hubert
wassner@esiea.fr
p. 3/60
Wassner Hubert
wassner@esiea.fr
Course prsentation
Prerequisits
Resume
Neural commuting techniques are said to be bio-inspired/bio-mimetic , this
means they are inspired from known biologic phnomemons. The goal is to use
the learning and auto-organization capabilities of living organisms to exploit
them on classic algorithmic problems.
Applications are : Artifial Intelligence, classification, Identification, biometry,
prediction, datamining ...
The course is two fold :
Presented techniques
Classification basics
Supervised learning
Non-supervised learning
Objectives
Students following this course won't be expert in the neural computing field.
The goal of this course is to understand basics of neural computing theory. To
be able to identify problems that can be solved by these techniques, and to be
able to implement such solutions with software librairies.
p. 4/60
Wassner Hubert
wassner@esiea.fr
Introduction
Foreword on fractal approach
This field of computer science can be quite hard to understand because lot of
prrequisit are needed (like mathematics). This makes it a field of choice for
engineer to make the difference from technicians. To make this course easier I
will use what I call the fractal approach . The idea is to always know the goal
of every part of the course. This leads to a certain amount of redundancy and
repetition but avoid lost of motivation due fuzzy goals. Fractal is a type of object
that has more or less same appearance whatever scale your looking at. A shore
is fractal, if you look at it at a large scale, and then zoom in, you have the same
kind of visual impression. This is why it is hard to know where you are when
walking on a shore. It can be the same problem when walking linearly in a
such a course. So to avoid this problem I will frequently recall local and global
objectives of each part of this course. This is the equivalent of frequently
zooming in an out on the shore map to know precisely where you are on the
shore/in the course.
That's why introduction part and deeper insight seems to talk about the
same exact things and having the same goals but are not on the same scale .
p. 5/60
Wassner Hubert
wassner@esiea.fr
p. 6/60
Wassner Hubert
wassner@esiea.fr
p. 7/60
Wassner Hubert
wassner@esiea.fr
Contents
1 History
2 Anatomy and histology
3 Classes
4 Connectivity
5 Adaptations to carrying action
potentials
6 Histology and internal structure
7 Challenges to the neuron doctrine
8 Neurons in the brain
9 See also
10 Sources
11 External links
History
The concept of a neuron as the primary computational unit of the nervous
system was devised by the Spanish anatomist Santiago Ramn y Cajal in the
early 20th century. Cajal proposed that neurons were discrete cells which
communicated with each other via specialized junctions. This became known as
the Neuron Doctrine, one of the central tenets of modern neuroscience.
However, Cajal would not have been able to observe the structure of individual
neurons if his rival, Camillo Golgi, (for whom the Golgi Apparatus is named) had
not developed his silver staining method. When the Golgi Stain is applied to
neurons, it binds the cell's microtubules and gives stained cells a black outline
when light is shone through them.
p. 8/60
Wassner Hubert
wassner@esiea.fr
The soma, or 'cell body', is the central part of the cell, where the nucleus
is located and where most protein synthesis occurs.
The axon is a finer, cable-like projection which can extend tens, hundreds,
or even tens of thousands of times the diameter of the soma in length. The
axon carries nerve signals away from the soma (and carry some types of
information in the other direction also). Many neurons have only one axon,
but this axon may - and usually will - undergo extensive branching,
enabling communication with many target cells. The part of the axon
where it emerges from the soma is called the 'axon hillock'. Besides being
an anatomical structure, the axon hillock is also the part of the neuron
that has the greatest density of voltage-dependent sodium channels. Thus
it has the most hyperpolarized action potential threshold of any part of the
neuron. In other words, it is the most easily-excited part of the neuron,
and thus serves as the spike initiation zone for the axon. While the axon
and axon hillock are generally considered places of information outflow,
this region can receive input from other neurons as well.
The axon terminal a specialized structure at the end of the axon that is
used to release neurotransmitter and communicate with target neurons.
Although the canonical view of the neuron attributes dedicated functions to its
p. 9/60
Wassner Hubert
wassner@esiea.fr
various anatomical components, dendrites and axons very often act contrary to
their so-called main function.
Axons and dendrites in the central nervous system are typically only about a
micrometer thick, while some in the peripheral nervous system are much
thicker. The soma is usually about 1025 micrometers in diameter and often is
not much larger than the cell nucleus it contains. The longest axon of a human
motoneuron can be over a meter long, reaching from the base of the spine to
the toes, while giraffes have single axons running along the whole length of
their necks, several meters in length. Much of what we know about axonal
function comes from studying the squid giant axon, an ideal experimental
preparation because of its relatively immense size (0.51 millimeters thick,
several centimeters long).
Classes
Functional classification
Afferent neurons convey information from tissues and organs into the
central nervous system.
Efferent neurons transmit signals from the central nervous system to the
effector cells and are sometimes called motor neurons.
Interneurons connect neurons within specific regions of the central
nervous system.
Afferent and efferent can also refer to neurons which convey information from
one region of the brain to another.
Structural classification Most neurons can be anatomically characterized as:
Connectivity
Neurons communicate with one another via synapses, where the axon terminal
of one cell impinges upon a dendrite or soma of another (or less commonly to an
axon). Neurons such as Purkinje cells in the cerebellum can have over 1000
dendritic branches, making connections with tens of thousands of other cells;
other neurons, such as the magnocellular neurons of the supraoptic nucleus,
have only one or two dendrites, each of which receives thousands of synapses.
Synapses can be excitatory or inhibitory and will either increase or decrease
activity in the target neuron. Some neurons also communicate via electrical
synapses, which are direct, electrically-conductive junctions between cells.
In a chemical synapse, the process of synaptic transmission is as follows: when
p. 10/60
Wassner Hubert
wassner@esiea.fr
Wassner Hubert
wassner@esiea.fr
Wassner Hubert
wassner@esiea.fr
See also
Artificial neuron
Neural oscillations
Mirror neuron
Neuroscience
Neural network
Spindle neuron
Sources
External links
p. 13/60
Wassner Hubert
wassner@esiea.fr
Contents
1 Basic structure
2 History
3 Types of transfer functions
3.1 Step function
3.2 Sigmoid
4 See also
5 Bibliography
Basic structure
For a given artificial neuron, let there be m inputs with signals x1 through xm
and weights w1 through wm.
The output of neuron k is:
Where
p. 14/60
Wassner Hubert
wassner@esiea.fr
The output propagates to the next layer (through a weighted synapse) or finally
exits the system as part or all of the output.
History
The original artificial neuron is the Threshold Logic Unit first proposed by
Warren McCulloch and Walter Pitts in 1943. As a transfer function, it employs a
threshold or step function taking on the values 1 or 0 only.
The output y of this transfer function is binary, depending on whether the input
meets a specified threshold, . The "signal" is sent, i.e. the output is set to one,
if the activation meets the threshold.
Sigmoid
A fairly simple non-linear function, the sigmoid also has an easily calculated
derivative, which is used when calculating the weight updates in the network. It
thus makes the network more easily manipulable mathematically, and was
attractive to early computer scientists who needed to minimise the
computational load of their simulations.
See: Sigmoid function
p. 15/60
Wassner Hubert
wassner@esiea.fr
Bibliography
p. 16/60
Wassner Hubert
wassner@esiea.fr
p. 17/60
Wassner Hubert
wassner@esiea.fr
s(w xi)
i
If output is around 0.5 then you know that the decision will be uncertain.
p. 18/60
Wassner Hubert
wassner@esiea.fr
One neuron can't do much, but several neurons (with an appropriate learning
phase) can do a lot more...
Using two neurons (on the same layer) can show more complex landscapes .
Example :
s(w xi)
i
s(w xi)
s(w xi)
Wassner Hubert
wassner@esiea.fr
Properties
Universal fonction approximation
A XOR B
Input pattern (1,0) and (1,1) are close but correponding output is not close (1)
and (0). This makes it a standard hard learning dataset.
p. 20/60
Wassner Hubert
wassner@esiea.fr
Note : The fact that neural networks are universal function generator does not
implies that it can solve any classification problems.
Example, this dataset show a problem that can not be totaly solved :
Red (+) and green (x) class overlaps each over (when considering the only two
parameters shown here). So the neural network can't do a 100 % correct
classification. The overlapped regions will lead to uncertain answers/outputs. If
the network is properly train, the output can reflect a measure the
uncertainity (probablity approximation) which is still an important
information even if you can't really make a straight decision.
Generalisation
p. 21/60
Wassner Hubert
wassner@esiea.fr
p. 22/60
Wassner Hubert
wassner@esiea.fr
This property comes from the fact that we are approximating continous
functions with a composition of continuous fonctions.
Robustness
p. 23/60
Wassner Hubert
wassner@esiea.fr
p. 24/60
Wassner Hubert
wassner@esiea.fr
Using a NN
Using a NN is simply propagating the input values towards the output, it is
generaly a low CPU consuming stage (just a few fonction evaluations). This is
one of the advantages of NN using them is a quite small task. Training might
need somme time and CPU power but using them is quick and easy.
Somme postprocessing is sometimes nedeed on real life application : A typical
classification problem will need a final decision where the NN only outputs a
numerical vector (in [-1,1]). Classical algorithmes are then used to find the best
fitting class and the distance to the best second class can be good confidence
index.
NN is rarely the complete solution to a problem, but it can be a decisive
part of a lot of solutions.
Example in computer gaming : making a good AI-gamer/bot for action games
is hard and complex because no one can realy mathematically express what is a
good stategy, because all depends on the human opponent gaming style.
Directly training a NN to decide each action is a far too complex task, beause
you should be able to label each action/decision as good or not good (since
we are doing supervised training). The quality of each decisions (leading to win
or lose the game) is simply inaccessible.
An other way to look at making a good AI-bot is to build a NN that simply
predicts the position and or action of the human opponent. This far more easy
because you have this information (simple game recording is enough). You may
then feed the information predicted by the NN to an expert system.
Progamming a decent AI-bot whith an expert system that has information on the
next moves of the human opponent is rather simple...
Here is interesting demonstration :
function approximation :
http://neuron.eng.wayne.edu/bpFunctionApprox/bpFunctionApprox.html
Applications
There is a very wide variety of applications of neural techniques, and there is
certainly even more to discover. Here is a non exhaustiv list :
gaming
p. 25/60
Wassner Hubert
wassner@esiea.fr
Deeper insight
Foreword on classification and predictions systems
The field is quite large it is very important to use correct the words to identify
each kind of application.
Classification is the process of attributing a class to an unkown entry. The
system measures features about the entry and the decides to which class it
belongs.
A classical example is biometric : one person is measured thru a biometric
system (voice sample, fingerprint, iris, ...) theses measures are then compared
to stored models and the program decides whever to grant access or not. This
particular case of classification is called verification, the system verifies the
claim of the user. The user claims it's identity, the system verifies it. There is
two classes : positive recognition and negative recognition.This makes it a two
classes problem. For real life application is it strongly advised to extend it to, at
least, 3 classes problems :
access granted
access refused
When the user knows that he's measured and wants to : collaborative
context.
p. 26/60
Wassner Hubert
wassner@esiea.fr
Imagine your are building an biometric system to access banking data the
cost of a false positive error is not the same than a false negative error.
On the first case you give access to a person who shouldn't have it. On the
second case you deny access to a genuine user. Remember that training
algorithmes are more or less optimisation problems, it is then very important to
adapt the cost of the training function to handle these different kind of
errors...
Extending this concept to N >2 classes classification problem lead to the
concept of confusion matrix .
A confusion matrix is a visualization tool typically used in supervised learning
(in unsupervised learning it is typically called a matching matrix). Each
column of the matrix represents the instances in a predicted class, while each
row represents the instances in an actual class. One benefit of a confusion
matrix is that it is easy to see if the system is confusing two classes (i.e.
commonly mislabelling one as an other).
In the example confusion matrix below, of the 8 actual cats, the system
predicted that three were dogs, and of the six dogs it predicted that one was a
rabbit and two were cats. We can see from the matrix that the system in
question has trouble distinguishing between cats and dogs, but can make the
distinction between rabbits and other types of animals pretty well.
Example of confusion matrix
Cat Dog Rabbit
Cat
Dog
Rabbit 0
11
Threshold finding
NN ouputs fuzzy values but lot of real life problems need discrete answers :
yes/no but no other solution. So the basic way to do so is to threshold the NN
output. All the difficulty is in the choice of that threshold. Let see the process
for a 2 classes classification problem. One common way to choose the threshold
is to draw the score histogram for each class on the same graph.
p. 27/60
Wassner Hubert
wassner@esiea.fr
p. 28/60
Wassner Hubert
wassner@esiea.fr
Wassner Hubert
wassner@esiea.fr
the intercept of the ROC curve with the line at 90 degrees to the nodiscrimination line
the area between the ROC curve and the no-discrimination line
the area under the ROC curve, often called AUC.
d' (pronounced "d-prime"), the distance between the mean of the
distribution of activity in the system under noise-alone conditions and its
distribution under signal plus noise conditions, divided by their standard
deviation, under the assumption that both these distributions are normal
with the same standard deviation. Under these assumptions, it can be
proved that the shape of the ROC depends only on d'.
Wassner Hubert
wassner@esiea.fr
http://www.rad.jhmi.edu/jeng/javarad/roc/main.html
Testing precautions
The above testing definitions and statisitics must be computed on correct
datasets. Since we are using training data to estimate models parameters, the
network will have a too good behaviour on the same dataset. It's very
important to test network on data that haven't been seen before. It's the
only way to ensure measuring the real generalisation feature of neural
networks.
Datasets cost
It may sound strange but oftenly training datasets are quite small and
proprietary. It's simply because the are expensive to build. They are expensive
because most of the time the output (called label or annotation ) is
determined by an expert , and you need a big datasets to properly train and
test your ANN. (Note : this is true for any other statisitical models, not only
ANN).
It is very important to estimate the cost of dataset collection and annotation on
this kind of projects.
When little data is available one option is the Jack knife technique.
Jackknifed statistics are created by systematically dropping out subsets of data
one at a time and assessing the resulting variation in the studied parameter.
p. 31/60
Wassner Hubert
wassner@esiea.fr
Jack-knife technique :
Whole dataset
test
train
train
train
train
train
train
train
train
train
train
Whole dataset
Whole dataset
train
train
test
train
train
train
train
train
train
train
train
train
test
train
train
train
train
train
train
train
train
train
( training
training
training )
( testing
testing
testing )
etc...
n trainings,
each on (n-1) examples
n tests,
each on different
(but similar)
trained model
In some case the annotation process is the most expensive. Bootstrap learning
may take advantage of this situation. One can train a NN with the labeled data
and use the trained network to ease the annotation of more data. Doing so
incrementaly grows the annotated database and raise the NN accuracy.
p. 32/60
Wassner Hubert
wassner@esiea.fr
p. 33/60
Wassner Hubert
wassner@esiea.fr
Solution is to study the statisitic of every input data to find a good scaling
function. To main solutions here :
Supervised learning
Training forward multi-layer neural network is finding the parameter values
Wi where the error is minimal over a given dataset. This is an optimisation
problem. A common way to solve this kind of problem is gradient based
p. 34/60
Wassner Hubert
wassner@esiea.fr
methods. The main idea is to use information given by the derivate of the error
function.
, then
gradient of F at
. It follows that, if
We have
so hopefully the sequence
converges to the desired local maximum. Note that the value of the step size is
allowed to change at every iteration.
Let us illustrate this process in the picture below. Here F is assumed to be
defined on the plane, and that its graph looks like a hill. The blue curves are the
contour lines, that is, the regions on which the value of F is constant. A red
arrow originating at a point shows the direction of the gradient at that point.
Note that the gradient at a point is perpendicular to the contour line going
through that point. We see that gradient ascent leads us to the top of the hill,
that is, to the point where the value of the function F is largest.
p. 35/60
Wassner Hubert
wassner@esiea.fr
p. 36/60
Wassner Hubert
wassner@esiea.fr
Comments
Note that gradient descent works in spaces of any number of dimensions, even
in infinite-dimensional ones.
Two weaknesses of gradient descent are:
1. The algorithm can take many iterations to converge towards a local
maximum/minimum, if the curvature in different directions is very
different.
2. Finding the optimal per step can be time-consuming. Conversely, using a
fixed can yield poor results. Methods based on Newton's method and
inversion of the Hessian using Conjugate gradient techniques are often a
better alternative.
A more powerful algorithm is given by the BFGS method which consists in
calculating on every step a matrix by which the gradient vector is multiplied to
go into a "better" direction, combined with a more sophisticated linear search
algorithm, to find the "best" value of .
See also
Wassner Hubert
wassner@esiea.fr
p. 38/60
Wassner Hubert
wassner@esiea.fr
http://en.wikipedia.org/wiki/Non-parametric_methods
http://en.wikipedia.org/wiki/Expectation-Maximization
Local optimum
Gradient method may in some case lead to local unsufficient local optimum.
Schematic definition of different kind of optimum :
Wassner Hubert
wassner@esiea.fr
http://en.wikipedia.org/wiki/Stochastic_gradient_descent
http://en.wikipedia.org/wiki/Simulated_annealing
http://en.wikipedia.org/wiki/Genetic_algorithms
...
Over-fitting/Under-fitting
Over-fitting
p. 40/60
Wassner Hubert
wassner@esiea.fr
Contents
The horse, Hans, had been trained by a Mr. von Osten to tap out the answers to
arithmetic questions with its hoof. The answers to questions involving reading,
spelling and musical tones were converted to numbers, and the horse also
tapped out these numbers.
Seeking to ascertain a scientific basis or disproof for the claim, philosopher and
psychologist Carl Stumpf formed a panel of 13 prominent scientists, known as
the Hans Commission, to study the claims that a Clever Hans could count. The
commission passed off the evaluation to Pfungst, who tested the basis for these
claimed abilities by:
1. Isolating horse and questioner from spectators, so no cues could come
from them
2. Using questioners other than the horse's master
3. By means of blinders, varying whether the horse could see the questioner
4. Varying whether the questioner knew the answer to the question in
advance.
Using a substantial number of trials, Pfungst found that the horse could get the
correct answer even if von Osten himself did not ask the questions, ruling out
the possibility of fraud. However, the horse got the right answer only when the
questioner knew what the answer was, and the horse could see the questioner.
He then proceeded to examine the behaviour of the questioner in detail, and
showed that as the horse's taps approached the right answer, the questioner's
posture and facial expression changed in ways that were consistent with an
increase in tension, which was released when the horse made the final,
"correct" tap. This provided a cue that the horse could use to tell it to stop
tapping.
The social communication systems of horses probably depend on the detection
of small postural changes, and this may be why Hans so easily picked up on the
cues given by von Osten (who seems to have been entirely unaware that he was
providing such cues). However, the capacity to detect such cues is not confined
to horses. Pfungst proceeded to test the hypothesis that such cues would be
discernible, by carrying out laboratory tests in which he played the part of the
horse, and human participants sent him questions to which he gave numerical
answers by tapping. He found that 90% of participants gave sufficient cues for
him to get a correct answer.
p. 41/60
Wassner Hubert
wassner@esiea.fr
The risk of Clever Hans effects is one strong reason why comparative
psychologists normally test animals in isolated apparatus, without interaction
with them. However this creates problems of its own, because many of the most
interesting phenomena in animal cognition are only likely to be demonstrated in
a social context, and in order to train and demonstrate them, it is necessary to
build up a social relationship between trainer and animal. This point of view has
been strongly argued by Irene Pepperberg in relation to her studies of parrots,
and by Alan and Beatrice Gardner in their study of the chimpanzee Washoe. If
the results of such studies are to gain universal acceptance, it is necessary to
find some way of testing the animals' achievements which eliminates the risk of
Clever Hans effects. However, simply removing the trainer from the scene may
not be an appropriate strategy, because where the social relationship between
trainer and subject is strong, the removal of the trainer may produce emotional
responses preventing the subject from performing. It is therefore necessary to
devise procedures where none of those present knows what the animal's likely
response may be.
For an example of an experimental protocol designed to overcome the Clever
Hans effect, see Rico (Border Collie).
As Pfungst's final experiment makes clear, Clever Hans effects are quite as
likely to occur in experiments with humans as in experiments with other
animals. For this reason, care is often taken in fields such as perception,
cognitive psychology, and social psychology to make experiments double-blind,
meaning that neither the experimenter nor the subject knows what condition
the subject is in, and thus what his or her responses are predicted to be.
Another way in which Clever Hans effects are avoided is by replacing the
experimenter with a computer, which can deliver standardized instructions and
record responses without giving clues.
Reference
Pfungst, O. (1911). Clever Hans (The horse of Mr. Von Osten): A contribution to
experimental animal and human psychology (Trans. C. L. Rahn). New York:
Henry Holt. (Originally published in German, 1907).
The horse learnt something which is ,in a way,far more complex than basic
arithmetics.
A NN net can do that too, when some unwanted statistics biais lay in training
data.
Building a Hans NN doing something different than expected
without knowing it can be a very deceptive experience...
Over-fitting can occurs when the model (NN) has too much parameters
regarding to the size of the training dataset, but it's not the only case...
p. 42/60
Wassner Hubert
wassner@esiea.fr
Solutions :
Try to use the smaller NN possible for a given problem (limiting the
number of parameters of the model).
p. 43/60
Wassner Hubert
wassner@esiea.fr
Studing error curves on train and test set can be helpful to find the best
moment to stop training.
The error curve on train and test set are more or less the same at the begining.
At one point the training error continues to fall down but the error on the test
set doesn't and even get bigger at some points.
The divergence point of this two curves is the best moment to stop training
because it is when the NN generalises the best. After that point the NN is too
specializing on the training set and will be unable to produce correct output on
new data.
Under-fitting
The over side of fitting problem is under-fitting, when the model constraint are
to strong regarding to the data statisitics.
The example below shows the underfitted output of a sinus function
approximation.
p. 44/60
Wassner Hubert
wassner@esiea.fr
Solutions are :
Raising the number of neurons (and adapting topology, one single hidden
layer is the most current topology but not always the best...)
Make sure that your problem can be solved the way you presented it to
the NN. Maybe the input data are not (that) relevant to the problem...
Unsupervised learning
Un-supervised algorithms are in a way simpler that supervised ones :
Mathematic basic needs can be quite low to understand and use them
The problem is that they are quite counterintuitive and maybe need more
abstraction capabilities.
First of all the question is how a machine can learn something if no one
tels what is expected ?
Let's see K-means algorithm as an introduction ...
p. 45/60
Wassner Hubert
wassner@esiea.fr
where there are k clusters Si, i = 1,2,...,k and i is the centroid or mean point of
all the points
The algorithm starts by partitioning the input points into k initial sets, either at
random or using some heuristic data. It then calculates the mean point, or
centroid, of each set. It constructs a new partition by associating each point
with the closest centroid. Then the centroids are recalculated for the new
clusters, and algorithm repeated by alternate application of these two steps
until convergence, which is obtained when the points no longer switch clusters
(or alternatively centroids are no longer changed).
The algorithm has remained extremely popular because it converges extremely
quickly in practice. In fact, many have observed that the number of iterations is
typically much less than the number of points. Recently, however, Arthur and
Vassilvitskii showed that there exist certain point sets on which k-means takes
superpolynomial time -
- to converge.
References
J. B. MacQueen (1967): "Some Methods for classification and Analysis of
Multivariate Observations", Proceedings of 5-th Berkeley Symposium on
Mathematical Statistics and Probability, Berkeley, University of California Press,
1:281-297
, S. Vassilvitskii (2006): "How Slow is the k-means Method?," Proceedings of the
2006 Symposium on Computational Geometry (SoCG).
p. 46/60
Wassner Hubert
wassner@esiea.fr
Kohonen network
Kohonen network also known as Self Organizing Maps are based on k-means
algorithm. The additional feature is that each centroid (mean) is located on a
map such as topology is conserved. This means that centroids that are close (in
the problem space) should also be close on the map.
One can choose any type of map (in dimension and topology) this is a way to
represents any dimensionnal data ( problem space ) in a selected map (idealy
2D, which is convenient for human)...
SOM helps to represents high dimensional data in a low dimensionnal
space while preserving topological information.
Wassner Hubert
wassner@esiea.fr
One important feature of SOM is that neurons are set on a map where location
is important.
In SOM, neighborhood is defined as a mathematical function which parameters
are one neuron of the map and a neighborhood level , the output is the list of
neurons belonging to this neighborhood. (The higher the neighborhood level,
the higher the number of neurons in the neighborhood).
Neighborhood example on a square 2D map :
p. 48/60
Wassner Hubert
wassner@esiea.fr
All the input data are presented to the neurons, the closest neuron, and its
neighborhood is selected, to be updated toward the input data sample. The
neighborhood is slowly decreased along the iterations...
Note : a lot of different topology are possible. This choice is often the core of the
problem defining properties of the map...
SOM training algorithm :
1. initialise neuron map
2. loop over a decreasing neighborhood schedule
3. loop over all input data
4. search closest neuron (n) from current input sample (S)
5. get the closest neuron and its neighborhood even closer to the
input data sample.
W(t+1) = W(t) + a * (S-W(t)), where a is a
learning factor and S is the current sample.
6. End of loop over input datamining
7. end of neighborhood decreasing loop
p. 49/60
Wassner Hubert
wassner@esiea.fr
The training phase can be viewed like a deformation of the neural map toward
the input data shape. A large neighborhood is like a quite rigid map, a small
neighborhood is like a soft map. So the map is going from a rigid to a soft state
to gradualy fit on the input data.
Here is a example of fitting 2D uniform random data (input space) into a 2D
SOM :
How to read the map : The 2D position on the figure comes out of the input data
space, the topology information is in the grid connexions.
The resulting straight grid indicates that we are modeling a 2D square into a 2D
square (preserving topology)...
The following example is more interesting... The input data is still 2D here but
the topology is different. Input data is displayed in a 2D cross shape (which is a
different topology from the square), Kohonen map is still a 2D square
p. 50/60
Wassner Hubert
wassner@esiea.fr
A Kohonen maps are inspired form the localisation biological neurons. You
will see in that example how a SOM can help to find one's way in the data.
Imagin that you need to operate a robot that have to move inside the input data
like in a maze (inside the cross shape). If you have to go from A to B, the simple
algorithm is to computing the A->B vector direction and using it to move the
robot. This will lead to go out of the cross (or hit the wall of the maze).
A
p. 51/60
Wassner Hubert
wassner@esiea.fr
Using The Kohonen map can help us to find the best way, this is how to use it :
Compute the a->b vector in the map (using neuron map coordinates)
each neuron on the road hold a point/vector (in the input space)
showing, step by step, the way from A to B in the input space.
When appling this algorithm on the cross shape figure you will see that your
robot is carefully avoiding walls .
a
Note : that the convergence of the map must be carefully done... The cactus
example below would need some training parameters tuning...
p. 52/60
Wassner Hubert
wassner@esiea.fr
p. 53/60
2D SOM http://www-ti.informatik.unituebingen.de/~goeppert/KohonenApp/KohonenApp.html
an other 2D SOM
http://www.cs.utexas.edu/users/yschoe/java/javasom/Base.html
OCR http://www.ice.nuie.nagoya-u.ac.jp/~l94334/bio/tsp/tsp.html
WEBSOM
http://websom.hut.fi/websom/milliondemo/html/root.html
wassner@esiea.fr
For each type of NN you have several training procedures which impact
its caracteristics.
...
This is just a small list of known technics, derivatives , new ones and
combinations are frequently invented...
A few web demos :
reinforcement learning :
Robot arm
http://www.fe.dis.titech.ac.jp/~gen/robot/robodemo.html
http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html
cat & mouse
http://www.cse.unsw.edu.au/~cs9417ml/RL1/applet.html
http://neuron.eng.wayne.edu/Hamming/voting.html
Hopfield network
p. 54/60
http://suhep.phy.syr.edu/courses/modules/MM/sim/hopfield.html
Wassner Hubert
wassner@esiea.fr
iis
So the processing power can be an issue if some of the n,N or I parameters are
really high. It becomes a serious issue is if real time answers are requested on
such process.
Solutions
divide and conquer strategy :
Since building and training NN need a lot of try, you can easily split them
accross different computer on a standard network.
NN are not suited to classical CPU even to math processors. They are
doing only very simple processing. They are parallel by nature, remember
the biological inspiration.
p. 55/60
Wassner Hubert
wassner@esiea.fr
Illustration 2:
Zero Instruction
Set Chip
NN are not magic, if the information you are trying to discover is not in
the data, NN won't find them. You should have evidences or strong
intuition than the information is in your data. Tools can be :
2D marginal projections.
expert informations.
We have seen that neural networks can solve several kind of problems :
classification
forecasting
robotic/automatic
...
but sometimes it's not the best in all cases. Well know alternates are :
Markov chains
K-nearest neighbor
Adaptative filtering
...
p. 56/60
Wassner Hubert
wassner@esiea.fr
Neural network :
a training procedure
Training can be viewed as an optimisation problem, as any
problem of this kind, local optimum can be a problem.
Libraries
p. 57/60
Wassner Hubert
wassner@esiea.fr
example : http://ai.bpa.arizona.edu/papers/dog93/dog93.html
This ability lead to some human problems :
fear to lose the job : NN can rise the old fear to be replaced by a
machine.
Datasets
web :
http://kdd.ics.uci.edu/
p. 58/60
Wassner Hubert
wassner@esiea.fr
promoters : ftp://ftp.ics.uci.edu/pub/machine-learningdatabases/molecular-biology/
...
Bibliography
Bibliography
p. 59/60
Wassner Hubert
wassner@esiea.fr
Web sites
http://leenissen.dk/fann/
http://www.google.com/Top/Computers/Artificial_Intelligence/Neural_Net
works/Companies/
http://www.google.com/Top/Computers/Artificial_Intelligence/Conferences
_and_Events/
...
Softwares
Open source
FANN http://leenissen.dk/fann/
SNNS http://www-ra.informatik.uni-tuebingen.de/SNNS/
scilab http://www.scilab.org
http://www.scilab.org/contrib/displayContribution.php?fileID=166
...
Proprietary
Mathematica
http://www.wolfram.com/products/applications/neuralnetworks/
matlab http://www.mathworks.com/products/neuralnet/
...
p. 60/60
Wassner Hubert
wassner@esiea.fr