Sunteți pe pagina 1din 60

Neural computing

ICO4168
Wassner Hubert
wassner@esiea.fr
http://professeurs.esiea.fr/wassner/

Hans the clever : an old story to explain how difficult and surprising it can be to teach a
trick to something.

Table of contents
Course prsentation............................................................................................4
Prerequisits .....................................................................................................4
Resume ...........................................................................................................4
Presented techniques.......................................................................................4
Objectives........................................................................................................4
Introduction.........................................................................................................5
Foreword on fractal approach ....................................................................5
Bionics (From Wikipedia, the free encyclopedia).............................................6
Neuron (From Wikipedia, the free encyclopedia).............................................7
Contents.......................................................................................................8
History.........................................................................................................8
Anatomy and histology.................................................................................8
Classes.......................................................................................................10
p. 1/60

Wassner Hubert

wassner@esiea.fr

Connectivity...............................................................................................10
Adaptations to carrying action potentials...................................................11
Challenges to the neuron doctrine.............................................................12
Neurons in the brain..................................................................................12
See also......................................................................................................13
Sources......................................................................................................13
External links.............................................................................................13
Artificial neuron (From Wikipedia, the free encyclopedia).............................14
Contents.....................................................................................................14
Basic structure...........................................................................................14
History.......................................................................................................15
Types of transfer functions.........................................................................15
Bibliography...............................................................................................16
Mathematical model and properties of Artificial Neural Networks................17
Mathematical model...................................................................................17
Properties...................................................................................................20
Training and using (simplified)......................................................................24
Training a NN............................................................................................24
Using a NN.................................................................................................25
Applications................................................................................................25
Deeper insight...................................................................................................26
Foreword on classification and predictions systems......................................26
Threshold finding.......................................................................................27
Receiver operating characteristic (From Wikipedia, the free encyclopedia)
...................................................................................................................29
Testing precautions....................................................................................31
The art of splitting datasets........................................................................31
Datasets cost..............................................................................................31
Are we doing that good ? ...........................................................................32
Training testing and using (the real stuff)......................................................33
Preparing the datasets...............................................................................33
Supervised learning...................................................................................34
Gradient descent (From Wikipedia, the free encyclopedia)........................35
Description of the method..........................................................................35
Comments..................................................................................................37
See also......................................................................................................37
Backpropagation (From Wikipedia, the free encyclopedia)........................38
When the problem is not that simple... Neural cooking ............................39
Local optimum............................................................................................39
Over-fitting/Under-fitting...........................................................................40
Contents.............................................................................................41
Clever Hans and Pfungst's study........................................................41
Clever Hans effect..............................................................................42
Reference...........................................................................................42
Unsupervised learning...................................................................................45
K-means algorithm (From Wikipedia, the free encyclopedia).....................46
References.................................................................................................46
Kohonen network .....................................................................................47
The wide field of neural-network.......................................................................54
p. 2/60

Wassner Hubert

wassner@esiea.fr

A word on algorithmic complexity.....................................................................55


Training is cpu consuming, but .....................................................................55
Solutions........................................................................................................55
NN are not magic !............................................................................................56
How to actualy create and use neural networks................................................57
When NN are doing better than experts............................................................58
Deeper Deeper inside : Exercises .....................................................................58
Datasets.........................................................................................................58
Bibliography......................................................................................................59
Bibliography...................................................................................................59
Web sites........................................................................................................60
Softwares.......................................................................................................60
Open source...............................................................................................60
Proprietary.................................................................................................60

p. 3/60

Wassner Hubert

wassner@esiea.fr

Course prsentation
Prerequisits

Good algorithmic knowledge

Good math basics (continuity, drivation and optimisation)

Good (C) programming skills

Resume
Neural commuting techniques are said to be bio-inspired/bio-mimetic , this
means they are inspired from known biologic phnomemons. The goal is to use
the learning and auto-organization capabilities of living organisms to exploit
them on classic algorithmic problems.
Applications are : Artifial Intelligence, classification, Identification, biometry,
prediction, datamining ...
The course is two fold :

a slight theoretical part

a pratice part using open-source librairies to solve real-life problems

Presented techniques

Classification basics

Supervised learning

feed forward multi-layer neural network

Non-supervised learning

Kohonen network (Self Organizing Maps)

Objectives
Students following this course won't be expert in the neural computing field.
The goal of this course is to understand basics of neural computing theory. To
be able to identify problems that can be solved by these techniques, and to be
able to implement such solutions with software librairies.

p. 4/60

Wassner Hubert

wassner@esiea.fr

Introduction
Foreword on fractal approach
This field of computer science can be quite hard to understand because lot of
prrequisit are needed (like mathematics). This makes it a field of choice for
engineer to make the difference from technicians. To make this course easier I
will use what I call the fractal approach . The idea is to always know the goal
of every part of the course. This leads to a certain amount of redundancy and
repetition but avoid lost of motivation due fuzzy goals. Fractal is a type of object
that has more or less same appearance whatever scale your looking at. A shore
is fractal, if you look at it at a large scale, and then zoom in, you have the same
kind of visual impression. This is why it is hard to know where you are when
walking on a shore. It can be the same problem when walking linearly in a
such a course. So to avoid this problem I will frequently recall local and global
objectives of each part of this course. This is the equivalent of frequently
zooming in an out on the shore map to know precisely where you are on the
shore/in the course.
That's why introduction part and deeper insight seems to talk about the
same exact things and having the same goals but are not on the same scale .

p. 5/60

Wassner Hubert

wassner@esiea.fr

Neural computing, a part of biomimetic ...

Bionics (From Wikipedia, the free encyclopedia)


(Redirected from Biomimetics)
Jump to: navigation, search
Bionics (also known as biomimetics, biognosis, biomimicry, or bionical
creativity engineering) is the application of methods and systems found in
nature to the study and design of engineering systems and modern technology.
Also a short form of biomechanics, the word 'bionic' is actually a portmanteau
formed from biology (from the Greek word "", pronounced "vios", meaning
"life") and electronic.
The transfer of technology between lifeforms and synthetic constructs is
desirable because evolutionary pressure typically forces natural systems to
become highly optimized and efficient. A classical example is the development
of dirt- and water-repellent paint (coating) from the observation that the
surface of the lotus flower plant is practically unsticky for anything (the lotus
effect). Examples of bionics in engineering include the hulls of boats imitating
the thick skin of dolphins, sonar, radar, and medical ultrasound imaging
imitating the echolocation of bats.
In the field of computer science, the study of bionics has produced cybernetics,
artificial neurons, artificial neural networks, and swarm intelligence.
Evolutionary computation was also motivated by bionics ideas but it took the
idea further by simulating evolution in silico and producing well-optimized
solutions that had never appeared in nature.
Biomimetic is the field of algorithmic which is inspired from biology. The goal is
to simulate known biologic phenomenons. This ables algorithms to have
interesting capabilities such as auto organisation, adaptation and even learning
... at some point.
Neural techniques mimics the biologic neuron.
We must then investigate a little neural biology basics ...

p. 6/60

Wassner Hubert

wassner@esiea.fr

Neuron (From Wikipedia, the free encyclopedia)

Drawing by Santiago Ramn y Cajal of cells in the pigeon cerebellum. (A)


Denotes Purkinje cells, an example of a bipolar neuron. (B) Denotes granule
cells which are multipolar.
Neurons are a major class of cells in the nervous system. Neurons are
sometimes called nerve cells, though this term is technically imprecise, as many
neurons do not form nerves. In vertebrates, neurons are found in the brain, the
spinal cord and in the nerves and ganglia of the peripheral nervous system.
Their main role is to process and transmit information. Neurons have excitable
membranes, which allow them to generate and propagate electrical impulses.

p. 7/60

Wassner Hubert

wassner@esiea.fr

Contents

1 History
2 Anatomy and histology
3 Classes
4 Connectivity
5 Adaptations to carrying action
potentials
6 Histology and internal structure
7 Challenges to the neuron doctrine
8 Neurons in the brain
9 See also
10 Sources
11 External links

History
The concept of a neuron as the primary computational unit of the nervous
system was devised by the Spanish anatomist Santiago Ramn y Cajal in the
early 20th century. Cajal proposed that neurons were discrete cells which
communicated with each other via specialized junctions. This became known as
the Neuron Doctrine, one of the central tenets of modern neuroscience.
However, Cajal would not have been able to observe the structure of individual
neurons if his rival, Camillo Golgi, (for whom the Golgi Apparatus is named) had
not developed his silver staining method. When the Golgi Stain is applied to
neurons, it binds the cell's microtubules and gives stained cells a black outline
when light is shone through them.

Anatomy and histology


Many neurons are highly specialized, and they differ widely in appearance.
Neurons have cellular extensions known as processes which they use to send
and receive information. Neurons are typically 4 to 100 micrometres in
diameter, the size varies depending on the type of neuron and the species it is
from. [1]

p. 8/60

Wassner Hubert

wassner@esiea.fr

The soma, or 'cell body', is the central part of the cell, where the nucleus
is located and where most protein synthesis occurs.

The dendrite is a branching arbor of cellular extensions. Most neurons


have several dendrites with profuse dendritic branches. The overall shape
and structure of a neuron's dendrites is called its dendritic tree, and is
traditionally thought to be the main information receiving network for the
neuron. However, information outflow (i.e. from dendrites to other
neurons) can also occur.

The axon is a finer, cable-like projection which can extend tens, hundreds,
or even tens of thousands of times the diameter of the soma in length. The
axon carries nerve signals away from the soma (and carry some types of
information in the other direction also). Many neurons have only one axon,
but this axon may - and usually will - undergo extensive branching,
enabling communication with many target cells. The part of the axon
where it emerges from the soma is called the 'axon hillock'. Besides being
an anatomical structure, the axon hillock is also the part of the neuron
that has the greatest density of voltage-dependent sodium channels. Thus
it has the most hyperpolarized action potential threshold of any part of the
neuron. In other words, it is the most easily-excited part of the neuron,
and thus serves as the spike initiation zone for the axon. While the axon
and axon hillock are generally considered places of information outflow,
this region can receive input from other neurons as well.

The axon terminal a specialized structure at the end of the axon that is
used to release neurotransmitter and communicate with target neurons.

Although the canonical view of the neuron attributes dedicated functions to its
p. 9/60

Wassner Hubert

wassner@esiea.fr

various anatomical components, dendrites and axons very often act contrary to
their so-called main function.
Axons and dendrites in the central nervous system are typically only about a
micrometer thick, while some in the peripheral nervous system are much
thicker. The soma is usually about 1025 micrometers in diameter and often is
not much larger than the cell nucleus it contains. The longest axon of a human
motoneuron can be over a meter long, reaching from the base of the spine to
the toes, while giraffes have single axons running along the whole length of
their necks, several meters in length. Much of what we know about axonal
function comes from studying the squid giant axon, an ideal experimental
preparation because of its relatively immense size (0.51 millimeters thick,
several centimeters long).

Classes
Functional classification

Afferent neurons convey information from tissues and organs into the
central nervous system.
Efferent neurons transmit signals from the central nervous system to the
effector cells and are sometimes called motor neurons.
Interneurons connect neurons within specific regions of the central
nervous system.

Afferent and efferent can also refer to neurons which convey information from
one region of the brain to another.
Structural classification Most neurons can be anatomically characterized as:

Unipolar or Pseudounipolar- dendrite and axon emerging from same


process.
Bipolar - single axon and single dendrite on opposite ends of the soma.
Multipolar - more than two dendrites
Golgi I- neurons with long-projecting axonal processes.
Golgi II- neurons whose axonal process projects locally.

Connectivity
Neurons communicate with one another via synapses, where the axon terminal
of one cell impinges upon a dendrite or soma of another (or less commonly to an
axon). Neurons such as Purkinje cells in the cerebellum can have over 1000
dendritic branches, making connections with tens of thousands of other cells;
other neurons, such as the magnocellular neurons of the supraoptic nucleus,
have only one or two dendrites, each of which receives thousands of synapses.
Synapses can be excitatory or inhibitory and will either increase or decrease
activity in the target neuron. Some neurons also communicate via electrical
synapses, which are direct, electrically-conductive junctions between cells.
In a chemical synapse, the process of synaptic transmission is as follows: when
p. 10/60

Wassner Hubert

wassner@esiea.fr

an action potential reaches the axon terminal, it opens voltage-gated calcium


channels, allowing calcium ions to enter the terminal. Calcium causes synaptic
vesicles filled with neurotransmitter molecules to fuse with the membrane,
releasing their contents into the synaptic cleft. The neurotransmitters diffuse
across the synaptic cleft and activate receptors on the postsynaptic neuron.
The human brain has a huge number of synapses. Each of 100 billion neurons
has on average 7,000 synaptic connections to other neurons. Most authorities
estimate that the brain of a three-year-old child has about 1,000 trillion
synapses. This number declines with age, stabilizing by adulthood. Estimates
vary for an adult, ranging from 100 to 500 trillion synapses. [2]

Adaptations to carrying action potentials


The cell membrane in the axon and soma contain voltage-gated ion channels
which allow the neuron to generate and propagate an electrical impulse (an
action potential). Substantial early knowledge of neuron electrical activity came
from experiments with squid giant axons. In 1937, John Zachary Young
suggested that the giant squid axon might be used to better understand neurons
[3]. As they are much larger than human neurons, but similar in nature, it was
easier to study them with the technology of that time. By inserting electrodes
into the giant squid axons, accurate measurements could be made of the
membrane potential.
Electrical activity can be produced in neurons by a number of stimuli. Pressure,
stretch, chemical transmitters, and electrical current passing across the nerve
membrane as a result of a difference in voltage can all initiate nerve activity [4].
p. 11/60

Wassner Hubert

wassner@esiea.fr

The narrow cross-section of axons lessens the metabolic expense of carrying


action potentials, but thicker axons convey impulses more rapidly. To minimize
metabolic expense while maintaining rapid conduction, many neurons have
insulating sheaths of myelin around their axons. The sheaths are formed by glial
cells: oligodendrocytes in the central nervous system and Schwann cells in the
peripheral nervous system. The sheath enables action potentials to travel faster
than in unmyelinated axons of the same diameter, whilst using less energy. The
myelin sheath in peripheral nerves normally runs along the axon in sections
about 1 mm long, punctuated by unsheathed nodes of Ranvier which contain a
high density of voltage-gated ion channels. Multiple sclerosis is a neurological
disorder that results from abnormal demyelination of peripheral nerves.
Neurons with demyelinated axons do not conduct electrical signals properly.

Challenges to the neuron doctrine


The neuron doctrine is a central tenet of modern neuroscience, but recent
studies suggest that this doctrine needs to be revised.
First, electrical synapses are more common in the central nervous system than
previously thought. Thus, rather than functioning as individual units, in some
parts of the brain large ensembles of neurons may be active simultaneously to
process neural information.
Second, dendrites, like axons, also have voltage-gated ion channels and can
generate electrical potentials that carry information to and from the soma. This
challenges the view that dendrites are simply passive recipients of information
and axons the sole transmitters. It also suggests that the neuron is not simply
active as a single element, but that complex computations can occur within a
single neuron.
Third, the role of glia in processing neural information has begun to be
appreciated. Neurons and glia make up the two chief cell types of the central
nervous system. There are far more glial cells than neurons: glia outnumber
neurons by as many as 10:1. Recent experimental results have suggested that
glia play a vital role in information processing. [citations?]
Finally, recent research has challenged the historical view that neurogenesis, or
the generation of new neurons, does not occur in adult mammalian brains. It is
now known that the adult brain continuously creates new neurons in the
hippocampus and in an area contributing to the olfactory bulb. This research
has shown that neurogenesis is environment-dependent (eg. exercise, diet,
interactive surroundings), age-related, upregulated by a number of growth
factors, and halted by survival-type stress factors. [5] [6]

Neurons in the brain


The number of neurons in the brain varies dramatically from species to species.
The human brain has about 100 billion (1011) neurons and 100 trillion (1014)
synapses. By contrast, the nematode worm (Caenorhabditis elegans) has 302
neurons. Scientists have mapped all of the nematode's neurons. As a result,
such worms are ideal candidates for neurobiological experiments and tests.
Many properties of neurons, from the type of neurotransmitters used to ion
p. 12/60

Wassner Hubert

wassner@esiea.fr

channel composition, are maintained across species, allowing scientists to study


processes occurring in more complex organisms in much simpler experimental
systems.

See also

Artificial neuron
Neural oscillations
Mirror neuron
Neuroscience
Neural network
Spindle neuron

Sources

Kandel E.R., Schwartz, J.H., Jessell, T.M. 2000. Principles of Neural


Science, 4th ed., McGraw-Hill, New York.
Bullock, T.H., Bennett, M.V.L., Johnston, D., Josephson, R., Marder, E.,
Fields R.D. 2005. The Neuron Doctrine, Redux, Science, V.310, p. 791-793.
Ramn y Cajal, S. 1933 Histology, 10th ed., Wood, Baltimore.
Peters, A., Palay, S.L., Webster, H, D., 1991 The Fine Structure of the
Nervous System, 3rd ed., Oxford, New York.

External links

Cell Centered Database UC San Diego images of neurons.


High Resolution Neuroanatomical Images of Primate and Non-Primate
Brains.

Retrieved from "http://en.wikipedia.org/wiki/Neuron"


Categories: Neurons | Neuroscience | Medical terms

p. 13/60

Wassner Hubert

wassner@esiea.fr

Artificial neuron (From Wikipedia, the free encyclopedia)

An artificial neuron (also called a "node" or "neuron") is a basic unit in an


artificial neural network. Artificial neurons are simulations of biological
neurons, and they are typically functions from many dimensions to one
dimension. They receive one or more inputs and sum them to produce an
output. Usually the sums of each node are weighted, and the sum is passed
through a non-linear function known as an activation or transfer function. The
canonical form of transfer functions is the sigmoid, but they may also take the
form of other non-linear functions, piecewise linear functions, or step functions.
Generally, transfer functions are monotonically increasing.

Contents

1 Basic structure
2 History
3 Types of transfer functions
3.1 Step function
3.2 Sigmoid
4 See also
5 Bibliography

Basic structure
For a given artificial neuron, let there be m inputs with signals x1 through xm
and weights w1 through wm.
The output of neuron k is:

Where

p. 14/60

(Phi) is the transfer function.

Wassner Hubert

wassner@esiea.fr

The output propagates to the next layer (through a weighted synapse) or finally
exits the system as part or all of the output.

History
The original artificial neuron is the Threshold Logic Unit first proposed by
Warren McCulloch and Walter Pitts in 1943. As a transfer function, it employs a
threshold or step function taking on the values 1 or 0 only.

See article on Perceptron for more details

Types of transfer functions


The transfer function of a neuron is chosen to have a number of properties
which either enhance or simplify the network containing the neuron. Crucially,
for instance, any multi-layer perceptron using a linear transfer function has an
equivalent single-layer network; a non-linear function is therefore necessary to
gain the advantages of a multi-layer network.
Step function

The output y of this transfer function is binary, depending on whether the input
meets a specified threshold, . The "signal" is sent, i.e. the output is set to one,
if the activation meets the threshold.

Sigmoid

A fairly simple non-linear function, the sigmoid also has an easily calculated
derivative, which is used when calculating the weight updates in the network. It
thus makes the network more easily manipulable mathematically, and was
attractive to early computer scientists who needed to minimise the
computational load of their simulations.
See: Sigmoid function

p. 15/60

Wassner Hubert

wassner@esiea.fr

Bibliography

McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas


immanent in nervous activity. Bulletin of Mathematical Biophysics, 7:115 133.

Retrieved from "http://en.wikipedia.org/wiki/Artificial_neuron"


Category: Neural networks
Important note : A lot of other functions than sigmoid can be used. Sigmoid is
just the most popular, it may be sometimes not the best choise.

p. 16/60

Wassner Hubert

wassner@esiea.fr

Mathematical model and properties of Artificial Neural


Networks
Mathematical model
Imagine a simple 2D classification problem with 2 classes.

A multidimentional (2D) sigmoid (= a 2 input neurons) looks like this.

p. 17/60

Wassner Hubert

wassner@esiea.fr

So one single neuron (W0 = 1 and W1 =1 ) can do a linear separation with a


fading , abling to handle uncertain decisions.
z

s(w xi)
i

Where : s(x) = 1/(1+exp(-x))

If output is close to 0 then the presented input data is believed to belong


to class 0 (green x).

If output is close to 1 then the presented input data is believed to belong


to class 1 (red +).

If output is around 0.5 then you know that the decision will be uncertain.

p. 18/60

Wassner Hubert

wassner@esiea.fr

One neuron can't do much, but several neurons (with an appropriate learning
phase) can do a lot more...
Using two neurons (on the same layer) can show more complex landscapes .
Example :

s(w xi)
i

s(w xi)

s(w xi)

Corresponding parameters are W0 = 1, W1 = 1 for left neuron and W0 = 2, W1 =


0 for right neuron, and W0 = 1, W1 = 1 for the output neuron.
p. 19/60

Wassner Hubert

wassner@esiea.fr

Properties
Universal fonction approximation

It has been proved that every continuous fonction f: |Rn->|Rp can be


approximated by a neural network.

Neural network can be seen as a function decomposition using a


sigmoide (or else) base, like Fourrier analysis is a fonction
decomposition over sine and cosine functions.
Since we are viewing our problems like function approximation and that
our model is a continuous function, it is important to ensure that the
way we present data to the neuron network is seen as continuous.
For NN the continuity constraint can be expressed that way: if two entry
patterns are close, the corresponding NN output patterns should be close too. If
it's not the case the learning process will be more difficult and NN performance
will probably be low.
Example approximating the XOR function :
A

A XOR B

Input pattern (1,0) and (1,1) are close but correponding output is not close (1)
and (0). This makes it a standard hard learning dataset.

p. 20/60

Wassner Hubert

wassner@esiea.fr

Note : The fact that neural networks are universal function generator does not
implies that it can solve any classification problems.
Example, this dataset show a problem that can not be totaly solved :

Red (+) and green (x) class overlaps each over (when considering the only two
parameters shown here). So the neural network can't do a 100 % correct
classification. The overlapped regions will lead to uncertain answers/outputs. If
the network is properly train, the output can reflect a measure the
uncertainity (probablity approximation) which is still an important
information even if you can't really make a straight decision.
Generalisation

When using neural networks as function approximation we implicitly use an


other neural property which is generalisation. The networks
learn/approximate a (mathematical) fonction only knowing a limited number
of values.
Example : A NN is used to approximate sinus function. The training dataset is a
serie of pair (x,sin(x)) where x is sampled every PI/20.

p. 21/60

Wassner Hubert

wassner@esiea.fr

The networks is said to have a good generalisation behaviour when tested


inbetween values used in training.
Example : the above trained network is tested on the same range but with a
PI/80 sampling rate.

p. 22/60

Wassner Hubert

wassner@esiea.fr

This property comes from the fact that we are approximating continous
functions with a composition of continuous fonctions.
Robustness

Since it is a statisitical learning ANN have a tolerant behavious facing


noised data.
Adding noise on the same sinus function used above don't bother much the
training.

p. 23/60

Wassner Hubert

wassner@esiea.fr

Training and using (simplified)


Training a NN
Training a NN is the process of estimating good parameters (Wi) regarding to a
training dataset. Good parameters is another way to say parameters that lead to
a mininum of errors. The error is simply the difference between the expected
output and the actual NN output. The training is mainly an optimisation
problem : finding the best parameters to lower the error as far as possible. It
exists a lot of optimisation techniques, they are more or less all applicable here.
The most common is the gradient descent. The technique is to follow iteratively
the direction given by the error gradient. For each input pattern we can
compute the error derivate indicating how to modify each NN parameter in
order to minimise the error. This process may take some CPU power if the
dataset, network or both is/are large.
Imagin that the network has only two parameters (Wi) training the network
would be equivalent to find the minimum of that curve (Z axis correponding to
the mean error of the network, W0 and W1 correponds to x and y.)
Generaly speaking we talk about error hypersurface because the number of
parameters is oftenly far superior to 2.

p. 24/60

Wassner Hubert

wassner@esiea.fr

Using a NN
Using a NN is simply propagating the input values towards the output, it is
generaly a low CPU consuming stage (just a few fonction evaluations). This is
one of the advantages of NN using them is a quite small task. Training might
need somme time and CPU power but using them is quick and easy.
Somme postprocessing is sometimes nedeed on real life application : A typical
classification problem will need a final decision where the NN only outputs a
numerical vector (in [-1,1]). Classical algorithmes are then used to find the best
fitting class and the distance to the best second class can be good confidence
index.
NN is rarely the complete solution to a problem, but it can be a decisive
part of a lot of solutions.
Example in computer gaming : making a good AI-gamer/bot for action games
is hard and complex because no one can realy mathematically express what is a
good stategy, because all depends on the human opponent gaming style.
Directly training a NN to decide each action is a far too complex task, beause
you should be able to label each action/decision as good or not good (since
we are doing supervised training). The quality of each decisions (leading to win
or lose the game) is simply inaccessible.
An other way to look at making a good AI-bot is to build a NN that simply
predicts the position and or action of the human opponent. This far more easy
because you have this information (simple game recording is enough). You may
then feed the information predicted by the NN to an expert system.
Progamming a decent AI-bot whith an expert system that has information on the
next moves of the human opponent is rather simple...
Here is interesting demonstration :

OCR demo : http://www.sund.de/netze/applets/BPN/bpn2/ochre.html

function approximation :
http://neuron.eng.wayne.edu/bpFunctionApprox/bpFunctionApprox.html

Applications
There is a very wide variety of applications of neural techniques, and there is
certainly even more to discover. Here is a non exhaustiv list :

Classification (includes biometry)

generalisation , interpolation, prediction ( $ credit allocation) energy


consumption, ...

prediction/forcasting (time or else)

gaming

p. 25/60

Wassner Hubert

wassner@esiea.fr

Deeper insight
Foreword on classification and predictions systems
The field is quite large it is very important to use correct the words to identify
each kind of application.
Classification is the process of attributing a class to an unkown entry. The
system measures features about the entry and the decides to which class it
belongs.
A classical example is biometric : one person is measured thru a biometric
system (voice sample, fingerprint, iris, ...) theses measures are then compared
to stored models and the program decides whever to grant access or not. This
particular case of classification is called verification, the system verifies the
claim of the user. The user claims it's identity, the system verifies it. There is
two classes : positive recognition and negative recognition.This makes it a two
classes problem. For real life application is it strongly advised to extend it to, at
least, 3 classes problems :

access granted

access refused

can't tell (restarts measures, a limited number of times.)

Identification is different kind of classification problem, you have no


information about the input data, and you have to attribute a class to it. One
good example is speaker identification. The problem is to identify the voice of a
person in a stored collection of speaker models.
It is sometimes important to distinguish two subkinds :

When the user knows that he's measured and wants to : collaborative
context.

When the user don't know or even don't want to be classified .

Prediction/forecasting task is like weather forecasting, predicting the futur of


a value. Beware this term is oftenly missused. We are trying to make programs
that predict things. In this field prediction may express far different
operations. In biometric application for instance, some people are talking about
prediction for the decision process which is not really a prediction simply
because there is no time dependence. So the term prediction is used nearly
every where a program takes a decision.
Since one of the goal of an engineer is to lead software fonctionality to business,
it is very important to be able to measure the effectiveness of a prediction
system to tailor a commercial system using prediction.
The mean quality of a program as a two class (positive|negative) predictor is
estimated thru these features and definitions :

true positives : are positives decisions that are confirmed

p. 26/60

Wassner Hubert

wassner@esiea.fr

true negative : are negative decisions that are confirmed

false positive : are positives decisions that should be negative

false negative : are negative decision that sould be positive

Imagine your are building an biometric system to access banking data the
cost of a false positive error is not the same than a false negative error.
On the first case you give access to a person who shouldn't have it. On the
second case you deny access to a genuine user. Remember that training
algorithmes are more or less optimisation problems, it is then very important to
adapt the cost of the training function to handle these different kind of
errors...
Extending this concept to N >2 classes classification problem lead to the
concept of confusion matrix .
A confusion matrix is a visualization tool typically used in supervised learning
(in unsupervised learning it is typically called a matching matrix). Each
column of the matrix represents the instances in a predicted class, while each
row represents the instances in an actual class. One benefit of a confusion
matrix is that it is easy to see if the system is confusing two classes (i.e.
commonly mislabelling one as an other).
In the example confusion matrix below, of the 8 actual cats, the system
predicted that three were dogs, and of the six dogs it predicted that one was a
rabbit and two were cats. We can see from the matrix that the system in
question has trouble distinguishing between cats and dogs, but can make the
distinction between rabbits and other types of animals pretty well.
Example of confusion matrix
Cat Dog Rabbit
Cat

Dog

Rabbit 0

11

Threshold finding
NN ouputs fuzzy values but lot of real life problems need discrete answers :
yes/no but no other solution. So the basic way to do so is to threshold the NN
output. All the difficulty is in the choice of that threshold. Let see the process
for a 2 classes classification problem. One common way to choose the threshold
is to draw the score histogram for each class on the same graph.

p. 27/60

Wassner Hubert

wassner@esiea.fr

This figures show that if the threshold is set at 3 we expect :


(s<3 => class1; s>=3 =>class2)

virtualy no false positive on class1

some false negatives on class1 (the ones scored > 3)

class2 should have no false negatives

classe2 will have some false positives

Moving the threshold allows different compromise on the classifier accuracy.


Error statisitic should be made for each possible threshold : this leads to ROCcurve (Receiver Operating Characteristic).

p. 28/60

Wassner Hubert

wassner@esiea.fr

Receiver operating characteristic (From Wikipedia, the free


encyclopedia)
In signal detection theory, a receiver operating characteristic (ROC), also
receiver operating curve, is a graphical plot of the sensitivity vs. (1 specificity) for a binary classifier system as its discrimination threshold is
varied. The ROC can also be represented equivalently by plotting the fraction of
true positives (TP) vs. the fraction of true negatives (TN). The usage receiver
operator characteristic is also common.
ROC curves are used to evaluate the results of a prediction and were first
employed in the study of discriminator systems for the detection of radio signals
in the presence of noise in the 1940s, following the attack on Pearl Harbor. The
initial research was motivated by the desire to determine how the US RADAR
"receiver operators" had missed the Japanese aircraft.
In the 1950s they began to be used in psychophysics, to assess human (and
occasionally animal) detection of weak signals. They also proved to be useful for
the evaluation of machine learning results, such as the evaluation of Internet
search engines. They are also used extensively in epidemiology and medical
research and are frequently mentioned in conjunction with evidence-based
medicine.
The best possible prediction method would yield a graph that was a point in the
upper left corner of the ROC space, i.e. 100% sensitivity (all true positives are
found) and 100% specificity (no false positives are found). A completely random
predictor would give a straight line at an angle of 45 degrees from the
horizontal, from bottom left to top right: this is because, as the threshold is
raised, equal numbers of true and false positives would be let in. Results below
this no-discrimination line would suggest a detector that gave wrong results
consistently, and could therefore be simply used to make a detector that gave
useful results by inverting its decisions.

How a ROC curve can be interpreted


Sometimes, the ROC is used to generate a summary statistic. Three common
versions are:
p. 29/60

Wassner Hubert

wassner@esiea.fr

the intercept of the ROC curve with the line at 90 degrees to the nodiscrimination line
the area between the ROC curve and the no-discrimination line
the area under the ROC curve, often called AUC.
d' (pronounced "d-prime"), the distance between the mean of the
distribution of activity in the system under noise-alone conditions and its
distribution under signal plus noise conditions, divided by their standard
deviation, under the assumption that both these distributions are normal
with the same standard deviation. Under these assumptions, it can be
proved that the shape of the ROC depends only on d'.

ROC curve of three epitope predictors


However, any attempt to summarize the ROC curve into a single number
loses information about the pattern of tradeoffs of the particular
discriminator algorithm.
The machine learning community most often uses the ROC AUC statistic. This
measure can be interpreted as the probability that when we randomly pick one
positive and one negative example, the classifier will assign a higher score to
the positive example than to the negative. In engineering, the area between the
ROC curve and the no-discrimination line is often preferred, because of its
useful mathematical properties as a non-parametric statistic. This area is often
simply known as the discrimination. In psychophysics, d' is the most
commonly used measure.
The illustration to the right shows the use of ROC graphs for the discrimination
between the quality of different epitope predicting algorithms. If you wish to
discover at least 60% of the epitopes in a virus protein, you can read out of the
graph that about 1/3 of the output would be falsely marked as an epitope. The
information that is not visible in this graph is that the person that uses the
algorithms knows what threshold settings give a certain point in the ROC graph.
Retrieved from "http://en.wikipedia.org/wiki/Receiver_operating_characteristic"
Category: Detection theory
Demos :

Very nice applet showing dynamics of ROC curves


http://www.anaesthetist.com/mnm/stats/roc/

ROC analysis thru the web


p. 30/60

Wassner Hubert

wassner@esiea.fr

http://www.rad.jhmi.edu/jeng/javarad/roc/main.html

Testing precautions
The above testing definitions and statisitics must be computed on correct
datasets. Since we are using training data to estimate models parameters, the
network will have a too good behaviour on the same dataset. It's very
important to test network on data that haven't been seen before. It's the
only way to ensure measuring the real generalisation feature of neural
networks.

The art of splitting datasets


We are basicaly making statisitics on both training and testing networks. We are
then willing to have biggest dataset possible. For most real life problem we have
limited access to dataset moreover training and testing datasets must be
different. This is the first constrain to take into account. The second is to have a
good coverage of real data into both training and testing and for all classes.
Random shuffle of data before splitting into train and test datasets is a good
basic solution. Sometimes it's not a good choice if there is unknown statisitical
links between samples. Some expert information might be needed at this
stage.

Datasets cost
It may sound strange but oftenly training datasets are quite small and
proprietary. It's simply because the are expensive to build. They are expensive
because most of the time the output (called label or annotation ) is
determined by an expert , and you need a big datasets to properly train and
test your ANN. (Note : this is true for any other statisitical models, not only
ANN).
It is very important to estimate the cost of dataset collection and annotation on
this kind of projects.
When little data is available one option is the Jack knife technique.
Jackknifed statistics are created by systematically dropping out subsets of data
one at a time and assessing the resulting variation in the studied parameter.

p. 31/60

Wassner Hubert

wassner@esiea.fr

Jack-knife technique :
Whole dataset
test
train
train
train
train
train
train
train
train
train
train

Whole dataset

Whole dataset
train
train
test
train
train
train
train
train
train
train
train

train
test
train
train
train
train
train
train
train
train
train

( training

training

training )

( testing

testing

testing )

etc...

n trainings,
each on (n-1) examples
n tests,
each on different
(but similar)
trained model

In some case the annotation process is the most expensive. Bootstrap learning
may take advantage of this situation. One can train a NN with the labeled data
and use the trained network to ease the annotation of more data. Doing so
incrementaly grows the annotated database and raise the NN accuracy.

Are we doing that good ?


To know if the NN is working good one must compare it to the minimal classifier
(or random predictor) which only uses the apriori knowled of class probablity.
Example : A classification problem with 2 classes having a distribution 10 % for
first class and 90% for the second can be pretty well approached by choosing
class randomly according to the apriori distribution.
If your NN can't do better that the minimal classifier :

Something is going wrong somewhere in your data processing. (bug?)

Your problem is maybe harder that you thought. Looking at two


dimensional projections of the training data can give hints to the
classification difficulty.
Example, the following 2D projection shows a very hard/(imposible?)
classification task. More separated data points must be found at least on

p. 32/60

Wassner Hubert

wassner@esiea.fr

some 2D projections to hope beeing able to train a classificator on this


problem...

Note : You should be suspicious if your NN does a 100% correct classification.


Be sure that the problem is that easy. Some bug may lead to apparent 100%
correctness, the optimistic programmer may go into pitfalls if doesn't discover
the truth. This happens more often than you may think ... ;-)

Training testing and using (the real stuff)


Preparing the datasets
Depending on activation function used by input neurons, input data must be in a
given range ([0,1] or [-1,+1], ...). Real life problems rarely directly fit into this
format. A normalisation function must be used. This function must be
carefully choosed since it is implicitily the sensitivity of the neurons.
These are the caracteristic needed for normalisation :

The input normalisation must preserve the dynamic of the data.

The output normalisation must enhance error measurement, in order to


ease the gradient descent. And it should be easily transfomed back into
real-world data.

p. 33/60

Wassner Hubert

wassner@esiea.fr

Warning : Normalising by simple homotety (linear normalisation) may not be a


good solution.
Example : the following histogram shows a data set where almost all values are
in [0,11] , but a verfy few are bigger that 25. Simple linear normalisation into [0,
1] would squizz all the data dynamic in [0, ~0.25]. This may not suit the input
neuron range and then blinding the NN.

Solution is to study the statisitic of every input data to find a good scaling
function. To main solutions here :

building a non linear normalisation function which has a better respect to


the data dynamic.

Using a linear normalisation using hand/expert selected extremum values


(this works well if extremum cases are easy classification cases).

Supervised learning
Training forward multi-layer neural network is finding the parameter values
Wi where the error is minimal over a given dataset. This is an optimisation
problem. A common way to solve this kind of problem is gradient based
p. 34/60

Wassner Hubert

wassner@esiea.fr

methods. The main idea is to use information given by the derivate of the error
function.

Gradient descent (From Wikipedia, the free encyclopedia)


Gradient descent is an optimization algorithm that approaches a local
minimum of a function by taking steps proportional to the negative of the
gradient (or the approximate gradient) of the function at the current point. If
instead one takes steps proportional to the gradient, one approaches a local
maximum of that function; the procedure is then known as gradient ascent.
Gradient descent is also known as steepest descent, or the method of
steepest descent. When known as the latter, gradient descent should not be
confused with the method of steepest descent for approximating integrals.
[edit]

Description of the method


We describe gradient descent here in terms of its equivalent (but opposite)
cousin, gradient ascent. Gradient ascent is based on the observation that if the
real-valued function
point

increases fastest if one goes from

, then

gradient of F at

is defined and differentiable in a neighborhood of a

in the direction of the

. It follows that, if

for > 0 a small enough number, then


. With this observation in
mind, one starts with a guess
for a local maximum of F, and considers the
sequence
such that

We have
so hopefully the sequence
converges to the desired local maximum. Note that the value of the step size is
allowed to change at every iteration.
Let us illustrate this process in the picture below. Here F is assumed to be
defined on the plane, and that its graph looks like a hill. The blue curves are the
contour lines, that is, the regions on which the value of F is constant. A red
arrow originating at a point shows the direction of the gradient at that point.
Note that the gradient at a point is perpendicular to the contour line going
through that point. We see that gradient ascent leads us to the top of the hill,
that is, to the point where the value of the function F is largest.

p. 35/60

Wassner Hubert

wassner@esiea.fr

To have gradient descent go towards a local minimum, one needs to replace


with .

p. 36/60

Wassner Hubert

wassner@esiea.fr

The gradient descent method applied to an arbitrary function


Contour-lines
3D-view

Comments
Note that gradient descent works in spaces of any number of dimensions, even
in infinite-dimensional ones.
Two weaknesses of gradient descent are:
1. The algorithm can take many iterations to converge towards a local
maximum/minimum, if the curvature in different directions is very
different.
2. Finding the optimal per step can be time-consuming. Conversely, using a
fixed can yield poor results. Methods based on Newton's method and
inversion of the Hessian using Conjugate gradient techniques are often a
better alternative.
A more powerful algorithm is given by the BFGS method which consists in
calculating on every step a matrix by which the gradient vector is multiplied to
go into a "better" direction, combined with a more sophisticated linear search
algorithm, to find the "best" value of .

See also

Stochastic gradient descent


Newton's method
Optimization
Line search
Delta rule

Retrieved from "http://en.wikipedia.org/wiki/Gradient_descent"


p. 37/60

Wassner Hubert

wassner@esiea.fr

Category: Optimization algorithms

Backpropagation (From Wikipedia, the free encyclopedia)


Backpropagation is a supervised learning technique used for training artificial
neural networks. It was first described by Paul Werbos in 1974, and further
developed by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams in
1986.
It is most useful for feed-forward networks (networks that have no feedback, or
simply, that have no connections that loop). The term is an abbreviation for
"backwards propagation of errors". Backpropagation requires that the transfer
function used by the artificial neurons (or "nodes") be differentiable.
The summary of the technique is as follows:
1. Present a training sample to the neural network.
2. Compare the network's output to the desired output from that sample.
Calculate the error in each output neuron.
3. For each neuron, calculate what the output should have been, and a
scaling factor, how much lower or higher the output must be adjusted to
match the desired output. This is the local error.
4. Adjust the weights of each neuron to lower the local error.
5. Assign "blame" for the local error to neurons at the previous level, giving
greater responsibility to neurons connected by stronger weights.
6. Repeat the steps above on the neurons at the previous level, using each
one's "blame" as its error.
As the algorithm's name implies, the errors (and therefore the learning)
propagate backwards from the output nodes to the inner nodes. So technically
speaking, backpropagation is used to calculate the gradient of the error of the
network with respect to the network's modifiable weights. This gradient is
almost always then used in a simple stochastic gradient descent algorithm to
find weights that minimize the error. Often the term "backpropagation" is used
in a more general sense, to refer to the entire procedure encompassing both the
calculation of the gradient and its use in stochastic gradient descent.
Backpropagation usually allows quick convergence on satisfactory local minima
for error in the kind of networks to which it is suited.
It is important to note that backprop networks are neccessarily multilayer
(usually with one input, one hidden, and one output layer). In order for the
hidden layer to serve any useful function, Multilayer networks must have nonlinear activation functions for the multiple layers. Non-linear activation
functions that are commonly used include the logistic function, the softmax
function, and the gaussian function.
The backpropagation algorithm for calculating a gradient has been rediscovered
a number of times, and is a special case of a more general technique called
automatic differentiation in the reverse accumulation mode.

p. 38/60

Wassner Hubert

wassner@esiea.fr

Other usefull readings :

http://en.wikipedia.org/wiki/Non-parametric_methods

http://en.wikipedia.org/wiki/Expectation-Maximization

When the problem is not that simple... Neural cooking


Rules of thumb to use while building neural networks ( a little theory is
needed )

Local optimum
Gradient method may in some case lead to local unsufficient local optimum.
Schematic definition of different kind of optimum :

Gradient search technic can be stucked in local minimum.


That why gradient techniques usualy begin with a random weight Wi intiation.
So each training can virtualy lead to a different NN solution. Some are
better than overs...
Several solutions exists :
p. 39/60

Wassner Hubert

wassner@esiea.fr

http://en.wikipedia.org/wiki/Stochastic_gradient_descent

http://en.wikipedia.org/wiki/Simulated_annealing

http://en.wikipedia.org/wiki/Genetic_algorithms

...

Over-fitting/Under-fitting
Over-fitting

In some cases the NN learns unwanted feature.


This is an interesting story to introduce over-fitting phenomenon.

Clever Hans (From Wikipedia, the free encyclopedia)

Clever Hans performs


Clever Hans (in German, der Kluge Hans) was a horse that was claimed to
have been able to perform arithmetic and other intellectual tasks.
In 1907, psychologist Oskar Pfungst demonstrated that the horse's claimed
abilities were due to an artifact in the research methodology, wherein the horse
was responding directly to involuntary clues in the body language of the human
trainer, who had the faculties to solve each problem. In honour of Pfungst's
study, the anomalous artifact has since been referred to as the Clever Hans
effect and has continued to be a recurrent problem with any research into
animal cognition.

p. 40/60

Wassner Hubert

wassner@esiea.fr

Contents

1 Clever Hans and Pfungst's study


2 Clever Hans effect
3 Reference
4 See also
5 External links

Clever Hans and Pfungst's study

The horse, Hans, had been trained by a Mr. von Osten to tap out the answers to
arithmetic questions with its hoof. The answers to questions involving reading,
spelling and musical tones were converted to numbers, and the horse also
tapped out these numbers.
Seeking to ascertain a scientific basis or disproof for the claim, philosopher and
psychologist Carl Stumpf formed a panel of 13 prominent scientists, known as
the Hans Commission, to study the claims that a Clever Hans could count. The
commission passed off the evaluation to Pfungst, who tested the basis for these
claimed abilities by:
1. Isolating horse and questioner from spectators, so no cues could come
from them
2. Using questioners other than the horse's master
3. By means of blinders, varying whether the horse could see the questioner
4. Varying whether the questioner knew the answer to the question in
advance.
Using a substantial number of trials, Pfungst found that the horse could get the
correct answer even if von Osten himself did not ask the questions, ruling out
the possibility of fraud. However, the horse got the right answer only when the
questioner knew what the answer was, and the horse could see the questioner.
He then proceeded to examine the behaviour of the questioner in detail, and
showed that as the horse's taps approached the right answer, the questioner's
posture and facial expression changed in ways that were consistent with an
increase in tension, which was released when the horse made the final,
"correct" tap. This provided a cue that the horse could use to tell it to stop
tapping.
The social communication systems of horses probably depend on the detection
of small postural changes, and this may be why Hans so easily picked up on the
cues given by von Osten (who seems to have been entirely unaware that he was
providing such cues). However, the capacity to detect such cues is not confined
to horses. Pfungst proceeded to test the hypothesis that such cues would be
discernible, by carrying out laboratory tests in which he played the part of the
horse, and human participants sent him questions to which he gave numerical
answers by tapping. He found that 90% of participants gave sufficient cues for
him to get a correct answer.

p. 41/60

Wassner Hubert

wassner@esiea.fr

Clever Hans effect

The risk of Clever Hans effects is one strong reason why comparative
psychologists normally test animals in isolated apparatus, without interaction
with them. However this creates problems of its own, because many of the most
interesting phenomena in animal cognition are only likely to be demonstrated in
a social context, and in order to train and demonstrate them, it is necessary to
build up a social relationship between trainer and animal. This point of view has
been strongly argued by Irene Pepperberg in relation to her studies of parrots,
and by Alan and Beatrice Gardner in their study of the chimpanzee Washoe. If
the results of such studies are to gain universal acceptance, it is necessary to
find some way of testing the animals' achievements which eliminates the risk of
Clever Hans effects. However, simply removing the trainer from the scene may
not be an appropriate strategy, because where the social relationship between
trainer and subject is strong, the removal of the trainer may produce emotional
responses preventing the subject from performing. It is therefore necessary to
devise procedures where none of those present knows what the animal's likely
response may be.
For an example of an experimental protocol designed to overcome the Clever
Hans effect, see Rico (Border Collie).
As Pfungst's final experiment makes clear, Clever Hans effects are quite as
likely to occur in experiments with humans as in experiments with other
animals. For this reason, care is often taken in fields such as perception,
cognitive psychology, and social psychology to make experiments double-blind,
meaning that neither the experimenter nor the subject knows what condition
the subject is in, and thus what his or her responses are predicted to be.
Another way in which Clever Hans effects are avoided is by replacing the
experimenter with a computer, which can deliver standardized instructions and
record responses without giving clues.
Reference

Pfungst, O. (1911). Clever Hans (The horse of Mr. Von Osten): A contribution to
experimental animal and human psychology (Trans. C. L. Rahn). New York:
Henry Holt. (Originally published in German, 1907).
The horse learnt something which is ,in a way,far more complex than basic
arithmetics.
A NN net can do that too, when some unwanted statistics biais lay in training
data.
Building a Hans NN doing something different than expected
without knowing it can be a very deceptive experience...

Over-fitting can occurs when the model (NN) has too much parameters
regarding to the size of the training dataset, but it's not the only case...

p. 42/60

Wassner Hubert

wassner@esiea.fr

Overfitting example on sinus function approximation :

Solutions :

Ensure to avoid any unwanted statisitical biais.

Try to use the smaller NN possible for a given problem (limiting the
number of parameters of the model).

Use a bigger training datasets.

While trainning, using a different dataset (test) to detect when overfitting


occurs : basicaly when the mean error of the test set (not used to train the
NN) is increasing. This adds a difficulty to the art of splitting
datasets . A good training procedures uses 3 different datasets :

p. 43/60

a training set : the one used to estimate the NN


weights/parameters.

a testing set : to avoid overfitting.

a validation set : to measure the actual perfomance of the NN.

Wassner Hubert

wassner@esiea.fr

Studing error curves on train and test set can be helpful to find the best
moment to stop training.

The error curve on train and test set are more or less the same at the begining.
At one point the training error continues to fall down but the error on the test
set doesn't and even get bigger at some points.
The divergence point of this two curves is the best moment to stop training
because it is when the NN generalises the best. After that point the NN is too
specializing on the training set and will be unable to produce correct output on
new data.
Under-fitting

The over side of fitting problem is under-fitting, when the model constraint are
to strong regarding to the data statisitics.
The example below shows the underfitted output of a sinus function
approximation.

p. 44/60

Wassner Hubert

wassner@esiea.fr

Solutions are :

Raising the number of neurons (and adapting topology, one single hidden
layer is the most current topology but not always the best...)

Make sure that your problem can be solved the way you presented it to
the NN. Maybe the input data are not (that) relevant to the problem...

Unsupervised learning
Un-supervised algorithms are in a way simpler that supervised ones :

There is no need to handle labels (the class information of each data


entry)

Mathematic basic needs can be quite low to understand and use them

The problem is that they are quite counterintuitive and maybe need more
abstraction capabilities.
First of all the question is how a machine can learn something if no one
tels what is expected ?
Let's see K-means algorithm as an introduction ...

p. 45/60

Wassner Hubert

wassner@esiea.fr

K-means algorithm (From Wikipedia, the free encyclopedia)


The K-means algorithm is an algorithm to cluster objects based on attributes
into k partitions. It is a variant of the expectation-maximization algorithm in
which the goal is to determine the k means of data generated from gaussian
distributions. It assumes that the object attributes form a vector space. The
objective it tries to achieve is to minimize total intra-cluster variance, or, the
function

where there are k clusters Si, i = 1,2,...,k and i is the centroid or mean point of
all the points

The algorithm starts by partitioning the input points into k initial sets, either at
random or using some heuristic data. It then calculates the mean point, or
centroid, of each set. It constructs a new partition by associating each point
with the closest centroid. Then the centroids are recalculated for the new
clusters, and algorithm repeated by alternate application of these two steps
until convergence, which is obtained when the points no longer switch clusters
(or alternatively centroids are no longer changed).
The algorithm has remained extremely popular because it converges extremely
quickly in practice. In fact, many have observed that the number of iterations is
typically much less than the number of points. Recently, however, Arthur and
Vassilvitskii showed that there exist certain point sets on which k-means takes
superpolynomial time -

- to converge.

In terms of performance the algorithm is not guaranteed to return a global


optimum. The quality of the final solution depends largely on the initial set of
clusters, and may, in practice, be much poorer than the global optimum. Since
the algorithm is extremely fast, a common method is to run the algorithm
several times and return the best clustering found.
Another main drawback of the algorithm is that it has to be told the number of
clusters (i.e. k) to find. If the data is not naturally clustered, you get some
strange results. Also, the algorithm works well only when spherical clusters are
naturally available in data.

References
J. B. MacQueen (1967): "Some Methods for classification and Analysis of
Multivariate Observations", Proceedings of 5-th Berkeley Symposium on
Mathematical Statistics and Probability, Berkeley, University of California Press,
1:281-297
, S. Vassilvitskii (2006): "How Slow is the k-means Method?," Proceedings of the
2006 Symposium on Computational Geometry (SoCG).
p. 46/60

Wassner Hubert

wassner@esiea.fr

Very nice applet showing k-means algorithm running :


http://www.leet.it/home/lale/joomla/component/option,com_wrapper/Itemid,50/
K-means algorithm :
1. choosing the supposed number of clusters k
2. set randomly the position of the centroid/mean
3. constructs a new partition by associating each input point with the closest
centroid/mean
4. update centroid/mean coordinates (mean of all data corresponding to the
cluster)
5. does the partions changed ?
Yes -> goto 3
No -> end of the algorithm.
Since this algorithm minimizes intra-cluster variance, we can expect for an easy
k-classification problem that our k centroid are the centers of each class. Then
we just need to label a few data in each class to test this hypothesis and give a
class label to each cluster.
This is an idealistic case but it helps to understand how you can do
unsupervised training. Remember we didn't used any label information while
training. Only a very few (not enough for supervised training) label are used
only to make the class/cluster matching (idealy we only needed k label, one for
each class).

Kohonen network
Kohonen network also known as Self Organizing Maps are based on k-means
algorithm. The additional feature is that each centroid (mean) is located on a
map such as topology is conserved. This means that centroids that are close (in
the problem space) should also be close on the map.
One can choose any type of map (in dimension and topology) this is a way to
represents any dimensionnal data ( problem space ) in a selected map (idealy
2D, which is convenient for human)...
SOM helps to represents high dimensional data in a low dimensionnal
space while preserving topological information.

Example : a 2D Konen network map holding 3D input neurons. When trained,


p. 47/60

Wassner Hubert

wassner@esiea.fr

each neron will hold a 3D representing vector (a centroid like in kmeans


algorithm), the position of each neuron on the map should respect the input
topology (wich is 3D in that case). So this map should be able to represent 3D
information on a 2D map, if correctly trained.

One important feature of SOM is that neurons are set on a map where location
is important.
In SOM, neighborhood is defined as a mathematical function which parameters
are one neuron of the map and a neighborhood level , the output is the list of
neurons belonging to this neighborhood. (The higher the neighborhood level,
the higher the number of neurons in the neighborhood).
Neighborhood example on a square 2D map :

p. 48/60

Wassner Hubert

wassner@esiea.fr

All the input data are presented to the neurons, the closest neuron, and its
neighborhood is selected, to be updated toward the input data sample. The
neighborhood is slowly decreased along the iterations...
Note : a lot of different topology are possible. This choice is often the core of the
problem defining properties of the map...
SOM training algorithm :
1. initialise neuron map
2. loop over a decreasing neighborhood schedule
3. loop over all input data
4. search closest neuron (n) from current input sample (S)
5. get the closest neuron and its neighborhood even closer to the
input data sample.
W(t+1) = W(t) + a * (S-W(t)), where a is a
learning factor and S is the current sample.
6. End of loop over input datamining
7. end of neighborhood decreasing loop

p. 49/60

Wassner Hubert

wassner@esiea.fr

The training phase can be viewed like a deformation of the neural map toward
the input data shape. A large neighborhood is like a quite rigid map, a small
neighborhood is like a soft map. So the map is going from a rigid to a soft state
to gradualy fit on the input data.
Here is a example of fitting 2D uniform random data (input space) into a 2D
SOM :

How to read the map : The 2D position on the figure comes out of the input data
space, the topology information is in the grid connexions.
The resulting straight grid indicates that we are modeling a 2D square into a 2D
square (preserving topology)...

The following example is more interesting... The input data is still 2D here but
the topology is different. Input data is displayed in a 2D cross shape (which is a
different topology from the square), Kohonen map is still a 2D square

p. 50/60

Wassner Hubert

wassner@esiea.fr

A Kohonen maps are inspired form the localisation biological neurons. You
will see in that example how a SOM can help to find one's way in the data.
Imagin that you need to operate a robot that have to move inside the input data
like in a maze (inside the cross shape). If you have to go from A to B, the simple
algorithm is to computing the A->B vector direction and using it to move the
robot. This will lead to go out of the cross (or hit the wall of the maze).
A

p. 51/60

Wassner Hubert

wassner@esiea.fr

Using The Kohonen map can help us to find the best way, this is how to use it :

find the neurons a & b corresponding to A & B points.

Compute the a->b vector in the map (using neuron map coordinates)

follow the map vector from neuron a to neuron b

each neuron on the road hold a point/vector (in the input space)
showing, step by step, the way from A to B in the input space.

When appling this algorithm on the cross shape figure you will see that your
robot is carefully avoiding walls .
a

Note : that the convergence of the map must be carefully done... The cactus
example below would need some training parameters tuning...

p. 52/60

Wassner Hubert

wassner@esiea.fr

SOM can be viewed as a minimum deformation mapping from one


topology to an other.
Remark : The following figure shows the mapping a N Dimesionnal set of data
into a 1dimensionnal map can lead to a solution to the traveling salesman
problem . Note that a circular topology would suit better than linear.

A few web demos :

p. 53/60

2D SOM http://www-ti.informatik.unituebingen.de/~goeppert/KohonenApp/KohonenApp.html

an other 2D SOM
http://www.cs.utexas.edu/users/yschoe/java/javasom/Base.html

color SOM http://davis.wpi.edu/~matt/courses/soms/applet.html

traveling salesman problem http://www.ice.nuie.nagoyau.ac.jp/~l94334/bio/tsp/tsp.html

OCR http://www.ice.nuie.nagoya-u.ac.jp/~l94334/bio/tsp/tsp.html

WEBSOM
http://websom.hut.fi/websom/milliondemo/html/root.html

topology preservation demo


http://www.cis.hut.fi/research/javasomdemo/demo2.html
Wassner Hubert

wassner@esiea.fr

The wide field of neural-network


This course and document only talks about the most basic and popular neural
network. There is a very wide variety of neural network solving very different
problems. Here is a non exhaustiv list of other kind of neural computing
technics.

Reinforcement learning : it's an intermediate way between supervided an


unsupervised learning. The network is not explicitly given the expected
output but rather a reward of punishment according to its output.
Recurrent network (better suited for some forcasting problem), where
ouput at time t is feed as part of NN input for (t+1), in order to model
time influance on the model.
Time delayed NN are used to handle time/space independance (detecting
a event no matter where or when it appears)
Radial Basis Function : sigmoid function is not always the best choice to
model some kind of problems, RBF can be an alternative.
Hybrid method (supervised and non-supervised) : a first stage Kohonen
first layer can lower the problem dimension.
Advantages :

Lower input dimension (2D , coordinates of the lighten kohonen


neurons) (=> lower number of model parameters)

handle time series of N dimensionnal input space : accumulate


lighten SOM neurons with a time decay and feed the map into a
feed forward NN.

For each type of NN you have several training procedures which impact
its caracteristics.
...

This is just a small list of known technics, derivatives , new ones and
combinations are frequently invented...
A few web demos :

reinforcement learning :
Robot arm

http://www.fe.dis.titech.ac.jp/~gen/robot/robodemo.html

http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html
cat & mouse

Hamming associative memory

http://www.cse.unsw.edu.au/~cs9417ml/RL1/applet.html
http://neuron.eng.wayne.edu/Hamming/voting.html

Hopfield network

p. 54/60

http://suhep.phy.syr.edu/courses/modules/MM/sim/hopfield.html

Wassner Hubert

wassner@esiea.fr

A word on algorithmic complexity


Training is cpu consuming, but ...
Training phase is expensive whatever neural technic used. For most
applications training is done only once, so it's a transient problem.
Using a neural network is quite straight forward and generaly does not require
high CPU usage.
So with usual computer you can adress a lot of common problems.
But ...
Algorithic complexity of training algorithms can easily be above O(n*N*i)
where
n

is the number of data samples


Training data size is usually as big as possible.

is the numer of neurons


When modeling a big set of complex data, it is quite usual to use big
network.

iis

the number of iteration steps (from hundreds to thousands)


If n and N parameters are high it is quite frequent that ineeds to
be high too.

So the processing power can be an issue if some of the n,N or I parameters are
really high. It becomes a serious issue is if real time answers are requested on
such process.

Solutions
divide and conquer strategy :

Since building and training NN need a lot of try, you can easily split them
accross different computer on a standard network.

A gridcomputer or a cluster , with an appropriate NN parallelised


software can efficiently split the training accros the processing units.

NN are not suited to classical CPU even to math processors. They are
doing only very simple processing. They are parallel by nature, remember
the biological inspiration.

p. 55/60

Today processor are not suited to this kind of processings. It exists


specialized hardware to fit that processing need.

Specialized hardware can outperfom clusters or gridcomputers


because of the parallel nature of the NN.

Specialized hardware is less expensive and is smaller than clusters.


It often counts on real life problems.

Wassner Hubert

wassner@esiea.fr

Illustration 2:
Zero Instruction
Set Chip

Illustration 1: mix of real (wet) neurons and electronic


chip

NN are not magic !

NN are not magic, if the information you are trying to discover is not in
the data, NN won't find them. You should have evidences or strong
intuition than the information is in your data. Tools can be :

2D marginal projections.

expert informations.

We have seen that neural networks can solve several kind of problems :

classification

forecasting

shortest way or traveling salesman problem

robotic/automatic

...

but sometimes it's not the best in all cases. Well know alternates are :

Markov chains

Regression model (better when there is a apriori knowledge about


the function like physical or statistical laws).

K-nearest neighbor

Adaptative filtering

...

Even if some of these techniquees can be seen as a variation of NN, but


they were discovered before the NN analogy.

Solving a proble with NN needs :

p. 56/60

Wassner Hubert

wassner@esiea.fr

training/ testing data

Neural network :

a topology (number and disposition of neuron in layers)


Topology might be an issue, there is very few basic rules and
knowledge leading to the right solution all the times. The
usual way is to try a lot of different topologies and keep the
best one. The only basic rules are :

the more complex is the problem the more neuron


is requested.

The more neurons the more data you need to train


them.

a training procedure
Training can be viewed as an optimisation problem, as any
problem of this kind, local optimum can be a problem.

NN are suited to solve continous problems : close entry patterns


should correspond to close output patterns. Not all real-life problems are
of that kind.

Multimedia data (sound, images, moving pictures) generaly needs a first


stage data processing (Fourrier/wavelets transform is usual).

Important warning : NN are F@#& magic in the Murphy's law


context
If some undetected bugs add noise to the training set the NN can go thru it
with more or less success. If no ones detects the bug , you will have poor results
believing that the problem is harder than expected (which might not be the
case...).

How to actualy create and use neural networks


It exist today a lot of ways to build NN :

Libraries

proprietary libraries : easy but expensive

free software(open source) libraries : easy and free.

under LGPL licence or alike, wich you can use on propretairy


software !

It might need a little coding but it's an engineer's job, no ? ;-)

Modelers using graphical interface

proprietary nice and expensive

free softwre(open source) nice and free


(installation might need some patience sometimes)

Programing it your selft from bottom up (need good enginering skills).

p. 57/60

Wassner Hubert

wassner@esiea.fr

When NN are doing better than experts


On some problems NN can do better that humain expert.
Why ?

The parameter number is too high to be processed by a human beeing.

Human expert might be misled by fasle apriori knowledge.

Problem can be too easy and repetive to be properly done by a human.

The problem and parameters might be counterintuitive.

example : http://ai.bpa.arizona.edu/papers/dog93/dog93.html
This ability lead to some human problems :

fear to lose the job : NN can rise the old fear to be replaced by a
machine.

expert resistance : Example : a computer science engineer can't


solve a biological problem better/quicker than an phd expert in biology,
physic or finance, ...

Because of the above phenomenon finding NN problem to solve is a hard task ,


you can't really expect experts to bring you NN projects.

Deeper Deeper inside : Exercises


You have now all information you need to build a NN yourself. All you need
is to walk along the shore (rember the fractal approch ), navigation points
are paragraph in the deeper inside section of this document. You will follow
these guidelines to help you building your datasets, training the network and
testing it.
The remaining ressources you need are NN programming library and datasets.
NN library can be found in the software section of the bibliography. Datasets
are harder to find as explained in the Datasets cost section. Some public
dataset are listed below.

Datasets
web :

http://kdd.ics.uci.edu/

protein localisation ftp://ftp.ics.uci.edu/pub/machine-learningdatabases/ecoli/

spam detection ftp://ftp.ics.uci.edu/pub/machine-learningdatabases/spambase/

image satelite : http://kdd.ics.uci.edu/databases/covertype/

add detection : ftp://ftp.ics.uci.edu/pub/machine-learningdatabases/internet_ads/

p. 58/60

Wassner Hubert

wassner@esiea.fr

promoters : ftp://ftp.ics.uci.edu/pub/machine-learningdatabases/molecular-biology/

optic digits : ftp://ftp.ics.uci.edu/pub/machine-learningdatabases/optdigits/

KDD98 cup (mailing response) :


http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html

KDD99 cup (internet intrusion detection) :


http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

EEG :(alcoholic|non-a.) http://kdd.ics.uci.edu/databases/eeg/eeg.html

collection of machine learning dataset proben1.tar.gz


ftp://ftp.ira.uka.de/pub/neuron

...

Bibliography
Bibliography

Format is : title , authors, editor.


comment.

Rseaux de neurones, mthodologies et applications , sous la direction


de Grard Dreyfus, Eyrolles.
Great Book.

Pattern Classification Duda, Hart, Stork, Wiley-interscience.


Good book on pattern classification (NN is only one chapter).

Neural Network, algorithmes, applications and programming


techniques , James A. Freeman, David M. Skapura, Addison Wesley.
Quite an old book with a lot of illustration and even code sample (in Pascal
?). Can be a good complement to this course.

Cybernetique des Rseaux Neuronaux , Alain Faure, Hermes.


Real neuron biology described by engineers, very few about artificial
neurons.

Les Rseaux neuromimtiques , Jean-Franois Jodouin, Hermes.


Clear explanations, nice illustrations, even code sample, it's a good book.

p. 59/60

Wassner Hubert

wassner@esiea.fr

Web sites

http://www.wikipedia.org/ Wikipedia (lot of paragraphs directly comes out


of Wikipedia)

http://leenissen.dk/fann/

http://www.google.com/Top/Computers/Artificial_Intelligence/Neural_Net
works/Companies/

http://www.google.com/Top/Computers/Artificial_Intelligence/Conferences
_and_Events/

...

Softwares
Open source

FANN http://leenissen.dk/fann/

SNNS http://www-ra.informatik.uni-tuebingen.de/SNNS/

scilab http://www.scilab.org
http://www.scilab.org/contrib/displayContribution.php?fileID=166

...

Proprietary

Mathematica
http://www.wolfram.com/products/applications/neuralnetworks/

matlab http://www.mathworks.com/products/neuralnet/

...

p. 60/60

Wassner Hubert

wassner@esiea.fr

S-ar putea să vă placă și