6 Neural Networks

Regression,
Artificial Neural Networks

16/03/2016
Regression
Regression
– Supervised learning: Based on
training examples, learn a modell
which works fine on previously
unseen examples.
– Regression: forecasting real values

Regression
Training dataset: {xi, ri} riϵR
Evaluation metric:
„Least squared error”
Linear regression
Linear regression
g(x) = w1x + w0
Its gradient is 0 if
Regression variants
+MLE →
– Bayes
– k nearest neighbur’s
• mean or
• distance weighted average
– Decision tree
• or various linear models on the leaves

Regression SVM
Artificial Neural Networks
Artificial neural networks
• Motivation: the simulation of the neuo
system (human brain)’s information
processing mechanisms
• Structure: huge amount of densely
connected, mutally operating
processing units (neurons)
• It learns from experiences (training
instances)
Some neurobiology…
• Neurons have many
inputs and a single
output
• The output is either
excited or not
• The inputs from other
neurons determins
whether the neuron
fires
• Each input synapse has
a weight
A neuron in maths
Weighted average of inputs. If the average is above
a threshold T it fires (outputs 1) else its output is 0
or -1.
Statistics about the human brain
• #nerons: ~ 1011
• Avg. #connections per neuron: 104
• Signal sending time: 10-3 sec
• Face recognition: 10-1 sec

Motivation
(machine learning point of view)
• Goal: non-linear classification
– Linear machines are not satisfactory at

several real world situations
– Which non-linear function family to choose?
– Neural networks: latent non-linear patterns

will be machine learnt
Perceptron
Multilayer perceptron =
Neural Network
Different
representation at
various layers
Multilayer perceptron
Feedforward neural networks
• Connection only to the next layer
• The weights of the connections
(between two layers) can be changed
• Activation functions are used to
calculate whether the neuron fires
• Three-layer network:
• Input layer
• Hidden layer
• Output layer
Network function
• The network function of neuron j:

d d
net j   xi w ji  w j 0   xi w ji  wtj x,
i 1 i 0
where i is the index of input neurons, and wji is the
weight between the neurons i and j.
• wj0 is the bias

Activation function
activation function is a non-linear function
of the network value:
yj = f(netj)
(if it’d be linear, the whole network will be linear)
The sign activation function:

1 if net  0
f ( net )  sgn( net )  
 1 if net  0
oi
0
Tj netj
Differentiable activation
functions
• Enables gradient descent-based learning
• The sigmoid function:
1
f (net j )  ( net j T j ) 1
1 e
0
Tj netj
Output layer
nH nH
net k   y j wkj  wk 0   y j wkj  wkt y
j 1 j 0
where k is the index on the output layer and nH is the

number of hidden neurons
• Binary classification: sign function

• Multi-class classification: a neuron for
each of the classes, the argmax is
predicted (discriminant function)
• Regression: linear transformation
– y1 hidden unit calculates:
 0  y1 = +1
x1 + x2 + 0.5 x1 OR x2
< 0  y1 = -1
- y2 represents:
 0  y2 = +1
x1 + x2 -1.5 x1 AND x2
< 0  y2 = -1
– The output neuron: z1 = 0.7y1-0.4y2 - 1,
sgn(z1) is 1 iff y1 =1, y2 = -1
(x1 OR x2 ) AND NOT(x1 AND x2)

General (three-layer) feedforward
network (c output unit)
•  nH  d  
g k ( x)  zk  f k   wkj f j   w ji xi  w j 0   wk 0 

 j 1  i 1  
(k  1,..., c)
– The hidden units with their activation functions can
express non-linear functions
– The activation functions can be different at neurons

(but the same one is used in practice)
Universal approximation
theorem
Universal approximation theorem states that
a feed-forward network with a single hidden layer
containing a finite number of neurons can
approximate any continuous functions
But the theorem does not give any hint on

who to design activation functions for
problems/datasets
Training of neural networks
(backpropagation)
• The network topology is given
• The same activation function is
used at each hidden neuron and it
is given
• Training = calibration of weights
• on-line learning (epochs)
1.Forward propagation
An input vector propagates through the
network
2. Weight update (backpropagation)

the weights of the network will be
changed in order to decrease the
difference between the predicted and gold
standard values
we can calculate (propagate back) the

error signal for each hidden neuron
• tk is the target (gold standard) value of
output neuron k, zk is the prediction at
output neuron k (k = 1, …, c) and w are
the weights
1 c 1
Error: J ( w )   ( t k  z k ) 
2
•
2
tz
2 k 1 2
– backpropagation is a gradient descent
algorithms
• initial weights are random, then
J
w  
w
Backpropagation
The error of the weights between the hidden and
output layers:
J J netk netk
 .   k
wkj netk wkj wkj
the error signal for output neuron k:
J
k  
net k
net k
because netk = wkty:  yj
w kj
and:
z k  f (net k )
J J z k
k    .  (t k  z k ) f ' (net k )
net k z k net k
The change of weights between the hidden

and output layers:
wkj = kyj = (tk – zk) f’ (netk)yj
The gradient of the hidden units:
d
y j  f (net j ), net j   w ji xi
i 0
J J y j net j
 . .
w ji y j net j w ji
J  1 c 2
c
zk

y j y j  2  (tk  zk )    (tk  zk ) y
 k 1  k 1 j
c
zk netk c
  (tk  zk ) .   (tk  zk ) f ' (netk ) wkj
k 1 netk y j k 1
k
The error signal of the hidden units:
c
 j  f ' ( net j ) w kj k
k 1
The weight change between the input and

hidden layers:
w ji  x i j   w kj k  f ' ( net j ) x i




j
Backpropagation
Calculate the error signal for the output
neurons and update the weights between the
output and hidden layers
 k  (tk  zk ) f ' (netk ) output
update the weights to k:

wki   k zi
hidden
input
Backpropagation
Calculate the error signal for hidden
neurons
output
c
 j  f ' ( net j ) w kj k rejtett
k 1
input
Backpropagation
Update the weights between the input
and hidden neurons
output
rejtett
updating the ones to j
w ji   j xi
input
w initialised randomly
Begin init: nH; w, stopping critera ,

, m  0
do m  m + 1
xm  a sampled training instance
wji  wji + jxi; wkj  wkj + kyj
until ||J(w)|| < 
return w
End
Stopping criteria
• if the change in J(w) is smaller than the threshold

• Problem: estimating the change from a single

training instance. Use bigger batches for change
estimation:
n
J  Jp
p 1
Stopping based on the performance
on a validation dataset
– The usage of unseen training instances for
estimating the performance of supervised
learning (to avoid overfitting)
– Stopping at the minimum error on the

validation set
Notes on backpropagation
• it can be stack at local minima
• In practice, the local minima is close to the
global one
• Multiple training starting from various

randomly initalized weights might help
– we can take the trained network with the minimal
error (on a validation set)
– there are voting schema for voting the networks
Questions of network design
• How many hidden neurons?
– few neurons cannot learn complex
patterns
– too many neurons can easily
overfit
– validation set?
• Learning rate!?
Outlook
History of neural networks
• Perceptron: one of the first
machine learners ~1950
• Backpropagation: multilayer
perceptrons, 1975-
• Deep learning:
popular again 2006-
Deep learning
(auto-encoder pretraining)
Recurrent neural networks
short term memory
http://www.youtube.com/watch?v=vmDByFN6eig

6 Neural Networks

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

6 Neural Networks

Încărcat de

Drepturi de autor:

Formate disponibile

Regression,

Artificial Neural Networks

– Regression: forecasting real values

• or various linear models on the leaves

• Avg. #connections per neuron: 104

• Signal sending time: 10-3 sec

• Face recognition: 10-1 sec

– Linear machines are not satisfactory at

– Which non-linear function family to choose?

– Neural networks: latent non-linear patterns

• The network function of neuron j:

• wj0 is the bias

The sign activation function:

where k is the index on the output layer and nH is the

• Binary classification: sign function

(x1 OR x2 ) AND NOT(x1 AND x2)

– The activation functions can be different at neurons

But the theorem does not give any hint on

2. Weight update (backpropagation)

we can calculate (propagate back) the

the error signal for output neuron k:

The change of weights between the hidden

The weight change between the input and

w ji  x i j   w kj k  f ' ( net j ) x i

 k  (tk  zk ) f ' (netk ) output

update the weights to k:

Begin init: nH; w, stopping critera ,

• Problem: estimating the change from a single

– Stopping at the minimum error on the

• Multiple training starting from various

S-ar putea să vă placă și