Sunteți pe pagina 1din 54

Regression,

Artificial Neural Networks


16/03/2016
Regression
Regression
– Supervised learning: Based on
training examples, learn a modell
which works fine on previously
unseen examples.

– Regression: forecasting real values


Regression
Training dataset: {xi, ri} riϵR

Evaluation metric:
„Least squared error”
Linear regression
Linear regression
g(x) = w1x + w0

Its gradient is 0 if
Regression variants
+MLE →
– Bayes
– k nearest neighbur’s
• mean or
• distance weighted average

– Decision tree

• or various linear models on the leaves


Regression SVM
Artificial Neural Networks
Artificial neural networks
• Motivation: the simulation of the neuo
system (human brain)’s information
processing mechanisms
• Structure: huge amount of densely
connected, mutally operating
processing units (neurons)
• It learns from experiences (training
instances)
Some neurobiology…
• Neurons have many
inputs and a single
output
• The output is either
excited or not
• The inputs from other
neurons determins
whether the neuron
fires
• Each input synapse has
a weight
A neuron in maths
Weighted average of inputs. If the average is above
a threshold T it fires (outputs 1) else its output is 0
or -1.
Statistics about the human brain
• #nerons: ~ 1011

• Avg. #connections per neuron: 104

• Signal sending time: 10-3 sec

• Face recognition: 10-1 sec


Motivation
(machine learning point of view)
• Goal: non-linear classification

– Linear machines are not satisfactory at


several real world situations

– Which non-linear function family to choose?

– Neural networks: latent non-linear patterns


will be machine learnt
Perceptron
Multilayer perceptron =
Neural Network
Different
representation at
various layers
Multilayer perceptron
Feedforward neural networks
• Connection only to the next layer
• The weights of the connections
(between two layers) can be changed
• Activation functions are used to
calculate whether the neuron fires
• Three-layer network:
• Input layer
• Hidden layer
• Output layer
Network function

• The network function of neuron j:


d d
net j   xi w ji  w j 0   xi w ji  wtj x,
i 1 i 0
where i is the index of input neurons, and wji is the
weight between the neurons i and j.

• wj0 is the bias


Activation function
activation function is a non-linear function
of the network value:
yj = f(netj)
(if it’d be linear, the whole network will be linear)

The sign activation function:


1 if net  0
f ( net )  sgn( net )  
 1 if net  0
oi

0
Tj netj
Differentiable activation
functions
• Enables gradient descent-based learning
• The sigmoid function:

1
f (net j )  ( net j T j ) 1
1 e
0
Tj netj
Output layer
nH nH
net k   y j wkj  wk 0   y j wkj  wkt y
j 1 j 0

where k is the index on the output layer and nH is the


number of hidden neurons

• Binary classification: sign function


• Multi-class classification: a neuron for
each of the classes, the argmax is
predicted (discriminant function)
• Regression: linear transformation
– y1 hidden unit calculates:
 0  y1 = +1
x1 + x2 + 0.5 x1 OR x2
< 0  y1 = -1
- y2 represents:
 0  y2 = +1
x1 + x2 -1.5 x1 AND x2
< 0  y2 = -1
– The output neuron: z1 = 0.7y1-0.4y2 - 1,
sgn(z1) is 1 iff y1 =1, y2 = -1

(x1 OR x2 ) AND NOT(x1 AND x2)


General (three-layer) feedforward
network (c output unit)
•  nH  d  
g k ( x)  zk  f k   wkj f j   w ji xi  w j 0   wk 0 

 j 1  i 1  
(k  1,..., c)
– The hidden units with their activation functions can
express non-linear functions

– The activation functions can be different at neurons


(but the same one is used in practice)
Universal approximation
theorem
Universal approximation theorem states that
a feed-forward network with a single hidden layer
containing a finite number of neurons can
approximate any continuous functions

But the theorem does not give any hint on


who to design activation functions for
problems/datasets
Training of neural networks
(backpropagation)
Training of neural networks
• The network topology is given
• The same activation function is
used at each hidden neuron and it
is given
• Training = calibration of weights
• on-line learning (epochs)
Training of neural networks
1.Forward propagation
An input vector propagates through the
network

2. Weight update (backpropagation)


the weights of the network will be
changed in order to decrease the
difference between the predicted and gold
standard values
Training of neural networks

we can calculate (propagate back) the


error signal for each hidden neuron
• tk is the target (gold standard) value of
output neuron k, zk is the prediction at
output neuron k (k = 1, …, c) and w are
the weights
1 c 1
Error: J ( w )   ( t k  z k ) 
2

2
tz
2 k 1 2
– backpropagation is a gradient descent
algorithms
• initial weights are random, then

J
w  
w
Backpropagation
The error of the weights between the hidden and
output layers:
J J netk netk
 .   k
wkj netk wkj wkj

the error signal for output neuron k:

J
k  
net k
net k
because netk = wkty:  yj
w kj
and:
z k  f (net k )
J J z k
k    .  (t k  z k ) f ' (net k )
net k z k net k

The change of weights between the hidden


and output layers:
wkj = kyj = (tk – zk) f’ (netk)yj
The gradient of the hidden units:
d
y j  f (net j ), net j   w ji xi
i 0

J J y j net j
 . .
w ji y j net j w ji
J  1 c 2
c
zk

y j y j  2  (tk  zk )    (tk  zk ) y
 k 1  k 1 j
c
zk netk c
  (tk  zk ) .   (tk  zk ) f ' (netk ) wkj
k 1 netk y j k 1

k
The error signal of the hidden units:
c
 j  f ' ( net j ) w kj k
k 1

The weight change between the input and


hidden layers:

w ji  x i j   w kj k  f ' ( net j ) x i





j
Backpropagation
Calculate the error signal for the output
neurons and update the weights between the
output and hidden layers

 k  (tk  zk ) f ' (netk ) output

update the weights to k:


wki   k zi
hidden

input
Backpropagation
Calculate the error signal for hidden
neurons

output

c
 j  f ' ( net j ) w kj k rejtett
k 1

input
Backpropagation
Update the weights between the input
and hidden neurons

output

rejtett
updating the ones to j
w ji   j xi
input
Training of neural networks
w initialised randomly

Begin init: nH; w, stopping critera ,


, m  0
do m  m + 1
xm  a sampled training instance
wji  wji + jxi; wkj  wkj + kyj
until ||J(w)|| < 
return w
End
Stopping criteria
• if the change in J(w) is smaller than the threshold

• Problem: estimating the change from a single


training instance. Use bigger batches for change
estimation:

n
J  Jp
p 1
Stopping based on the performance
on a validation dataset
– The usage of unseen training instances for
estimating the performance of supervised
learning (to avoid overfitting)

– Stopping at the minimum error on the


validation set
Notes on backpropagation
• it can be stack at local minima
• In practice, the local minima is close to the
global one

• Multiple training starting from various


randomly initalized weights might help
– we can take the trained network with the minimal
error (on a validation set)
– there are voting schema for voting the networks
Questions of network design
• How many hidden neurons?
– few neurons cannot learn complex
patterns
– too many neurons can easily
overfit
– validation set?

• Learning rate!?
Outlook
History of neural networks
• Perceptron: one of the first
machine learners ~1950
• Backpropagation: multilayer
perceptrons, 1975-
• Deep learning:
popular again 2006-
Deep learning
(auto-encoder pretraining)
Recurrent neural networks
short term memory

http://www.youtube.com/watch?v=vmDByFN6eig