Sunteți pe pagina 1din 54

# Regression,

## Artificial Neural Networks

16/03/2016
Regression
Regression
– Supervised learning: Based on
training examples, learn a modell
which works fine on previously
unseen examples.

## – Regression: forecasting real values

Regression
Training dataset: {xi, ri} riϵR

Evaluation metric:
„Least squared error”
Linear regression
Linear regression
g(x) = w1x + w0

Regression variants
+MLE →
– Bayes
– k nearest neighbur’s
• mean or
• distance weighted average

– Decision tree

## • or various linear models on the leaves

Regression SVM
Artificial Neural Networks
Artificial neural networks
• Motivation: the simulation of the neuo
system (human brain)’s information
processing mechanisms
• Structure: huge amount of densely
connected, mutally operating
processing units (neurons)
• It learns from experiences (training
instances)
Some neurobiology…
• Neurons have many
inputs and a single
output
• The output is either
excited or not
• The inputs from other
neurons determins
whether the neuron
fires
• Each input synapse has
a weight
A neuron in maths
Weighted average of inputs. If the average is above
a threshold T it fires (outputs 1) else its output is 0
or -1.
• #nerons: ~ 1011

## • Face recognition: 10-1 sec

Motivation
(machine learning point of view)
• Goal: non-linear classification

## – Linear machines are not satisfactory at

several real world situations

## – Neural networks: latent non-linear patterns

will be machine learnt
Perceptron
Multilayer perceptron =
Neural Network
Different
representation at
various layers
Multilayer perceptron
Feedforward neural networks
• Connection only to the next layer
• The weights of the connections
(between two layers) can be changed
• Activation functions are used to
calculate whether the neuron fires
• Three-layer network:
• Input layer
• Hidden layer
• Output layer
Network function

## • The network function of neuron j:

d d
net j   xi w ji  w j 0   xi w ji  wtj x,
i 1 i 0
where i is the index of input neurons, and wji is the
weight between the neurons i and j.

## • wj0 is the bias

Activation function
activation function is a non-linear function
of the network value:
yj = f(netj)
(if it’d be linear, the whole network will be linear)

## The sign activation function:

1 if net  0
f ( net )  sgn( net )  
 1 if net  0
oi

0
Tj netj
Differentiable activation
functions
• The sigmoid function:

1
f (net j )  ( net j T j ) 1
1 e
0
Tj netj
Output layer
nH nH
net k   y j wkj  wk 0   y j wkj  wkt y
j 1 j 0

## where k is the index on the output layer and nH is the

number of hidden neurons

## • Binary classification: sign function

• Multi-class classification: a neuron for
each of the classes, the argmax is
predicted (discriminant function)
• Regression: linear transformation
– y1 hidden unit calculates:
 0  y1 = +1
x1 + x2 + 0.5 x1 OR x2
< 0  y1 = -1
- y2 represents:
 0  y2 = +1
x1 + x2 -1.5 x1 AND x2
< 0  y2 = -1
– The output neuron: z1 = 0.7y1-0.4y2 - 1,
sgn(z1) is 1 iff y1 =1, y2 = -1

## (x1 OR x2 ) AND NOT(x1 AND x2)

General (three-layer) feedforward
network (c output unit)
•  nH  d  
g k ( x)  zk  f k   wkj f j   w ji xi  w j 0   wk 0 

 j 1  i 1  
(k  1,..., c)
– The hidden units with their activation functions can
express non-linear functions

## – The activation functions can be different at neurons

(but the same one is used in practice)
Universal approximation
theorem
Universal approximation theorem states that
a feed-forward network with a single hidden layer
containing a finite number of neurons can
approximate any continuous functions

## But the theorem does not give any hint on

who to design activation functions for
problems/datasets
Training of neural networks
(backpropagation)
Training of neural networks
• The network topology is given
• The same activation function is
used at each hidden neuron and it
is given
• Training = calibration of weights
• on-line learning (epochs)
Training of neural networks
1.Forward propagation
An input vector propagates through the
network

## 2. Weight update (backpropagation)

the weights of the network will be
changed in order to decrease the
difference between the predicted and gold
standard values
Training of neural networks

## we can calculate (propagate back) the

error signal for each hidden neuron
• tk is the target (gold standard) value of
output neuron k, zk is the prediction at
output neuron k (k = 1, …, c) and w are
the weights
1 c 1
Error: J ( w )   ( t k  z k ) 
2

2
tz
2 k 1 2
– backpropagation is a gradient descent
algorithms
• initial weights are random, then

J
w  
w
Backpropagation
The error of the weights between the hidden and
output layers:
J J netk netk
 .   k
wkj netk wkj wkj

## the error signal for output neuron k:

J
k  
net k
net k
because netk = wkty:  yj
w kj
and:
z k  f (net k )
J J z k
k    .  (t k  z k ) f ' (net k )
net k z k net k

## The change of weights between the hidden

and output layers:
wkj = kyj = (tk – zk) f’ (netk)yj
The gradient of the hidden units:
d
y j  f (net j ), net j   w ji xi
i 0

J J y j net j
 . .
w ji y j net j w ji
J  1 c 2
c
zk

y j y j  2  (tk  zk )    (tk  zk ) y
 k 1  k 1 j
c
zk netk c
  (tk  zk ) .   (tk  zk ) f ' (netk ) wkj
k 1 netk y j k 1

k
The error signal of the hidden units:
c
 j  f ' ( net j ) w kj k
k 1

hidden layers:

## w ji  x i j   w kj k  f ' ( net j ) x i




j
Backpropagation
Calculate the error signal for the output
neurons and update the weights between the
output and hidden layers

## update the weights to k:

wki   k zi
hidden

input
Backpropagation
Calculate the error signal for hidden
neurons

output

c
 j  f ' ( net j ) w kj k rejtett
k 1

input
Backpropagation
Update the weights between the input
and hidden neurons

output

rejtett
updating the ones to j
w ji   j xi
input
Training of neural networks
w initialised randomly

## Begin init: nH; w, stopping critera ,

, m  0
do m  m + 1
xm  a sampled training instance
wji  wji + jxi; wkj  wkj + kyj
until ||J(w)|| < 
return w
End
Stopping criteria
• if the change in J(w) is smaller than the threshold

## • Problem: estimating the change from a single

training instance. Use bigger batches for change
estimation:

n
J  Jp
p 1
Stopping based on the performance
on a validation dataset
– The usage of unseen training instances for
estimating the performance of supervised
learning (to avoid overfitting)

## – Stopping at the minimum error on the

validation set
Notes on backpropagation
• it can be stack at local minima
• In practice, the local minima is close to the
global one

## • Multiple training starting from various

randomly initalized weights might help
– we can take the trained network with the minimal
error (on a validation set)
– there are voting schema for voting the networks
Questions of network design
• How many hidden neurons?
– few neurons cannot learn complex
patterns
– too many neurons can easily
overfit
– validation set?

• Learning rate!?
Outlook
History of neural networks
• Perceptron: one of the first
machine learners ~1950
• Backpropagation: multilayer
perceptrons, 1975-
• Deep learning:
popular again 2006-
Deep learning
(auto-encoder pretraining)
Recurrent neural networks
short term memory