Documente Academic
Documente Profesional
Documente Cultură
Albert-Ludwigs-University Freiburg
AG Maschinelles Lernen
we want to have a generic model that can adapt to some training data
basic idea: multi layer perceptron (Werbos 1974, Rumelhart, McClelland, Hinton
1986), also named feed forward networks
0.6
~x 7 fstep (w0 + hw,
~ ~xi)
0.4
with
1
flog (z) =
1 + ez
flog is called logistic function
0.6
~x 7 fstep (w0 + hw,
~ ~xi)
0.4
x1 ...
x2 ...
.. .. .. ..
. . . .
xn ...
neurons of i-th layer serve as input features for neurons of i + 1th layer
very complex functions can be calculated combining many neurons
Machine Learning: Multi Layer Perceptrons p.5/61
Multi layer perceptrons
(cont.)
multi layer perceptrons, more formally:
A MLP is a finite directed acyclic graph.
nodes that are no target of any connection are called input neurons. A
MLP that should be applied to input patterns of dimension n must have n
input neurons, one for each dimension. Input neurons are typically
enumerated as neuron 1, neuron 2, neuron 3, ...
nodes that are no source of any connection are called output neurons. A
MLP can have more than one output neuron. The number of output
neurons depends on the way the target values (desired values) of the
training patterns are described.
all nodes that are neither input neurons nor output neurons are called
hidden neurons.
since the graph is acyclic, all neurons can be organized in layers, with the
set of input layers being the first layer.
ai flog (net i )
ai flog (net i )
the network output is given by the ai of the output neurons
Machine Learning: Multi Layer Perceptrons p.8/61
Multi layer perceptrons
(cont.)
illustration:
apply pattern ~
x = (x1 , x2 )T
calculate activation of input neurons: ai xi
propagate forward the activations: step by step
read the network output from both output neurons
1.5
Neurons with logistic activation can 1
0.5
real number variants are used:
1
tanh(x)
neurons with tanh activation 1.5 linear activation
function: 2
3 2 1 0 1 2 3
enet
i e
net i
ai = tanh(net i ) = net net i
ei +e
neurons with linear activation:
ai = net i
1.5
Neurons with logistic activation can 1
0.5
real number variants are used:
1
tanh(x)
neurons with tanh activation 1.5 linear activation
function: 2
3 2 1 0 1 2 3
given training data: D = {(~x(1) , d~(1) ), . . . , (~x(p) , d~(p) )} where d~(i) is the
desired output (real number for regression, class label 0 or 1 for
classification)
given topology of a MLP
task: adapt weights of the MLP
x; w)
with y(~ ~ : network output for input pattern ~x and weight vector w
~,
2 2
Pdim(~u)
||~u|| squared length of vector ~u: ||~u|| = j=1 (uj )2
x; w)
with y(~ ~ : network output for input pattern ~x and weight vector w
~,
2 2
Pdim(~u)
||~u|| squared length of vector ~u: ||~u|| = j=1 (uj )2
learning means: calculating weights for which the error becomes minimal
~ D)
minimize E(w;
w
~
x; w)
with y(~ ~ : network output for input pattern ~x and weight vector w
~,
2 2
Pdim(~u)
||~u|| squared length of vector ~u: ||~u|| = j=1 (uj )2
learning means: calculating weights for which the error becomes minimal
~ D)
minimize E(w;
w
~
minimize f (~u)
~
u
~u can be any vector of suitable size. But which one solves this task and how
can we calculate it?
minimize f (~u)
~
u
~u can be any vector of suitable size. But which one solves this task and how
can we calculate it?
some simplifications:
here we consider only functions f which are continuous and differentiable
y y y
x x x
non continuous function continuous, non differentiable differentiable function
(disrupted) function (folded) (smooth)
f
but: there are also other points for which u
i
= 0, and resolving these
equations is often not possible
~u
~u
~u
~u
~u
~u
~u
~u
f
v i ui
ui
~u
> 0 is called learning rate
~u ~v
~u ~v
zig-zagging:
in higher dimensions: is not appropriate
for all dimensions
Rprop
Rprop
~ ~ 1~T ~
f (~u + ) f (~u) + grad f (~u) + H(~u)
2
u) the Hessian of f at ~u, the matrix of second order partial
with H(~
derivatives.
~ ~ 1~T ~
f (~u + ) f (~u) + grad f (~u) + H(~u)
2
u) the Hessian of f at ~u, the matrix of second order partial
with H(~
derivatives.
~:
Zeroing the gradient of approximation with respect to
~
~0 (grad f (~u))T + H(~u)
Hence:
~ (H(~u))1 (grad f (~u))T
~ ~ 1~T ~
f (~u + ) f (~u) + grad f (~u) + H(~u)
2
u) the Hessian of f at ~u, the matrix of second order partial
with H(~
derivatives.
~:
Zeroing the gradient of approximation with respect to
~
~0 (grad f (~u))T + H(~u)
Hence:
~ (H(~u))1 (grad f (~u))T
advantages: no learning rate, no parameters, quick convergence
disadvantages: calculation of H and H 1 very time consuming in high
dimensional spaces
Machine Learning: Multi Layer Perceptrons p.32/61
Beyond gradient descent
(cont.)
Quickprop (Fahlmann, 1988)
like Newtons method, but replaces H by a diagonal matrix containing
only the diagonal entries of H .
hence, calculating the inverse is simplified
replaces second order derivatives by approximations (difference ratios)
minimize f (~u)
~
u
~ D)
minimize E(w;
w
~
p
1 X
~ D) =
E(w; ~ d~(i) ||2
||y(~x(i) ; w)
2 i=1
~ D)
minimize E(w;
w
~
p
1 X
~ D) =
E(w; ~ d~(i) ||2
||y(~x(i) ; w)
2 i=1
introducing e(w; ~
~ ~x, d) = 12 ||y(~x; w) ~ 2 we can write:
~ d||
introducing e(w; ~
~ ~x, d) = 12 ||y(~x; w) ~ 2 we can write:
~ d||
p
X
~ D) =
E(w; ~ ~x(i) , d~(i) )
e(w;
i=1
introducing e(w; ~
~ ~x, d) = 12 ||y(~x; w) ~ 2 we can write:
~ d||
p
X
~ D) =
E(w; ~ ~x(i) , d~(i) )
e(w;
i=1
by direct calculation:
e
= (yj dj )
yj
yj is the activation of a certain output neuron, say ai
Hence:
e e
= = (ai dj )
ai yj
e X e net j
=
ai net j ai
jSucc(i)
e X e net j
=
ai net j ai
jSucc(i)
and:
net j = wji ai + ...
hence:
netj
= wji
ai
apply pattern ~
x = (x1 , x2 )T
propagate forward the activations: step by step
e e
calculate error, a
i
, and net for output neurons
i
propagate backward error: step
apply pattern ~
x = (x1 , x2 )T
propagate forward the activations: step by step
e e
calculate error, a
i
, and net for output neurons
i
propagate backward error: step by
apply pattern ~
x = (x1 , x2 )T
propagate forward the activations: step by step
e e
calculate error, a
i
, and net for output neurons
i
propagate backward error: step by step
apply pattern ~
x = (x1 , x2 )T
propagate forward the activations: step by step
e e
calculate error, a
i
, and net for output neurons
i
propagate backward error: step by step
e
calculate w
ji
apply pattern ~
x = (x1 , x2 )T
propagate forward the activations: step by step
e e
calculate error, a
i
, and net for output neurons
i
propagate backward error: step by step
e
calculate w
ji
Quickprop
Rprop
...
learning by epoch: all training patterns are considered for one update step of
function minimization
learning by pattern: only one training patterns is considered for one update
step of function minimization (only works with vanilla gradient descent!)
80
70
Rprop
60
50
40 SSAB
30
20
10
0.001 0.01 0.1 1
learning parameter
Machine Learning: Multi Layer Perceptrons p.52/61
Lernverhalten und Parameterwahl - 6 Bit Parity
400
350 BP
300
250
200
150 SSAB
100 QP
50 Rprop
0
0.0001 0.001 0.01 0.1
learning parameter
Machine Learning: Multi Layer Perceptrons p.53/61
Lernverhalten und Parameterwahl - 10 Encoder
400
350
300
250
200 BP
SSAB
150
100
Rprop
50
QP
0
0.001 0.01 0.1 1 10
learning parameter
Machine Learning: Multi Layer Perceptrons p.54/61
Lernverhalten und Parameterwahl - 12 Encoder
800 QP
600 SSAB
400
Rprop
200
0
0.001 0.01 0.1 1
learning parameter
Machine Learning: Multi Layer Perceptrons p.55/61
Lernverhalten und Parameterwahl - two sprials
14000
BP
average no. epochs
12000
10000 SSAB
8000 QP
6000
4000
Rprop
2000
0
1e-05 0.0001 0.001 0.01 0.1
learning parameter
Machine Learning: Multi Layer Perceptrons p.56/61
Real-world examples: sales rate prediction