5

CONTENTS
1. Introduction
2. Probability Distributions
3. Linear Models for Regression
4. Linear Models for Classification
5. Neural Networks
6. Kernel Methods
Feed-forward Network Functions

Multilayer Perceptron (MLP)
Mathematical representations of information processing in
biological systems
Linear models for regression and classification
Neural network model
Pattern Recognition and Machine Learning
83
where
84
where
, softmax activation function
Neural network model is simply a nonlinear function from

a set of inputs
to a set of outputs
controlled by
.
deterministic variables for neural network
We can give the additional input variables
,
85
Neural networks are said to be universal approximators

86
87
Objective for Network Training

Given input vectors
and target vector
,
the minimization of a sum-of-squares error function
Relation to maximum likelihood criterion
Taking the negative logarithm, we have

88
ML Minimization of sum-of-squares error function
For the case of multiple target variables
89
Cross Entropy Error Function

Consider a binary classification, let
Cross entropy error function
In case of K binary classifications,
90
Taking the negative logarithm of the likelihood function

then gives the cross-entropy error function
where
In case of standard multiclass classification, 1-of-K coding
scheme
,
91
92
Parameter optimization
Smallest value will occur at a point which
Local quadratic approximation

is some point around
, the Taylor expansion
93
We obtain
Let
be a local minimum and
For geometric interpretation, we can perform eigen-analysis

where
are orthonormal vectors, i.e.,
We expand
, then
94
H is positive definite if, and only if,

for all
or if
are all positive.
95
Gradient Descent Optimization

Batch gradient descent
where the parameter
is the learning rate.
Sequential gradient descent method

If
then
96
Error Backpropagation Algorithm

We try to find an efficient way of evaluating
feed-forward neural network.
for a
In a general feed-forward network, each unit computes a
97
Evaluation of Error-Function Derivatives

send a connection to
Finally,
98
Error Backpropagation Procedure

1.Apply
to network by general computation
Find all activations of all hidden and output units

2.Evaluate
for all output nodes
3.Backpropagate the 's using
4.Find the required derivatives

for all weights in first layer
and second layer
99
Logistic Sigmoid Activation Function
100
The Hessian Matrix

Backpropagation algorithm for the first derivatives can be
extended to that for the second derivatives of the error, given
by
Hessian matrix plays an important role in many aspects of

neural computing. Various approximation schemes have been
used to evaluate the Hessian matrix for a neural network.
Hessian exact computation is also available with cost
where
is the number of weights.
101
Diagonal Approximation
For this case, the diagonal elements are computed by
If we neglect off-diagonal elements in the second-derivative

terms, we obtain
102
Outer Product Approximation

In case of a single output
If the network is well trained,

product approximation
, we have the outer-
103
Inverse Hessian
Using outer-product approximation,
where
Sequential procedure for building Hessian, for the first
data points,
Using the Woodbury identity
we obtain
104
5.5 Regularization in Neural Networks

Optimum value of M, number of hidden neurons, gives the
best generalization performance or optimum balance
between under-fitting and over-fitting.
Figure 5.9
The above regularizer is also known as weight decay and

is the regularization coefficient.
105
5.5.1 Consistent Gaussian priors

This weight decay is inconsistent with certain scaling
properties of network mapping due to it treats all weights
and biases on an equal footing, does not satisfy the
property of consistency of linear transformation of
obtaining equivalent network.
Regularizer is modified by
This corresponds to merge a prior of the form
106
5.5.2 Early stopping

Early stopping is a way of controlling the effective
complexity of a network.
Our goal is to achieve the best generalization.
Figure 5.12
(Training set error)

(Validation set error)

107
Figure 5.13 A schematic illustration of why early stopping can give

similar results to weight decay in the case of a quadratic error
function.
Early stopping exhibits similar behavior to regularization

using a weight-decay term.
108
5.5.3 Invariances
The training set is augmented using replicas of the
training patterns, transformed according to the desired
invariances (e.g., translation or scale invariance).
Figure 5.14
109
5.5.4 Tangent propagation

We can use regularization to encourage models to be
invariant to transformations of the input through the
technique of tangent propagation.
Figure 5.15
110

Let the vector that results from acting on
by the
transformation
with
and is a
parameter, the tangent vector at the point
is given by
Under a transformation of the input vector, the network

output vector will, in general, change.
111

Encourage local invariance in the neighborhood of the
data points by the addition of a regularization function
where
is a regularization coefficient and
Notes :
1.
in case of the network mapping function is invariant under

the transformation in the neighborhood of each pattern vector.
2. The value of determines the balance between fitting the
training data and learning the invariance property.
112
Figure 5.16
113
5.5.5 Training with transformed data
In case of perturbation by the transformation

density
zero mean and small variance
with
Expand the transformation by Taylor series
114

Expand the model function to give
Taylor series in terms of
115

Omitting terms of
, the average error function is
where the regularization term takes the form
in which we have performed the integration over .
116

To simplify the regularization term, we express
which minimizes the total error.
Thus, to order
and left with
In case of
, the first term in the regularizer vanishes
, then
and
which is known as Tikhonov regularization. We can build

MLP by considering this regularization function.
117
5.5.7 Soft weight sharing

One way to reduce the effective complexity is to constrain
weights within certain groups to be equal.
We can encourage the weight values to form several
groups, rather than just one group, by considering a
mixture of Gaussians
where
and
are the mixing coefficients.

118
5.5.7 Soft weight sharing

Taking the negative logarithm then leads to a
regularization term
and the total error is given by

Parameters
can be determined by using EM
algorithm if the weights were constant.
As the distribution of weights is itself evolving during the
learning process, a joint optimization can be performed
simultaneously over the weights and the mixture-model
parameters.
119

5

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

5

Încărcat de

Drepturi de autor:

Formate disponibile

CONTENTS

Feed-forward Network Functions

Neural network model

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

, softmax activation function

Neural network model is simply a nonlinear function from

Pattern Recognition and Machine Learning

Neural networks are said to be universal approximators

Pattern Recognition and Machine Learning

Objective for Network Training

Relation to maximum likelihood criterion

Taking the negative logarithm, we have

ML Minimization of sum-of-squares error function

For the case of multiple target variables

Pattern Recognition and Machine Learning

Cross Entropy Error Function

Cross entropy error function

In case of K binary classifications,

Pattern Recognition and Machine Learning

Taking the negative logarithm of the likelihood function

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Local quadratic approximation

, the Taylor expansion

Pattern Recognition and Machine Learning

For geometric interpretation, we can perform eigen-analysis

are orthonormal vectors, i.e.,

Pattern Recognition and Machine Learning

H is positive definite if, and only if,

are all positive.

Pattern Recognition and Machine Learning

Gradient Descent Optimization

where the parameter

is the learning rate.

Sequential gradient descent method

Pattern Recognition and Machine Learning

Error Backpropagation Algorithm

In a general feed-forward network, each unit computes a

Pattern Recognition and Machine Learning

Evaluation of Error-Function Derivatives

Pattern Recognition and Machine Learning

Error Backpropagation Procedure

to network by general computation

Find all activations of all hidden and output units

for all output nodes

3.Backpropagate the 's using

4.Find the required derivatives

and second layer

Pattern Recognition and Machine Learning

Logistic Sigmoid Activation Function

Pattern Recognition and Machine Learning

The Hessian Matrix

Hessian matrix plays an important role in many aspects of

If we neglect off-diagonal elements in the second-derivative

Pattern Recognition and Machine Learning

Outer Product Approximation

If the network is well trained,

, we have the outer-

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

5.5 Regularization in Neural Networks

The above regularizer is also known as weight decay and

5.5.1 Consistent Gaussian priors