Sunteți pe pagina 1din 38

CONTENTS

1. Introduction
2. Probability Distributions
3. Linear Models for Regression
4. Linear Models for Classification
5. Neural Networks
6. Kernel Methods

Feed-forward Network Functions


Multilayer Perceptron (MLP)
Mathematical representations of information processing in
biological systems
Linear models for regression and classification

Neural network model

Pattern Recognition and Machine Learning

83

where

Pattern Recognition and Machine Learning

84

where

, softmax activation function

Neural network model is simply a nonlinear function from


a set of inputs
to a set of outputs
controlled by
.
deterministic variables for neural network
We can give the additional input variables
,

Pattern Recognition and Machine Learning

85

Neural networks are said to be universal approximators


Pattern Recognition and Machine Learning

86

Pattern Recognition and Machine Learning

87

Objective for Network Training


Given input vectors
and target vector
,
the minimization of a sum-of-squares error function

Relation to maximum likelihood criterion

Taking the negative logarithm, we have


Pattern Recognition and Machine Learning

88

ML Minimization of sum-of-squares error function

For the case of multiple target variables

Pattern Recognition and Machine Learning

89

Cross Entropy Error Function


Consider a binary classification, let

Cross entropy error function

In case of K binary classifications,

Pattern Recognition and Machine Learning

90

Taking the negative logarithm of the likelihood function


then gives the cross-entropy error function

where
In case of standard multiclass classification, 1-of-K coding
scheme
,

Pattern Recognition and Machine Learning

91

Pattern Recognition and Machine Learning

92

Parameter optimization
Smallest value will occur at a point which

Local quadratic approximation


is some point around

, the Taylor expansion

Pattern Recognition and Machine Learning

93

We obtain
Let
be a local minimum and

For geometric interpretation, we can perform eigen-analysis


where

are orthonormal vectors, i.e.,

We expand

, then

Pattern Recognition and Machine Learning

94

H is positive definite if, and only if,


for all
or if

are all positive.

Pattern Recognition and Machine Learning

95

Gradient Descent Optimization


Batch gradient descent

where the parameter

is the learning rate.

Sequential gradient descent method


If

then

Pattern Recognition and Machine Learning

96

Error Backpropagation Algorithm


We try to find an efficient way of evaluating
feed-forward neural network.

for a

In a general feed-forward network, each unit computes a

Pattern Recognition and Machine Learning

97

Evaluation of Error-Function Derivatives


send a connection to

Finally,

Pattern Recognition and Machine Learning

98

Error Backpropagation Procedure


1.Apply

to network by general computation

Find all activations of all hidden and output units


2.Evaluate

for all output nodes

3.Backpropagate the 's using

4.Find the required derivatives


for all weights in first layer

and second layer

Pattern Recognition and Machine Learning

99

Logistic Sigmoid Activation Function

Pattern Recognition and Machine Learning

100

The Hessian Matrix


Backpropagation algorithm for the first derivatives can be
extended to that for the second derivatives of the error, given
by

Hessian matrix plays an important role in many aspects of


neural computing. Various approximation schemes have been
used to evaluate the Hessian matrix for a neural network.
Hessian exact computation is also available with cost
where
is the number of weights.
Pattern Recognition and Machine Learning

101

Diagonal Approximation
For this case, the diagonal elements are computed by

If we neglect off-diagonal elements in the second-derivative


terms, we obtain

Pattern Recognition and Machine Learning

102

Outer Product Approximation


In case of a single output

If the network is well trained,


product approximation

, we have the outer-

Pattern Recognition and Machine Learning

103

Inverse Hessian
Using outer-product approximation,

where
Sequential procedure for building Hessian, for the first
data points,
Using the Woodbury identity

we obtain

Pattern Recognition and Machine Learning

104

5.5 Regularization in Neural Networks


Optimum value of M, number of hidden neurons, gives the
best generalization performance or optimum balance
between under-fitting and over-fitting.
Figure 5.9

The above regularizer is also known as weight decay and


is the regularization coefficient.
Pattern Recognition and Machine Learning

105

5.5.1 Consistent Gaussian priors


This weight decay is inconsistent with certain scaling
properties of network mapping due to it treats all weights
and biases on an equal footing, does not satisfy the
property of consistency of linear transformation of
obtaining equivalent network.
Regularizer is modified by

This corresponds to merge a prior of the form

Pattern Recognition and Machine Learning

106

5.5.2 Early stopping


Early stopping is a way of controlling the effective
complexity of a network.
Our goal is to achieve the best generalization.
Figure 5.12

(Training set error)


Pattern Recognition and Machine Learning

(Validation set error)


107

Figure 5.13 A schematic illustration of why early stopping can give


similar results to weight decay in the case of a quadratic error
function.

Early stopping exhibits similar behavior to regularization


using a weight-decay term.
Pattern Recognition and Machine Learning

108

5.5.3 Invariances
The training set is augmented using replicas of the
training patterns, transformed according to the desired
invariances (e.g., translation or scale invariance).
Figure 5.14

Pattern Recognition and Machine Learning

109

5.5.4 Tangent propagation


We can use regularization to encourage models to be
invariant to transformations of the input through the
technique of tangent propagation.
Figure 5.15

Pattern Recognition and Machine Learning

110

5.5.4 Tangent propagation


Let the vector that results from acting on
by the
transformation
with
and is a
parameter, the tangent vector at the point
is given by

Under a transformation of the input vector, the network


output vector will, in general, change.

Pattern Recognition and Machine Learning

111

5.5.4 Tangent propagation


Encourage local invariance in the neighborhood of the
data points by the addition of a regularization function
where

is a regularization coefficient and

Notes :
1.

in case of the network mapping function is invariant under


the transformation in the neighborhood of each pattern vector.
2. The value of determines the balance between fitting the
training data and learning the invariance property.
Pattern Recognition and Machine Learning

112

Figure 5.16

Pattern Recognition and Machine Learning

113

5.5.5 Training with transformed data

In case of perturbation by the transformation


density
zero mean and small variance

with

Expand the transformation by Taylor series

Pattern Recognition and Machine Learning

114

5.5.5 Training with transformed data


Expand the model function to give

Taylor series in terms of

Pattern Recognition and Machine Learning

115

5.5.5 Training with transformed data


Omitting terms of

, the average error function is

where the regularization term takes the form

in which we have performed the integration over .

Pattern Recognition and Machine Learning

116

5.5.5 Training with transformed data


To simplify the regularization term, we express
which minimizes the total error.
Thus, to order
and left with
In case of

, the first term in the regularizer vanishes

, then

and

which is known as Tikhonov regularization. We can build


MLP by considering this regularization function.
Pattern Recognition and Machine Learning

117

5.5.7 Soft weight sharing


One way to reduce the effective complexity is to constrain
weights within certain groups to be equal.
We can encourage the weight values to form several
groups, rather than just one group, by considering a
mixture of Gaussians

where

and

are the mixing coefficients.


Pattern Recognition and Machine Learning

118

5.5.7 Soft weight sharing


Taking the negative logarithm then leads to a
regularization term

and the total error is given by


Parameters
can be determined by using EM
algorithm if the weights were constant.
As the distribution of weights is itself evolving during the
learning process, a joint optimization can be performed
simultaneously over the weights and the mixture-model
parameters.
Pattern Recognition and Machine Learning

119

S-ar putea să vă placă și