Sunteți pe pagina 1din 6

Generalization of ANN

There are many different methods around for doing multivariate statistical analysis,
function fitting or prediction tasks and ANN represents only a small subset of these.
From a statistical modeling point of view, ANN models belong to the general class of
non-parametric methods that do not make any assumption about the parametric
form of the function they model. In this sense they are more powerful than
parametric methods that try to fit reality into a specific parametric form. However,
non-parametric methods like ANN contain more free parameters and hence require
more training data than parametric ones in order to achieve good generalization
performance (Geman et al., 1992)1. Due to their generality, ANN methods also have
some drawbacks, the most prominent one being long training times.
Working with Matlab NN toolbox

Import large and complex data sets


Partition the data set into training set and testing set.
Create, initialize, train, simulate and manage the networks
Supervised Networks are trained to produce desired output in response to
sample inputs, making them particularly well suited to
o Modeling and controlling dynamic systems
o Classifying noisy data
o Predicting future events.
o Unsupervised neural networks are trained by letting the network
continually adjust itself to new inputs. They find relationships within
data and can automatically define classification schemes.

Visit (http://www.mathworks.com/products/neuralnet/index.html)

The overall objective of training a MLP network in prediction task is that the network so
designed will generalize. That means the input/output mapping computed by the network
is correct (or nearly zero) for test data.
The critical issue in developing a neural network is generalization: how well will the
network make predictions for cases that are not in the training set? NNs can suffer from
either underfitting or overfitting. A network that is not sufficiently complex can fail to
detect fully the signal in a complicated data set, leading to underfitting. A network that is
too complex may fit the noise, not just the signal, leading to overfitting.
A model designed to generalize well will produce a corresponding input output mapping
even when the input is slightly different from the examples used to train the network.
When, however, a neural network learns too many input output examples, the network
may end up memorizing the training data. It may do so by finding a feature (e.g. noise)
1

Geman, S., Bienenstock, E., and Doursat, R., 1992. ``Neural Networks and the Bias/Variance Dilemma'',
Neural Comput. 4, 1.

30

that is present in the training data but not true of the underlying function that is to be
model. Such a phenomenon is refereed to as overfitting.
Overfitting is especially dangerous because it can easily lead to predictions that are far
beyond the range of the training data with many of the common types of NNs. Overfitting
can also produce wild predictions in multilayer perceptrons even with noise-free data.
The best way to avoid overfitting is to use lots of training data. Given a fixed amount of
training data, there are various approaches to avoiding underfitting and overfitting, and
hence getting good generalization: Model selection, Jittering, Early stopping, Network
pruning etc.
Model Selection
The complexity of a network is related to both the number of weights and the size of the
weights. Model selection is concerned with the number of weights, and hence the number
of hidden units and layers. The more weights there are, relative to the number of training
cases, the more overfitting amplifies noise in the targets (Moody 1992) 2. The other
approaches are concerned, directly or indirectly, with the size of the weights.
A standard tool of network (model) selection within a set of candidate model structures
(parameterizations) is cross-validation. In cross-validation approach of model selection,
firstly the available data set is randomly partitioned into a training set and a test set. The
training set is further partitioned into two disjoint subsets:
Estimation subset, used to select the model.
Validation subset, used to test or validate the model.
The motivation here is to validate the model on a data set different from the one used for
parameter estimation. In this way we may use the training set to assess the performance
of various candidate models and thereby choose the best one. There is however a
distinct possibility that the model with best-performing parameter values so selected may
end up overfitting the validation subset. To guard against this possibility, the
generalization performance of the selected model is measured on the test set which is
different from the validation subset. The objective here will be to select the one that
minimizes the generalization error. The use of validation data set is to determine the
termination point for the training. This validation set is actually the data set not used
directly in the training, i.e. not presented to the network, but used indirectly to monitor
the performance on unknown data. A deteriorating performance on the validation set
signals that the ANN is overlearning the training data and that training should be stopped.
When the training is stopped, a test set can be used to estimate the generalization
performance
Early stopping method of training
2

Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of


Generalization and
Regularization in Nonlinear Learning Systems", in Moody, J.E., Hanson, S.J., and Lippmann, R.P.,
Advances in Neural Information Processing Systems 4, 847-854.

31

Here, the training is performed by periodically stopping the training session. After a
session of training, the performance is measured by validation data set. Then again the
training is resumed as needed. In cases where data is scarce and the use of a validation set
is too costly, one can instead use a threshold value on the training error.
The figure (Fig.1) shows the conceptual forms of two learning curves. The estimation
learning (training) curve decreases monotonically for an increasing number of epochs in
the usual manner. In contrast, the validation learning curve decreases monotonically to a
minimum, it then starts to increase as the training continues. In reality, what the network
is learning beyond the minimum point is essentially noise contained in the training data.
This heuristic suggests that the minimum point on the validation learning curve be used
as a sensible criterion for stopping the training session (which is early stopping method of
training).

Fig.1 Illustration of the earlystopping rule based on cross validation (Haykin, 1999)3
Jittering
Jittering is a technique of adding noise term into the input data. Adding noise or jitter to
the inputs during training is also found empirically to improve network generalization.
This is because the noise will smear out each data point and make it difficult for the
network to fit the individual data points precisely, and consequently reduce over-fitting.
[Note: Detail is not covered].
Network Pruning
As the complexity of the system increases, governing the system by ANN requires the
use of highly structured networks of a rather large size. A practical issue may arise on
3

Haykin, S., 1999. Neural Networks A comprehensive Foundation, Second Edition. Pearson Education.

32

such situation is that of minimizing the size of the network while maintaining good
performance. A neural network with minimum size is less likely to learn the idiosyncrasy
or noise in the training data, and may thus generalize better to new data. A way of
achieving such design objective is Network Pruning.
In case of Network Pruning, a multilayer perceptron with an adequate performance for
the problem at hand is considered initially and the network is pruned by weakening or
eliminating certain synaptic weights in a selective and orderly fashion.
Complexity-Regularization Network Pruning Approach
In Complexity Regularization approach, total risk of unfit of the network model of a
complex system is expressed as:
R( w) s ( w) c ( w)
(1)
The term s(w) is the standard performance measure which depends on network model
and the input data i.e. Least mean square (LMS) in case of back propagation learning.
The second term c(w) is complexity penalty, which depends on the network (model)
alone. is a regularization parameter, which represents the relative importance of the
complexity penalty term with respect to the performance measure term.
When = 0, the network is completely determined by training examples (input).
When or very large, complexity penalty is itself sufficient to specify the network
or the training examples are unreliable.
In a general setting, one choice of complexity penalty term c(w) is the kth order
smoothing integral
c ( w, k )

1
k
F ( x, w) ( x )dx

2
x k

(2)

Where F(x,w) is the input-output mapping performed by the model, and (x) is some
weighting function that determines the region of the input space over which the function
F(x,w) is required to smooth. Higher the value of k, smoother (less complex) the function
F(x,w) will be.
There are three different complexity regularization procedures:
a)
Weight Decay
In the weight-decay procedure, the complexity penalty is defined as the squared
norm of the weight vector, as given below:
c ( w) w

wi2
i

(3)

This procedure operates by forcing some of the synaptic weights in the network to
take values close to zero while permitting other weights to retain their relatively
large values. Consequently, the weights of the network are grouped roughly into
two categories: those that have a large influence on the network (model), and

33

those that have little or no influence on it. The weights on this latter category are
referred to as excess weights. In absence of Complexity Regularization, excess
weights result in poor generalization by virtue of their high likelyhood of taking
on completely arbitrary values causing the network to overfit the data in order to
produce a slight reduction in the training error. The use of complexity
regularization encourages the excess weights to assume values close to zero and
thereby improve generalization.
b)

Weight Elimination
In case of Weight Elimination procedure of complexity regularization, the
complexity penalty is defined as
( wi / wo ) 2
c ( w)
(4)
2
i 1 ( wi / wo )
Where wo is a preassigned parameter, and wi refers to the weight of some synapse
i in the network.
When |wi| >> wo,
When |wi| << wo

c(w) 1
c(w) 0

An individual penalty term varies with wi/wo in a symmetric fashion as shown in


Figure 2.

Figure.2 The complexity penalty term verses wi/wo


c)

Approximate Smoother

34

In this procedure of complexity regularization, the complexity term is proposed


for a multilayer perceptron with a single hidden layer and a single neuron in the
output layer as
M

c ( w) woj2 w j

j 1

(5)

Where, woj are the weights in the ouput layer, and wj is the weight vector for the jth
neuron in the hidden layer. The power p is defined by

2k 1 for a global smoother


p
2k for a local smoother

(6)

Where k is the order of differentiation of F(x,y) with respect to x (equation 2).


The approximate smoother appears to be more accurate than weight decay or
weight elimination for the complexity regularization of a multilayer perceptron.
Unlike those earlier methods, it does two things:
I. It distinguishes between the roles of synaptic weights in the hidden layer
and those in the output layer.
II. It captures the interactions between these two sets of weights.
However, its computation complexity is much more complicated than weight
decay or weight elimination procedure.

Limitations of Back Propagation


1. Many iterations may require for convergence (slow convergence)
2. Large-scale neural network training problems are so inherently difficult to
perform that no supervised learning strategy is feasible, and use of other
approaches of pre-processing may be necessary.
3. Back propagation is basically a hill climbing technique; it runs the risk of
being trapped in a local minimum where every small change in synaptic
weights increases the cost function. But somewhere else in the weight space
there exist another set of synaptic weights for which the cost function is
smaller than the local minimum in which the network is stuck. It is clearly
undesirable to have the learning terminate at a local minimum, especially if it
is located far above a global minimum.

35

S-ar putea să vă placă și