Sunteți pe pagina 1din 8

SCE4101

Multi-layer
Perceptrons
Computational Intelligence

Warren Gauci ( 529190 (M)

Multi-layer perceptrons: Selection of a multi-layer perceptron for a


specific data classification task
Warren Gauci
Abstract This paper considers the
different parameters of multi layer perceptron
architectures and suggests a suitable
architecture to complete a specific data set
classification task. Results obtained from this
paper are based on Math Works Neural
Network Toolbox software. Rigorous testing
with variable hidden layer size, learning rate
and tests sets lead to a neural architecture
with the best performance measures. All
testing was performed on an Iris Data set.
Performance measures are evaluated by the
use of the confusion matrix, the mean square
error plot and the receiver operating
characteristic chart. This paper will contribute
to further advancements in the field of neuron
training and in the field of distinguishing and
classifying linearly and non-linearly separable
data.
1. INTRODUCTION

An artificial neural network (ANN) is


an information-processing system that has
certain performance characteristics in common
with biological neural networks. Neural
networks possess a remarkable ability to derive
meaning from complicated or imprecise data,
which can be used to extract patterns and detect
trends that are too complex to be noticed by
either humans or other computer techniques. A
simple neuron is a device with many inputs and
one output. The neuron has two modes of
operation; the training mode and the using
mode. A more sophisticated neuron is presented
in the McCulloch and Pitts model (MCP). The
difference from the simple neuron model is that
the inputs are 'weighted'. Each input has a
decision making that is dependent on the weight
of the particular input.

All weighted inputs are then added together and


if they exceed a pre-set threshold value, the
neuron fires. An adjustment to the MCP model
lead to the formulation of the perceptron, a
term coined by Frank Rosenblatt. A perceptron
is an MCP model with additional, fixed, preprocessing. This paper deals with perceptron
architectures structure, as this kind of neuron is
best for pattern recognition (see [1]).
Perceptrons may be grouped in single layer or
multilayer
architectures.
Single
layer
architectures are restricted in classifying only
linearly separable data, thus in this paper only
multi layer perceptron (MLP) networks are
used, as the connection form one layer to the
next allows for non-linearly separable data
recognition and classification. For a
comprehensive overview of other kinds of
networks refer to [2].
The objective of this paper is to find
MLP parameters that lead to the best MLP
architecture used for the classification of a
given data set. In general, ANNs are scalable,
i.e. they have parameters that can be changed.
In most cases the structure and size of an ANN
are chosen a priori or after the ANN learns a
given task. Two questions that are relevant in
this case are:
What size and structure is necessary to
solve a given task?
What size and structure is sufficient to
solve a given task?
This paper gives an answer to the first question
quoted above. The second question is also
relevant but is not in the scope of this paper. In
reality a MLP model can approximate functions
to any given accuracy, for a good overview
refer to [3].
2. Background Theory

The ANN must be trained using a


learning process. This process involves the
memorisation of patterns and the subsequent
response of the neural network. This can be
categorised into two paradigms; associative
mapping and regularity detection. Learning is
performed by the updating of the value of
weights associated with each input. The
methodology used in this paper makes use of an
adaptive network, in which neurons found in
the input layer are capable of changing their
weights. The adaptive network is introduced to
a supervised learning procedure, where each
neuron actually knows the target output and
adjusts weight to the input signals to minimise
the error. The error is stipulated using a least
mean square convergence technique.
The behaviour of an ANN depends also
on the input-output transfer function, which is
specified for the units. This paper makes use of
sigmoid units, where the output varies
continuously but not linearly as the input
changes. Sigmoid units bear a greater
resemblance to real neurons than linear units
do. In order to train the ANN to perform a
classification task, some kind of weight
adjusting technique must be set. This paper
considers a back-propagation algorithm
technique where the error derivative of the
weights (EW) is computed. In this way the
network is able to calculate how the error will
change as each weight is increased or decreased
slightly.
3. MATERIALS
3.1 NEURAL NETWORK TOOLBOX

The Neural Network Toolbox provides


functions that allow the modelling of complex
nonlinear systems. This tool box was used in
the development of this paper to design, train,
simulate
and
assess
different
ANN
architectures. The pattern recognition tool was
applied on an Iris dataset (see 3.2). This

toolbox allowed for the division of the data set


into training, validation and test sets. The
function that changes the number of neurons in
the hidden layer was used to change the MLP
architecture. The performance of the different
ANNs was assessed using the performance
plots provided by this software.
3.2 DATASET

The dataset used for classification is an


Iris Data set. Created by R.A Fisher, this data
set is a classic in the field of pattern
recognition. It contains 3 classes of 50 instances
each. Each class refers to a type of Iris plant;
Setosa, Versicolour, Virginica. One class is
linearly separable from the other two, while the
latter are not linearly separable from each other.
Each class has four attributes; sepal length,
sepal width, petal length, petal width.
4. METHOD

The best structure of MLP to perform


the given classification task was determined
using an empirical procedure. The method used
allowed for the variation of all the variable
parameters, namely; Input weights, epoch limit,
learning rate, and hidden layer size. The
specific steps performed are presented in a flow
chart in Appendix A. The method used may be
divided into three sections; Running train.gd
algorithm, collection of data and choice of best
data. This method enumerated a total of 80
samples in the selection process of the best
ANN architecture.
4.1 COLLECTION OF DATA

Iris dataset was divided into training


validation and test data [70% : 15% :
15%];
Hidden layer size was chosen;

Collection of Data

Sample No
1
2
3
4
5
6
7
8
9
10
Min mse
Average mse
Standard.dev
iation

Mean Square Error (Validation Data Set)


HL size
HL size
Hl size
HL size
5
10
HL size 35
60
100
0.00618 0.000224
0.04551
49
54
0.000335 0.34636
4
0.000357 0.01453 0.00030
0.14066 0.027226
23
7
17
0.01510 0.003075
5
3
0.21794
0.21883 0.16041
0.00347 0.000834
1.00E0.00222
02
85
3.89E-08
06
97
0.007992
1.77E0.24494
8
0.05669
0.27
05
0.00071
7
0.27435
0.0268
0.15964
0.14
0.0174
0.00756
0.00753
0.2197
0.015
0.16
0.00228
0.0505
0.00233 0.00241
0.00754
0.0067
0.1296
0.226
0.3521
0.14033 0.00234
0.023
0.2604
0.0034
0.00071 0.000224
1.001E- 0.00001
7
54
3.887E-08
06
77
0.07355 0.040442 0.045019 0.15392 0.04573
96
94
034
48
54
0.09462 0.094913 0.073507 0.13372 0.06644
29
92
36
62
54

Choice of best data


HL size 35, lr=5,
Epoch=2000
Iris
Thyroid
Datset
dataset
2.09E-09
0.04122
0.003790
5
0.048414
1.81E-08
0.04938
0.041
0.045504
2.02E-08
0.04337
0.037337
0.05013
7.46E-06
0.03996
0.029518
0.04231
1.45E-01
0.04899
0.006489
0.04532
2.09E-09
0.03996
0.045459
2.63E-02
8
0.044835 0.003665
008
182

Table 1: Mean square error of different hidden layer sizes

Train.gd algorithm was run for 10 times,


saving and assessing performance plots
of each sample;
The sample with the minimum mse
error was chosen considering validation
and test data plots;
The procedure was repeated for 5
different hidden layer sizes [ 5, 10, 35,
60, 100 ].
This section of the method had the following
outcomes; production of 10 samples for 5
different hidden layers (50 samples in total),
determination of the best data for each hidden
layer size.
4.2 RUNNING TRAIN.GD ALGORITHIM

Assign and initialise weights to input


data ;
Define the learning rate and epoch limit;

Train the network using the pre-defined


training set;
Evaluate the network using the
validation data;
Update weights and terminate when the
validation error is a minimum.
All those steps were performed by the train.gd
function in the software. The best learning rate
and epoch limit values were determined in a
pre-test and remained fixed during this method
(learning rate of 5 and epoch limit of 2000).
The outcome of this section of the method was
the determination of the minimum mean square
error of the validation data set for each sample
taken in the previously defined method.
4.3 CHOICE OF BEST DATA

Upload the best sample for each hidden


layer size from the collection of data
method;
Determine the best overall sample and
the corresponding hidden layer size
(using average standard deviation
functions);
Work out another 10 samples using the
best determined HL size;
Select the best sample overall using
validation and test performance plots;
Save parameters of best sample and try
this ANN architecture on a new set of
data.
This section of the method allowed for the
determination of the overall best ANN
architecture using another set of samples.
5. RESULTS

The most relevant results are


tabulated in Table 1. All results were
obtained using the following percentage
ratios for training, validation and test
sets 70% : 15% :15%. Results are also
based on a learning rate of 5 and epoch
limit of 2000. These values were
justified by a pre-testing procedure
using the same dataset. The best ANN
architecture, taking into account the
mean square error of both the validation
and test data is that containing 35
neurons in the hidden layer. This choice
is based on the average and standard
deviation values of the mean square
error. Results also show this
architecture applied to a different data
set. Figure 1 and Figure 2show the mse
vs epoch plot and the confusion matrix
for the best data sample. Figure 1 shows
that for the best data sample, validation
converged with a mse of 2.0241e-08 at
18 epochs, with a test error of 9e-09.
Figure 1 Mse Vs Epoch plot
Figure 2 Confusion matrix

6. DISCUSSION

The samples with 35 HL size were


initially not those with the minimum average
and standard deviation values. Further samples
were taken and more concrete results were
obtained, which was proof of good and
consistent data. This architecture also proved a
consistent mse value when test on thyroid
dataset, that has the same no of classes, but
more attributes.
The confusion matrix in Figure 2 shows
perfect classification in the test and validation
data but a miss-classification of 4.3% in the test
data. This shows that an error of classification
still occurred even if training was executed
perfectly. The latter is reinforced in the ROC
plots, showing true positives and no false
positives for the test and validation data and a
few false positives present in the test data.
7. CONCLUSION

It may be concluded that although results are


not always satisfactory, consistency is present
only in considerably small sized HL networks.
Furthermore, results show that class 2 and 3 are
the classes containing non-linearly separable
data. It may also be concluded that a specific
MLP architecture for a particular classification
task can be chosen, but classification in random
and not always consistent.
8. REFERENCES

[1] M. Nrgaard, O. Ravn, N. Poulsen, and L.


Hansen, Neural Networks for
Modelling and Control of Dynamic Systems,M.
Grimble and M. Johnson, Eds.
London: Springer-Verlag, 2000.
[2] S. Haykin, Neural Networks, J. Grifn, Ed.
New York: Macmillan
College Publishing Company, 1994.
[3] F. Lewis, J. Huang, T. Parisini, D.
Prokhorov, and D. Wunsch,
IEEE Trans. Neural Networks, vol. 18, no. 4,
pp. 969972, July 2007.

9. Appendix A

S-ar putea să vă placă și