DR C Aravindan

Introduction to Neural Networks
Perceptron and Feed-Forward Networks
Chandrabose Aravindan
<ArvindanC@ssn.edu.in>
Machine Learning Research Group

SSN College of Engineering, Chennai
Presented at:
Workshop on Machine Learning for Image Analysis
SSN, Chennai
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 1 / 62

Outline
1 Introduction

Outline
1 Introduction
2 Understanding Linear Discrimination

Outline
1 Introduction
3 Perceptron Weight Update Rule

Outline
1 Introduction
4 LMS Weight Update Rule

Outline
1 Introduction
5 Back-Propagation Algorithm

Outline
1 Introduction
6 Building and Validating NN Models

Outline
1 Introduction
7 Summary

Outline
1 Introduction
7 Summary

Introduction
Artificial Neural Network (ANN) is a computational model that is

capable of representing any continuous or discrete function
Interesting point about ANN is that a model can be “learnt” from
input-output example pairs
Extremely useful when the underlying functional mapping is not clear
and all we have is set of examples

Introduction

Can we describe a function that maps an image to a digit?
Do we know the function that maps CT scan images to stenosis?

Introduction

Can we describe a function that maps an image to a digit?
Do we know the function that maps CT scan images to stenosis?
In this talk, we will focus on basics of simple but useful kind of ANN
referred to as feed-forward networks (single layer and multi-layer)

Binary Classification Problem
Consider a task where we need to decide whether an instance belongs

to a class or not — for example, decide if given image is digit “1” or
not.
This is a binary classification task.
Assume that we have identified the features (that can be represented
by real numbers) and have collected positive and negative instances
— for digit recognition, we may simply take the pixel values
Each instance is a vector in the feature space
Machine learning problem here is to find a geometric model using
which we can predict the class of new instances
In this talk, we focus on building discriminating neural network
models from instances (inductive learning) using supervised learning
algorithms

Binary Classification
f2
+ + +
+
+
- + +
- -
- f1
-
-

Z-Score
Problem: Predict if a company goes bankrupt in the near future

Z-Score

Features:
A: Working capital / Total assets
B: Retained earnings / Total assets
C: Earnings before interest and tax / Total assets
D: Market value of equity / Total liabilities
E: Sales / Total assets

Z-Score

Features:
Altman’s Z-Score: Z = 1.2A + 1.4B + 3.3C + 0.6D + 1.0E
Predict bankruptcy if Z < 1.8

Z-Score

Features:
Altman’s Z-Score: Z = 1.2A + 1.4B + 3.3C + 0.6D + 1.0E
Predict bankruptcy if Z < 1.8
How did Prof. Edward Altman find these magic numbers?
Do we have any algorithm / tool today that can auto-magically find
these numbers from available labelled data?

Linear Discriminant Function
General idea for discrimination: Consider weighted sum of feature

values and take decision based on how it compares with a threshold
(like Altman’s Z-Score)


g(x ) = w0 + w1 x1 + . . . + wn xn (note that w0 is the bias added to the
weighted sum)
In vector notation, this can be written as g(x) = wT x + w0


weighted sum)
Typically, a non-linear function is applied on this weighted sum. Let’s
assume a simple step function. Binary class decision can now be
taken as +1 if g(x) > 0, and −1 if g(x) < 0.


weighted sum)
Typically, a non-linear function is applied on this weighted sum. Let’s
assume a simple step function. Binary class decision can now be
taken as +1 if g(x) > 0, and −1 if g(x) < 0.
What we have described here is a simple model of a neuron!

Model of a Neuron
Figure: Model of a Neuron (Adopted from [Russel and Norvig, 2009])

Model of a Neuron
Figure: Biological Neuron (Adopted from [Russel and Norvig, 2009])

NN Examples
Figure: Simple Examples (Adopted from [Russel and Norvig, 2009])

Activation Functions
Binary and Bipolar step functions

1 x >0

f (x ) =
0 x ≤0
1 x >0

f (x ) =
−1 x ≤ 0
Binary and bipolar sigmoid functions
1
f (x ) =
1 + e −σx
1 − e −σx
g(x ) = 2f (x ) − 1 =
1 + e −σx

Activation Functions
Desirable properties of an activation function include that it is

differentiable and the differentiation is expressible in terms of the
function itself
Binary sigmoid:
f 0 (x ) = σf (x ) [1 − f (x )]
Bipolar sigmoid:
σ
g 0 (x ) = [1 + g(x )] [1 − g(x )]
2

Outline
1 Introduction
7 Summary

Boundary separating the classes
Let us consider a neuron with a step function for activation


What is the boundary separating the two classes?


What is the boundary separating the two classes?
g(x) = wT x + w0 = 0
In case of two dimensions this is a line; for three dimensions this is a
plane; and in general it is a hyperplane
Thus, we are looking for a geometric model (hyperplane) defined by
weights as model parameters

Binary Classification — Boundary
f2
+ + +
+
+
- + +
- -
- f1
-
-
f2
+ + +
+
+
- + +
- -
- f1
-
-
f2
?
+ + +
+
+
- + +
- -
- f1
-
-
f2
?
+ + +
+
+
? + +
-
- -
- f1
-
-
f2
?
+ + +
+
+
? + +
-
- -
- ? f1
-
-

Geometry of Linear Discriminantion
If x1 and x2 are any two arbitrary points on the hyperplane, then
wT x1 + w0 = wT x2 + w0 = 0

wT x1 + w0 = wT x2 + w0 = 0
This means that

wT (x1 − x2 ) = 0

wT x1 + w0 = wT x2 + w0 = 0
This means that

wT (x1 − x2 ) = 0
Note that the vector x1 − x2 lies on the hyperplane, and its dot
product with weight vector is 0.

wT x1 + w0 = wT x2 + w0 = 0
This means that

wT (x1 − x2 ) = 0
Note that the vector x1 − x2 lies on the hyperplane, and its dot
product with weight vector is 0.
Hence, weight vector w is orthogonal to the hyperplane and points to
the positive direction.

Geometry of Linear Discrimination
Figure: Hyperplane geometry in 2-dimensional space (Adopted from

[Theodoridis and Koutroumbas, 2009])

Geometry of Linear Discrimination
Now, distance between an arbitrary vector x and the hyperplane is

given as
|g(x)|
z=
||w||
In particular, the distance between origin and the hyperplane is
|w0 |
d=
||w||
Thus, the sign of w0 says whether origin is in negative response or

positive response region.
g(x) is a measure of the Euclidean distance of the point x from the
decision plane
w0 determines the position of hyperplane, while the direction is
determined by the weight vector w.

Learning the weights
Now that we have discussed about how the binary classification

decision is made and the corresponding geometry, we turn our
attention to how to find a model from given examples.


In this case, the model parameters are the weights and thus we need
to learn the weight vector and bias w0 from the examples.


In this case, the model parameters are the weights and thus we need
to learn the weight vector and bias w0 from the examples.
Perceptron algorithm is one of the earliest and simplest algorithm for
this purpose.
It is not difficult to pose this as minimization of mean squared error
(remember that this is supervised learning) and so gradient descent
based LMS algorithm is also popular.

Supervised Learning Feedback Loop
Target
f(x)
x Model h(x)
(defined by Compare
model parameters)

Target
f(x)
x Model h(x)
(defined by Compare
model parameters)
Major Issue
Will this feedback loop converge?

Target
f(x)
x Model h(x)
(defined by Compare
model parameters)
Major Issue
Will this feedback loop converge?
Major Issue
Will the model generalize beyond the training samples?
Outline
1 Introduction
7 Summary

Perceptron Algorithm
Start with some initial weight vector (including the bias component
w0 )

w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi

w0 )
If a negative example xj is misclassified, i.e. g(xi ) > 0 leading to a
positive response +1 while the target yj = −1, we need to decrease
the weight: w0 = w − ηxj

w0 )
This can be combined in to a single update rule. When an example xi
is misclassified, update the weights as follows: w0 = w + ηyi xi

w0 )
This can be combined in to a single update rule. When an example xi
is misclassified, update the weights as follows: w0 = w + ηyi xi
This process is repeated until there are no more misclassified
examples.

Perceptron Convergence
Will this iteration converge and terminate?

Perceptron Convergence
Will this iteration converge and terminate?

Perceptron Convergence Theorem: Perceptron learning algorithm
converges for any linearly separable data irrespective of the initial
weights chosen

Perceptron: Dual view
There is no restriction on what the initial weight vector is. We can

very well start with zero vector. Also assume that the learning rate is
set to 1.
As per the Perceptron algorithm, every time an example xi is
misclassified, we add yi xi to the weight vector.
Let αi be the number of times an example xi is misclassified. Then,
the resulting weight vector can be expressed as
n
X
w= αi yi xi
i=1
where n is the number of examples in the training set.

Perceptron: Dual view
Note that an example xj is mis-classified when yj wxj < 0, i.e. when

yj ni=1 αi yi xi · xj < 0
P
In the dual problem, we learn all αi . In the learning iteration, αi is

incremented whenever xi is mis-classified.
When the training is over, an instance x will be classified as
n
!
X
y = sign αi yi xi · x
i=1
An interesting point to note is that αi could be set to 0 for many

examples that are far away from the hyperplane (thereby reducing the
number of model parameters)!
Another interesting point here is that we need only the pairwise dot
products of examples. This is very important in the context of
Support Vector Machines (SVM)

Outline
1 Introduction
7 Summary

LMS Weight Update Rule
Instead of simple step function, a differentiable non-linear function,

such as a sigmoid function, can be applied on weighted sum of inputs
(logistic regression)
Can be used for both Classification and Regression


Error to be minimized
1 1
E = Err 2 = (y − hw (x))2
2 2


Error to be minimized
1 1
E = Err 2 = (y − hw (x))2
2 2
Simple gradient descent algorithm may be used to find the weights

that minimize the error
Note that the gradient is obtained as
* +
∂E ∂E ∂E ∂E
, ,... ,...
∂w1 ∂w2 ∂wj ∂wn

∂E ∂Err
= Err ×
∂wj ∂wj

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0
= −Err × f 0 (inp) × xj

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0
= −Err × f 0 (inp) × xj
Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0
= −Err × f 0 (inp) × xj
But, what should be the quantum of change in that direction?

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0
= −Err × f 0 (inp) × xj
But, what should be the quantum of change in that direction?
We use a parameter called learning rate to control this and arrive at
the following rule:
wj0 = wj + η × Err × f 0 (inp) × xj

LMS Algorithm
Start with an initial weight vector (including the bias component w0 )

For each example in the training set, do
Compute the output with current weight
Find the Err at the output
Adjust all the weights as per the weight update rule (incremental
update)
Add up all the weight updates and apply at the end of the iteration
(batch update)
Continue with this iteration until the mean squared error falls below a
pre-determined threshold

Convergence of LMS
Does this algorithm converge?

Convergence of LMS

The learning rate η plays an important role. Convergence analysis has
been carried out and constraints on choosing η have been reported in
the literature

Convergence of LMS

the literature
Does this converge to an optimal weight vector?

Convergence of LMS

the literature
Like any gradient descent algorithm, this learning algorithm may get
trapped in a local minima

Convergence of LMS

the literature
Like any gradient descent algorithm, this learning algorithm may get
trapped in a local minima
But should we worry too much about local minima!?

Outline
1 Introduction
7 Summary

Limitations of linear models
The geometric models and their corresponding learning algorithms are

obviously suitable only for linearly separable classification problems
But in the real-world, in general, problems are not linearly separable,
and hence linear discriminator model would not be sufficient.
A very simple and classic example is given below:
Figure: Linear Separability (Adopted from [Russel and Norvig, 2009])

Multi-layer Perceptron
Figure: Multi-Layer Perceptron (Adopted from [Russel and Norvig, 2009])

Multi-layer Perceptron
Any continuous function can be represented with two layers and any
function with three layers [Hornik et al., 1989]
Combine two opposite facing threshold functions to make a ridge
Combine two perpendicular ridges to make a bump
Add bumps of various sizes and locations to fit any surface

Back-propagation algorithm
Gradient based error minimization techniques do not work because we
do not know the target for hidden units!

Solution is to “back propagate” error from all the output units to set
a target for a hidden unit [Rumelhart et al., 1986]

.
.
∆j .
Wj,i
∆i
.
.
.
Hidden Output
Layer Layer

.
.
∆j .
Wj,i
∆i
.
.
.
Hidden Output
Layer Layer
∆j = g 0 (inpj ) Wj,i ∆i , where ∆i = Erri × g 0 (inpi )

X
i
Derivation of back-propagation weight update rules
1X
E= (yi − ai )2
2 i

1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I

1X
E= (yi − ai )2
2 i
= −(yI − aI ) = −(yI − aI )
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I

1X
E= (yi − ai )2
2 i
= −(yI − aI ) = −(yI − aI )
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
 
∂ X
= −(yI − aI )g 0 (inp I ) Wj,i aj 
∂WJ,I j

1X
E= (yi − ai )2
2 i
= −(yI − aI ) = −(yI − aI )
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
 
∂ X
∂WJ,I j
= −(yI − aI )g 0 (inp I )aJ

1X
E= (yi − ai )2
2 i
= −(yI − aI ) = −(yI − aI )
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
 
∂ X
∂WJ,I j
= −(yI − aI )g 0 (inp I )aJ
= −∆I aJ

∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J

= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
 
∂inp i ∂ X
(yi − ai )g 0 (inp i )
X X
=− =− ∆i Wj,i aj 
i
∂WK ,J i
∂W K ,J j

= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
 
∂inp i ∂ X
X X
=− =− ∆i Wj,i aj 
i
∂WK ,J i
∂W K ,J j
X ∂aJ X ∂g(inp J )
=− ∆i WJ,i =− ∆i WJ,i
i
∂WK ,J i
∂WK ,J

= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
 
∂inp i ∂ X
X X
=− =− ∆i Wj,i aj 
i
∂WK ,J i
∂W K ,J j
=− ∆i WJ,i =− ∆i WJ,i
i
∂WK ,J i
∂WK ,J
!
0 ∂inp J ∂
∆i WJ,i g 0 (inp J )
X X X
=− ∆i WJ,i g (inp J ) =− Wk,J ak
i
∂WK ,J i
∂WK ,J k

= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
 
∂inp i ∂ X
X X
=− =− ∆i Wj,i aj 
i
∂WK ,J i
∂W K ,J j
=− ∆i WJ,i =− ∆i WJ,i
i
∂WK ,J i
∂WK ,J
!
0 ∂inp J ∂
∆i WJ,i g 0 (inp J )
X X X
=− ∆i WJ,i g (inp J ) =− Wk,J ak
i
∂WK ,J i
∂WK ,J k
∆i WJ,i g 0 (inp J )aK = −∆J aK

X
=−
i

Weight Update Rules
Output Layer: Weight update is similar to that of a Perceptron
Wj,i ← Wj,i + η × aj × ∆i
where
∆i = Erri × g 0 (inpi )

Weight Update Rules
Output Layer: Weight update is similar to that of a Perceptron
Wj,i ← Wj,i + η × aj × ∆i
where
∆i = Erri × g 0 (inpi )
Hidden Layer: Back-propagate error from output layer and use that
for updating weights
∆j = g 0 (inpj )
X
Wj,i ∆i
i
Wk,j ← Wk,j + η × ak × ∆j

Initialize the weights — many methods have been proposed

Perform the following iteration for each example:
Feed forward: Compute the output of network using current weights
Back-propagation: Compute and propagate the error back (starting
from the output layer)
Weight-updation: Update all the weights using the weight update
rules (Incremental Learning)
Weight-updation: Add up the weight updates for all the examples and
apply at the end of an iteration (Batch Learning)
Continue this iteration until error falls below a pre-determined value

Issues in Back-propagation algorithm
Back-propagation algorithm has generated a lot of interest and

several variations have been proposed.

Issues in Back-propagation algorithm
Back-propagation algorithm has generated a lot of interest and

several variations have been proposed.
However, there are several issues in employing this algorithm
It has been shown that finding weights to minimize error is
NP-complete [Blum and Rivest, 1989]
Algorithm is not guaranteed to converge
Over-fitting to the training samples
Error minimization process can get trapped in a local minima
Extremely slow convergence
Selection of initial weights may be critical
Selection of optimal parameters
Extremely difficult to explain the model and its predictions

Outline
1 Introduction
7 Summary

NN for Image Analysis
Identify the machine learning task involved — binary classification,

multi-class classification, clustering, etc.
Given an image segment, identify the object therein — multi-class
Given an image segment, identify if it is a human face or not — binary
classification
Handwritten digit recognition from images — 10-class classification


classification
Decide if any pre-processing is required on the images — noise
removal, re-sizing, re-scaling, gray-scale conversion, etc.


classification
Decide if any pre-processing is required on the images — noise
removal, re-sizing, re-scaling, gray-scale conversion, etc.
Decide on the features to be considered
Face detection: PCA, LDA, Local binary pattern histogram, Legendre
moments, Zernike moments, Generic Fourier Descriptor, etc.
Supply all the pixel values directly!

Decide on the feed-forward network structure

Number of input neurons depend on the size of the feature vector and
number of output neurons depend on the number of classes / clusters
Decide on the number of hidden layers and number of neurons in each
hidden layer.

Decide on the feed-forward network structure

Number of input neurons depend on the size of the feature vector and
number of output neurons depend on the number of classes / clusters
Decide on the number of hidden layers and number of neurons in each
hidden layer.
Example: Hand-written digit recognition from MNIST database —
[LeCun et al., 1998] has reported the following structures:
10 single layer networks, each with the same 28 × 28 inputs — winner
takes all
45 one-against-one single layer networks — final score for class i is sum
of i/x units minus the sum of y /i units for all x and y
28 × 28 – 300 – 10, 28 × 28 – 1000 – 10, 28 × 28 – 300 – 100 – 10,
28 × 28 – 1000 – 150 – 10 multi-layer networks

Are there any techniques to find suitable network structure?


Grid search — select a range and step size for each parameter and
perform an exhaustive search using each combination and select the
“best performing” combination on the training set


Optimal brain damage [LeCun et al., 1990]


Optimal brain damage [LeCun et al., 1990]
Dropout proposed by [Srivastava et al., 2014]
For each presentation of each instance, certain units with all their
connections are randomly dropped
for a network with n units, a maximum of 2n different “thin” networks
are trained on different instances
All the weights are finally “combined” based on retention probabilities
of the units to have a single network in the production mode

Decide on the activation function and its parameters — sigmoid,
tanh, Gaussian, Elliot, bounded linear, cos, sin, etc.

Decide on error function to be minimized — regularization?

Decide on the learning rate — may be adopted depending on the
error — each weight can have a different learning rate that is
adaptive [Tollenaere, 1990]

Decide on the learning momentum — add a fraction of last weight
change to the current direction of movement

Decide on the initial weight vectors — Avoid weights for which f or f 0
evaluates to 0, Random weights between -1 and 1, Nguyen-Widrow
initialization [Nguyen and Widrow, 1990], Yam-Chow initialization
[Yam and Chow, 2001]

Decide on the initial weight vectors — Avoid weights for which f or f 0
evaluates to 0, Random weights between -1 and 1, Nguyen-Widrow
initialization [Nguyen and Widrow, 1990], Yam-Chow initialization
[Yam and Chow, 2001]
Decide on the learning algorithm — Standard back-propagation
(incremental or batch), QuickProp [Fahlman, 1988], RPROP
[Riedmiller and Braun, 1993], second-order derivatives
[Moller, 1993, Hagan and Menhaj, 1994], etc.
Performance measures for binary classification
Let T be the size of test set with POS positive instances and NEG
negative instances. T = POS + NEG
Let TP be the size of positive instances that are correctly classified
and TN be the size of correctly classified negative instances. False
positive and False negatives are then given by FP = NEG − TN and
FN = POS − TP.
Accuracy can be obtained as
TP + TN
Acc =
T
Precision (P) and Recall (R) (also known as True Positive Rate or
Sensitivity) are two important measures for binary classification:
TP TP
P= R=
TP + FP TP + FN
F-measure is the harmonic mean of P and R:
2×P ×R
F1 =
P +R
Confusion Matrix
(Also known as Contingency Matrix)
Predicted (+) Predicted (–)

Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
Table: Confusion Matrix

Confusion Matrix

Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
T = 160; POS = 90; NEG = 70

Confusion Matrix

Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60

Confusion Matrix

Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375

Confusion Matrix

Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375
Recall = 75/90 = 0.83333; Precision = 75/85 = 0.88235

Confusion Matrix

Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375
F1 = (2 ∗ 0.88235 ∗ 0.83333)/(0.88235 + 0.83333) = 0.85714

Confusion Matrix

Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375
F1 = (2 ∗ 0.88235 ∗ 0.83333)/(0.88235 + 0.83333) = 0.85714
True Negative Rate (or) Specificity (or) negative recall =
60/70 = 0.85714

Confusion Matrix

Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375
F1 = (2 ∗ 0.88235 ∗ 0.83333)/(0.88235 + 0.83333) = 0.85714
True Negative Rate (or) Specificity (or) negative recall =
60/70 = 0.85714
Similarly, False Positive Rate (1 − TNR) and False Negative Rate
(1 − TPR) can also be computed
ROC Plots
(Also known as normalized coverage plots)
Figure: ROC Plot

Cross-Validation
How do we assess the training error in building a model from data?

This error analysis is important to have an estimate of prediction error
of the model

Cross-Validation

of the model
When a model is built using all of the training data, error analysis
may be misleading

Cross-Validation

of the model
may be misleading
Hence, k-fold cross-validation is generally advocated

Cross-Validation

of the model
may be misleading
Hence, k-fold cross-validation is generally advocated
Training data is divided in to k sets in such a way that each set
maintains the key ratios of the training set (stratification)
k different experiments are conducted, where in each experiment one
of the folds is used for testing and the remaining k − 1 folds are used
for training
Thus, each instance is used for testing once, based on which error can
be calculated

Suggested process to apply NN
Precisely define the machine learning task required (classification,

multi-class, regression, etc.)
Ensure that NN is suitable for that task
Decide on the features to be used, and decide on how to extract
those features from instances
Collect a large set of random instances
Extract features and scale them accordingly
Divide it into two disjoint sets: training set and testing set
Use the training set to obtain a model and perform k-fold cross
validation — decide on optimal algorithm parameters
Analyze the performance of the model on the test set (accuracy,
precision, recall, f-measure, etc.)
Repeat for different random partitions of instances to training and
testing sets
Outline
1 Introduction
7 Summary

Summary
Machine learning is about using the right features to build the right
models that achieve the right tasks [Flach, 2012]
In this talk, we have focused on finding linear discriminant model (a
hyperplane in the feature space) for binary classification problem
The model has to be constructed from examples (inductive learning)
that are properly labeled (supervised learning). Further, the model
has to be used for predicting the class of a new instance (predictive
analytics)
A hyperplane in the feature space that properly separates positive and
negative examples can be constructed by Perceptron or LMS learning
algorithms
It has been shown that these algorithms converge for linearly
separable problems

Summary
These algorithms can only find linear models and are not suitable for
problems that are not linearly separable
In neural networks perspective, we need hidden layers to handle such
problems
Back-propagation of errors is the basic mechanism used to arrive at
algorithms for learning weights for such neural networks
However, there are several issues with back-propagation algorithm —
not guaranteed to converge, may get trapped in local minima, etc.
We have highlighted a few points to overcome these limitations and
suggested a process to apply ANN for solving a problem

References I
Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H.-T. (2012).

Learning from Data — A Short Course.
AMLbook.
Alpaydin, E. (2010).
Introduction to Machine Learning.
The MIT Press, Second edition.
Anderson, J. A. (1995).
An introduction to neural networks.
The MIT Press.
Blum, A. and Rivest, R. L. (1989).
Training a 3-node neural net is NP-complete.
In Advances in neural information processing systems, volume 1, pages
494–501. Morgan-Kaufmann.

References II
Fahlman, S. E. (1988).
An empirical study of learning speed in back-propagation networks.
Technical Report CMU-CS-88-162, Carnegie Mellon University.
Flach, P. (2012).
Machine Learning: The art and science of algorithms that make sense
of data.
Cambridge University Press.
Hagan, M. T. and Menhaj, M. B. (1994).
Training feedforward networks with the marquardt algorithm.
IEEE Transactions on Neural Networks, 5:989–993.
Hassoun, M. H. (1995).
Fundamentals of Artificial Neural Networks.
The MIT Press.

References III
Hertz, J., Krogh, A., and Palmer, R. G. (1991).

Introduction to the theory of neural computing.
Addison-Wesley Publishing Company.
Hornik, K., Stinchcomb, M., and White, H. (1989).
Multilayer feedforward networks are universal approximators.
Neural Networks, 2(5):359–366.
Jacobs, R. A. (1988).
Increased rates of convergence through learning rate adaptation.
Neural Networks, 1(4):295–307.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).
Gradient-based learning applied to document recognition.
In Proceedings of the IEEE, volume 86, pages 2278–2324.

References IV
LeCun, Y., Denker, J., Solla, S., Howard, R. E., and Jackel, L. D.
(1990).
Optimal brain damage.
In Advances in Neural Information Processing Systems, volume II.
Mitchell, T. M. (1997).
Machine Learning.
McGraw-Hill.
Moller, M. F. (1993).
A scaled conjugated gradient algorithm for fast supervised learning.
Neural Networks, 6:525–533.

References V
Nguyen, D. and Widrow, B. (1990).

Improving the learning speed of 2-layer neural networks by choosing
initial values of the adaptive weights.
In Proceedings of the International Joint Conference on Neural
Networks, volume 3, pages 21–26.
Riedmiller, M. and Braun, H. (1993).
A direct adaptive method for faster backpropagation learning: The
RPROP algorithm.
In Proceedings of the IEEE International Conference on Neural
Networks, pages 586–591. IEEE Press.
Rumelhart, D. E., Hinton, G. E., and Willams, R. J. (1986).
Learning representations by back-propagating errors.
Nature, pages 533–536.
doi:10.1038/323533a0.

References VI
Russel, S. and Norvig, P. (2009).

Artificial Intelligence — A Modern Approach.
Prentice Hall, Third edition.
Sarkar, D. (1995).
Methods to speed up error back-propagation learning algorithm.
ACM Computing Surveys, 27(4):519–542.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2014).
Dropout: A simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research, 15:1929–1958.
Theodoridis, S. and Koutroumbas, K. (2009).
Pattern Recognition.
Academic Press, Fourth edition.

References VII
Tollenaere, T. (1990).
SuperSAB: Fast adaptive backpropagation with good scaling
properties.
Neural Networks, 3:561–573.
Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W. T., and Alkon,
D. L. (1988).
Accelerating the convergence of the backpropagation method.
Biological Cybernetics, 58:257–263.
Yam, J. Y. F. and Chow, T. W. S. (2001).
Feed forward networks training speed enhancement by optimal
initialization of the synaptic coefficients.
IEEE Transactions on Neural Networks, 12(2):430–434.

QUESTIONS?

DR C Aravindan

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

DR C Aravindan

Încărcat de

Drepturi de autor:

Formate disponibile

Introduction to Neural Networks

Perceptron and Feed-Forward Networks

Machine Learning Research Group

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 1 / 62

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

2 Understanding Linear Discrimination

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

6 Building and Validating NN Models

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

6 Building and Validating NN Models

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

6 Building and Validating NN Models

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 3 / 62

Artificial Neural Network (ANN) is a computational model that is

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 4 / 62

Artificial Neural Network (ANN) is a computational model that is

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 4 / 62

Artificial Neural Network (ANN) is a computational model that is

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 4 / 62

Consider a task where we need to decide whether an instance belongs

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 5 / 62

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 6 / 62

Problem: Predict if a company goes bankrupt in the near future

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62

Problem: Predict if a company goes bankrupt in the near future

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62

Problem: Predict if a company goes bankrupt in the near future

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62

Problem: Predict if a company goes bankrupt in the near future

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62

General idea for discrimination: Consider weighted sum of feature

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62

General idea for discrimination: Consider weighted sum of feature

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62

General idea for discrimination: Consider weighted sum of feature

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62

General idea for discrimination: Consider weighted sum of feature

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62

Figure: Model of a Neuron (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 9 / 62

Figure: Biological Neuron (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 10 / 62

Figure: Simple Examples (Adopted from [Russel and Norvig, 2009])