Sunteți pe pagina 1din 140

Introduction to Neural Networks

Perceptron and Feed-Forward Networks

Chandrabose Aravindan
<ArvindanC@ssn.edu.in>

Machine Learning Research Group


SSN College of Engineering, Chennai

Presented at:
Workshop on Machine Learning for Image Analysis
SSN, Chennai

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 1 / 62


Outline

1 Introduction

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 2 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 3 / 62


Introduction

Artificial Neural Network (ANN) is a computational model that is


capable of representing any continuous or discrete function
Interesting point about ANN is that a model can be “learnt” from
input-output example pairs
Extremely useful when the underlying functional mapping is not clear
and all we have is set of examples

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 4 / 62


Introduction

Artificial Neural Network (ANN) is a computational model that is


capable of representing any continuous or discrete function
Interesting point about ANN is that a model can be “learnt” from
input-output example pairs
Extremely useful when the underlying functional mapping is not clear
and all we have is set of examples
Can we describe a function that maps an image to a digit?
Do we know the function that maps CT scan images to stenosis?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 4 / 62


Introduction

Artificial Neural Network (ANN) is a computational model that is


capable of representing any continuous or discrete function
Interesting point about ANN is that a model can be “learnt” from
input-output example pairs
Extremely useful when the underlying functional mapping is not clear
and all we have is set of examples
Can we describe a function that maps an image to a digit?
Do we know the function that maps CT scan images to stenosis?
In this talk, we will focus on basics of simple but useful kind of ANN
referred to as feed-forward networks (single layer and multi-layer)

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 4 / 62


Binary Classification Problem

Consider a task where we need to decide whether an instance belongs


to a class or not — for example, decide if given image is digit “1” or
not.
This is a binary classification task.
Assume that we have identified the features (that can be represented
by real numbers) and have collected positive and negative instances
— for digit recognition, we may simply take the pixel values
Each instance is a vector in the feature space
Machine learning problem here is to find a geometric model using
which we can predict the class of new instances
In this talk, we focus on building discriminating neural network
models from instances (inductive learning) using supervised learning
algorithms

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 5 / 62


Binary Classification

f2
+ + +
+
+
- + +
- -
- f1
-
-

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 6 / 62


Z-Score

Problem: Predict if a company goes bankrupt in the near future

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62


Z-Score

Problem: Predict if a company goes bankrupt in the near future


Features:
A: Working capital / Total assets
B: Retained earnings / Total assets
C: Earnings before interest and tax / Total assets
D: Market value of equity / Total liabilities
E: Sales / Total assets

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62


Z-Score

Problem: Predict if a company goes bankrupt in the near future


Features:
A: Working capital / Total assets
B: Retained earnings / Total assets
C: Earnings before interest and tax / Total assets
D: Market value of equity / Total liabilities
E: Sales / Total assets
Altman’s Z-Score: Z = 1.2A + 1.4B + 3.3C + 0.6D + 1.0E
Predict bankruptcy if Z < 1.8

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62


Z-Score

Problem: Predict if a company goes bankrupt in the near future


Features:
A: Working capital / Total assets
B: Retained earnings / Total assets
C: Earnings before interest and tax / Total assets
D: Market value of equity / Total liabilities
E: Sales / Total assets
Altman’s Z-Score: Z = 1.2A + 1.4B + 3.3C + 0.6D + 1.0E
Predict bankruptcy if Z < 1.8
How did Prof. Edward Altman find these magic numbers?
Do we have any algorithm / tool today that can auto-magically find
these numbers from available labelled data?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 7 / 62


Linear Discriminant Function

General idea for discrimination: Consider weighted sum of feature


values and take decision based on how it compares with a threshold
(like Altman’s Z-Score)

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62


Linear Discriminant Function

General idea for discrimination: Consider weighted sum of feature


values and take decision based on how it compares with a threshold
(like Altman’s Z-Score)
g(x ) = w0 + w1 x1 + . . . + wn xn (note that w0 is the bias added to the
weighted sum)
In vector notation, this can be written as g(x) = wT x + w0

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62


Linear Discriminant Function

General idea for discrimination: Consider weighted sum of feature


values and take decision based on how it compares with a threshold
(like Altman’s Z-Score)
g(x ) = w0 + w1 x1 + . . . + wn xn (note that w0 is the bias added to the
weighted sum)
In vector notation, this can be written as g(x) = wT x + w0
Typically, a non-linear function is applied on this weighted sum. Let’s
assume a simple step function. Binary class decision can now be
taken as +1 if g(x) > 0, and −1 if g(x) < 0.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62


Linear Discriminant Function

General idea for discrimination: Consider weighted sum of feature


values and take decision based on how it compares with a threshold
(like Altman’s Z-Score)
g(x ) = w0 + w1 x1 + . . . + wn xn (note that w0 is the bias added to the
weighted sum)
In vector notation, this can be written as g(x) = wT x + w0
Typically, a non-linear function is applied on this weighted sum. Let’s
assume a simple step function. Binary class decision can now be
taken as +1 if g(x) > 0, and −1 if g(x) < 0.
What we have described here is a simple model of a neuron!

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 8 / 62


Model of a Neuron

Figure: Model of a Neuron (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 9 / 62


Model of a Neuron

Figure: Biological Neuron (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 10 / 62


NN Examples

Figure: Simple Examples (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 11 / 62


Activation Functions

Binary and Bipolar step functions


1 x >0

f (x ) =
0 x ≤0
1 x >0

f (x ) =
−1 x ≤ 0
Binary and bipolar sigmoid functions
1
f (x ) =
1 + e −σx
1 − e −σx
g(x ) = 2f (x ) − 1 =
1 + e −σx

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 12 / 62


Activation Functions

Desirable properties of an activation function include that it is


differentiable and the differentiation is expressible in terms of the
function itself
Binary sigmoid:
f 0 (x ) = σf (x ) [1 − f (x )]
Bipolar sigmoid:
σ
g 0 (x ) = [1 + g(x )] [1 − g(x )]
2

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 13 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 14 / 62


Boundary separating the classes

Let us consider a neuron with a step function for activation

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 15 / 62


Boundary separating the classes

Let us consider a neuron with a step function for activation


What is the boundary separating the two classes?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 15 / 62


Boundary separating the classes

Let us consider a neuron with a step function for activation


What is the boundary separating the two classes?

g(x) = wT x + w0 = 0
In case of two dimensions this is a line; for three dimensions this is a
plane; and in general it is a hyperplane
Thus, we are looking for a geometric model (hyperplane) defined by
weights as model parameters

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 15 / 62


Binary Classification — Boundary

f2
+ + +
+
+
- + +
- -
- f1
-
-
Binary Classification — Boundary

f2
+ + +
+
+
- + +
- -
- f1
-
-
Binary Classification — Boundary

f2
?
+ + +
+
+
- + +
- -
- f1
-
-
Binary Classification — Boundary

f2
?
+ + +
+
+
? + +
-
- -
- f1
-
-
Binary Classification — Boundary

f2
?
+ + +
+
+
? + +
-
- -
- ? f1
-
-

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 16 / 62


Geometry of Linear Discriminantion

If x1 and x2 are any two arbitrary points on the hyperplane, then

wT x1 + w0 = wT x2 + w0 = 0

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 17 / 62


Geometry of Linear Discriminantion

If x1 and x2 are any two arbitrary points on the hyperplane, then

wT x1 + w0 = wT x2 + w0 = 0

This means that


wT (x1 − x2 ) = 0

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 17 / 62


Geometry of Linear Discriminantion

If x1 and x2 are any two arbitrary points on the hyperplane, then

wT x1 + w0 = wT x2 + w0 = 0

This means that


wT (x1 − x2 ) = 0

Note that the vector x1 − x2 lies on the hyperplane, and its dot
product with weight vector is 0.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 17 / 62


Geometry of Linear Discriminantion

If x1 and x2 are any two arbitrary points on the hyperplane, then

wT x1 + w0 = wT x2 + w0 = 0

This means that


wT (x1 − x2 ) = 0

Note that the vector x1 − x2 lies on the hyperplane, and its dot
product with weight vector is 0.
Hence, weight vector w is orthogonal to the hyperplane and points to
the positive direction.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 17 / 62


Geometry of Linear Discrimination

Figure: Hyperplane geometry in 2-dimensional space (Adopted from


[Theodoridis and Koutroumbas, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 18 / 62


Geometry of Linear Discrimination

Now, distance between an arbitrary vector x and the hyperplane is


given as
|g(x)|
z=
||w||
In particular, the distance between origin and the hyperplane is

|w0 |
d=
||w||

Thus, the sign of w0 says whether origin is in negative response or


positive response region.
g(x) is a measure of the Euclidean distance of the point x from the
decision plane
w0 determines the position of hyperplane, while the direction is
determined by the weight vector w.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 19 / 62


Learning the weights

Now that we have discussed about how the binary classification


decision is made and the corresponding geometry, we turn our
attention to how to find a model from given examples.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 20 / 62


Learning the weights

Now that we have discussed about how the binary classification


decision is made and the corresponding geometry, we turn our
attention to how to find a model from given examples.
In this case, the model parameters are the weights and thus we need
to learn the weight vector and bias w0 from the examples.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 20 / 62


Learning the weights

Now that we have discussed about how the binary classification


decision is made and the corresponding geometry, we turn our
attention to how to find a model from given examples.
In this case, the model parameters are the weights and thus we need
to learn the weight vector and bias w0 from the examples.
Perceptron algorithm is one of the earliest and simplest algorithm for
this purpose.
It is not difficult to pose this as minimization of mean squared error
(remember that this is supervised learning) and so gradient descent
based LMS algorithm is also popular.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 20 / 62


Supervised Learning Feedback Loop

Target
f(x)
x Model h(x)
(defined by Compare
model parameters)

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 21 / 62


Supervised Learning Feedback Loop

Target
f(x)
x Model h(x)
(defined by Compare
model parameters)

Major Issue
Will this feedback loop converge?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 21 / 62


Supervised Learning Feedback Loop

Target
f(x)
x Model h(x)
(defined by Compare
model parameters)

Major Issue
Will this feedback loop converge?

Major Issue
Will the model generalize beyond the training samples?
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 21 / 62
Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 22 / 62


Perceptron Algorithm

Start with some initial weight vector (including the bias component
w0 )

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62


Perceptron Algorithm

Start with some initial weight vector (including the bias component
w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62


Perceptron Algorithm

Start with some initial weight vector (including the bias component
w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi
If a negative example xj is misclassified, i.e. g(xi ) > 0 leading to a
positive response +1 while the target yj = −1, we need to decrease
the weight: w0 = w − ηxj

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62


Perceptron Algorithm

Start with some initial weight vector (including the bias component
w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi
If a negative example xj is misclassified, i.e. g(xi ) > 0 leading to a
positive response +1 while the target yj = −1, we need to decrease
the weight: w0 = w − ηxj
This can be combined in to a single update rule. When an example xi
is misclassified, update the weights as follows: w0 = w + ηyi xi

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62


Perceptron Algorithm

Start with some initial weight vector (including the bias component
w0 )
If a positive example xi is misclassified, i.e. g(xi ) < 0 leading to a
negative response −1 while the target yi = +1, we need to increase
the weight: w0 = w + ηxi
If a negative example xj is misclassified, i.e. g(xi ) > 0 leading to a
positive response +1 while the target yj = −1, we need to decrease
the weight: w0 = w − ηxj
This can be combined in to a single update rule. When an example xi
is misclassified, update the weights as follows: w0 = w + ηyi xi
This process is repeated until there are no more misclassified
examples.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 23 / 62


Perceptron Convergence

Will this iteration converge and terminate?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 24 / 62


Perceptron Convergence

Will this iteration converge and terminate?


Perceptron Convergence Theorem: Perceptron learning algorithm
converges for any linearly separable data irrespective of the initial
weights chosen

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 24 / 62


Perceptron: Dual view

There is no restriction on what the initial weight vector is. We can


very well start with zero vector. Also assume that the learning rate is
set to 1.
As per the Perceptron algorithm, every time an example xi is
misclassified, we add yi xi to the weight vector.
Let αi be the number of times an example xi is misclassified. Then,
the resulting weight vector can be expressed as
n
X
w= αi yi xi
i=1

where n is the number of examples in the training set.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 25 / 62


Perceptron: Dual view

Note that an example xj is mis-classified when yj wxj < 0, i.e. when


yj ni=1 αi yi xi · xj < 0
P

In the dual problem, we learn all αi . In the learning iteration, αi is


incremented whenever xi is mis-classified.
When the training is over, an instance x will be classified as
n
!
X
y = sign αi yi xi · x
i=1

An interesting point to note is that αi could be set to 0 for many


examples that are far away from the hyperplane (thereby reducing the
number of model parameters)!
Another interesting point here is that we need only the pairwise dot
products of examples. This is very important in the context of
Support Vector Machines (SVM)

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 26 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 27 / 62


LMS Weight Update Rule

Instead of simple step function, a differentiable non-linear function,


such as a sigmoid function, can be applied on weighted sum of inputs
(logistic regression)
Can be used for both Classification and Regression

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 28 / 62


LMS Weight Update Rule

Instead of simple step function, a differentiable non-linear function,


such as a sigmoid function, can be applied on weighted sum of inputs
(logistic regression)
Can be used for both Classification and Regression
Error to be minimized
1 1
E = Err 2 = (y − hw (x))2
2 2

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 28 / 62


LMS Weight Update Rule

Instead of simple step function, a differentiable non-linear function,


such as a sigmoid function, can be applied on weighted sum of inputs
(logistic regression)
Can be used for both Classification and Regression
Error to be minimized
1 1
E = Err 2 = (y − hw (x))2
2 2

Simple gradient descent algorithm may be used to find the weights


that minimize the error
Note that the gradient is obtained as
* +
∂E ∂E ∂E ∂E
, ,... ,...
∂w1 ∂w2 ∂wj ∂wn

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 28 / 62


LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62


LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62


LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

= −Err × f 0 (inp) × xj

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62


LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

= −Err × f 0 (inp) × xj

Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62


LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

= −Err × f 0 (inp) × xj

Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction
But, what should be the quantum of change in that direction?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62


LMS Weight Update Rule

∂E ∂Err
= Err ×
∂wj ∂wj
  
n
∂  X
= Err × y − f  wj xj 
∂wj j=0

= −Err × f 0 (inp) × xj

Since the gradient shows the direction in which the error function is
growing, we “descent” in the opposite direction
But, what should be the quantum of change in that direction?
We use a parameter called learning rate to control this and arrive at
the following rule:
wj0 = wj + η × Err × f 0 (inp) × xj

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 29 / 62


LMS Algorithm

Start with an initial weight vector (including the bias component w0 )


For each example in the training set, do
Compute the output with current weight
Find the Err at the output
Adjust all the weights as per the weight update rule (incremental
update)
Add up all the weight updates and apply at the end of the iteration
(batch update)
Continue with this iteration until the mean squared error falls below a
pre-determined threshold

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 30 / 62


Convergence of LMS

Does this algorithm converge?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62


Convergence of LMS

Does this algorithm converge?


The learning rate η plays an important role. Convergence analysis has
been carried out and constraints on choosing η have been reported in
the literature

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62


Convergence of LMS

Does this algorithm converge?


The learning rate η plays an important role. Convergence analysis has
been carried out and constraints on choosing η have been reported in
the literature
Does this converge to an optimal weight vector?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62


Convergence of LMS

Does this algorithm converge?


The learning rate η plays an important role. Convergence analysis has
been carried out and constraints on choosing η have been reported in
the literature
Does this converge to an optimal weight vector?
Like any gradient descent algorithm, this learning algorithm may get
trapped in a local minima

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62


Convergence of LMS

Does this algorithm converge?


The learning rate η plays an important role. Convergence analysis has
been carried out and constraints on choosing η have been reported in
the literature
Does this converge to an optimal weight vector?
Like any gradient descent algorithm, this learning algorithm may get
trapped in a local minima
But should we worry too much about local minima!?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 31 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 32 / 62


Limitations of linear models

The geometric models and their corresponding learning algorithms are


obviously suitable only for linearly separable classification problems
But in the real-world, in general, problems are not linearly separable,
and hence linear discriminator model would not be sufficient.
A very simple and classic example is given below:

Figure: Linear Separability (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 33 / 62


Multi-layer Perceptron

Figure: Multi-Layer Perceptron (Adopted from [Russel and Norvig, 2009])

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 34 / 62


Multi-layer Perceptron

Any continuous function can be represented with two layers and any
function with three layers [Hornik et al., 1989]
Combine two opposite facing threshold functions to make a ridge
Combine two perpendicular ridges to make a bump
Add bumps of various sizes and locations to fit any surface

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 35 / 62


Back-propagation algorithm
Gradient based error minimization techniques do not work because we
do not know the target for hidden units!

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 36 / 62


Back-propagation algorithm
Gradient based error minimization techniques do not work because we
do not know the target for hidden units!
Solution is to “back propagate” error from all the output units to set
a target for a hidden unit [Rumelhart et al., 1986]

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 36 / 62


Back-propagation algorithm
Gradient based error minimization techniques do not work because we
do not know the target for hidden units!
Solution is to “back propagate” error from all the output units to set
a target for a hidden unit [Rumelhart et al., 1986]

.
.
∆j .

Wj,i
∆i

.
.
.

Hidden Output
Layer Layer

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 36 / 62


Back-propagation algorithm
Gradient based error minimization techniques do not work because we
do not know the target for hidden units!
Solution is to “back propagate” error from all the output units to set
a target for a hidden unit [Rumelhart et al., 1986]

.
.
∆j .

Wj,i
∆i

.
.
.

Hidden Output
Layer Layer

∆j = g 0 (inpj ) Wj,i ∆i , where ∆i = Erri × g 0 (inpi )


X

i
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 36 / 62
Derivation of back-propagation weight update rules

1X
E= (yi − ai )2
2 i

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 37 / 62


Derivation of back-propagation weight update rules

1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 37 / 62


Derivation of back-propagation weight update rules

1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 37 / 62


Derivation of back-propagation weight update rules

1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
 
∂ X
= −(yI − aI )g 0 (inp I ) Wj,i aj 
∂WJ,I j

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 37 / 62


Derivation of back-propagation weight update rules

1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
 
∂ X
= −(yI − aI )g 0 (inp I ) Wj,i aj 
∂WJ,I j

= −(yI − aI )g 0 (inp I )aJ

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 37 / 62


Derivation of back-propagation weight update rules

1X
E= (yi − ai )2
2 i
∂E ∂aI ∂g(inp I )
= −(yI − aI ) = −(yI − aI )
∂WJ,I ∂WJ,I ∂WJ,I
∂inp I
= −(yI − aI )g 0 (inp I )
∂WJ,I
 
∂ X
= −(yI − aI )g 0 (inp I ) Wj,i aj 
∂WJ,I j

= −(yI − aI )g 0 (inp I )aJ

= −∆I aJ

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 37 / 62


Derivation of back-propagation weight update rules

∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 38 / 62


Derivation of back-propagation weight update rules

∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
 
∂inp i ∂ X
(yi − ai )g 0 (inp i )
X X
=− =− ∆i Wj,i aj 
i
∂WK ,J i
∂W K ,J j

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 38 / 62


Derivation of back-propagation weight update rules

∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
 
∂inp i ∂ X
(yi − ai )g 0 (inp i )
X X
=− =− ∆i Wj,i aj 
i
∂WK ,J i
∂W K ,J j

X ∂aJ X ∂g(inp J )
=− ∆i WJ,i =− ∆i WJ,i
i
∂WK ,J i
∂WK ,J

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 38 / 62


Derivation of back-propagation weight update rules

∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
 
∂inp i ∂ X
(yi − ai )g 0 (inp i )
X X
=− =− ∆i Wj,i aj 
i
∂WK ,J i
∂W K ,J j

X ∂aJ X ∂g(inp J )
=− ∆i WJ,i =− ∆i WJ,i
i
∂WK ,J i
∂WK ,J
!
0 ∂inp J ∂
∆i WJ,i g 0 (inp J )
X X X
=− ∆i WJ,i g (inp J ) =− Wk,J ak
i
∂WK ,J i
∂WK ,J k

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 38 / 62


Derivation of back-propagation weight update rules

∂E X ∂ai X ∂g(inp i )
= − (yi − ai ) = − (yi − ai )
∂WK ,J i
∂WK ,J i
∂WK ,J
 
∂inp i ∂ X
(yi − ai )g 0 (inp i )
X X
=− =− ∆i Wj,i aj 
i
∂WK ,J i
∂W K ,J j

X ∂aJ X ∂g(inp J )
=− ∆i WJ,i =− ∆i WJ,i
i
∂WK ,J i
∂WK ,J
!
0 ∂inp J ∂
∆i WJ,i g 0 (inp J )
X X X
=− ∆i WJ,i g (inp J ) =− Wk,J ak
i
∂WK ,J i
∂WK ,J k

∆i WJ,i g 0 (inp J )aK = −∆J aK


X
=−
i

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 38 / 62


Weight Update Rules

Output Layer: Weight update is similar to that of a Perceptron

Wj,i ← Wj,i + η × aj × ∆i

where
∆i = Erri × g 0 (inpi )

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 39 / 62


Weight Update Rules

Output Layer: Weight update is similar to that of a Perceptron

Wj,i ← Wj,i + η × aj × ∆i

where
∆i = Erri × g 0 (inpi )

Hidden Layer: Back-propagate error from output layer and use that
for updating weights

∆j = g 0 (inpj )
X
Wj,i ∆i
i

Wk,j ← Wk,j + η × ak × ∆j

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 39 / 62


Back-propagation algorithm

Initialize the weights — many methods have been proposed


Perform the following iteration for each example:
Feed forward: Compute the output of network using current weights
Back-propagation: Compute and propagate the error back (starting
from the output layer)
Weight-updation: Update all the weights using the weight update
rules (Incremental Learning)
Weight-updation: Add up the weight updates for all the examples and
apply at the end of an iteration (Batch Learning)
Continue this iteration until error falls below a pre-determined value

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 40 / 62


Issues in Back-propagation algorithm

Back-propagation algorithm has generated a lot of interest and


several variations have been proposed.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 41 / 62


Issues in Back-propagation algorithm

Back-propagation algorithm has generated a lot of interest and


several variations have been proposed.
However, there are several issues in employing this algorithm
It has been shown that finding weights to minimize error is
NP-complete [Blum and Rivest, 1989]
Algorithm is not guaranteed to converge
Over-fitting to the training samples
Error minimization process can get trapped in a local minima
Extremely slow convergence
Selection of initial weights may be critical
Selection of optimal parameters
Extremely difficult to explain the model and its predictions

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 41 / 62


Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 42 / 62


NN for Image Analysis

Identify the machine learning task involved — binary classification,


multi-class classification, clustering, etc.
Given an image segment, identify the object therein — multi-class
Given an image segment, identify if it is a human face or not — binary
classification
Handwritten digit recognition from images — 10-class classification

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 43 / 62


NN for Image Analysis

Identify the machine learning task involved — binary classification,


multi-class classification, clustering, etc.
Given an image segment, identify the object therein — multi-class
Given an image segment, identify if it is a human face or not — binary
classification
Handwritten digit recognition from images — 10-class classification
Decide if any pre-processing is required on the images — noise
removal, re-sizing, re-scaling, gray-scale conversion, etc.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 43 / 62


NN for Image Analysis

Identify the machine learning task involved — binary classification,


multi-class classification, clustering, etc.
Given an image segment, identify the object therein — multi-class
Given an image segment, identify if it is a human face or not — binary
classification
Handwritten digit recognition from images — 10-class classification
Decide if any pre-processing is required on the images — noise
removal, re-sizing, re-scaling, gray-scale conversion, etc.
Decide on the features to be considered
Face detection: PCA, LDA, Local binary pattern histogram, Legendre
moments, Zernike moments, Generic Fourier Descriptor, etc.
Supply all the pixel values directly!

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 43 / 62


NN for Image Analysis

Decide on the feed-forward network structure


Number of input neurons depend on the size of the feature vector and
number of output neurons depend on the number of classes / clusters
Decide on the number of hidden layers and number of neurons in each
hidden layer.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 44 / 62


NN for Image Analysis

Decide on the feed-forward network structure


Number of input neurons depend on the size of the feature vector and
number of output neurons depend on the number of classes / clusters
Decide on the number of hidden layers and number of neurons in each
hidden layer.
Example: Hand-written digit recognition from MNIST database —
[LeCun et al., 1998] has reported the following structures:
10 single layer networks, each with the same 28 × 28 inputs — winner
takes all
45 one-against-one single layer networks — final score for class i is sum
of i/x units minus the sum of y /i units for all x and y
28 × 28 – 300 – 10, 28 × 28 – 1000 – 10, 28 × 28 – 300 – 100 – 10,
28 × 28 – 1000 – 150 – 10 multi-layer networks

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 44 / 62


NN for Image Analysis

Are there any techniques to find suitable network structure?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 45 / 62


NN for Image Analysis

Are there any techniques to find suitable network structure?


Grid search — select a range and step size for each parameter and
perform an exhaustive search using each combination and select the
“best performing” combination on the training set

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 45 / 62


NN for Image Analysis

Are there any techniques to find suitable network structure?


Grid search — select a range and step size for each parameter and
perform an exhaustive search using each combination and select the
“best performing” combination on the training set
Optimal brain damage [LeCun et al., 1990]

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 45 / 62


NN for Image Analysis

Are there any techniques to find suitable network structure?


Grid search — select a range and step size for each parameter and
perform an exhaustive search using each combination and select the
“best performing” combination on the training set
Optimal brain damage [LeCun et al., 1990]
Dropout proposed by [Srivastava et al., 2014]
For each presentation of each instance, certain units with all their
connections are randomly dropped
for a network with n units, a maximum of 2n different “thin” networks
are trained on different instances
All the weights are finally “combined” based on retention probabilities
of the units to have a single network in the production mode

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 45 / 62


NN for Image Analysis
Decide on the activation function and its parameters — sigmoid,
tanh, Gaussian, Elliot, bounded linear, cos, sin, etc.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 46 / 62


NN for Image Analysis
Decide on the activation function and its parameters — sigmoid,
tanh, Gaussian, Elliot, bounded linear, cos, sin, etc.
Decide on error function to be minimized — regularization?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 46 / 62


NN for Image Analysis
Decide on the activation function and its parameters — sigmoid,
tanh, Gaussian, Elliot, bounded linear, cos, sin, etc.
Decide on error function to be minimized — regularization?
Decide on the learning rate — may be adopted depending on the
error — each weight can have a different learning rate that is
adaptive [Tollenaere, 1990]

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 46 / 62


NN for Image Analysis
Decide on the activation function and its parameters — sigmoid,
tanh, Gaussian, Elliot, bounded linear, cos, sin, etc.
Decide on error function to be minimized — regularization?
Decide on the learning rate — may be adopted depending on the
error — each weight can have a different learning rate that is
adaptive [Tollenaere, 1990]
Decide on the learning momentum — add a fraction of last weight
change to the current direction of movement

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 46 / 62


NN for Image Analysis
Decide on the activation function and its parameters — sigmoid,
tanh, Gaussian, Elliot, bounded linear, cos, sin, etc.
Decide on error function to be minimized — regularization?
Decide on the learning rate — may be adopted depending on the
error — each weight can have a different learning rate that is
adaptive [Tollenaere, 1990]
Decide on the learning momentum — add a fraction of last weight
change to the current direction of movement
Decide on the initial weight vectors — Avoid weights for which f or f 0
evaluates to 0, Random weights between -1 and 1, Nguyen-Widrow
initialization [Nguyen and Widrow, 1990], Yam-Chow initialization
[Yam and Chow, 2001]

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 46 / 62


NN for Image Analysis
Decide on the activation function and its parameters — sigmoid,
tanh, Gaussian, Elliot, bounded linear, cos, sin, etc.
Decide on error function to be minimized — regularization?
Decide on the learning rate — may be adopted depending on the
error — each weight can have a different learning rate that is
adaptive [Tollenaere, 1990]
Decide on the learning momentum — add a fraction of last weight
change to the current direction of movement
Decide on the initial weight vectors — Avoid weights for which f or f 0
evaluates to 0, Random weights between -1 and 1, Nguyen-Widrow
initialization [Nguyen and Widrow, 1990], Yam-Chow initialization
[Yam and Chow, 2001]
Decide on the learning algorithm — Standard back-propagation
(incremental or batch), QuickProp [Fahlman, 1988], RPROP
[Riedmiller and Braun, 1993], second-order derivatives
[Moller, 1993, Hagan and Menhaj, 1994], etc.
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 46 / 62
Performance measures for binary classification
Let T be the size of test set with POS positive instances and NEG
negative instances. T = POS + NEG
Let TP be the size of positive instances that are correctly classified
and TN be the size of correctly classified negative instances. False
positive and False negatives are then given by FP = NEG − TN and
FN = POS − TP.
Accuracy can be obtained as
TP + TN
Acc =
T
Precision (P) and Recall (R) (also known as True Positive Rate or
Sensitivity) are two important measures for binary classification:
TP TP
P= R=
TP + FP TP + FN
F-measure is the harmonic mean of P and R:
2×P ×R
F1 =
P +R
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 47 / 62
Confusion Matrix
(Also known as Contingency Matrix)

Predicted (+) Predicted (–)


Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
Table: Confusion Matrix

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 48 / 62


Confusion Matrix
(Also known as Contingency Matrix)

Predicted (+) Predicted (–)


Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
Table: Confusion Matrix
T = 160; POS = 90; NEG = 70

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 48 / 62


Confusion Matrix
(Also known as Contingency Matrix)

Predicted (+) Predicted (–)


Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
Table: Confusion Matrix
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 48 / 62


Confusion Matrix
(Also known as Contingency Matrix)

Predicted (+) Predicted (–)


Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
Table: Confusion Matrix
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 48 / 62


Confusion Matrix
(Also known as Contingency Matrix)

Predicted (+) Predicted (–)


Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
Table: Confusion Matrix
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375
Recall = 75/90 = 0.83333; Precision = 75/85 = 0.88235

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 48 / 62


Confusion Matrix
(Also known as Contingency Matrix)

Predicted (+) Predicted (–)


Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
Table: Confusion Matrix
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375
Recall = 75/90 = 0.83333; Precision = 75/85 = 0.88235
F1 = (2 ∗ 0.88235 ∗ 0.83333)/(0.88235 + 0.83333) = 0.85714

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 48 / 62


Confusion Matrix
(Also known as Contingency Matrix)

Predicted (+) Predicted (–)


Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
Table: Confusion Matrix
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375
Recall = 75/90 = 0.83333; Precision = 75/85 = 0.88235
F1 = (2 ∗ 0.88235 ∗ 0.83333)/(0.88235 + 0.83333) = 0.85714
True Negative Rate (or) Specificity (or) negative recall =
60/70 = 0.85714

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 48 / 62


Confusion Matrix
(Also known as Contingency Matrix)

Predicted (+) Predicted (–)


Actual (+) 75 15 90
Actual (–) 10 60 70
85 75 160
Table: Confusion Matrix
T = 160; POS = 90; NEG = 70
TP = 75; FP = 10; FN = 15; TN = 60
Accuracy = (75 + 60)/160 = 0.84375
Recall = 75/90 = 0.83333; Precision = 75/85 = 0.88235
F1 = (2 ∗ 0.88235 ∗ 0.83333)/(0.88235 + 0.83333) = 0.85714
True Negative Rate (or) Specificity (or) negative recall =
60/70 = 0.85714
Similarly, False Positive Rate (1 − TNR) and False Negative Rate
(1 − TPR) can also be computed
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 48 / 62
ROC Plots
(Also known as normalized coverage plots)

Figure: ROC Plot

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 49 / 62


Cross-Validation

How do we assess the training error in building a model from data?


This error analysis is important to have an estimate of prediction error
of the model

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 50 / 62


Cross-Validation

How do we assess the training error in building a model from data?


This error analysis is important to have an estimate of prediction error
of the model
When a model is built using all of the training data, error analysis
may be misleading

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 50 / 62


Cross-Validation

How do we assess the training error in building a model from data?


This error analysis is important to have an estimate of prediction error
of the model
When a model is built using all of the training data, error analysis
may be misleading
Hence, k-fold cross-validation is generally advocated

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 50 / 62


Cross-Validation

How do we assess the training error in building a model from data?


This error analysis is important to have an estimate of prediction error
of the model
When a model is built using all of the training data, error analysis
may be misleading
Hence, k-fold cross-validation is generally advocated
Training data is divided in to k sets in such a way that each set
maintains the key ratios of the training set (stratification)
k different experiments are conducted, where in each experiment one
of the folds is used for testing and the remaining k − 1 folds are used
for training
Thus, each instance is used for testing once, based on which error can
be calculated

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 50 / 62


Suggested process to apply NN

Precisely define the machine learning task required (classification,


multi-class, regression, etc.)
Ensure that NN is suitable for that task
Decide on the features to be used, and decide on how to extract
those features from instances
Collect a large set of random instances
Extract features and scale them accordingly
Divide it into two disjoint sets: training set and testing set
Use the training set to obtain a model and perform k-fold cross
validation — decide on optimal algorithm parameters
Analyze the performance of the model on the test set (accuracy,
precision, recall, f-measure, etc.)
Repeat for different random partitions of instances to training and
testing sets
C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 51 / 62
Outline

1 Introduction

2 Understanding Linear Discrimination

3 Perceptron Weight Update Rule

4 LMS Weight Update Rule

5 Back-Propagation Algorithm

6 Building and Validating NN Models

7 Summary

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 52 / 62


Summary

Machine learning is about using the right features to build the right
models that achieve the right tasks [Flach, 2012]
In this talk, we have focused on finding linear discriminant model (a
hyperplane in the feature space) for binary classification problem
The model has to be constructed from examples (inductive learning)
that are properly labeled (supervised learning). Further, the model
has to be used for predicting the class of a new instance (predictive
analytics)
A hyperplane in the feature space that properly separates positive and
negative examples can be constructed by Perceptron or LMS learning
algorithms
It has been shown that these algorithms converge for linearly
separable problems

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 53 / 62


Summary

These algorithms can only find linear models and are not suitable for
problems that are not linearly separable
In neural networks perspective, we need hidden layers to handle such
problems
Back-propagation of errors is the basic mechanism used to arrive at
algorithms for learning weights for such neural networks
However, there are several issues with back-propagation algorithm —
not guaranteed to converge, may get trapped in local minima, etc.
We have highlighted a few points to overcome these limitations and
suggested a process to apply ANN for solving a problem

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 54 / 62


References I

Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H.-T. (2012).


Learning from Data — A Short Course.
AMLbook.
Alpaydin, E. (2010).
Introduction to Machine Learning.
The MIT Press, Second edition.
Anderson, J. A. (1995).
An introduction to neural networks.
The MIT Press.
Blum, A. and Rivest, R. L. (1989).
Training a 3-node neural net is NP-complete.
In Advances in neural information processing systems, volume 1, pages
494–501. Morgan-Kaufmann.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 55 / 62


References II

Fahlman, S. E. (1988).
An empirical study of learning speed in back-propagation networks.
Technical Report CMU-CS-88-162, Carnegie Mellon University.
Flach, P. (2012).
Machine Learning: The art and science of algorithms that make sense
of data.
Cambridge University Press.
Hagan, M. T. and Menhaj, M. B. (1994).
Training feedforward networks with the marquardt algorithm.
IEEE Transactions on Neural Networks, 5:989–993.
Hassoun, M. H. (1995).
Fundamentals of Artificial Neural Networks.
The MIT Press.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 56 / 62


References III

Hertz, J., Krogh, A., and Palmer, R. G. (1991).


Introduction to the theory of neural computing.
Addison-Wesley Publishing Company.
Hornik, K., Stinchcomb, M., and White, H. (1989).
Multilayer feedforward networks are universal approximators.
Neural Networks, 2(5):359–366.
Jacobs, R. A. (1988).
Increased rates of convergence through learning rate adaptation.
Neural Networks, 1(4):295–307.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).
Gradient-based learning applied to document recognition.
In Proceedings of the IEEE, volume 86, pages 2278–2324.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 57 / 62


References IV

LeCun, Y., Denker, J., Solla, S., Howard, R. E., and Jackel, L. D.
(1990).
Optimal brain damage.
In Advances in Neural Information Processing Systems, volume II.
Mitchell, T. M. (1997).
Machine Learning.
McGraw-Hill.
Moller, M. F. (1993).
A scaled conjugated gradient algorithm for fast supervised learning.
Neural Networks, 6:525–533.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 58 / 62


References V

Nguyen, D. and Widrow, B. (1990).


Improving the learning speed of 2-layer neural networks by choosing
initial values of the adaptive weights.
In Proceedings of the International Joint Conference on Neural
Networks, volume 3, pages 21–26.
Riedmiller, M. and Braun, H. (1993).
A direct adaptive method for faster backpropagation learning: The
RPROP algorithm.
In Proceedings of the IEEE International Conference on Neural
Networks, pages 586–591. IEEE Press.
Rumelhart, D. E., Hinton, G. E., and Willams, R. J. (1986).
Learning representations by back-propagating errors.
Nature, pages 533–536.
doi:10.1038/323533a0.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 59 / 62


References VI

Russel, S. and Norvig, P. (2009).


Artificial Intelligence — A Modern Approach.
Prentice Hall, Third edition.
Sarkar, D. (1995).
Methods to speed up error back-propagation learning algorithm.
ACM Computing Surveys, 27(4):519–542.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2014).
Dropout: A simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research, 15:1929–1958.
Theodoridis, S. and Koutroumbas, K. (2009).
Pattern Recognition.
Academic Press, Fourth edition.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 60 / 62


References VII

Tollenaere, T. (1990).
SuperSAB: Fast adaptive backpropagation with good scaling
properties.
Neural Networks, 3:561–573.
Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W. T., and Alkon,
D. L. (1988).
Accelerating the convergence of the backpropagation method.
Biological Cybernetics, 58:257–263.
Yam, J. Y. F. and Chow, T. W. S. (2001).
Feed forward networks training speed enhancement by optimal
initialization of the synaptic coefficients.
IEEE Transactions on Neural Networks, 12(2):430–434.

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 61 / 62


QUESTIONS?

C. Aravindan (SSN Institutions) ML — Classification Algorithms September 30, 2016 62 / 62

S-ar putea să vă placă și