Sunteți pe pagina 1din 14

1

Neural Networks
1943: McCulloch and Pitts proposed a model of a neuron -->
Perceptron.
1960s: Widrow and Hoff extended the perceptron networks to
Adalines and introduced the LMS algorithm.
1962: Rosenblatt proved the convergence of the perceptron
training rule.
1969: Minsky and Papert showed that the perceptron cannot
deal with nonlinearly-separable data sets---even those that
represent simple function such as X-OR.
1986: Rumelhart and McClelland invented the Backpropagation
algorithm, also Parker and Werbos (earlier).

Feed-forward nets

Data is presented to Input layer
Passed on to Hidden Layer
Passed on to Output layer




3
Application: NETtalk
One of first applications of multilayer neural networks developed by
Sejnowski and Rosenberg in the 80s
Train a neural network to pronouce written English text
NETtalk learned to read at the level of 4 year old human in 16 hours
120 hidden units: 98% correct pronunciation
Feeding data through the net:






n= (1 0.25) + (0.5 (-1.5)) = 0.25 + (-0.75) = - 0.5
Output:
0.5
1
( ) 0.3775
1
a f n
e
= = =
+
Data is presented to the network in the form of activations in the
input layer, for examples, pixel intensity for images
How much data is enough?
How much data is needed depends on the complexity of the problem and
the amount of noise in the data.
One simple method to test the amount of data is to perform your training
and testing using a subset of the available data. If performance does not
increase when you use the full data set, that is an indication that you
have enough data.
Data usually requires pre-processing
Missing data values
- throw out incomplete data
- manually examine example data and intelligently enter a reasonable or
expected number for the missing value
- generating a missing value automatically - average, interpolation,
prediction
- filling into a missing data with an abnormal value indicating the data
value is missing
- encode the missing value explicitly in the problem


Data Preparation - Preparing Data
Transforming Data into Numerical Values
If there is a logical sequential ordering of the values of a categorical variable
(e.g. low, medium..etc), they can usually be converted to a numerical
representation in a straightforward manner.
In many cases, categorical variables such as ZIP codes have no obvious and
meaningful ordering of the values:
If there are a small number n of distinct categories, you can encode the
input variable as n different binary inputs.
If the number of distinct values is too large, then different values must be
grouped together somehow to reduce the number of inputs. Possible
approaches include clustering, principal component analysis..etc
Data Preparation - Preparing Data
Transforming Data into Numerical Values - Example
A restaurant tries to predict how customer will rate their omelettes
Inputs:
The omelette size (small, medium, and large)
Optional ingredients (onion, peppers, ham, and anchovies)
The persons gender (male, female)
The time of day (based on a 24-hour clock)
Omelette size
Small = 2
Medium = 3
Large = 4

Data Preparation - Preparing Data
Transforming Data into Numerical Values - Example
Optional ingredients: each ingredient can be a binary input
Gender
Male = 0 and Female = 1 ==> sequential
Male = 1 0 and Female = 0 1 ==> nonsequential
Time
translate all times into minutes: time = hours*60 + minutes
problem: 23:50 => 1439 and 00:01 => 1, and the real difference should
only be 2
reference to a fixed point in time, say Jan 1 2000
mapping values onto a two-dimensional circle


The Back Propagation Algorithm
Introduction
The LMS algorithm is given by (It based on the steepest descent)
or
Q1. What if the transfer function is not the linear function?





i
i
w
F
w
c
c
= A

o
i
i i
w
F
k w k w
c
c
= +

) ( ) 1 ( o

M
p
M
w
1
.

1
p
2
p
1
w
2
w
f
b
n
a
| |
( ) | |
( ) | |
2
2
2
) (
) (
) (
b p w f t E
n f t E
a t E MSE F
T
+ =
=
= =


2
)] ) ( ( ) ( [ ) (

.

n
T
b k p w f k t k F + =



e.g.
Q2: What if the network has multiple layers?



Since , we have


Since , we have


Since



( ) | |
i
T
i
w
n
n
f
b k p w f k t
w
F
c
c

c
c
+ =
c
c
) ( ) ( 2



) sin( ) ( n n f =

2
w n = w w
w
n
n
w
f
2 ) cos( ) cos(
2
=
c
c
=
c
c

1
a

11
w
1
p
1
n
-
2
a

21
w
2
n
1
f
2
f
( ) | | ( ) | |
2
2 2
2
) (n f t E a t E MSE F = = =


( )
2
2 2
)) ( ( ) ( ) (

k n f k t k F =
21
2
2 21
) (

) (

w
n
n
k F
w
k F
c
c

c
c
=
c
c
( )

21
2
1
) (

2
2
2 2
21
2
) ( 2
) (

w
n
n
k F
a
n
f
n f t
w
k F
c
c
c
c

|
|
.
|

\
|
c
c
=
c
c
.
11
1
1 11
) (

) (

w
n
n
k F
w
k F
c
c

c
c
=
c
c
1 21 2
a w n =
1 11 1
p w n =
1
1
2
2 11
1
1
2
2 11
) (

) (

) (

p
n
n
n
k F
w
n
n
n
n
k F
w
k F

c
c

c
c
=
c
c

c
c

c
c
=
c
c

) (
1 1 21 1 21 2
n f w a w n = = ) (
1
'
1 21
1
2
n f w
n
n
=
c
c


1 1
'
1 21
2 11
) (
) (

) (

p n f w
n
k F
w
k F

|
c
c
=
c
c
.
back propagated from the upper layer.
For a single neuron (or a single layer of neurons)




For a general multilayer neural networks:

3
11
w
2
1 2
s s
w
3
2 3
s s
w
2
p
1
p
M
p
.
.

1
f
1
f
1
f

1
n
2
n
1
s
n
1
12
w
1
2
a
1
1
s
a
1
1
a 1
11
w
1
1M
w
1
2
b
1
1
b
1
1
s
b
.

2
f
2
f
2
f

1
n
2
n
2
s
n
2
2
b
2
1
b
2
2
s
b
3
2
a
3
3
s
a
3
1
a
2
2
a
2
2
s
a
2
1
a
.

3
f
3
f
3
f

1
n
2
n
3
s
n
3
2
b
3
1
b
3
3
s
b
2
11
w
.

.

.

( neurons) 1
s ( neurons)
2
s
( neurons)
3
s
) (
1 1 1 1
b p w f a


+ =
) (
2 1 2 2 2
b a w f a


+ = ) (
3 2 3 3 3
b a w f a


+ =
) ( )] ( ) ( ) ( [ 2 ) ( ) 1 (
) ( ) ( 2 ) ( ) 1 (
k p k p k w k t k w k w
k p k e k w k w
T


+ = +
+ = +
o
o
.

M
p
2
p
1
w
2
w
M
w
f
.

1
p b
n a
1
Performance Index
Given a set of training examples/patterns
The MSE is
As with the LMS algorithm, the approximate MSE is

Learning Algorithm


To find the learning rule for all the weights, we need to find all the derivatives:

Consider



To compute or in general,

{ } ) , ( , ), , ( ), , (
2 2 1 1 Q Q
t p t p t p

| | | | | |

=
= = =
Q
k
T T
k a k t k a k t
Q
a t a t E a t E F
1
2
)) ( ) ( ( )) ( ) ( (
1
) ( ) ( ) (

) ( ) ( )) ( ) ( ( )) ( ) ( (

k e k e k a k t k a k t F
T T

= =
m
ij
m
ij
w
F
w
c
c
= A

o
))] ( ( ) ( [ ))] ( ( ) ( [ )) ( ) ( ( )) ( ) ( ( ) ( ) (

k n f k t k n f k t k a k t k a k t k e k e F
T T T


= = =
m
ij
m
i
m
i
m
ij
w
n
n
F
w
F
c
c

c
c
=
c
c

+ =
1
1
1
m
s
j
m
i
m
j
m
ij
m
i
b a w n

1
=
c
c
m
j
m
ij
m
i
a
w
n
m
i
n
F
c
c

c
c
c
c
=
c
c
m
s
m
m
m
n
F
n
F
n
F

1
.

we need to obtain the derivative at layer m from layer m +1. It is this process that
gives us the term backpropagation.






where Recall




We still need one more step to complete the backpropagation algorithm. That is, the
starting point - the output layer.

c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
=
c
c
+ + +
+ + +
+ + +
+
+ + +
m
s
m
s
m
m
s
m
m
s
m
s
m
m
m
m
m
m
s
m
m
m
m
m
m
m
m
m m m
m
m
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
1
2
1
1
1
1
2
2
1
2
1
1
2
1
1
2
1
1
1
1
1
1
1 1 1

) (
) (


' 1
1
1
1
1
1
m
j
m m
ij
m
j
m
j
m
m
ij
m
j
m
j m
ij
s
l
m
l
m
il
m
j
m
j
m
i
n f w
n
n f
w
n
a
w
a w
n n
n
m
+
+
+
=
+
+
=
c
c
=
c
c
=
|
|
.
|

\
|
c
c
=
c
c


) (
' 1
1
m m m
m
m
n F W
n
n

+
+
=
c
c

=
) (
) (
) (
) (
'
2
'
1
'
'
m
s
m
m m
m m
m m
m
n f
n f
n f
n F
0
0

m
ij
m
i
m
i
m
ij
w
n
n
F
w
F
c
c

c
c
=
c
c

( )
1
1 '
1
1

) (

+
+
+
+
c
c
=
c
c
|
|
.
|

\
|
c
c
=
c
c
m
T
m m m
m m
m
m
i
n
F
W n F
n
F
n
n
n
F

) ( ) ( 2
) (
) ( 2 ) ( 2
) ( ) ( ) (

'
1
2
M
i
M
j j
m
i
M
j
M
j j
M
i
j
j j
s
j
j j
M
i
T
M
i
M
i
n f a t
n
n f
a t
n
a
a t
a t
n
a t a t
n n
F
m
=
c
c
=
c
c
=

c
c
=
c
c
=
c
c


) ( ) ( 2

'
a t n F
n
F
M M
M

=
c
c
Summary of the backpropagation algorithm
1. Propagate the input forward through the network



2. Propagate the derivative backward through the network





3. The weights and biases are updated using the approximated steepest descent
algorithm:
p a

=
0
) (
1 1 1 1 + + + +
+ =
m m m m m
b a W f a


1 , , 1 , 0 = M m for .
M
a output a

= =
) )( ( 2

'
a t n F
n
F
M M
M

=
c
c
' 1
1

( )( )
m m m T
m m
F F
F n W
n n
+
+
c c
=
c c
1 , 2 , , 1 . = M m for
T m
m
m m
a
n
F
k W k W ) (

) ( ) 1 (
1
c
c
= +

o
m
m m
n
F
k b k b


c
c
= +

) ( ) 1 ( o

S-ar putea să vă placă și