0 evaluări0% au considerat acest document util (0 voturi)

5 vizualizări25 pagininn

Jun 06, 2017

© © All Rights Reserved

PDF, TXT sau citiți online pe Scribd

nn

© All Rights Reserved

0 evaluări0% au considerat acest document util (0 voturi)

5 vizualizări25 pagininn

© All Rights Reserved

Sunteți pe pagina 1din 25

y

Regression)

x1

We have training data X = {x1k}, i=1,..,N with

corresponding output Y = {yk}, i=1,..,N

We want to find the parameters that predict the

output Y from the data X in a linear fashion:

Y wo + w1 x1

1

A Simple Problem (Linear

y

Regression)

Notations:

Superscript: Index of the data point in the

training data set; k = kth training data point

Subscript: Coordinate of the data point;

x1k = coordinate 1 of data point k.

x1

We have training data X = {x1k}, k=1,..,N with

corresponding output Y = {yk}, k=1,..,N

We want to find the parameters that predict the

output Y from the data X in a linear fashion:

yk wo + w1 x1k

y

Regression)

x1

It is convenient to define an additional fake

attribute for the input data: xo = 1

We want to find the parameters that predict the

output Y from the data X in a linear fashion:

yk woxok + w1 x1k

2

More

y

convenient notations

x1

Vector of attributes for each training data point:

xk = [xok,..,xMk]

We seek a vector of parameters: w = [wo,..,wM]

Such that we have a linear relation between prediction

Y and attributes X:

M

k

y w o xok + w1x1k + L + w M xMk = w i xik = w xk

i =0

More

y

convenient notations

By definition: The dot product

between vectors w and xk is:

M

wx = k

w i xik

i =0

x1

Vector of attributes for each training data point:

xi = [xoi,..,xMi]

We seek a vector of parameters: w = [wo,..,wM]

Such that we have a linear relation between prediction

Y and attributes X:

M

k

y w o xok + w1x1k + L + w M xMk = w i xik = w xk

i =0

3

Neural Network: Linear Perceptron

Input attribute values

xo w

o Output prediction

M

xi

wi w i xi = w x

i =0

wM

xM

Note: This input unit corresponds to the

fake attribute xo = 1. Called the bias

xo w Output Unit

o

M

xi

wi w i xi = w x

i =0

wM Connection with weight

Neural Network Learning

xM problem: Adjust the connection

weights so that the network

Input Units generates the correct

prediction on the training data.

4

Linear Regression: Gradient

Descent

We seek a vector of parameters: w = [w ,..,w ] that

o M

minimizes the error between the prediction Y and and

the data X:

N

E= (y k

(w o xok + w1x1k k 2

+ L + w M xM ) )

k =1

N

(y )

k 2

= w xk

k =1

N

= k2 k = y k w xk

k =1

Linear Regression:

is the error between

k Gradient

the input x and the

prediction y at data point k.

y Descent

Graphically, it the vertical distance

We seek a vector of parameters: w = [w ,..,w ] that

between data point k and the

o prediction

M

minimizes the error between

calculated the prediction

by using Y linear

the vector of and and

the data X: parameters w.

N

(y )

k 2

E= (w o xok + w1x1k + L + w M xMk )

k =1

N

(y )

k 2

= w xk

x1

k =1

N

= k2 k = y k w xk

k =1

5

Gradient Descent

The minimum of E is reached when the derivatives

with respect to each of the parameters wi is zero:

N

E

w i

(

= 2 y k (w o x ok + w1 x1k + L + w M x Mk ) x ik)

k =1

N

( )

= 2 y k w x k x ik

k =1

N

= 2 k x ik

k =1

Gradient Descent

The minimum of E is reached when the derivatives

with respect to each of the parameters wi is zero:

Note that the contribution of

training data element number k

E N to the overall gradient is -kxik

w i

(

= 2 y k (w o x ok + w1 x1k + L + w M x Mk ) x ik)

k =1

N

( )

= 2 y k w x k x ik

k =1

N

= 2 k x ik

k =1

6

Gradient Descent Update Rule

E Here we need

Here we need to decrease

to increase wi. wi. Note that

Note that E

E w i

w i is positive

is negative

wi

Update rule: Move in the direction

opposite to the gradient direction

E

wi wi

w i

Perceptron Training

Given input training data xk with

corresponding value yk

1. Compute error:

k y k w xk

2. Update NN weights:

k

w i w i + k x i

7

is the learning rate.

Linear

too small: Perceptron

May converge Training

slowly and may need a lot of

training examples

large:

too GivenMayinput training

change w toodata xk with

quickly and spend a

corresponding k

valuethey minimum.

long time oscillating around

1. Compute error:

k y k w xk

2. Update NN weights:

k

w i w i + k x i

w1 wo

True value w1= 0.7

iterations iterations

True function:

y = 0.3 + 0.7 x1

w = [0.3 0.7]

8

After 2 iterations

(2 training points)

After 6 iterations

(6 training points)

9

w1 wo

iterations iterations

After 20 iterations

(20 training points)

Perceptrons: Remarks

Update has many names: delta rule, gradient

rule, LMS rule..

Update is guaranteed to converge to the best

linear fit (global minimum of E)

Of course, there are more direct ways of solving

the linear regression problem by using linear

algebra techniques. It boils down to a simple

matrix inversion (not shown here).

In fact, the perceptron training algorithm can be

much, much slower than the direct solution

So why do we bother with this? The answer in

the next few of slidesbe patient

10

A Simple Classification Problem

Training data:

x1

Suppose that we have one attribute x1

Suppose that the data is in two classes

(red dots and green dots)

Given an input value x1, we wish to predict

the most likely class (note: Same problem

as the one we solved with decision trees

and nearest-neighbors).

y=1

y=0

We could convert it to a problem similar to the x1

previous one by defining an output value y

0 if in red class

y =

1 if in green class

The problem now is to learn a mapping between

the attribute x1 of the training examples and their

corresponding class output y

11

A Simple Classification Problem

y=1

0 if y < What we would like: a piece-wise

y = constant prediction function:

1if y This is not continuous Does

not have derivatives

y=0 x1

y=1

y = w.x

w = [wo w1]

x = [1 x1] What we get from the current

linear perceptron model:

y=0 continuous linear prediction

x1

y=1

y = w.x

w = [wo w1]

Possible solution: Transform the linear

x = [1 x1] predictions by some function which

would transform to a continuous

y=0 approximation of a threshold

x1

y=1

y = (w.x) This curve is a continuous

w = [wo w1] approximation (soft threshold) of the

x = [1 x1] hard threshold

Note that we can take derivatives of

y=0 that prediction function

x1

12

The Sigmoid Function

1

(t ) =

1 + e t

Note: It is not important to remember the exact

expression of (in fact, alternate definitions are used

for ). What is important to remember is that:

It is smooth and has a derivative (exact expression is

unimportant)

It approximates a hard threshold function at x = 0

Generalization to M Attributes

A linear separation is

parameterized like a line:

M

w x

i =0

i i = w x = 0

can be separated by a linear combination

of the attributes:

Threshold in 1-d

Line in 2-d

Plane in 3-d

Hyperplane in M-d

13

Generalization to M Attributes

y = 0 in this region, A linear separation is

parameterized like a line:

we can approximate

M

w x

y by (w.x) 0

i i = w x = 0

i =0

y = 1 in this region,

we can approximate

y by (w.x) 1

Classification

xo w

o Output prediction

wi

xi M

wM w i x i = (w x )

i =0

xM

14

Interpreting the Squashing Function

Data is very close to threshold (small Data is very far from threshold

margin) value is very close to (large margin) value is

1/2 we are not sure 50-50 very close to 0 we are very

chance that the class is 0 or 1 confident that the class is 1

confident we are in the classification: Prob(y=1|x)

Training

Given input training data xk with

corresponding value yk

1. Compute error:

k y k (w x k )

2. Update NN weights:

k

w i w i + k x i (w x ) k

15

Note: It is exactly the same as before,

Training

except for the additional complication

of passing the linear output through

Given input training data xk with

corresponding value yk

1. Compute error:

k y k (w x k )

2. Update

This formula NN weights:

derived by direct application of

the chain rule from calculus

k

w i w i + k x i (w x ) k

Example

x2

x2 = x1

y=0

w = [0 1 -1]

the same separating

line if we multiplying all y=1

of w by some constant,

so we are really

interested in the relative

values, such as the

slope of the line, -w2/w1. x1

16

True class boundary

True value of

slope w1/w2 = 1

5 iterations

(5 training

data points) Iterations

40 iterations

(40 training True value of

data points) slope w1/w2 = 1

60 iterations

(60 training

data points) Iterations

17

Single Layer: Remarks

Good news: Can represent any problem in which the

decision boundary is linear.

Bad news: NO guarantee if the problem is not linearly

separable

Canonical example: Learning the XOR function from

example There is no line separating the data in 2

classes. x1 = 1

x1 = 0 x2 = 1

x2 = 1

Class output:

y = x1 XOR x2

x1 = 0

x2 = 0

(and why we need to suffer through a lot more slides)

Input: All the pixels from the images from 2 cameras are input units

Output: Steering direction

Network: Many layers (3.15 million connections, and 71,900 parameters!!)

Training: Trained on 95,000 frames from human driving (30,000 for testing)

Execution: Real-time execution on input data (10 frames/sec. approx.)

From 2004: http://www.cs.nyu.edu/~yann/research/dave/index.html

18

More Complex Decision Classifiers:

Multi-Layer Network

xo wo1

Input attribute values

w Output

w i1

1 prediction

xi wi w

L

oL

1

L

w M wML

w

xM

number of hidden units and an arbitrary number of

layers. More complex networks can be used to

describe more complex classification boundaries.

xo wo1

w i1 w

1

xi wi w

L

oL

1

L

w M wML

w

xM

19

Training: Backpropagation

The analytical expression of the networks output becomes

totally unwieldy (something like:

L M

output = w m w jm x j

m =1

j =0

Things get worse as we can use an arbitrary number of layers

with an arbitrary number of hidden units

The point is that we can still take the derivatives of this

monster expression with respect to all of the weights and

adjust them to do gradient descent based on the training data

Fortunately, there is a mechanical way of propagating the

errors (the s) through the network so that the weights are

correctly updated. So we never have to deal directly with

these ugly expressions.

This is called backpropagation. Backpropagation implements

the gradient descent shown earlier in the general case of

multiple layers

(Contrived) Example

y=0

y=1

Lets try: 2-layer network, 4 hidden units

20

Boundary between the

Network Output

classes from

thresholding network

coding network output output at 1/2

Colormap used for

training data

Error on

to be viewed in

color. It wont

Iterations print well on b/w.

Sorry!

Good: Multi-layer networks can represent any

arbitrary decision boundary. We are no longer

limited to linear boundaries.

Bad: Unfortunately, there is no guarantee at all

regarding convergence of training procedure.

More complex networks (more hidden units

and/or more layers) can represent more

complex boundaries, but beware of overfitting

A complex enough network can always fit the

training data!

In practice: Training works well for reasonable

designs of the networks for specific problems.

21

Overfitting Issues

Error output vs. input data

Error on test data

the other techniques

This is addressed essentially in the same way:

Train on training data set

At each step, evaluate performance on an

independent validation test set

Stop when the error on the test data is minimum

Real Example

sampled at 10o intervals posterior distribution

Network-Based Face Detection. Proceedings of IEEE Conference on

Computer Vision and Pattern Recognition, June, 1998.

22

Real Example

Each pixel is an input unit

Complex network with many layers

Output is digit class

Real-time

Used commercially

P. Haffner. Gradient-based learning

applied to document recognition.

Proceedings of the IEEE, november http://yann.lecun.com/exdb/lenet/

1998.

23

Real Example

network with 1 layer

(4 hidden units)

Demonstrated at highway perception for mobile robot

speeds over 100s of miles guidance. Kluwer Academic

Publishing, 1993.

Training data:

Images +

corresponding

steering angle

Important:

Conditioning of

training data to

generate new

examples avoids

overfitting

24

Real Example

Input: All the pixels from the images from 2 cameras are input units

Output: Steering direction

Network: Many layers (3.15 million connections, and 71,900 parameters!!)

Training: Trained on 95,000 frames from human driving (30,000 for testing)

Execution: Real-time execution on input data (10 frames/sec. approx.)

From 2004: http://www.cs.nyu.edu/~yann/research/dave/index.html

Summary

Neural networks used for

Approximating y as function of input x (regression)

Predicting (discrete) class y as function of input x (classification)

Key Concepts:

Difference between linear and sigmoid outputs

Gradient descent for training

Backpropagation for general networks

Use of validation data for avoiding overfitting

Good:

simple framework

Direct procedure for training (gradient descent)

Convergence guarantees in the linear case

Not so good:

Many parameters (learning rate, etc.)

Need to design the architecture of the network (how many units?

How many layers? What transfer function at each unit? Etc.)

Requires a substantial amount of engineering in designing the

network

Training can be very slow and can get stuck in local minima

25

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.