Regression Analysis

Assessing
Performance
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1
2015 Emily Fox & Carlos Guestrin
Machine Learning Specializa0on
Make predictions, get $, right??

Algorithm
Model
Model + algorithm
fitted function
Predictions
decisions outcome
Fit f
Or, how much am I losing?

Example: Lost $ due to inaccurate listing price
- Too low low oers
- Too high few lookers + no/low oers
How much am I losing compared to perfection?

Perfect predictions: Loss = 0
My predictions: Loss = ???
3
Measuring loss
Loss function:
L(y,f(x))
actual
value
Cost of using at x
when y is true
f(x) = predicted value
Examples:
(assuming loss for underpredicting = overpredicting)
Absolute error: L(y,f(x)) = |y-f(x)|

Squared error: L(y,f(x)) = (y-f(x))2
4
Remember that all models are

wrong; the practical question is
how wrong do they have to be to
not be useful. George Box, 1987.
Assessing the loss
Assessing the loss

Part 1: Training error
Define training data
price ($)
square feet (sq.ft.)

8
x
Define training data
price ($)

9
x
Example:
Fit quadratic to minimize RSS
price ($)
minimizes
RSS of
training data
10
Compute training error

1. Define a loss function L(y,f(x))
- E.g., squared error, absolute error,
2. Training error
= avg. loss on houses in training set
N
1 X
=N
L(yi,f(xi))
i=1
fit using training data

11
Example:
Use squared error loss (y-f(x))2
price ($)

12
Training error () = 1/N *

[($train 1-f(sq.ft.train 1))2
+ ($train 2-f(sq.ft.train 2))2
+ ($train 3-f(sq.ft.train 3))2
+ include all
x training houses]
Example:
Use squared error loss (y-f(x))2
price ($)
Training error () =
N
1 X
2
(y
-f
(x
))
N i=1i i

13
RMSE
v =
u
N
u1 X
t
2
(y
-f
(x
))
N i=1 i i
Training error vs.

model complexity
price ($)
Error
Model complexity
14
Training error vs.

model complexity
price ($)
Error
Model complexity
15
Training error vs.

model complexity
price ($)
Error
Model complexity
16
Training error vs.

model complexity
price ($)
Error
Model complexity
17
Error
Training error vs.

model complexity
y
18
Model complexity
x
Is training error a good measure

of predictive performance?
How do we expect to perform on
a new house?
price ($)
19

Is there something particularly bad
about having xt square feet???
price ($)
xt
20

Issue: Training error is overly optimistic
because was fit to training data
price ($)
xt
21
Small training error > good predictions

unless training data includes everything you
might ever see
Assessing the loss

Part 2: Generalization (true) error
Generalization error
Really want estimate of loss
over all possible ( ,$) pairs
Lots of houses
in neighborhood,
but not in dataset
23
Distribution over houses

In our neighborhood, houses of what
# sq.ft. ( ) are we likely to see?
24

Distribution over sales prices

For houses with a given # sq.ft. ( ),
what house prices $ are we likely to see?
For fixed
# sq.ft.
25
price ($)
Generalization error definition

Really want estimate of loss
Formally:
average over all possible

(x,y) pairs weighted by
how likely each is
generalization error = Ex,y[L(y,f(x))]

26
Generalization error vs.

model complexity
price ($)
Error
Model complexity
27

model complexity
price ($)
Error
Model complexity
28

model complexity
f
price ($)
Error
Model complexity
29

model complexity
f
price ($)
Error
Model complexity
30

model complexity
price ($)
Error
Model complexity
31
Error

model complexity
Cant
compute!
y
32
Model complexity
x
Assessing the loss

Part 3: Test error
Approximating
generalization error
Wanted estimate of loss
Approximate by looking at
houses not in training set
34
Forming a test set

Hold out some ( ,$) that are
not used for fitting the model
Training set
Test set
35
Forming a test set

Hold out some ( ,$) that are
not used for fitting the model
Training
seteverything you
Proxy for
might see
Test set
36
Compute test error

Test error
= avg. loss on houses in test set
1
=N
L(yi,f(xi))
test i in test set
# test points
has never seen

test data!
37
Example: As before,
fit quadratic to training data
price ($)
minimizes
RSS of
training data
38
Example: As before,
use squared error loss (y-f(x))2
price ($)

39
Test error () = 1/N *

[($test 1-f(sq.ft.test 1))2
+ ($test 2-f(sq.ft.test 2))2
+ ($test 3-f(sq.ft.test 3))2
+ include all
x test houses]
Error
Training, true, & test error vs.

model complexity
Overfitting if:
y
40
Model complexity
x
Training/test split
Training/test splits
Training set
Test set
how many? vs. how many?
42
Training
set
Test set
Too few poorly estimated
43
Training set
Test
set
Too few test error bad approximation

of generalization error
44
Training set
Test set
Typically, just enough test points to form a

reasonable estimate of generalization error
If this leaves too few for training, other
methods like cross validation (will see later)
45
3 sources of error +
the bias-variance tradeo
3 sources of error
In forming predictions, there
are 3 sources of error:
1. Noise
2. Bias
3. Variance
47 47
Data inherently noisy

yi = fw(true)(xi)+i
price ($)
y
variance
of noise
Irreducible
error
48
x
Bias contribution
Assume we fit a constant function
f(train1)
49
y
price ($)
price ($)
N other house
sales ( ,$)
N house
sales ( ,$)
f(train2)
Bias contribution
Over all possible size N training sets,
what do I expect my fit to be?
price ($)
fw(true)
f(train3) f
(train1)
fw
f(train2)
50
x
Bias contribution
Bias(x) = fw(true)(x) - fw(x)
low complexity
high bias
fw(true)
price ($)
Is our approach flexible

enough to capture fw(true)?
If not, error in predictions.
fw
51
x
Variance contribution
How much do specific fits
vary from the expected fit?
price ($)
f(train3) f
(train1)
fw
f(train2)
52
x
price ($)
f(train3) f
(train1)
fw
f(train2)
53
x
low complexity
low variance
price ($)
Can specific fits

vary widely?
If so, erratic
predictions

54
x
Variance of high-complexity models

Assume we fit a high-order polynomial
f(train1)
55
y
price ($)
price ($)
f(train2)

Assume we fit a high-order polynomial
price ($)
f(train1)
f(train2)
fw
f(train3)
56
x

high
complexity
high variance
price ($)
fw
57
x
Bias of high-complexity models
price ($)
high
complexity
low bias
fw
58
fw(true)
x
Bias-variance tradeo
y
59
Model complexity y
x
x
Error
Error vs. amount of data
# data points in
training set
60
More in depth on the

3 sources of errors
OP
L
A
N
O
I
T
Accounting for training set randomness

Training set was just a random sample of N houses sold
What if N other houses had been sold and recorded?
f(1)

62
f(2)
price ($)
price ($)

x

Training set was just a random sample of N houses sold
What if N other houses had been sold and recorded?
generalization error of (1)
f(1)

63
f(2)
price ($)
price ($)

x

Ideally, want performance averaged
over all possible training sets of size N
f(1)

64
f(2)
price ($)
price ($)

x
Expected prediction error

Etraining set[generalization error of (training set)]
averaging over all training sets
(weighted by how likely each is)
parameters fit
on a specific
training set
f(training set)
price ($)
65
Prediction error at target input

Start by considering:
1. Loss at target xt (e.g. 2640 sq.ft.)
2. Squared error loss L(y,f(x)) = (y-f(x))2
f(training set)
price ($)
xt
66
Sum of 3 sources of error

Average prediction error at xt
= 2 + [bias(f(xt))]2 + var(f(xt))
f(training set)
price ($)
xt
67
Error variance of the model

= 2 + [bias(f(xt))]2 + var(f(xt))
2 =
variance
y = fw(true)(x)+
price ($)
xt
68
Irreducible
error
x
Bias of function estimator

= 2 + [bias(f(xt))]2 + var(f(xt))
price ($)
f(train1)
69
price ($)
f(train2)

Average estimated function = fw(x)
True function = fw(x)
Etrain[f(train)(x)]
over all training
sets of size N
price ($)
y
fw
f(train2)
f(train1) fw
xt
70

Average estimated function = fw(x)
True function = fw(x)
bias(f(xt)) = fw(xt) - fw(xt)
price ($)
y
fw
fw
xt
71

= 2 + [bias(f(xt))]2 + var(f(xt))
72
Variance of function estimator

= 2 + [bias(f(xt))]2 + var(f(xt))
f(train3)
price ($)
73
f(train1)
fw
f(train2)
x

= 2 + [bias(f(xt))]2 + var(f(xt))
price ($)
fw
xt
74
x

fit on a specific
training dataset
what I expect to learn

over all training sets
var(f(xt)) = Etrain[(f(train)(xt)-fw(xt))2]
price ($)
deviation of
over all training
sets of size N specific fit from
expected fit at xt
fw
xt
75
Why 3 sources of error?

A formal derivation
OP
L
A
N
O
I
T
Deriving expected
prediction error
Expected prediction error
= Etrain [generalization error of (train)]
= Etrain [Ex,y[L(y,f(train)(x))]]
1. Look at specific xt
2. Consider L(y,f(x)) = (y-f(x))2
Expected prediction error at xt
= Etrain, yt [(yt-f(train)(xt))2]
77
Deriving expected
prediction error
= Etrain, yt [(yt-f(train)(xt))2]
= Etrain, yt[((yt-fw(true)(xt)) + (fw(true)(xt)-f(train)(xt)))2]
78
Equating MSE with

bias and variance
MSE[f(train)(xt)]
= Etrain[(fw(true)(xt) f(train)(xt))2]
= Etrain[((fw(true)(xt)fw(xt)) + (fw(xt)f(train)(xt)))2]
79
Putting it all together

= 2 + MSE[f(xt)]
= 2 + [bias(f(xt))]2 + var(f(xt))
3 sources of error
80
Summary of tasks
The regression/ML workflow

1. Model selection
Often, need to choose tuning
parameters controlling model
complexity (e.g. degree of polynomial)
2. Model assessment
Having selected a model, assess
the generalization error
82
Hypothetical implementation
Training set
Test set
1. Model selection
For each considered model complexity :
i. Estimate parameters on training data
ii. Assess performance of on test data
iii. Choose * to be with lowest test error
2. Model assessment
Compute test error of * (fitted model for selected
complexity *) to approx. generalization error
83
Training set
Test set
1. Model selection
For each considered model complexity :
i. Estimate parameters on training data
ii. Assess performance of on test data
iii. Choose * to be with lowest test error
Overly optimistic!
2. Model assessment
Compute test error of * (fitted model for selected
complexity *) to approx. generalization error
84
Training set
Test set
Issue: Just like fitting and assessing its

performance both on training data
* was selected to minimize test error
(i.e., * was fit on test data)
If test data is not representative of the whole
world, then * will typically perform worse than
test error indicates
85
Practical implementation
Validation Test
Training
Training
setset
Test set
set
set
Solution: Create two test sets!
1. Select * such that * minimizes error on
validation set
2. Approximate generalization error of * using
test set
86
Practical implementation
Training set
fit
87
Validation Test
set
set
test performance
of to select *
assess
generalization
error of *
Typical splits
Training set
88
Validation Test
set
set
80%
10%
10%
50%
25%
25%
Summary of
assessing performance
What you can do now
Describe what a loss function is and give examples

Contrast training, generalization, and test error
Compute training and test error given a loss function
Discuss issue of assessing performance on training set
Describe tradeos in forming training/test splits
List and interpret the 3 sources of avg. prediction error
- Irreducible error, bias, and variance
Discuss issue of selecting model complexity on test data

and then using test error to assess generalization error
Motivate use of a validation set for selecting tuning
parameters (e.g., model complexity)
Describe overall regression workflow
90

Regression Analysis

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Regression Analysis

Încărcat de

Drepturi de autor:

Formate disponibile

Assessing

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Make predictions, get $, right??

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Or, how much am I losing?

How much am I losing compared to perfection?

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

f(x) = predicted value

(assuming loss for underpredicting = overpredicting)

Absolute error: L(y,f(x)) = |y-f(x)|

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Remember that all models are

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Assessing the loss

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Assessing the loss

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Define training data

square feet (sq.ft.)

2015 Emily Fox & Carlos Guestrin

Define training data

square feet (sq.ft.)

2015 Emily Fox & Carlos Guestrin

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Compute training error

fit using training data

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

square feet (sq.ft.)

2015 Emily Fox & Carlos Guestrin

Training error () = 1/N *

square feet (sq.ft.)

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Training error vs.

square feet (sq.ft.)

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Training error vs.

square feet (sq.ft.)

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Training error vs.

square feet (sq.ft.)

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Training error vs.

square feet (sq.ft.)

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Training error vs.

2015 Emily Fox & Carlos Guestrin

Machine Learning Specializa0on

Is training error a good measure

square feet (sq.ft.)

2015 Emily Fox & Carlos Guestrin