Documente Academic
Documente Profesional
Documente Cultură
Martin Haugh
Department of Industrial Engineering and Operations Research
Columbia University
Email: martin.b.haugh@gmail.com
Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with
applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie
and R. Tibshirani (JWHT).
Additional References: Chapter 3 of JWHT and Learning from Data by Abu-Mostafa, Lin and
Magdon-Ismail.
Outline
Linear Regression
Applications from ISLR
Potential Problems with Linear Regression
Bias-Variance Decomposition
A Case Study: Overfitting with Polynomials
2 (Section 0)
Linear Regression
Linear regression assumes the regression function E[Y |X] is linear in the
inputs, X1 , . . . , Xp .
Developed many years ago but still very useful today
- simple and easy to understand
- can sometimes outperform more sophisticated models when there is little data
available.
3 (Section 1)
Linear Regression
In a linear regression model the dependent variable Y is a random variable that
satisfies X = (X1 , . . . , Xp )
Y = 0 +
p
X
i Xi +
i=1
where
N
X
yi 0
p
X
xij j
2
= min ky Xk
i=1
j=1
1
1
X := .
..
x11
x21
..
.
x12
x22
..
.
...
...
..
.
x1p
x2p
..
.
xN 1
xN 2
...
xNp
y1
y2
y := .
..
yN
5 (Section 1)
ky Xk = 2X> (y X) = 0|=
Also have
and
= y y = (I H)y.
N
X
(yi yi )2 = >
i=1
N
X
(yi y)2 .
i=1
RSS
.
TSS
R2 always lies in the interval [0, 1] with values closer to 1 being better
- but whether a given R2 value is good or not depends on the application
- in physical science applications we looks for values close to 1 (if the model is
truly linear); in social sciences an R2 of .1 might be deemed good!
7 (Section 1)
Figure 3.1 from HTF: Linear least squares fitting with X R2 . We seek the linear function of X that
minimizes the sum of squared residuals from Y .
8 (Section 1)
Figure 3.2 from HTF: The N -dimensional geometry of least squares regression with two predictors. The
outcome vector y is orthogonally projected onto the hyperplane spanned by the input vectors x1 and x2 . The
projection
y represents the vector of the least squares predictions.
9 (Section 1)
2 :=
X
1
RSS
(yi yi )2 =
.
N p 1 i=1
N p1
q j
tN p1 N (0, 1) for large N .
(X> X)1
jj )
(X> X)1
jj
jj
where z1 := 1 (1 ).
10 (Section 1)
(TSS RSS)/p
RSS/(N p 1)
11 (Section 1)
(RSS0 RSS)/(p1 p0 )
RSS/(N p 1)
where RSS0 = RSS for model that uses all variables except for last q.
Under the null hypothesis that the nested model (with pq+2 = = p = 0)
fits the data sufficiently well we have F Fq,np1
- which we can use to compute p-values.
Such F -tests are commonly used for model selection in classical statistics
- but they only work when p << N
- they are also problematic due to issues associated with multiple testing.
Will prefer to use more general validation set and cross-validation approaches for
model selection to be covered soon.
12 (Section 1)
50
100
200
TV
300
25
5
10
15
Sales
20
25
5
10
15
Sales
20
25
20
15
Sales
10
5
0
10
20
30
Radio
40
50
20
40
60
80
100
Newspaper
Figure 2.1 from ISLR: The Advertising data set. The plot displays sales, in thousands of units, as a function
of TV, radio, and newspaper budgets, in thousands of dollars, for 200 different markets. In each plot we show
the simple least squares fit of sales to that variable, as described in Chapter 3. In other words, each blue line
represents a simple model that can be used to predict sales using TV, radio, and newspaper, respectively.
40
60
80
100
10
15
20
2000
8000
14000
1500
20
80
100
500
Balance
20
40
60
Age
15
20
Cards
100
150
10
Education
8000
14000
50
Income
600
1000
2000
Limit
200
Rating
500
1500
50
100
150
200
600
1000
Figure 3.6 from ISLR: The Credit data set contains information about balance, age, cards, education,
income, limit, and rating for a number of potential customers.
> Xi
4. Outliers, i.e. points for which yi is far from the predicted value
- could be genuine or a data error
- may or may not impact fitted model but regardless will impact
2,
confidence intervals and p-values, possibly dramatically
- can identify them by plotting studentized residuals against fitted values
values > 3 in absolute value are suspicious.
17 (Section 1)
- solution is to either drop one of the variables or combine them into a single
predictor. e.g. in credit data set could combine limit and rating into a single
variable.
M
X
i i (x) +
i=1
1
1
e 22 kxi k2 .
(2 2 )p/2
where
1
1
= .
..
1
1 (x1 )
1 (x2 )
..
.
2 (x1 )
2 (x2 )
..
.
...
...
..
.
1 (xN ) 2 (xN ) . . .
M (x1 )
M (x2 )
..
.
M (xN )
19 (Section 2)
21 (Section 3)
Mean-Squared Error
To answer this question let be some estimator for .
The mean-squared-error (MSE) then satisfies
h
i
= E 2
MSE()
+ E 2 .
= Var()
|
{z
}
bias2
If the goal is to minimize MSE then unbiasedness not necessarily a good thing
- can often trade a small increase in bias2 for a larger decrease in variance
- will do this later with subset selection methods as well as shrinkage methods
- an added benefit of some of these methods is improved interpretability.
22 (Section 3)
(2)
23 (Section 3)
2
!2
k
1X
2
f (x0 )
f (x(l) ) +
k
k
(3)
l=1
where x(1) , . . . , x(k) are the k nearest neighbors to x0 and (for simplicity) weve
assumed the training inputs are all fixed.
Here k is inversely related to model complexity (why?) and then see:
bias typically decreases with model complexity
variance increases with model complexity.
Can repeat this analysis for other models, e.g. linear or ridge regression etc
- goal is to choose the model complexity which corresponds to the optimal
bias-variance tradeoff.
24 (Section 3)
f
(x)f
(x)
2
ED f (x;D)f (x)
= Irreducible Error + Ex Variance(x) + Bias2 (x)
=
x [0, 1],
N (0, c)
(4)
j (x) := e 22 (xj )
with j =
j
M1
for j = 0, . . . , M 1 and =
1
M1 .
N
X
j=1
Yj 0
M
X
i=1
!2
i i (xj )
>
2
(5)
26 (Section 3)
27 (Section 3)
ln = 2.6
ln = 0.31
ln = 2.4
Figure 3.5 from Bishop: Illustration of the dependence of bias and variance on model complexity, governed by a regularization parameter , using the
sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N = 25 data points, and there are 24 Gaussian basis functions in the
model so that the total number of parameters is M = 25 including the bias parameter. The left column shows the result of fitting the model to the
data sets for various values of ln (for clarity, only 20 of the 100 fits are shown). The right column shows the corresponding average of the 100 fits
(red) along with the sinusoidal function from which the data sets were generated (green).
0.15
(bias)2
variance
2
(bias) + variance
test error
0.12
0.09
0.06
0.03
0
3
ln
Figure 3.6 from Bishop: Plot of squared bias and variance, together with their sum, corresponding to the
results shown in Figure 3.5. Also shown is the average test set error for a test data set size of 1000 points.
The minimum value of (bias)2 + variance occurs around ln = 0.31, which is close to the value that gives
the minimum error on the test data.
We fit 2nd and 10th order polynomials to this data via simple linear
regression, that is we regress Y on 1, X , . . . , X J where J = 2 or J = 10.
The results are displayed in the figure on the next slide.
30 (Section 3)
Data
Target
2nd Order Fit
10th Order Fit
4
3
2
1
0
1
2
3
4
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
Fitting a Low-Order Polynomial With Noisy Data: The target curve is the 10th order
polynomial y = f (x).
32 (Section 3)
We fit 2nd and 10th order polynomials to this data via simple linear
regression, that is we regress Y on 1, X , . . . , X J where J = 2 or J = 10.
The results are displayed in the figure on the next slide.
Commonly thought that overfitting occurs when the fitted model is too complex
relative to the true model
- but this is not the case here: clearly the 10th order regression overfits the
data but a 10th order polynomial is considerably less complex than a 50th
order polynomial.
What matters is how the model complexity matches the quantity and quality of
the data, not the (unknown) target function.
33 (Section 3)
Data
Target
2nd Order Fit
10th Order Fit
4
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
Fitting a High-Order Polynomial With Noiseless Data: The target curve is the 10th order
polynomial y = f (x).
Note: This case study is based on the case study in Section 4.1 of Learning from Data by
Abu-Mostafa, Magdon-Ismail and Lin.
Cross-validation often used to select the specific model. Other methods include:
Bayesian models
- many shrinkage / regularization methods can be interpreted as Bayesian
models where the penalty on large-magnitude parameters becomes a prior
distribution on those parameters.