Documente Academic
Documente Profesional
Documente Cultură
Agenda
Agenda
1
2
Cross Validation
K -Fold Cross Validation
Generalized CV
The LASSO
References
Statistics 305: Autumn Quarter 2006/2007
Part I
The Bias-Variance Tradeoff
Estimating
(0, 2 )
Just because f(z) fits our data well, this doesnt mean that it
will be a good fit to new data
In fact, suppose that we take new measurements yi at the
same zi s:
(z1 , y1 ), (z2 , y2 ), . . . , (zn , yn )
So if f() is a good model, then f(zi ) should also be close to
the new target yi
This is the notion of prediction error (PE)
Squared Error
Prediction Error
Bias^2
Variance
Model Complexity
Part II
Ridge Regression
1.
2.
3.
4.
n
X
i =1
(yi zi ) s.t.
minimize (y Z) (y Z) s.t.
p
X
j=1
p
X
j=1
j2 t
j2 t
1.
2.
3.
4.
p
n
X
X
2
(yi z
)
+
j2
i
i =1
j=1
= (y Z) (y Z) + ||||22
ls
Its solution may have smaller average PE than
PRSS()2 is convex, and hence has a unique solution
Taking derivatives, we obtain:
PRSS()2
= 2Z (y Z) + 2
1.
2.
3.
4.
1.
2.
3.
4.
Tuning parameter
1.
2.
3.
4.
bmi
ldl
map
Coefficient
tch
hdl
glu
age
sex
tc
0
10
DF
Figure: Ridge coefficient path for the diabetes data set found in
the lars library in R.
Statistics 305: Autumn Quarter 2006/2007
1.
2.
3.
4.
Choosing
ridge against
Plot the components of
ridge is biased
Proving that
1.
2.
3.
4.
Let R = Z Z
Then:
ridge
= (Z Z + Ip )1 Z y
= (R + Ip )1 R(R1 Z y)
= [R(Ip + R1 )]1 R[(Z Z)1 Z y]
= (Ip + R1 )1 R1 R
ls
ls
= (Ip + R1 )
So:
ridge
E(
=
=
ls
}
E{(Ip + R1 )
(Ip + R1 )
(if 6=0)
6=
Statistics 305: Autumn Quarter 2006/2007
.
Regularization: Ridge Regression and the LASSO
1.
2.
3.
4.
p
n
X
X
2
=
(yi zi ) +
j2
i =1
n
X
i =1
2
(yi z
i ) +
j=1
p
X
j=1
(0
j )2
1.
2.
3.
4.
z1,1 z1,2
..
..
.
.
zn,1 zn,2
0
Z = 0
0
0
0
0
0
0
So:
Z =
z1,3 z1,p
y1
..
..
..
.
.
.
.
..
zn,3 zn,p
yn
0
0
0
;
y
=
0
0
0
.
0
..
0
..
..
.
.
0
0
0
0
0
Z
Ip
y =
y
0
1.
2.
3.
4.
So the least squares solution for the augmented data set is:
1
(Z
Z ) Z y =
(Z , Ip )
Z
Ip
= (Z Z + Ip )1 Z y,
1
y
(Z , Ip )
0
1.
2.
3.
4.
Bayesian framework
1.
2.
3.
4.
avoided:
Inverting Z Z can be computationally expensive: O(p 3 )
1.
2.
3.
4.
ridge
Numerical computation of
ridge
= (Z Z + Ip )1 Z y
!
dj
= V diag
U y
dj2 +
Result uses the eigen (or spectral) decomposition of Z Z:
Z Z = (UDV ) (UDV )
= VD U UDV
= VD DV
= VD2 V
Statistics 305: Autumn Quarter 2006/2007
1.
2.
3.
4.
yridge = Z
p
X
=
uj
j=1
dj2
dj2 +
u
j
1.
2.
3.
4.
1 ls
1+
1.
2.
3.
4.
1.
2.
3.
4.
p
X
j=1
dj2
dj2 +
Part III
Cross Validation
How do we choose ?
K
X
()
(CV Error)k
k=1
0.20
0.16
0.18
Squared Error
0.22
0.24
30
35
40
45
50
55
df
Leave-one-out CV
i =1
i =1
1X
1X
CV(1) =
(yi fi (zi ))2 =
n
n
yi f(zi )
1 Sii
!2
Part IV
The LASSO
minimize (y Z) (y Z) s.t.
p
X
j=1
|j | t
p
n
X
X
2
=
(yi zi ) +
|j |
i =1
j=1
= (y Z) (y Z) + ||||1
Statistics 305: Autumn Quarter 2006/2007
Hence, P
have a path of solutions indexed by t
p
If t0 = j=1 |jls | (equivalently, = 0), we obtain no shrinkage
(and hence obtain the LS solutions as our solution)
Often, the path of solutions is indexed by a fraction of
shrinkage factor of t0
**
**
*
0.0
0.2
0.4
0.6
0.8
1.0
|beta|/max|beta|
**
**
*
** *
*
* *
**
** **
* **
* **
* ***
*
* **
* **
**
***
** ***
* **
* ****
0.0
0.2
0.4
0.6
10
*
**
*
**
*
*
*
*
0.8
**
*
*
**
*
*
2 1
* **
* ***
*
**
**
****
**
**
**
0
500
* **
* **
***
** **
***
* **
**
* **
12
*
** *
*
* *
**
** **
* **
LAR
10
500
5
2 1
4
9
Standardized Coefficients
500
0
500
Standardized Coefficients
**
**
LASSO
0
1.0
|beta|/max|beta|
Forward Stagewise
* **
* ***
*
0.4
0.6
14
*
*
**
*
***
*
*
**
*
*
**
*
*
*
*
0.0
0.2
0.8
*
** *
*
* *
* **
* **
** ****
*
* ****
**
** **
* **
2 1
500
0
**
**
500
Standardized Coefficients
1.0
|beta|/max|beta|
Part V
Model Selection, Oracles, and the Dantzig Selector
Variable selection
The ridge and LASSO solutions are indexed by the continuous
parameter :
Variable selection in least squares is discrete:
Perhaps consider best subsets, which is of order O(2p )
(combinatorial explosion compare to ridge and LASSO)
Stepwise selection
In stepwise procedures, a new variable may be added into the
model even with a miniscule improvement in R 2
When applying stepwise to a perturbation of the data,
probably have different set of variables enter into the model at
each stage
Variants
||
Dantzig
Part VI
References
References
Cand`es E. and Tao T. The Dantzig selector: statistical
estimation when p is much larger than n. Available at
http://www.acm.caltech.edu/~emmanuel/papers/DantzigSelector.pdf.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004).
Least angle regression. Annals of Statistics, 32 (2): 409499.
Frank, I. and Friedman, J. (1993). A statistical view of some
chemometrics regression tools. Technometrics, 35, 109148.
Hastie, T. and Efron, B. The lars package. Available from
http://cran.r-project.org/src/contrib/Descriptions/lars.html.
References continued
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The
Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Springer Series in Statistics.
Hoerl, A.E. and Kennard, R. (1970). Ridge regression: Biased
estimation for nonorthogonal problems. Technometrics, 12:
55-67
Seber, G. and Lee, A. (2003). Linear Regression Analysis, 2nd
Edition. Wiley Series in Probability and Statistics.
Zou, H. and Hastie, T. (2005). Regularization and variable
selection via the elastic net. Journal of the Royal Statistical
Society, Series B. 67: pp. 301320.
Statistics 305: Autumn Quarter 2006/2007