Documente Academic
Documente Profesional
Documente Cultură
TheElementsofStatisticalLearning
TrevorHastie,RobertTibshirani,JeromeFriedman
PresentedbyJunyan Li
Linearregressionmodel
Input:XT=(x1,,xp )
isaddedsothatf(x)donothavetopassthroughtheorigin.
Basic expansion:
Linearregressionmodel
residualsumofsquares, RSS
,givenasetoftrainingdata(x,y)
denotebyX theN*(p+1)matrixwitheachrowaninputvector(witha1in
thefirstposition)
RSS
Let
0.As
0,ifX isoffullrank(ifnot,
removetheredundancies).
, here iscalledthehat.
Linearregressionmodel
istheorthogonalprojectionofy ontothesubspacethecolumn
vectorsofX byx0,,xp,withx0 1
0.=>y
isorthogonaltothesubspace.
Linearregressionmodel
Assume
hasconstantvariance
Assumethedeviationsof
andGaussian,ande~N 0,
arounditsexpectationareadditive
.Then ~
,
.
Linearregressionmodel
areconstrainedtop+1equalities:
0
So,
isaunbiasedestimatorand
0,use z
0,1 (if
is
~
(if
is not given). When Np1 is large enough, the
given) or t
different between the tail quantities of a tdistribution and a standard normal
distribution is negligible, so we use the standard normal distribution regardless of
whether is given or not.
Simultaneously, we could use F statistic to test whether some parameters could be
removed.
/
/
where RSS is the residual sumofsquares for the least squares fit of the bigger
model with +1 parameters, and RSS the same for the nested smaller model with
+1 parameters.
GaussMarkovTheorem
beanotherlinearestimatorof
,asE(e)=0
However,theremayexistabiasedestimatorwithsmallermeansquarederror,is
intimatelyrelatedtopredictionaccuracy.
MultipleRegressionfromSimpleUnivariate Regression
Ifp=1,
,
,
,where<x,y>=
,
j,
Letx1,,xp bethecolumnsofX.If<xi,xj>=0(orthogonal)foreachi
r
r
,
,
<xp,xp>
=>r ,
<x1,x1>
areorthogonal.
,
,
=0
MultipleRegressionfromSimpleUnivariate Regression
Orthogonal inputs occur most often with balanced, designed experiments (where
orthogonality is enforced), but almost never with observational data.
RegressionbySuccessiveOrthogonalization:
1.Initializez0 =x0 =1.
2.Forj=1,2,...,p
,l=0,...,j 1
3.Regressyontheresidualzp togivetheestimate
wecanseethateachofthe
orthogonaltoeachother.
,
,
,
isalinearcombinationofthez ,kj,andare
MultipleRegressionfromSimpleUnivariate Regression
In matrix form:
, where Z has columns zj and is the upper triangular matrix
with entries . Introducting D with jth diagonal entry Djj=
, let
and
The QR decomposition represents a convenient orthogonal basis for
the column space of X.
Q is an N (p+1) orthogonal matrix, and R is a (p + 1) (p + 1) upper
triangular matrix.
,
,
SubsetSelection
two reasons why we are often not satisfied with the least squares estimates :
prediction accuracy: the least squares estimates often have low bias but large
variance
interpretation. With a large number of predictors, we often would like to
determine a smaller subset that exhibits the strongest effects.
1. BestSubsetSelection
Bestsubsetregressionfindsforeachk {0,1,2,...,p}thesubsetofsizekthat
givessmallestresidualsumofsquares.
Infeasibleforpmuchlargerthan40.
SubsetSelection
SubsetSelection
3. ForwardStagewise Regression
Forwardstagewise regression (FS) is even more constrained than forwardstepwise
regression. It starts like forwardstepwise regression, with an intercept equal to y, and
centered predictors with coefficients initially all 0. At each step the algorithm identifies
the variable most correlated with the current residual. It then computes the simple
linear regression coefficient of the residual on this chosen variable, and then adds it to
the current coefficient for that variable. This is continued till none of the variables
have correlation with the residuals.
In Forward selection, variables are added at one time, but in FS selection, variables are
added partially, which works better in very high dimensional problems.
ShrinkageMethods
it often exhibits high variance, and so doesnt reduce the prediction error of
the full model. Shrinkage methods are more continuous, and dont suffer
as much from high variability.
1. RidgeRegression
Penalizingbythesumofsquares
,Subjectto
0
There is a onetoone correspondence between in "
subject to
0,
" and t in
ShrinkageMethods
RSS
Likelinearregression,ridgeregressioncomputesthecoordinatesofywith
respecttotheorthonormalbasisU.Itthenshrinksthesecoordinatesbythe
factors
ShrinkageMethods
Eigendecompositionof
D
,andTheeigenvectorsvj (columnsofV)
arealsocalledtheprincipalcomponents(orKarhunenLoeve)directionsofX.
Degreeoffreedomdf
tr
ShrinkageMethods
2. LassoRegression
The latter constraint makes the solutions nonlinear in the yi. there is no
closed form expression as in ridge regression. Computing the lasso solution
is a quadratic programming problem.
ifthesolutionoccursat
acorner,thenithasone
parameterjequalto
zero.
ShrinkageMethods
AssumethatcolumnsofX areorthonormal=>
=I
Minimize
=
<==>minimize
(
)+
Tominimize
so
) =0so
(
)(
)+
,wecanseethat
)=
)(
)
)=
1. Unbiasedness: The resulting estimator is early unbiased when the true unknown
parameter is large to avoid unnecessary modeling bias.
2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets small
estimated coeffcients to zero to reduce model complexity.
3. Continuity: The resulting estimator is continuous to avoid instability in model
prediction.
so
)(
ShrinkageMethods
3. LeastAngleRegression
Forward stepwise regression builds a model sequentially, adding one
variable at a time. At each step, it identifies the best variable to include
in the active set, and then updates the least squares fit to include all
the active variables.
Least angle regression uses a similar strategy, but only enters as much
of a predictor as it deserves.
1. Standardize the predictors to have mean zero and unit norm. Start with the residual
, ,...,
0.
2. Find the predictor xj most correlated with r.(cosine)
3. Move from 0 towards its leastsquares coefficient <xj , r>, until some other
competitor xk has as much correlation with the current residual as does xj .
,
) defined
4. Move and in the direction(
by their joint least squares coefficient of the current residual on (xj , xk), until some other
competitor xl has as much correlation with the current residual.
4a. If a nonzero coefficient hits zero, drop its variable from the active set of variables
and recompute the current joint least squares direction.(it becomes lasso if this is added)
5. Continue in this way until all p predictors have been entered. After min(N 1, p) steps,
we arrive at the full leastsquares solution.
LinearMethodsforClassification
K classes and the fitted linear models f x
x. The
decision boundary between class k and l is that set of points for
which f x
f x
P(G=k|X=x)
For two classes, a popular model is
P G
1X
P G
2X
log
Hyperplanes
>model boundaries as
linear
LinearMethodsforClassification
Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K, with
Yk = 1 if G = k else 0. These are collected together in a vector Y = (Y1, . . . ,
YK.
Classify according to G x
argmax f x
Because of the rigid nature of the regression model, classes may be
masked by others, even though they are perfectly separated.
A loose but general rule is that if K 3 classes are lined up, polynomial
terms up to degree K 1 might be needed to resolve them.
LinearMethodsforClassification
Suppose f x is the classconditional density of X in class G=k and k is the prior
probability of class k.
f x
P G kX x
f x
Suppose f x
exp
log
log
log
log
for each k.
LinearMethodsforClassification
Thelineardiscriminantfunction:
1
x
log
2
,as issymmetricand
transposeisstillanumber.
isanumber,whose
isignored,asitisaconst.
Inpractice,weestimatetheparametersusingtrainingdata:
/ ,where
isthenumberofclasskobservations
/
LinearMethodsforClassification
Inquadraticdiscriminantanalysis(QDA)weassumethat arenotequal.
x
log
log
LDA in the enlarged quadratic polynomial space is quite similar with QDA.
a
1 a
Regularized discriminant analysis(RDA): a
In practice a can be chosen based on the performance of the model on
validation data, or by crossvalidation.
,where isap*porthonormaland
matrixofpositiveeigenvaluesd .
x
x
x
log
isadiagonal
x
x
LinearMethodsforClassification
LinearMethodsforClassification
a X suchthatthebetweenclass
FindthelinearcombinationZ
varianceismaximizedrelativetothewithinclassvariance.max
B isthebetweenclassesscattermatrixandW isthewithinclasses
scattermatrix
Asascattermatrix=>min
L
a
1
a
2
a subjecttoa
a
1
a
2
a a
a
1
LogisticRegression
P G
kX
P G
KX
LetP G
KX
,k=1,,K1
P x ;
x ;
intwoclasscase:
viaa0/1responseyi,whereyi =1whengi =1,andyi =0whengi =2
l
l
;
1
log 1
log 1
0
NewtonRaphson algorithm
LogisticRegression
In matrix notation:
Y denote the vector of yi values; X the N*(p+1) matrix of xi values; P the
vector of fitted probabilities with ith element
;
and W a N*N
diagonal matrix of weights with ith diagona element
;
1
;
,adjusted response
LDAorLogisticRegression
LDA: log
log
Although they have exactly the same form, the difference lies in the way the linear
coefficients are estimated. The logistic regression model is more general, in that it
makes less assumptions.
Logistic regression: fit the parameters by maximizing the conditional likelihood
the multinomial likelihood with probabilities the P(G = k|X), where P(X) is ignored.
LDA: fit the parameters by maximizing the full loglikelihood, based on the joint
density P(X,G=k)=(X;k,)k, where P(X) does play a role as P(X)= P(X,G=k).
Assume f(x)s are Gaussian, we can find a more efficient way.
It is generally felt that logistic regression is a safer, more robust bet than the LDA
model, relying on fewer assumptions. It is our experience that the models give
very similar results, even when LDA is used inappropriately.
AnyQuestions?