Sunteți pe pagina 1din 3

Using LASSO from lars (or glmnet) package in R for variable selection - ...

1 of 3

http://stats.stackexchange.com/questions/58531/using-lasso-from-lars-or-...

sign up

log in tour

Here's how it works:

Cross Validated is a question and


answer site for people interested in
statistics, machine learning, data
analysis, data mining, and data
visualization. It's 100% free, no
registration required.

Anybody can ask


a question

help

Anybody can
answer

The best answers are voted


up and rise to the top

Sign up

Using LASSO from lars (or glmnet) package in R for variable selection
Sorry if this question comes across a little basic.
I am looking to use LASSO variable selection for a multiple linear regression model in R. I have 15 predictors, one of which is categorical(will that
cause a problem?). After setting my x and y I use the following commands:
model=lars(x,y)
coef(model)

My problem is when I use coef(model) . This returns a matrix with 15 rows, with one extra predictor added each time. However there is no
suggestion as to which model to choose. Have I missed something? Is there a way I can get the lars package to return just one "best" model?
There are other posts suggesting using
Have I missed something here?:

glmnet

instead but this seems more complicated. An attempt is as follows, using the same x and y.

cv=cv.glmnet(x,y)
model=glmnet(x,y,type.gaussian="covariance",lambda=cv$lambda.min)
predict(model, type="coefficients")

The final command returns a list of my variables, the majority with a coefficient although some are =0. Is this the correct choice of the "best"
model selected by LASSO? If I then fit a linear model with all my variables which had coefficients not=0 I get very similar, but slightly different,
coefficient estimates. Is there a reason for this difference? Would it be acceptable to refit the linear model with these variables chosen by
LASSO and take that as my final model? Otherwise I cannot see any p-values for significance. Have I missed anything?
Does
type.gaussian="covariance"

ensure that that

glmnet

uses multiple linear regression?

Does the automatic normalisation of the variables affect the coefficients at all? Is there any way to include interaction terms in a LASSO
procedure?
I am looking to use this procedure more as a demonstration of how LASSO can be used than for any model that will actually be used for any
important inference/prediction if that changes anything.
Thank you for taking the time to read this. Any general comments on LASSO/lars/glmnet would also be greatly appreciated.
feature-selection

lasso

glmnet

lars

edited May 9 '13 at 10:53

asked May 8 '13 at 23:57

James
61

As a side comment, if you want to interpret the result be sure to demonstrate the that set of variables
selected by lasso is stable. This can be done using Monte Carlo simulation or by bootstrapping your own
dataset. Frank Harrell Sep 15 '13 at 8:43

5 Answers

Using glmnet is really easy once you get the grasp of it thanks to its excellent vignette in
http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html (you can also check the CRAN
package page). As for the best lambda for glmnet , the rule of thumb is to use
cvfit <- glmnet::cv.glmnet(x, y)
coef(cvfit, s = "lambda.1se")

instead of

lambda.min

3/6/2016 8:55 AM

Using LASSO from lars (or glmnet) package in R for variable selection - ...

2 of 3

To do the same for

lars

http://stats.stackexchange.com/questions/58531/using-lasso-from-lars-or-...

you have to do it by hand. Here is my solution

cv <- lars::cv.lars(x, y, plot.it = FALSE, mode = "step")


idx <- which.max(cv$cv - cv$cv.error <= min(cv$cv))
coef(lars::lars(x, y))[idx,]

Bear in mind that this is not exactly the same, because this is stopping at a lasso knot (when a
variable enters) instead of at any point.
Please note that glmnet is the preferred package now, it is actively maintained, more so than
lars , and that there have been questions about glmnet vs lars answered before (algorithms
used differ).
As for your question of using lasso to choose variables and then fit OLS, it is an ongoing
debate. Google for OLS post Lasso and there are some papers discussing the topic. Even the
authors of Elements of Statistical Learning admit it is possible.
Edit: Here is the code to reproduce more accurately what

glmnet

lars

does in

cv <- lars::cv.lars(x, y, plot.it = FALSE)


ideal_l1_ratio <- cv$index[which.max(cv$cv - cv$cv.error <= min(cv$cv))]
obj <- lars::lars(x, y)
scaled_coefs <- scale(obj$beta, FALSE, 1 / obj$normx)
l1 <- apply(X = scaled_coefs, MARGIN = 1, FUN = function(x) sum(abs(x)))
coef(obj)[which.max(l1 / tail(l1, 1) > ideal_l1_ratio),]
edited Oct 28 '14 at 1:31

answered Oct 27 '14 at 5:34

Juancentro
213

Perhaps the comparison with forward selection stepwise regression will help (see the following
link to a site by one of the authors http://www-stat.stanford.edu/~tibs/lasso/simple.html). This is
the approach used in Chapter 3.4.4 of The Elements of Statistical Learning (available online
for free). I thought that Chapter 3.6 in that book helped to understand the relationship between
least squares, best subset, and lasso (plus a couple of other procedures). I also find it helpful
to take the transpose of the coefficient, t(coef(model)) and write.csv, so that I can open it in
Excel along with a copy of the plot(model) on the side. You might want to sort by the last
column, which contains the least squares estimate. Then you can see clearly how each
variable gets added at each piecewise step and how the coefficients change as a result. Of
course this is not the whole story, but hopefully it will be a start.
answered May 17 '13 at 17:31

Joel Cadwell
41

I'm returning to this question from a while ago since I think I've solved the correct
solution.
Here's a replica using the mtcars dataset:
library(glmnet)
`%ni%`<-Negate(`%in%')
data(mtcars)
x<-model.matrix(mpg~.,data=mtcars)
x=x[,-1]
glmnet1<-cv.glmnet(x=x,y=mtcars$mpg,type.measure='mse',nfolds=5,alpha=.5)
c<-coef(glmnet1,s='lambda.min',exact=TRUE)
inds<-which(c!=0)
variables<-row.names(c)[inds]
variables<-variables[variables %ni% '(Intercept)']

'variables' gives you the list of the variables that solve the best solution.
edited Jun 30 '15 at 18:47

answered Apr 1 '14 at 19:10

Jason
141

I was looking to the code and I find that "testing" was not defined yet and therefore the code:
"final.list<-testing[-removed] #removing variables" gives the error: object not found So looking to the code I
suppose that instead of using "testing" it should be used "cp.list" so that the code will be:
final.list<-cp.list[-removed] #removing variables final.list<-c(final.list,duplicates) #adding in those vars which
were both removed then added later Let me know if this is correct Kind regards user55101 Sep 2 '14 at
14:27

3/6/2016 8:55 AM

Using LASSO from lars (or glmnet) package in R for variable selection - ...

3 of 3

http://stats.stackexchange.com/questions/58531/using-lasso-from-lars-or-...

`%ni%`<-Negate(`%ni%`); ##looks wrong. While `%ni%`<-Negate(`%in%`); ##looks right. I think the


stackexchange formatter messed it up... Chris Jun 30 '15 at 4:17

LARS solves the ENTIRE solution path. The solution path is piecewise linear -- there are a
finite number of "notch" points (i.e., values of the regularization parameter) at which the
solution changes.
So the matrix of solutions you're getting is all the possible solutions. In the list that it returns, it
should also give you the values of the regularization parameter.
answered May 9 '13 at 10:19

Adam
21

Thank you for your answer. Is there a way to display the values of the regularisation parameter?
Additionally is there a way to then choose between the solutions based on this parameter? (Also is the
parameter lambda?) James May 9 '13 at 10:31

lars and glmnet operate on raw matrices. To includ interaction terms, you will have to
construct the matrices yourself. That means one column per interaction (which is per level per
factor if you have factors). Look into lm() to see how it does it (warning: there be dragons).

To do it right now, do something like: To make an interaction term manually, you could (but
maybe shouldn't, because it's slow) do:
int = D["x1"]*D["x2"]
names(int) = c("x1*x2")
D = cbind(D, int)

Then to use this in lars (assuming you have a

kicking around):

lars(as.matrix(D), as.matrix(y))

I wish I could help you more with the other questions. I found this one because lars is giving me
grief and the documentation in it and on the web is very thin.
edited Mar 24 '14 at 17:00

answered Mar 24 '14 at 12:21

kousu
25

"Warning: there be dragons" This is pretty easy with model.matrix() . Gregor Nov 25 '14 at 23:52

protected by Community Sep 4 '14 at 7:22


Thank you for your interest in this question. Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation
on this site.
Would you like to answer one of these unanswered questions instead?

3/6/2016 8:55 AM

S-ar putea să vă placă și