Documente Academic
Documente Profesional
Documente Cultură
1 of 4
http://stats.stackexchange.com/questions/72251/an-example-lasso-regress...
sign up
log in tour
help
Anybody can
answer
Sign up
glmnet
with lasso where my outcome of interest is dichotomous. I have created a small bogus data
The goal of this example is to make use of lasso to create a model predicting child asthma status from the list of 6 potential predictor variables
(age, gender, bmi_p, m_edu, p_edu, and f_color). Obviously the sample size is an issue here, but I am hoping to gain more insight into how to
handle the different types of variables (ie. continuous, ordinal, nominal, and binary) within the glmnet framework when the outcome is binary (1 =
asthma; 0 = no asthma).
As such, would anyone being willing to provide a sample R script along with explanations for this silly example using lasso with the above data to
predict asthma status? Although very basic, I know myself, and likely many others on CV would greatly appreciate this! Thanks!
r
self-study
lasso
Matt Reichenbach
901
35
You might get more luck if you posted the data as a dput of an actual R object; don't make readers put
frosting on top as well as bake you a cake!. If you generate the appropriate data frame in R, say foo ,
then edit into the question the output of dput(foo) . Gavin Simpson Oct 8 '13 at 17:17
Thanks @GavinSimpson! I updated the post with a data frame so hopefully I get to eat some cake without
frosting! :) Matt Reichenbach Oct 8 '13 at 17:47
By using BMI percentile you are in a sense defying the laws of physics. Obesity affects individuals
according to physical measurements (lengths, volumes, weight) not according to how many individuals are
similar to the current subject, which is what percentiling is doing. Frank Harrell Oct 8 '13 at 19:43
I agree, BMI percentile is not a metric that I prefer to use; however, CDC guidelines recommends using
BMI percentile over BMI (also a highly questionable metric!) for children and adolescents less than 20
years old as it takes into account age and gender in addition to height and weight. All of these variables
and data values were thought up entirely for this example. This example does not reflect any of of my
current work as I work with big data. I just wanted to see an example of glmnet in action with a binary
outcome. Matt Reichenbach Oct 8 '13 at 19:58
Plug here for a package by Patrick Breheny called ncvreg which fits linear and logistic regression models
penalized by MCP, SCAD, or LASSO. (cran.r-project.org/web/packages/ncvreg/index.html) bdeonovic
Oct 8 '13 at 21:12
2 Answers
3/6/2016 8:54 AM
2 of 4
http://stats.stackexchange.com/questions/72251/an-example-lasso-regress...
library(glmnet)
age <- c(4,8,7,12,6,9,10,14,7)
gender <- c(1,0,1,1,1,0,1,0,0) ; gender<-as.factor(gender)
bmi_p <- c(0.86,0.45,0.99,0.84,0.85,0.67,0.91,0.29,0.88)
m_edu <- c(0,1,1,2,2,3,2,0,1); m_edu<-as.factor(m_edu)
p_edu <- c(0,2,2,2,2,3,2,0,0); p_edu<-as.factor(p_edu)
f_color <- c("blue", "blue", "yellow", "red", "red", "yellow", "yellow", "red", "yellow")
asthma <- c(1,1,0,1,0,0,0,1,1)
f_color <- as.factor(f_color)
xfactors <- model.matrix(asthma ~ gender + m_edu + p_edu + f_color)[,-1]
x <- as.matrix(data.frame(age, bmi_p, xfactors))
#note alpha =1 for lasso only and can blend with ridge penalty down to alpha=0 ridge only
glmmod<-glmnet(x,y=as.factor(asthma),alpha=1,family='binomial')
#plot variable coefficients vs. shrinkage parameter lambda.
plot(glmmod,xvar="lambda")
grid()
Categorical variables are usually first transformed into factors, then a dummy variable matrix
of predictors is created and along with the continuous predictors, is passed to the model. Keep
in mind, glmnet uses both ridge and lasso penalties, but can be set to either alone.
# some results
#model shown for lambda up to first 3 selected variables. Lambda can have manual
# tuning grid for wider range
> glmmod
Call:
Df
%Dev
Lambda
[1,] 0 0.00000 0.273300
[2,] 1 0.01955 0.260900
[3,] 1 0.03737 0.249000
[4,] 1 0.05362 0.237700
[5,] 1 0.06847 0.226900
[6,] 1 0.08204 0.216600
[7,] 1 0.09445 0.206700
[8,] 1 0.10580 0.197300
[9,] 1 0.11620 0.188400
[10,] 3 0.13120 0.179800
[11,] 3 0.15390 0.171600
#coefficents can be extracted from the glmmod. Here shown with 3 variables selected.
> coef(glmmod)[,10]
(Intercept)
age
bmi_p
gender1
m_edu1
m_edu2
m_edu3
p_edu2
p_edu3
0.59445647
0.00000000
0.00000000
-0.01893607
0.00000000
0.00000000
0.00000000
-0.01882883
0.00000000
f_colorred f_coloryellow
0.00000000
-0.77207831
3/6/2016 8:54 AM
3 of 4
http://stats.stackexchange.com/questions/72251/an-example-lasso-regress...
plot(cv.glmmod)
best_lambda <- cv.glmmod$lambda.min
> best_lambda
[1] 0.2732972
pat
2,181
10
16
this is exactly what I was looking for +1, the only questions I have are 1) what can you do with the cross
validation lambda of 0.2732972? and 2) From the glmmod, are the selected variables favorite color
(yellow), gender, and father's education (bachelor's degree)? Thanks so much! Matt Reichenbach Oct
9 '13 at 14:21
1) Cross validation is used to choose lambda and coefficients (at min error). In this mockup, there is no
local min (there was a warning also related to too few obs); I would interpret that all coefficients were
shrunk to zero with the shrinkage penalties (best model has only intercept) and re-run with more (real)
observations and maybe increase lambda range. 2) Yes, in the example where I chose coef(glmmod)
[,10]... you choose lambda for the model via CV or interpretation of results. Could you mark as solved if
you felt I solved your question? thanks. pat Oct 9 '13 at 22:09
Can I ask how this handles the f_color variable? Is factor level 1 to 4 considered a larger step that 1 to
2, or are these all equally weighted, non-directional, and categorical? (I want to apply it to an analysis with
all unordered predictors.) beroe May 12 '14 at 20:41
The line xfactors <- model.matrix(asthma ~ gender + m_edu + p_edu + f_color)[,-1] codes the
categorical variable f_color (as declared by as.factor in the previous lines). It should use the default R
dummy variable coding, unless the contrasts.arg argument is supplied. This means all the levels of
f_color are equally weighted and non directional, except for the first one which is used as the reference
class and absorbed into the intercept. Alex Oct 27 '15 at 5:16
I will use the package enet since that is my preffered method. It is a little more flexible.
install.packages('elasticnet')
library(elasticnet)
age <- c(4,8,7,12,6,9,10,14,7)
gender <- c(1,0,1,1,1,0,1,0,0)
bmi_p <- c(0.86,0.45,0.99,0.84,0.85,0.67,0.91,0.29,0.88)
m_edu <- c(0,1,1,2,2,3,2,0,1)
p_edu <- c(0,2,2,2,2,3,2,0,0)
#f_color <- c("blue", "blue", "yellow", "red", "red", "yellow", "yellow", "red", "yellow")
f_color <- c(0, 0, 1, 2, 2, 1, 1, 2, 1)
asthma <- c(1,1,0,1,0,0,0,1,1)
pred <- cbind(age, gender, bmi_p, m_edu, p_edu, f_color)
3/6/2016 8:54 AM
4 of 4
http://stats.stackexchange.com/questions/72251/an-example-lasso-regress...
bdeonovic
2,947
10
26
thanks for sharing elasticnet ; however, I do not know what to make of the output from the above R
script. Can you please clarify? Thanks in advance! Matt Reichenbach Oct 9 '13 at 14:29
3/6/2016 8:54 AM