Sunteți pe pagina 1din 4

r - An example: Lasso regression using glmnet for binary outcome - Cross...

1 of 4

http://stats.stackexchange.com/questions/72251/an-example-lasso-regress...

sign up

log in tour

Here's how it works:

Cross Validated is a question and


answer site for people interested in
statistics, machine learning, data
analysis, data mining, and data
visualization. It's 100% free, no
registration required.

Anybody can ask


a question

help

Anybody can
answer

The best answers are voted


up and rise to the top

Sign up

An example: Lasso regression using glmnet for binary outcome


I am having confusion and difficulties using
frame below:

glmnet

with lasso where my outcome of interest is dichotomous. I have created a small bogus data

age <- c(4,8,7,12,6,9,10,14,7)


gender <- c(1,0,1,1,1,0,1,0,0)
bmi_p <- c(0.86,0.45,0.99,0.84,0.85,0.67,0.91,0.29,0.88)
m_edu <- c(0,1,1,2,2,3,2,0,1)
p_edu <- c(0,2,2,2,2,3,2,0,0)
f_color <- c("blue", "blue", "yellow", "red", "red", "yellow", "yellow", "red", "yellow")
asthma <- c(1,1,0,1,0,0,0,1,1)
df <- data.frame(age, gender, bmi_p, m_edu, p_edu, f_color, asthma) # df is a data frame
for further use!

The columns (variables) in the above dataset are as follows:


age (age of child in years) - continuous
gender - binary (1 = male; 0 = female)
bmi_p (BMI percentile) - continuous
m_edu (mother highest education level) - ordinal (0 = less than high school; 1 = high
school diploma; 2 = bachelors degree; 3 = post-baccalaureate degree)
p_edu (father highest education level) - ordinal (same as m_edu)
f_color (favorite primary color) - nominal ("blue", "red", or "yellow")
asthma (child asthma status) - binary (1 = asthma; 0 = no asthma)

The goal of this example is to make use of lasso to create a model predicting child asthma status from the list of 6 potential predictor variables
(age, gender, bmi_p, m_edu, p_edu, and f_color). Obviously the sample size is an issue here, but I am hoping to gain more insight into how to
handle the different types of variables (ie. continuous, ordinal, nominal, and binary) within the glmnet framework when the outcome is binary (1 =
asthma; 0 = no asthma).
As such, would anyone being willing to provide a sample R script along with explanations for this silly example using lasso with the above data to
predict asthma status? Although very basic, I know myself, and likely many others on CV would greatly appreciate this! Thanks!
r

self-study

lasso

edited Oct 8 '13 at 17:46

asked Oct 8 '13 at 15:56

Matt Reichenbach
901

35

You might get more luck if you posted the data as a dput of an actual R object; don't make readers put
frosting on top as well as bake you a cake!. If you generate the appropriate data frame in R, say foo ,
then edit into the question the output of dput(foo) . Gavin Simpson Oct 8 '13 at 17:17
Thanks @GavinSimpson! I updated the post with a data frame so hopefully I get to eat some cake without
frosting! :) Matt Reichenbach Oct 8 '13 at 17:47

By using BMI percentile you are in a sense defying the laws of physics. Obesity affects individuals
according to physical measurements (lengths, volumes, weight) not according to how many individuals are
similar to the current subject, which is what percentiling is doing. Frank Harrell Oct 8 '13 at 19:43

I agree, BMI percentile is not a metric that I prefer to use; however, CDC guidelines recommends using
BMI percentile over BMI (also a highly questionable metric!) for children and adolescents less than 20
years old as it takes into account age and gender in addition to height and weight. All of these variables
and data values were thought up entirely for this example. This example does not reflect any of of my
current work as I work with big data. I just wanted to see an example of glmnet in action with a binary
outcome. Matt Reichenbach Oct 8 '13 at 19:58
Plug here for a package by Patrick Breheny called ncvreg which fits linear and logistic regression models
penalized by MCP, SCAD, or LASSO. (cran.r-project.org/web/packages/ncvreg/index.html) bdeonovic
Oct 8 '13 at 21:12

2 Answers

3/6/2016 8:54 AM

r - An example: Lasso regression using glmnet for binary outcome - Cross...

2 of 4

http://stats.stackexchange.com/questions/72251/an-example-lasso-regress...

library(glmnet)
age <- c(4,8,7,12,6,9,10,14,7)
gender <- c(1,0,1,1,1,0,1,0,0) ; gender<-as.factor(gender)
bmi_p <- c(0.86,0.45,0.99,0.84,0.85,0.67,0.91,0.29,0.88)
m_edu <- c(0,1,1,2,2,3,2,0,1); m_edu<-as.factor(m_edu)
p_edu <- c(0,2,2,2,2,3,2,0,0); p_edu<-as.factor(p_edu)
f_color <- c("blue", "blue", "yellow", "red", "red", "yellow", "yellow", "red", "yellow")
asthma <- c(1,1,0,1,0,0,0,1,1)
f_color <- as.factor(f_color)
xfactors <- model.matrix(asthma ~ gender + m_edu + p_edu + f_color)[,-1]
x <- as.matrix(data.frame(age, bmi_p, xfactors))
#note alpha =1 for lasso only and can blend with ridge penalty down to alpha=0 ridge only
glmmod<-glmnet(x,y=as.factor(asthma),alpha=1,family='binomial')
#plot variable coefficients vs. shrinkage parameter lambda.
plot(glmmod,xvar="lambda")
grid()

Categorical variables are usually first transformed into factors, then a dummy variable matrix
of predictors is created and along with the continuous predictors, is passed to the model. Keep
in mind, glmnet uses both ridge and lasso penalties, but can be set to either alone.
# some results
#model shown for lambda up to first 3 selected variables. Lambda can have manual
# tuning grid for wider range
> glmmod
Call:

glmnet(x = x, y = as.factor(asthma), family = "binomial", alpha = 1)

Df
%Dev
Lambda
[1,] 0 0.00000 0.273300
[2,] 1 0.01955 0.260900
[3,] 1 0.03737 0.249000
[4,] 1 0.05362 0.237700
[5,] 1 0.06847 0.226900
[6,] 1 0.08204 0.216600
[7,] 1 0.09445 0.206700
[8,] 1 0.10580 0.197300
[9,] 1 0.11620 0.188400
[10,] 3 0.13120 0.179800
[11,] 3 0.15390 0.171600
#coefficents can be extracted from the glmmod. Here shown with 3 variables selected.
> coef(glmmod)[,10]
(Intercept)
age
bmi_p
gender1
m_edu1
m_edu2
m_edu3
p_edu2
p_edu3
0.59445647
0.00000000
0.00000000
-0.01893607
0.00000000
0.00000000
0.00000000
-0.01882883
0.00000000
f_colorred f_coloryellow
0.00000000
-0.77207831

Lastly, cross validation can also be used to select lambda.


cv.glmmod <- cv.glmnet(x,y=asthma,alpha=1)

3/6/2016 8:54 AM

r - An example: Lasso regression using glmnet for binary outcome - Cross...

3 of 4

http://stats.stackexchange.com/questions/72251/an-example-lasso-regress...

plot(cv.glmmod)
best_lambda <- cv.glmmod$lambda.min
> best_lambda
[1] 0.2732972

edited Oct 8 '13 at 23:02

answered Oct 8 '13 at 20:21

pat
2,181

10

16

this is exactly what I was looking for +1, the only questions I have are 1) what can you do with the cross
validation lambda of 0.2732972? and 2) From the glmmod, are the selected variables favorite color
(yellow), gender, and father's education (bachelor's degree)? Thanks so much! Matt Reichenbach Oct
9 '13 at 14:21
1) Cross validation is used to choose lambda and coefficients (at min error). In this mockup, there is no
local min (there was a warning also related to too few obs); I would interpret that all coefficients were
shrunk to zero with the shrinkage penalties (best model has only intercept) and re-run with more (real)
observations and maybe increase lambda range. 2) Yes, in the example where I chose coef(glmmod)
[,10]... you choose lambda for the model via CV or interpretation of results. Could you mark as solved if
you felt I solved your question? thanks. pat Oct 9 '13 at 22:09

Can I ask how this handles the f_color variable? Is factor level 1 to 4 considered a larger step that 1 to
2, or are these all equally weighted, non-directional, and categorical? (I want to apply it to an analysis with
all unordered predictors.) beroe May 12 '14 at 20:41
The line xfactors <- model.matrix(asthma ~ gender + m_edu + p_edu + f_color)[,-1] codes the
categorical variable f_color (as declared by as.factor in the previous lines). It should use the default R
dummy variable coding, unless the contrasts.arg argument is supplied. This means all the levels of
f_color are equally weighted and non directional, except for the first one which is used as the reference
class and absorbed into the intercept. Alex Oct 27 '15 at 5:16

I will use the package enet since that is my preffered method. It is a little more flexible.
install.packages('elasticnet')
library(elasticnet)
age <- c(4,8,7,12,6,9,10,14,7)
gender <- c(1,0,1,1,1,0,1,0,0)
bmi_p <- c(0.86,0.45,0.99,0.84,0.85,0.67,0.91,0.29,0.88)
m_edu <- c(0,1,1,2,2,3,2,0,1)
p_edu <- c(0,2,2,2,2,3,2,0,0)
#f_color <- c("blue", "blue", "yellow", "red", "red", "yellow", "yellow", "red", "yellow")
f_color <- c(0, 0, 1, 2, 2, 1, 1, 2, 1)
asthma <- c(1,1,0,1,0,0,0,1,1)
pred <- cbind(age, gender, bmi_p, m_edu, p_edu, f_color)

enet(x=pred, y=asthma, lambda=0)


answered Oct 8 '13 at 21:18

3/6/2016 8:54 AM

r - An example: Lasso regression using glmnet for binary outcome - Cross...

4 of 4

http://stats.stackexchange.com/questions/72251/an-example-lasso-regress...

bdeonovic
2,947

10

26

thanks for sharing elasticnet ; however, I do not know what to make of the output from the above R
script. Can you please clarify? Thanks in advance! Matt Reichenbach Oct 9 '13 at 14:29

3/6/2016 8:54 AM

S-ar putea să vă placă și