Logistic Regression Analysis

By Hui Bian Office for Faculty Excellence
Contact information
Email: bianh@ecu.edu Phone: 328-5428 Location: 2307 Old Cafeteria Complex (east campus) Website:
http://core.ecu.edu/ofe/StatisticsResearch/
What is logistic regression
According to IBM SPSS Manual
It is used to predict the presence or absence of a characteristic or outcome based on values of a set of predictor variables. It is similar to a linear regression model, but suited to models where the dependent variable is dichotomous.
The ultimate goal of logistic regression
to determine the probability of a case belonging to the 1 category of dependent variable or the probability of event occurring (event occurring is always coded as 1) for a given set of predictors.
Variables in logistic regression
Dependent variable: one dichotomous/binary variable

Yes/No: drug users vs. non-drug users Membership: intervention vs. control Characteristics: males vs. females
Independent variables: interval or dummy or categorical variables (indicator coded).
Indicator coded: SPSS will automatically recode categorical variable for us.
Assumptions
Homogeneity of variance and normality of errors are NOT assumed, but it requires:
Absence of multicollinearity
No specification errors: all relevant
predictors are included and irrelevant predictors are excluded.
Larger sample size than using linear regression

5
Logistic regression equation

Equation: Logit function ln(p/1-p) =a + b1x1+b2x2++bnxn Logit (p) = a + b1x1+b2x2++bnxn

p: probability of a case belonging to category 1 p/1-p: odds a: constant n: number of predictors b1-bn: regression coefficients
Logistic regression equation

Non-linear relationship between predictors and binary outcome. The regression coefficients are estimated using maximum likelihood.
Logistic regression curve
The Y-axis is P (probability), which indicates the proportion of 1 at any given value of X.
Coding of variables
Recommendation for dependent variable

Use 1 as the event occurring (the focus of
the study) Use 0 as absence of event (the reference category) SPSS automatically recodes the lower number of the category to 0 and higher number to 1.
Coding of variables
Recommendation for independent variables

Coded as 1 for the category that is the focus
of the study Coded as 0 for the category of reference
10
Coding of variables
Cases coded as 1 referred to as the response group, comparison group, or target group. Cases coded as 0 referred to as the reference group, base group, or control group.
11
Terms
Probability: likelihood of an event occurring

Drug use status is DV and gender is IV.
The probability of a student using drug is 205/413=.496

We want to know whether this proportion is the same for
both males and females.

Drug users
Male Female Total 120 85 205
Non-users
102 106 208
Total
222 191 413
12
Terms
Odds: the probability of belonging to one group or event occurring divided by the probability of not belonging to that group or event not occurring. The odds of a male using drug is 120/102=1.18, The odds of a female using drug is 85/106= .80 For males, it means that a male is 1.18 times as likely to use drug as not to use.
Drug users Male Female Total 120 85 205 Non-users 102 106 208 Total 222 191 413
13
Terms
Odds ratio: an important estimate in logistic regression and used to answer our research question. For the table below, the research question is whether there is a gender difference in using drugs or whether the probability of drug use is the same for males and females.
Drug users (1) Male (1) Female (0) Total 120 85 205 Non-users (0) 102 106 208 Total 222 191 413
14
Terms
Odds ratio
A ratio of the odds for each group. Always odd for the response group (males)
divided by odd for the referent group (females). Odds ratio is 1.18/.80= 1.48
Drug users (1) Male (1) Female (0) Total 120 85 205 Non-users (0) 102 106 208 Total 222 191 413
15
Terms
Odds ratio
Males in this example were 1.48 times more
likely than females to use drugs. An odds ratio > 1 indicates that the likelihood of an event occurring is more likely for the response category than the referent category of an independent variable. An odds ratio < 1 indicates that the likelihood of an event occurring is less likely for the response category than the referent category of an independent variable.
16
Terms
Adjusted odds ratio

When multiple independent variables in the
model It indicates the contribution of a particular predictor when other predictors are controlled.
17
Terms
Ordinary least squares (OLS)

A method used for a linear regression model. It minimizes the sum of squared vertical distances between the
observed responses in the dataset and the responses predicted by the linear model.
Maximum likelihood estimation (MLE)

It is more appropriate for logistic regression model. It maximizes the log likelihood. The log likelihood indicates how likely the observed grouping can
be predicted from observed values of predictors.
18
Logistic regression using SPSS
Example of logistic regression analysis Research question is whether a gender, selfcontrol, and self-efficacy predict drug use status.
Three predictors Gender (a01: 1 = males, 0 = females) Self-control (continuous variable) Self-efficacy (a80r: 1 =somewhat-not sure, 0 = very sure) One dependent variable Drug use status (1 = drug users, 0 =non-users)
19
Logistic regression
Null hypothesis
There is an equal chance of drug use or not
use for a given set of predictors or

The model coefficients are 0 (0 means
there is no change due to the predictor variable).
20

Analyze > Regression > Binary Logistic Enter Drug_use to Dependent Enter a01, self-control, and a80r to Covariates
21
Click Categorical
Enter two categorical variables (a01 and
a80r) to the right.
22
Categorical
The default contrast is Indicator and
reference category is Last.
23
Categorical: contrast methods
Indicator. Contrasts indicate the presence or absence of
category membership. The reference category is represented in the contrast matrix as a row of zeros. Simple. Each category of the predictor variable (except the reference category) is compared to the reference category. Difference. Each category of the predictor variable except the first category is compared to the average effect of previous categories. Also known as reverse Helmert contrasts. Helmert. Each category of the predictor variable except the last category is compared to the average effect of subsequent categories.
24
Categorical: contrast
Repeated. Each category of the predictor
variable except the first category is compared to the category that precedes it. Polynomial. Orthogonal polynomial contrasts. Categories are assumed to be equally spaced. Polynomial contrasts are available for numeric variables only. Deviation. Each category of the predictor variable except the reference category is compared to the overall effect.
25
Categorical
For a01 and a80r, we want category 0 as a
reference category. Check First

Then click Change
26
Save
For each case, if the predicted probability is greater than 0.5, then this respondent would be predicted to use drug (coded as 1). If the predicted probability is less than 0.5, this respondent would be predicted to not use drug (coded as 0).
27
Options
28
SPSS output
Coding
Our coding is the same as the Parameter coding
29
SPSS output
Constant-only-model: Block 0 (beginning block/step 0) means only constant is in the model and our predictors are not in the equation yet.
30
SPSS output
Block 1
Model fit statistics
Validity of the model: The null hypothesis is rejected (p < .05).

-2Log likelihood ratio test: tests whether a set of IVs improves prediction of DV better than chance.
Full model: Block1 (step1) indicates that our predictors are entered the model simultaneously. The method used is Enter. Pseudo R square: Nagelkerke R2 is preferred. The model accounts for almost 10% of variance of DV. This test assesses whether the predicted probabilities match the observed probabilities. P > .05 means a set of IVs will accurately predict the actual probabilities.
31
SPSS output
Model fit: Goodness-of-fit statistics help you to

determine whether the model adequately describes the data. -2 log likelihood test (-2LL) Omnibus test of model coefficients: Chi-square, it is based on the null hypothesis that all the coefficients are zero. Model summary Pseudo R2 Homer and Lemeshow Test
32
SPSS output
The overall predictive accuracy is 62.7%.

Block 1
In block 0, the probability of a correct prediction is 50.4%. In block 1, the overall predictive accuracy is 62.7%.
33
SPSS output
Classification table
64.9% is also known as the sensitivity of
prediction. 60.6% is also known as the specificity of prediction.
34
SPSS output
Variables in the equation
1. Wald test It tests the effect of individual predictor while controlling other predictors. 2. Exp(B) It is an odds ratio. For gender, males are 1.60 times more likely to use drugs than females. For self-control, the probability of drug use is contingent on self-control level. Higher score of self-control, less likely to use drugs. For a80r, low self-efficacy group is 1.53 times more likely to use drugs than high self-efficacy group.
35
SPSS output
Equation
The equation should be:
Ln [odds] = .256 + .471 Gender -.093Self-control + .427Self-efficacy Predicted probability = eln(odds)/(1+eln(odds))
36
SPSS output
Saved new variables
1. For each case, if the predicted probability is greater than 0.5, then this respondent would be predicted to use drug (coded as 1). 2. If the predicted probability is less than 0.5, this respondent would be predicted to not use drug (coded as 0).
37
Results
The logistic regression was performed to test effects of self-control, self-efficacy, and gender on drug use. Results indicated that the three-predictor model provided a statistically significant improvement over the constantonly-model, 2(3, N= 413) = 31.36, p = .00. The Nagelkerke R2 indicated that the model accounted for 9.8% of the total variance. The correct prediction rate was about 63.7% . The Wald tests showed that all three predictors significantly predicted drug use status.
38
Independent variables are categorical variables with more than 2 categories.

Example: add a93a (My friends think that it's
okay for me to drinks too much alcohol) into the model as an independent variable. Rerun previous logistic regression Use Indicator method and first level as a reference.
39
SPSS output: coding
SPSS recoded a93a into three dummy variables with first level as the reference (in the contrast matrix as a row of zeros).
40
SPSS output: Variables in the equation
41
Comparing logistic models
Purpose of comparing logistic models

Whether adding more variables in the model will
provide an improvement in predictive power.
Example: we want to know whether there is a significant interaction of self-efficacy and self-control on probability of drug use. Add interaction term (self-efficacy*self-control) to the model We are going to have three models: constant-onlymodel, model with three predictors and constant, and model with interaction term, three predictors, and constant.
42
Enter a01, a80r, and self-control to Block1, then click Next.
43
Block 2
Highlight a80r and hold Control to select self-control. Then click a*b>, enter a80r*self-control to Block 2.
44
SPSS output
The results for Block 0 and Block 1 are the same as those from the previous study.
1. The Block 2 is not significant. It means the interaction term is not significant. Model means with everything in the equation, the whole model is significant. 2. The difference of -2Log Likelihood between block 2 and 1 is 541.153537.486 = 3.68, this is a Chisquare statistic with df = 1, p < .05 (check Chi-square table, Chisquare = 3.84 as p =.05, df = 1)
45
SPSS output
The Chi-square change from Block1 to Block 2 is 35.032-31.364=3.668, which is the Chi-square for interaction term. The R2 change indicates that 1% of variance is explained by interaction term. The improvement of prediction is not significant ( p = .055).
46
SPSS output
The Wald test also shows that there is no significant interaction effect of self-efficacy and self-control on DV. Equation of model: ln(odds) = .021 + .476 Gender -.063Self-control + 1.03 Self-efficacy -.079Self-efficacy * Self-control
47
Graphs for interaction effect
48
Graphs for interaction effect

Self-Efficacy: 1 = Somewhat sure-not sure Self-efficacy: 0 = Very sure
49
Multinomial logistic regression
Used to classify subjects based on values of a set of predictor variables. This type of regression is similar to logistic regression, but it is more general because the dependent variable is not restricted to two categories.
50
Multinomial logistic regression
Variables
The dependent variable should be
categorical. Independent variables can be factors or covariates. In general, factors should be categorical variables and covariates should be continuous variables.
51
Multinomial logistic regression using SPSS
Example (from SPSS samples)

Regression to determine marketing profiles for
each breakfast option. Dependent variable: Choice of breakfast: 1 = Breakfast bar; 2 = Oatmeal; 3 = Cereal Independent variables: age, gender, lifestyle (they are all categorical variables)
52
Go to Analyze > Regression > Multinomial Logistic
53
Click Reference Category
We can use any category of dependent variable as the reference.
54
Click Model
The default model is Main effects. We can custom our model and request main effects and interaction effects.
55
Click Statistics
56
SPSS outputs
57
SPSS Output
Cells with zero frequencies can be a useful indicator for potential problems. Since there are few (only 4.2%) of these empty cells, you can probably safely use the results of the goodness-of-fit tests.
58
SPSS Outputs
1. The likelihood ratio tests check the difference between null model and final model. 2. The Chi-Square in the first table is the change of -2 Log Likelihood from intercept-onlymodel to the final model. 3. The results show that the final model is outperforming the null. 3. Results of Goodness-of-Fit show that the model adequately fits the data.
59
SPSS Output
The likelihood ratio tests check the contribution of each effect to the model. Age and Active make significant contributions to the model.
60
SPSS Output
61
SPSS Output: Parameter estimates table summarizes the effect of each predictor.
Parameters with significant negative
coefficients decrease the likelihood of that response category with respect to the reference category. Parameters with positive coefficients increase the likelihood of that response category.
62
SPSS Output
Cells on the diagonal are correct predictions and cells off the diagonal are incorrect predictions. Overall, 56.4% of the cases are classified correctly.
63
References
Agresti, A. & Finlay, B. (1997). Statistical methods for the social sciences (3rd ed.). Upper Saddle River, NJ: Prentice Hall, Inc. Meyers, L. S., Gamst, G., & Guarino, A. J. (2006). Applied multivariate research: design and interpretation. Thousand Oaks, CA: Sage Publications, Inc. Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4th ed.). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
64
65

Logistic Regression Analysis

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Logistic Regression Analysis

Încărcat de

Drepturi de autor:

Formate disponibile

By Hui Bian Office for Faculty Excellence

What is logistic regression

According to IBM SPSS Manual

The ultimate goal of logistic regression

Variables in logistic regression

Dependent variable: one dichotomous/binary variable

Independent variables: interval or dummy or categorical variables (indicator coded).

predictors are included and irrelevant predictors are excluded.

Larger sample size than using linear regression

Logistic regression equation

Logistic regression equation

Logistic regression curve

Recommendation for dependent variable

Recommendation for independent variables

of the study Coded as 0 for the category of reference

Probability: likelihood of an event occurring

The probability of a student using drug is 205/413=.496

both males and females.

Adjusted odds ratio

Ordinary least squares (OLS)

Maximum likelihood estimation (MLE)

be predicted from observed values of predictors.

Logistic regression using SPSS

use for a given set of predictors or

there is no change due to the predictor variable).

Logistic regression using SPSS

Logistic regression using SPSS

a80r) to the right.

Logistic regression using SPSS

reference category is Last.

Logistic regression using SPSS

Categorical: contrast methods

Indicator. Contrasts indicate the presence or absence of

Logistic regression using SPSS

Logistic regression using SPSS

reference category. Check First

Logistic regression using SPSS

Logistic regression using SPSS

Our coding is the same as the Parameter coding

Validity of the model: The null hypothesis is rejected (p < .05).

Model fit: Goodness-of-fit statistics help you to

The overall predictive accuracy is 62.7%.

prediction. 60.6% is also known as the specificity of prediction.

Variables in the equation

Ln [odds] = .256 + .471 Gender -.093Self-control + .427Self-efficacy Predicted probability = eln(odds)/(1+eln(odds))

Saved new variables

Logistic regression using SPSS

Independent variables are categorical variables with more than 2 categories.

Logistic regression using SPSS

SPSS output: coding

Logistic regression using SPSS

SPSS output: Variables in the equation

Comparing logistic models

Purpose of comparing logistic models

provide an improvement in predictive power.

Comparing logistic models

Enter a01, a80r, and self-control to Block1, then click Next.

Comparing logistic models

Graphs for interaction effect

Graphs for interaction effect

Multinomial logistic regression

Multinomial logistic regression

Multinomial logistic regression using SPSS