Sunteți pe pagina 1din 65

By Hui Bian Office for Faculty Excellence

Contact information
Email: bianh@ecu.edu Phone: 328-5428 Location: 2307 Old Cafeteria Complex (east campus) Website:

http://core.ecu.edu/ofe/StatisticsResearch/

What is logistic regression

According to IBM SPSS Manual

It is used to predict the presence or absence of a characteristic or outcome based on values of a set of predictor variables. It is similar to a linear regression model, but suited to models where the dependent variable is dichotomous.

The ultimate goal of logistic regression

to determine the probability of a case belonging to the 1 category of dependent variable or the probability of event occurring (event occurring is always coded as 1) for a given set of predictors.

Variables in logistic regression

Dependent variable: one dichotomous/binary variable


Yes/No: drug users vs. non-drug users Membership: intervention vs. control Characteristics: males vs. females

Independent variables: interval or dummy or categorical variables (indicator coded).

Indicator coded: SPSS will automatically recode categorical variable for us.

Assumptions

Homogeneity of variance and normality of errors are NOT assumed, but it requires:
Absence of multicollinearity
No specification errors: all relevant

predictors are included and irrelevant predictors are excluded.

Larger sample size than using linear regression


5

Logistic regression equation


Equation: Logit function ln(p/1-p) =a + b1x1+b2x2++bnxn Logit (p) = a + b1x1+b2x2++bnxn

p: probability of a case belonging to category 1 p/1-p: odds a: constant n: number of predictors b1-bn: regression coefficients

Logistic regression equation


Non-linear relationship between predictors and binary outcome. The regression coefficients are estimated using maximum likelihood.

Logistic regression curve

The Y-axis is P (probability), which indicates the proportion of 1 at any given value of X.

Coding of variables

Recommendation for dependent variable


Use 1 as the event occurring (the focus of

the study) Use 0 as absence of event (the reference category) SPSS automatically recodes the lower number of the category to 0 and higher number to 1.

Coding of variables

Recommendation for independent variables


Coded as 1 for the category that is the focus

of the study Coded as 0 for the category of reference

10

Coding of variables
Cases coded as 1 referred to as the response group, comparison group, or target group. Cases coded as 0 referred to as the reference group, base group, or control group.

11

Terms

Probability: likelihood of an event occurring


Drug use status is DV and gender is IV.

The probability of a student using drug is 205/413=.496


We want to know whether this proportion is the same for

both males and females.


Drug users
Male Female Total 120 85 205

Non-users
102 106 208

Total
222 191 413

12

Terms

Odds: the probability of belonging to one group or event occurring divided by the probability of not belonging to that group or event not occurring. The odds of a male using drug is 120/102=1.18, The odds of a female using drug is 85/106= .80 For males, it means that a male is 1.18 times as likely to use drug as not to use.
Drug users Male Female Total 120 85 205 Non-users 102 106 208 Total 222 191 413

13

Terms

Odds ratio: an important estimate in logistic regression and used to answer our research question. For the table below, the research question is whether there is a gender difference in using drugs or whether the probability of drug use is the same for males and females.
Drug users (1) Male (1) Female (0) Total 120 85 205 Non-users (0) 102 106 208 Total 222 191 413

14

Terms

Odds ratio
A ratio of the odds for each group. Always odd for the response group (males)

divided by odd for the referent group (females). Odds ratio is 1.18/.80= 1.48
Drug users (1) Male (1) Female (0) Total 120 85 205 Non-users (0) 102 106 208 Total 222 191 413

15

Terms

Odds ratio
Males in this example were 1.48 times more

likely than females to use drugs. An odds ratio > 1 indicates that the likelihood of an event occurring is more likely for the response category than the referent category of an independent variable. An odds ratio < 1 indicates that the likelihood of an event occurring is less likely for the response category than the referent category of an independent variable.
16

Terms

Adjusted odds ratio


When multiple independent variables in the

model It indicates the contribution of a particular predictor when other predictors are controlled.

17

Terms

Ordinary least squares (OLS)


A method used for a linear regression model. It minimizes the sum of squared vertical distances between the

observed responses in the dataset and the responses predicted by the linear model.

Maximum likelihood estimation (MLE)


It is more appropriate for logistic regression model. It maximizes the log likelihood. The log likelihood indicates how likely the observed grouping can

be predicted from observed values of predictors.

18

Logistic regression using SPSS

Example of logistic regression analysis Research question is whether a gender, selfcontrol, and self-efficacy predict drug use status.
Three predictors Gender (a01: 1 = males, 0 = females) Self-control (continuous variable) Self-efficacy (a80r: 1 =somewhat-not sure, 0 = very sure) One dependent variable Drug use status (1 = drug users, 0 =non-users)

19

Logistic regression

Null hypothesis
There is an equal chance of drug use or not

use for a given set of predictors or


The model coefficients are 0 (0 means

there is no change due to the predictor variable).

20

Logistic regression using SPSS


Analyze > Regression > Binary Logistic Enter Drug_use to Dependent Enter a01, self-control, and a80r to Covariates

21

Logistic regression using SPSS

Click Categorical
Enter two categorical variables (a01 and

a80r) to the right.

22

Logistic regression using SPSS

Categorical
The default contrast is Indicator and

reference category is Last.

23

Logistic regression using SPSS

Categorical: contrast methods

Indicator. Contrasts indicate the presence or absence of

category membership. The reference category is represented in the contrast matrix as a row of zeros. Simple. Each category of the predictor variable (except the reference category) is compared to the reference category. Difference. Each category of the predictor variable except the first category is compared to the average effect of previous categories. Also known as reverse Helmert contrasts. Helmert. Each category of the predictor variable except the last category is compared to the average effect of subsequent categories.

24

Logistic regression using SPSS

Categorical: contrast
Repeated. Each category of the predictor

variable except the first category is compared to the category that precedes it. Polynomial. Orthogonal polynomial contrasts. Categories are assumed to be equally spaced. Polynomial contrasts are available for numeric variables only. Deviation. Each category of the predictor variable except the reference category is compared to the overall effect.
25

Logistic regression using SPSS

Categorical
For a01 and a80r, we want category 0 as a

reference category. Check First


Then click Change

26

Logistic regression using SPSS

Save
For each case, if the predicted probability is greater than 0.5, then this respondent would be predicted to use drug (coded as 1). If the predicted probability is less than 0.5, this respondent would be predicted to not use drug (coded as 0).

27

Logistic regression using SPSS

Options

28

SPSS output

Coding

Our coding is the same as the Parameter coding

29

SPSS output

Constant-only-model: Block 0 (beginning block/step 0) means only constant is in the model and our predictors are not in the equation yet.

30

SPSS output

Block 1
Model fit statistics

Validity of the model: The null hypothesis is rejected (p < .05).


-2Log likelihood ratio test: tests whether a set of IVs improves prediction of DV better than chance.

Full model: Block1 (step1) indicates that our predictors are entered the model simultaneously. The method used is Enter. Pseudo R square: Nagelkerke R2 is preferred. The model accounts for almost 10% of variance of DV. This test assesses whether the predicted probabilities match the observed probabilities. P > .05 means a set of IVs will accurately predict the actual probabilities.
31

SPSS output

Model fit: Goodness-of-fit statistics help you to


determine whether the model adequately describes the data. -2 log likelihood test (-2LL) Omnibus test of model coefficients: Chi-square, it is based on the null hypothesis that all the coefficients are zero. Model summary Pseudo R2 Homer and Lemeshow Test

32

SPSS output

The overall predictive accuracy is 62.7%.


Block 1

In block 0, the probability of a correct prediction is 50.4%. In block 1, the overall predictive accuracy is 62.7%.

33

SPSS output

Classification table
64.9% is also known as the sensitivity of

prediction. 60.6% is also known as the specificity of prediction.

34

SPSS output

Variables in the equation

1. Wald test It tests the effect of individual predictor while controlling other predictors. 2. Exp(B) It is an odds ratio. For gender, males are 1.60 times more likely to use drugs than females. For self-control, the probability of drug use is contingent on self-control level. Higher score of self-control, less likely to use drugs. For a80r, low self-efficacy group is 1.53 times more likely to use drugs than high self-efficacy group.
35

SPSS output

Equation
The equation should be:

Ln [odds] = .256 + .471 Gender -.093Self-control + .427Self-efficacy Predicted probability = eln(odds)/(1+eln(odds))

36

SPSS output

Saved new variables

1. For each case, if the predicted probability is greater than 0.5, then this respondent would be predicted to use drug (coded as 1). 2. If the predicted probability is less than 0.5, this respondent would be predicted to not use drug (coded as 0).

37

Results
The logistic regression was performed to test effects of self-control, self-efficacy, and gender on drug use. Results indicated that the three-predictor model provided a statistically significant improvement over the constantonly-model, 2(3, N= 413) = 31.36, p = .00. The Nagelkerke R2 indicated that the model accounted for 9.8% of the total variance. The correct prediction rate was about 63.7% . The Wald tests showed that all three predictors significantly predicted drug use status.

38

Logistic regression using SPSS

Independent variables are categorical variables with more than 2 categories.


Example: add a93a (My friends think that it's

okay for me to drinks too much alcohol) into the model as an independent variable. Rerun previous logistic regression Use Indicator method and first level as a reference.

39

Logistic regression using SPSS

SPSS output: coding

SPSS recoded a93a into three dummy variables with first level as the reference (in the contrast matrix as a row of zeros).

40

Logistic regression using SPSS

SPSS output: Variables in the equation

41

Comparing logistic models

Purpose of comparing logistic models


Whether adding more variables in the model will

provide an improvement in predictive power.

Example: we want to know whether there is a significant interaction of self-efficacy and self-control on probability of drug use. Add interaction term (self-efficacy*self-control) to the model We are going to have three models: constant-onlymodel, model with three predictors and constant, and model with interaction term, three predictors, and constant.
42

Comparing logistic models

Enter a01, a80r, and self-control to Block1, then click Next.

43

Comparing logistic models

Block 2

Highlight a80r and hold Control to select self-control. Then click a*b>, enter a80r*self-control to Block 2.

44

SPSS output

The results for Block 0 and Block 1 are the same as those from the previous study.
1. The Block 2 is not significant. It means the interaction term is not significant. Model means with everything in the equation, the whole model is significant. 2. The difference of -2Log Likelihood between block 2 and 1 is 541.153537.486 = 3.68, this is a Chisquare statistic with df = 1, p < .05 (check Chi-square table, Chisquare = 3.84 as p =.05, df = 1)

45

SPSS output

The Chi-square change from Block1 to Block 2 is 35.032-31.364=3.668, which is the Chi-square for interaction term. The R2 change indicates that 1% of variance is explained by interaction term. The improvement of prediction is not significant ( p = .055).
46

SPSS output

The Wald test also shows that there is no significant interaction effect of self-efficacy and self-control on DV. Equation of model: ln(odds) = .021 + .476 Gender -.063Self-control + 1.03 Self-efficacy -.079Self-efficacy * Self-control

47

Graphs for interaction effect

48

Graphs for interaction effect


Self-Efficacy: 1 = Somewhat sure-not sure Self-efficacy: 0 = Very sure

49

Multinomial logistic regression

Used to classify subjects based on values of a set of predictor variables. This type of regression is similar to logistic regression, but it is more general because the dependent variable is not restricted to two categories.

50

Multinomial logistic regression

Variables
The dependent variable should be

categorical. Independent variables can be factors or covariates. In general, factors should be categorical variables and covariates should be continuous variables.

51

Multinomial logistic regression using SPSS

Example (from SPSS samples)


Regression to determine marketing profiles for

each breakfast option. Dependent variable: Choice of breakfast: 1 = Breakfast bar; 2 = Oatmeal; 3 = Cereal Independent variables: age, gender, lifestyle (they are all categorical variables)

52

Multinomial logistic regression using SPSS

Go to Analyze > Regression > Multinomial Logistic

53

Multinomial logistic regression using SPSS

Click Reference Category

We can use any category of dependent variable as the reference.

54

Multinomial logistic regression using SPSS

Click Model

The default model is Main effects. We can custom our model and request main effects and interaction effects.

55

Multinomial logistic regression using SPSS

Click Statistics

56

Multinomial logistic regression using SPSS

SPSS outputs

57

Multinomial logistic regression using SPSS

SPSS Output

Cells with zero frequencies can be a useful indicator for potential problems. Since there are few (only 4.2%) of these empty cells, you can probably safely use the results of the goodness-of-fit tests.

58

Multinomial logistic regression using SPSS

SPSS Outputs
1. The likelihood ratio tests check the difference between null model and final model. 2. The Chi-Square in the first table is the change of -2 Log Likelihood from intercept-onlymodel to the final model. 3. The results show that the final model is outperforming the null. 3. Results of Goodness-of-Fit show that the model adequately fits the data.

59

Multinomial logistic regression using SPSS

SPSS Output

The likelihood ratio tests check the contribution of each effect to the model. Age and Active make significant contributions to the model.

60

Multinomial logistic regression using SPSS

SPSS Output

61

Multinomial logistic regression using SPSS

SPSS Output: Parameter estimates table summarizes the effect of each predictor.
Parameters with significant negative

coefficients decrease the likelihood of that response category with respect to the reference category. Parameters with positive coefficients increase the likelihood of that response category.
62

Multinomial logistic regression using SPSS

SPSS Output

Cells on the diagonal are correct predictions and cells off the diagonal are incorrect predictions. Overall, 56.4% of the cases are classified correctly.

63

References
Agresti, A. & Finlay, B. (1997). Statistical methods for the social sciences (3rd ed.). Upper Saddle River, NJ: Prentice Hall, Inc. Meyers, L. S., Gamst, G., & Guarino, A. J. (2006). Applied multivariate research: design and interpretation. Thousand Oaks, CA: Sage Publications, Inc. Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4th ed.). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

64

65

S-ar putea să vă placă și