Sunteți pe pagina 1din 57

Multiple Linear Regression:

Stepwise Regression and Variable Selection Methods

Jitamitra DESAI (“JD”)


PGP 2018-20: Decision Sciences - II
Indian Institute of Management @ Bangalore
Gender Discrimination
► Consider the ‘Gender Discrimination’ case
► Take a look at the data
 200 observations (134 Female, 66 Male)
 Gender, Job Level, Education Level, Work Ex, Prior Work Ex,
Analytical Skills, Age
 Salary

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2 2
Gender Discrimination
► Consider the ‘Gender Discrimination’ case
► Take a look at the data
 200 observations (134 Female, 66 Male)
 Gender, Job Level, Education Level, Work Ex, Prior Work Ex,
Analytical Skills, Age
 Salary
► A simple analysis: Collect the male and female employees into
two columns and find the average salaries:

► Conclude that there is discrimination?

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3 3
Gender Discrimination
► Consider the ‘Gender Discrimination’ case
► Take a look at the data
 200 observations (134 Female, 66 Male)
 Gender, Job Level, Education Level, Work Ex, Prior Work Ex,
Analytical Skills, Age
 Salary
► A simple analysis: Collect the male and female employees into
two columns and find the average salaries:

► Conclude that there is discrimination?


► Are the mean salaries statistically different?
 Do a 2-sample t-test to check if the population means are equal
(assuming equal variances): Salaries are statistically different!
10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4 4
Gender Discrimination
► Consider the ‘Gender Discrimination’ case
► Take a look at the data
 200 employees (134 Female, 66 Male)
 Gender, Job Level, Education Level, Work Ex, Prior Work Ex,
Analytical Skills, Age
 Salary
► Identify the numerical and categorical variables

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide5 5
Gender Discrimination
► Consider the ‘Gender Discrimination’ case
► Take a look at the data
 200 employees (134 Female, 66 Male)
 Gender, Job Level, Education Level, Work Ex, Prior Work Ex,
Analytical Skills, Age
 Salary
► Identify the numerical and categorical variables
► How would you model Gender?
 Create a dummy (categorical/indicator) variable

 Recall that two categories requires one dummy variable


 Generalizing, we create (m – 1) dummy variables for ‘m’
categories; the other one being the “base”

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide6 6
Model 1 (Salary vs Gender)

► Consider the model: (Gender = 1 for female)

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide7 7
Model 1 (Salary vs Gender)
Variables Entered/Removeda
Model Variables Entered Variables Removed Method
1 Genderb . Enter
a. Dependent Variable: SalaryRs

Model Summary
Adjusted R Std. Error of
Model R R Square
Square the Estimate
1 .351a .123 .119 127558.101
a. Predictors: (Constant), Gender

ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 452670537977.227 1 452670537977.2 27.821 .000b
1 Residual 3221671689990.777 198 16271069141.36
Total 3674342227968.004 199

Coefficientsa
Standardized
Unstandardized Coefficients
Model Coefficients t Sig.
B Std. Error Beta
(Constant) 545158.182 15701.317 34.721 .000
1
Gender -101176.988 19182.212 -.351 -5.275 .000

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide8 8
Model 1 (Salary vs Gender)

► Consider the model: (Gender = 1 for female)

► The developed regression equation:

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide9 9
Model 1 (Salary vs Gender)

► Consider the model: (Gender = 1 for female)

► The developed regression equation:

► This equation merely states that:


 Salary (Male): 545158; Salary (Female): 443981

► If the base was chosen as ‘Female’, i.e., Gender = 0 for female,


then the equation would have been:

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1010
Model 1 (Salary vs Gender)
► Draw the regression plot
600000
580000
560000
Male:
540000 545158
520000
500000
480000
460000
Female:
440000
443981
420000
400000
1 1.2 1.4 1.6 1.8 2

► In the presence of only categorical variables, the regression fit


gives us only individual points (there is no regression line!)
► It merely computes the averages of the two populations!

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1111
Model 1 (Salary vs Gender)
► Draw the regression plot
600000
580000
560000
Male:
540000 545158
520000
500000
480000
460000
Female:
440000
443981
420000
400000
1 1.2 1.4 1.6 1.8 2

► In the presence of only categorical variables, the regression fit


gives us only individual points (there is no regression line!)
► It merely computes the averages of the two populations!
► Isn’t this the same as (single factor) ANOVA?
► Is there gender discrimination in the company?

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1212
Model 2 (Salary vs Gender and YrsExp)

► Consider the model: (Gender = 1 for female)

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1313
Model 2 (Salary vs Gender and YrsExp)
Variables Entered/Removeda
Variables
Model Variables Entered Method
Removed
1 ExpYears, Genderb . Enter
a. Dependent Variable: SalaryRs
Model Summary
Adjusted R Std. Error of
Model R R Square
Square the Estimate
1 .702a .493 .488 97212.725
a. Predictors: (Constant), ExpYears, Gender

ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 1812630387807.858 2 906315193903.929 95.903 .000b
1 Residual 1861711840160.146 197 9450313909.442
Total 3674342227968.004 199

Coefficientsa
Standardized
Unstandardized Coefficients
Model Coefficients t Sig.
B Std. Error Beta
(Constant) 440769.577 14795.584 29.791 .000
1 Gender -96986.767 14623.041 -.336 -6.632 .000
ExpYears 11757.078 980.075 .609 11.996 .000

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1414
Model 2 (Salary vs Gender and YrsExp)

► Consider the model: (Gender = 1 for female)

► The developed regression equation:

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1515
Model 2 (Salary vs Gender and YrsExp)

► Consider the model: (Gender = 1 for female)

► The developed regression equation:

600000

550000

500000

450000
Male:
400000 440769

350000
Female:
300000
343783

250000
0 1 2 3 4 5 6 7 8 9 10

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1616
Model 2 (Salary vs Gender and YrsExp)

► Consider the model: (Gender = 1 for female)

► The developed regression equation:

600000

550000 Conclusion:
500000 • Women are
450000 discriminated at the
Male:
400000 440769 entry level but there is
350000
no discrimination on the
Female:
300000
343783 job!
250000
0 1 2 3 4 5 6 7 8 9 10

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1717
Model 2 (Salary vs Gender and YrsExp)
► Examine the residual plot(s)
 Residuals vs. Years Experience
 Check for any patterns/correlation

400000

300000

200000

100000

0
0 5 10 15 20 25 30 35 40

-100000

-200000

-300000

-400000

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1818
Model 2 (Salary vs Gender and YrsExp)
► Examine the residual plot(s)
 Residuals vs. Years Experience for Males/Females separately
 What about patterns/correlations?
 What is your conclusion? (Need interactive effects)

400000

300000
e = – 57559 + 6482.8yrs

200000

100000
300000
0
200000
0 10 20 30 40
-100000
100000
-200000
0
0 5 10 15 20 25 30 35
-300000
-100000

-200000

-300000 e = 74127 - 8697.9yrs


-400000

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide1919
Model 3 (Salary vs Gender, YrsExp, and interaction)

► What we really need is not only the effect of gender and


experience, but their interactive effect:

Interaction Variable

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2020
Model 3 (Salary vs Gender, YrsExp, and interaction)
Variables Entered/Removeda
Model Variables Entered Variables Method
Removed
GenderYearsExp, . Enter
1
ExpYears, Genderb
a. Dependent Variable: SalaryRs

Model Summary
Model R R Square Adjusted R Square Std. Error of the
Estimate

1 .803a .644 .639 81658.556


a. (Constant), GenderYearsExp, ExpYears, Gender

ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 2367390744669.802 3 789130248223.267 118.344 .000b

1 Residual 1306951483298.202 196 6668119812.746

Total 3674342227968.004 199

Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients

B Std. Error Beta


(Constant) 383210.173 13938.601 27.493 .000

Gender 34699.710 18955.737 .120 1.831 .069


1
ExpYears 18239.878 1087.618 .944 16.770 .000

GenderYearsExp -15180.723 1664.338 -.682 -9.121 .000

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2121
Model 3 (Salary vs Gender, YrsExp, and interaction)

► What we really need is not only the effect of gender and


experience, but their interactive effect:

► The developed regression model is:

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2222
Model 3 (Salary vs Gender, YrsExp, and interaction)

► What we really need is not only the effect of gender and


experience, but their interactive effect:

► The developed regression model is:

► For female employees:

► For male employees:

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2323
Model 3 (Salary vs Gender, YrsExp, and interaction)
► What we really need is not only the effect of gender and
experience, but their interactive effect

► The developed regression models:

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2424
Model 3 (Salary vs Gender, YrsExp, and interaction)
► What we really need is not only the effect of gender and
experience, but their interactive effect

► The developed regression models:

600000

550000

500000
Female:
450000
417910
400000
Male:
350000 383210

300000
0 1 2 3 4 5 6 7 8 9 10

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2525
Model 3 (Salary vs Gender, YrsExp, and interaction)
► What we really need is not only the effect of gender and
experience, but their interactive effect

► The developed regression models:

600000
Conclusion:
550000
• Men are discriminated
500000
Female:
against at the entry
450000
417910 level…
400000
• Women are
Male:
350000 383210 discriminated against on
300000 the job!
0 1 2 3 4 5 6 7 8 9 10

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2626
Model 4 (Education Levels)
Variables Entered/Removeda
Model Variables Entered Variables Removed Method

PostGraduate, ExpYears,
Science, Gender,
1 . Enter
Commerce, Technology,
GenderYearsExpb

a. Dependent Variable: SalaryRs


b. All requested variables entered.

Model Summary
Model R R Square Adjusted R Square Std. Error of the
Estimate

1 .857a .735 .725 71223.906


a. (Constant), (Constant), PostGraduate, ExpYears, Science, Gender,
Commerce, Technology, GenderYearsExp

ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 2700356036352.063 7 385765148050.295 76.045 .000b

1 Residual 973986191615.941 192 5072844748.000

Total 3674342227968.004 199

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2727
Model 4 (Education Levels)
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients

B Std. Error Beta

(Constant) 314754.124 18501.051 17.013 .000

Gender 47144.931 16606.065 .164 2.839 .005

ExpYears 17430.615 959.596 .902 18.165 .000

GenderYearsExp -12669.649 1486.425 -.569 -8.524 .000


1
Commerce 9029.747 17412.168 .025 .519 .605

Science 71980.851 28323.126 .104 2.541 .012

Technology 41040.344 15900.612 .139 2.581 .011

PostGraduate 112641.146 16425.224 .388 6.858 .000


a. Dependent Variable: SalaryRs

► What is the base category for ‘Education’?


► Interpret the β-value for employees with ‘Postgraduate’ degrees?
► What is the starting salary for a woman with a ‘Science’ degree?
► What is the β-value for ‘Technology’ if ‘Science’ is the base?
► What is the incremental salary for women every year?
10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2828
Categorical Variables Interpretation

► What is the interpretation of a coefficient for a specific category


of a categorical variable (e.g. ‘Science’ which belongs to
‘Education’)?
 The coefficient of a categorical variable measures the “average
difference in the response variable between the category of
interest and the reference (base) category
► Interaction between categorical and numerical variables can be
modeled by calculating a new variable that is a product of the two
variables
► Interaction captures the difference in association between the
response variable and the numerical variable for different
categories

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide2929
Stepwise Regression

Jitamitra DESAI (“JD”)


PGP 2018-20: Decision Sciences - II
Indian Institute of Management @ Bangalore
Stepwise Regression
► Suppose, we wish to build a regression model from scratch
► We have identified the dependent variable and all possible
independent variables
► In stepwise regression, we add one independent variable to the
model at a time
► The criterion used for selection of the independent variable to
be added to the model is based on part (semi-partial)
correlation and significance level (p-in value)

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3131
Stepwise Regression
► Suppose, we wish to build a regression model from scratch
► We have identified the dependent variable and all possible
independent variables
► In stepwise regression, we add one independent variable to the
model at a time
► The criterion used for selection of the independent variable to
be added to the model is based on part (semi-partial)
correlation and significance level (p-in value)

Recap:
Partial Correlation: Correlation between Y and X2 after the effect
of X1 has been removed from both Y and X2
Semi-Partial (Part) Correlation: Correlation between Y and X2 after
the effect of X1 has been removed from only X2

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3232
Stepwise Regression
► Suppose, we wish to build a regression model from scratch
► We have identified the dependent variable and all possible
independent variables
► In stepwise regression, we add one independent variable to the
model at a time
► The criterion used for selection of the independent variable to
be added to the model is based on part (semi-partial)
correlation and significance level (p-in value)
► When a new variable is added to the model, an existing
variable might become insignificant and can be removed from
the model (p-out value)
► Process stops when:
 No more variables are left to be considered
 No variables can be added based on significance level (p-in)
 No variables can be removed based on significance level (p-out)
10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3333
Stepwise Regression

► Consider the ‘Gender Divide’ dataset

Variables Salary (Rs)


Gender -0.350995309
Analytical Skills -0.001946653
ORG Level A -0.428023208
ORG Level B -0.233537445
ORG Level C -0.05460298
ORG Level D 0.139744912
ORG Level E 0.319852893
ORG Level F 0.673086708
Commerce -0.209797013
Arts -0.174808689
Science 0.02993951
Technology -0.147985749
Post Graduate 0.442507476
Exp (Years) 0.616588227
Age 0.376630814
Prior Exp (Years) -0.067532526
Salary (Rs) 1

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3434
Stepwise Regression

► Begin with the variable that has the largest partial (and also part)
correlation with the response variable
► Initially, this is the same as pairwise correlation (Identify)
► In our example, it is Org Level F (What is R2?)
► Run a simple linear regression with salary as the response
variable and Org Level F as the predictor variable

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3535
Stepwise Regression

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .673a .453 .450 100747.139 Partial F-value for


Org Level 6
Increase in R2

Model Sum of Squares df Mean Square F Sig.


Regression 1664645006771.126 1 1664645006771.126 164.005 .000b
1 Residual 2009697221196.878 198 10149985965.641
Total 3674342227968.004 199

Coefficientsa

Unstandardized Coefficients Standardized Coefficients


Model t Sig.
B Std. Error Beta
(Constant) 453315.080 7367.360 61.530 .000
1
ORGLevelF 370069.535 28897.165 .673 12.806 .000

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3636
Stepwise Regression

► Begin with the variable that has the largest partial (and also part)
correlation with the response variable
► Initially, this is the same as pairwise correlation (Identify)
► In our example, it is Org Level F (What is R2?)
► Run a simple linear regression with salary as the response
variable and Org Level F as the predictor variable
► Having entered a variable, do a partial F-test to test its
significance
► Initially, partial F-test is the same as F-test (obtained from
output) (only for the first variable)
► The F-test shows that the overall model is significant; so,
Org Level F stays

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3737
Stepwise Regression
Excluded variables
Collinearity
Partial Statistics
Model Beta In t Sig.
Correlation
Tolerance
Gender -.143b -2.602 .010 -.182 .889
ExpYears .328b 5.283 .000 .352 .629
GenderYearsExp -.060b -1.124 .262 -.080 .980
Commerce -.131b -2.514 .013 -.176 .986
Science -.004b -.066 .947 -.005 .998
Technology -.092b -1.754 .081 -.124 .993
PostGraduate .290b 5.761 .000 .380 .936
AnalyticalSkills -.056b -1.070 .286 -.076 .994
1
ORGLevelA -.323b -6.682 .000 -.430 .971
ORGLevelB -.145b -2.774 .006 -.194 .982
ORGLevelC .032b .600 .549 .043 .984
ORGLevelD .212b 4.178 .000 .285 .989
ORGLevelE .384b 8.473 .000 .517 .992
Arts -.096b -1.821 .070 -.129 .986
BirthYear .123b 2.164 .032 .152 .835
PriorExpYears .042b .797 .426 .057 .974

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3838
Stepwise Regression

► Amongst the excluded variables, select the variable that has the
largest partial correlation with the response variable
► Value of Org Level E partial correlation = 0.517
► What does this mean?

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide3939
Stepwise Regression

► Amongst the excluded variables, select the variable that has the
largest partial correlation with the response variable
► Value of Org Level E partial correlation = 0.517
► What does this mean?

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4040
Stepwise Regression

► Amongst the excluded variables, select the variable that has the
largest partial correlation with the response variable
► Value of Org Level E partial correlation = 0.517
► What does this mean?

► Square of this partial correlation is 0.26789


► Interpret this value:
 This implies that 26.789% of the residual of Model 1 is explained
by the Residual_(Org Level E vs. Org Level F)

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4141
Stepwise Regression
► Value of Org Level E partial correlation = 0.517
► Square of this partial correlation is 0.26789
► Interpret this value:
 This implies that 26.789% of the residual of Model 1 is explained
by the Residual_(Org Level E vs. Org Level F)

SSR = 1664645006771.126 SSE = 2009697221196.878


Variability explained by Model 1
Variability unexplained by Model 1
Explanatory Variable: Org Level F

Total variability in data = SST = SSR + SSE = 3674342227968.004


10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4242
Stepwise Regression (Method 1: Using SS)
Residual between Org Level E
Correlation between Org Level and Org Level F:
E and Org Level F: New variability explained by
Part already explained by Org Org Level E
Level F

SSE = 2009697221196.878
Variability unexplained by Model 1
SSR = 1664645006771.126
Variability explained by Model 1
Explanatory Variable: Org Level F 26.79% 83.21%

Part that gets explained after introducing Org Level E: Orange


= 0.2679
26.79% of variability unexplained by Model 1 Orange + Red
10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4343
Stepwise Regression
► What we need is semi-partial correlation of Org Level E
► Variability explained by Org Level E (not explained by Org Level F):

► Proportion of total variability explained by Org Level E:

► Therefore, part correlation is

SSR = 1664645006771.126 SSE = 2009697221196.878


Variability explained by Model 1 Variability unexplained by Model 1
Explanatory Variable: Org Level F

Orange
Semi-partial correlation of Org Level E is analogous to: = 0.3828
Orange + Red + Blue
10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4444
Stepwise Regression (Method 2: Using R2)
► Value of Org Level E partial correlation = 0.517
► Square of this partial correlation is 0.26789
► Interpret this value:
 This implies that 26.789% of the residual of Model 1 is explained
by the Residual_(Org Level E vs. Org Level F)

R2 = 0.453 1 – R2 = 0.547
Variability explained by Model 1
Variability unexplained by Model 1
Explanatory Variable: Org Level F

SST = Total variability in data


10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4545
Stepwise Regression
► Value of Org Level E partial correlation = 0.517
► Square of this partial correlation is 0.2679
► Interpret this value:
 This implies that 26.79% of the residual of Model 1 is explained by
the Residual_(Org Level E vs. Org Level F)

1 – R2 = 0.547
R2 = 0.453 Variability unexplained by Model 1
Variability explained by Model 1
Explanatory Variable: Org Level F 26.79% 83.21%

Part that gets explained after introducing Org Level E: Orange


= 0.2679
26.79% of variability unexplained by Model 1 Orange + Red

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4646
Stepwise Regression

► Therefore, total variability explained by Model 2 (after


introducing Org Level E) is:

► Part (semi-partial) correlation of Org Level E can also be


calculated as:

► This is the same as:

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4747
Stepwise Regression

► Therefore, total variability explained by Model 2 (after


introducing Org Level E) is:

► Part (semi-partial) correlation of Org Level E can also be


calculated as:

R2 = 0.599 1 – R2 = 0.401
Variability explained by Model 2
Variability unexplained by Model 2
Explanatory Variables: Org Level F, Org Level E

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4848
Stepwise Regression

Model R R Square Adjusted R Square Std. Error of the Estimate

2 .774b .599 .595 86469.339

Model Sum of Squares df Mean Square F Sig.


Regression 2201383751625.989 2 1100691875812.995 147.211 .000c
2 Residual 1472958476342.015 197 7476946580.416
Total 3674342227968.004 199

Coefficientsa

Unstandardized Coefficients Standardized Coefficients


Model t Sig.
B Std. Error Beta
(Constant) 434259.759 6711.322 64.706 .000

2 ORGLevelF 389124.856 24903.646 .708 15.625 .000

ORGLevelE 169683.098 20027.155 .384 8.473 .000

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide4949
Stepwise Regression
► Now, Org Level E has also been added to the model as the
second explanatory variable
► Does the presence of Org Level F impact the significance of Org
Level E as a predictor?

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide5050
Stepwise Regression

► Partial F-test for Org Level E

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide5151
Stepwise Regression

► Now, Org Level E has also been added to the model as the
second explanatory variable
► Does the presence of Org Level F impact the significance of Org
Level E as a predictor?
► Do a “partial F-test” for Org Level E
 As p-value is less than 0.05 (p-in value), Org Level E stays
► Repeat analysis for Org Level F:
 If p-value is greater than 0.15 (p-out value), remove Org
Level F
► Note that: p-in < p-out (always)
► Default values used by SPSS: p-in = 0.05 and p-out = 0.10

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide5252
Stepwise Regression

SSR = Variability in data explained

Part not
Part Part Part explained
Part explained
explained explained explained … by the
by X1
by X2 by X3 by X4 variables
(SSE)

SST = Total variability in data

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide5353
Comments on Stepwise Regression
► Step 0: Find pairwise correlations between all dependent variable and all
independent variables; select the one with the largest correlation, and
add to “included variables” list. Go to Step 1
► Step 1: Develop a linear regression model between dependent variable
and all variables in “included variables” list. Do a partial F-test to
determine if last included variable is insignificant (p-in value). If yes, go
to Step 4; else, go to Step 2
► Step 2: Do a partial F-test and check if any of the included variables are
insignificant (p-out value). If yes, remove the most insignificant variable
from the list and add it to “excluded variables” list. Go to Step 3
► Step 3: Look at the “excluded variables” list; if none exist, stop; return
current model as final model. Else, select the one with the largest
partial correlation. Go to Step 5.
► Step 4: Stop; return current model as final model. Else, go to Step 5.
► Step 5: Add selected variable to “included variables” list and go to Step 1

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide5454
Variable Selection Methods

Jitamitra DESAI (“JD”)


PGP 2018-20: Decision Sciences - II
Indian Institute of Management @ Bangalore
Forward Regression

► Suppose, we wish to build a regression model from scratch


► We have identified the dependent variable and all possible
independent variables
► Calculate pairwise correlations between dependent variable and
independent variables
► The criterion used for selection of the independent variable to
be added to the model is based on part (semi-partial)
correlation and significance level (p-in value)
► Once a new variable is added to the model, it stays in the model
► When a variable is added, recalculate p-values for all variables
► Process stops when:
 No more independent variables are left to be considered
 No variables can be added based on significance level (p-in)

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide5656
Backward Regression

► Suppose, we wish to build a regression model from scratch


► We have identified the dependent variable and all possible
independent variables
► Start with all independent variables in the model
► Several of them will be insignificant
► Begin by removing one independent variable at a time based on
its significance level (p-out value)
► Once a variable is removed from the model, it never reenters
► Recalculate p-values for all other variables, after a variable is
removed
► Process stops when:
 No more independent variables are left to be considered
 No variables can be removed based on significance level (p-out)

10/2/2018
PGP 2018-20 A Global Optimization Framework
Decision Sciences - II for SDA Slide5757

S-ar putea să vă placă și