Documente Academic
Documente Profesional
Documente Cultură
2
+
Multiple Linear Regression
Chapter 5:
Multiple predictor variables
ANOVA and ANCOVA
Polynomial Regression
Assumption that model is valid
Chapter 6:
Leverage points
Transformations
Relationships between explanatory variables:
Multicollinearity
Interactions
Chapter 7:
Variable Selection
3
+
Types of Multiple Linear
Regression
Polynomial Regression (curves)
4
+
ANOVA & REGRESSION &
ANALYSIS OF COVARIANCE
Y X e
Some are categorical
All are continuous and some are continuous
ANOVA
5
+
ANOVA (ANalysis Of VAriance)
One-way ANOVA, without an intercept:
yi = i + ei, i = 1, 2, 3
Design Matrix:
6
+
ANOVA
One-way ANOVA, with an intercept:
yi = 0 + 1 x1i + 2 x2i + ei
Design matrix:
7
+
Two-Way ANOVA
Model:
Design matrix:
8
+
Analysis of Covariance
Suppose we have three groups, and want to compare
the three means, holding the value of a quantitative
variable x constant.
Group 1:
Group 2:
Group 3:
9
+
ANCOVA: Same slopes;
Simpsons Paradox
100 120 140
Rural
Urban
Crime
80
60
40
55 60 65 70 75 80 85
Education 10
+
Interactions
An interaction between two input variables exists when the
effect of one input (X1) on the target variable (Y) is different for
different values of the other input (X 2).
Ex: Adding sugar (X1) to coffee makes it much sweeter (Y) when
the coffee is stirred (X2).
Ex: Injecting one more unit of sand (X 1) into a fracking well has
a larger effect on oil production (Y) when the well is shorter (X 2).
Ex: Third grade math students with ADHD (X 1) have lower math
scores (Y) if they studied while listening to music (X 2), while
those without ADHD had higher math scores when listening to
music.
...
+
Interactions: ANOVA
4.5
dish
meat
4.0
legumes
vegetable
3.5
mean of iron
3.0
2.5
2.0
1.5
pot
...
+
ANOVA: Interactions
What would the model and design matrix look like in the
case of the iron pot example?
...
+
ANCOVA Model
Where x1 is a quantitative variable and x2 is an
indicator (dummy) variable, write down the model for
separate slopes, separate intercepts:
...
+
ANCOVA Example
Rats are randomly assigned to be fed 0, 2, 4, and 6 mg of
one of two cancerous substances. The response variable
y is the number of tumors recorded. What should the
model look like? What should the design matrix look
like?
...
+
Multiple Regression: Interactions
...
+
Interactions Vs. Simpsons
Paradox
Simpsons paradox
From one explanatory variable to two, slopes change
direction, but lines can still be parallel.
More education is associated with more crime until we add
the urban/rural variable. But education has the SAME
effect on both urban and rural areas: decreasing crime.
Interactions
From two explanatory variables to two with a multiplier
effect. Lines cross.
Cooking in an iron pot has an even LARGER effect in the
presence of meat (and tomatoes).
17
+
Multiple Regression:
Interactions
Interactions between
quantitative variables: fit
different kinds of surfaces.
18
+
The Fallacy of Bivariate
Thinking
19
+
The Fallacy of Bivariate
Thinking
Plots of x1 or x2 vs. y may
seem to have no
relationship with y.
20
+
Rank of X
The rank of our design matrix X should be the number of
columns of X. We say our design matrix is not full rank if
it isnt.
21
+
Rank of X
R error message:
22
+
Step 1: Multiple Regression
If this p-value is
large, STOP.
23
+
Step 2: Multiple Regression
Example: oil company and well type have large p-values for
predicting well performance (oil output).
25
+
Italian Restaurants: Houston
16 20 24 28 18 22 26 60 80
50
plot(Italian) Restaurant
20
0
28
Food
22
16
24
Decor
18
12
26
Service
22
18
60
Cost
40
20
Pct_Liked
80
60
0 20 40 60 12 16 20 24 20 40 60
26
+
Italian Restaurants: Houston
my.lm<lm(Italian$food~Italian$Service+Italian$Pct_Liked
+Italian$Cost)
anova(my.lm)
Response:Italian$Food
DfSumSqMeanSqFvaluePr(>F)
Italian$Service164.61964.61936.3692.605e07
Italian$Pct_Liked136.51036.51020.5484.133e05
Italian$Cost11.8601.8601.0470.3116
Residuals4681.7311.777
Residualstandarderror:1.333on46degreesoffreedom
(15observationsdeletedduetomissingness)
MultipleRsquared:0.5575,AdjustedRsquared:0.5287
Fstatistic:19.32on3and46DF,pvalue:2.989e08
27
+
Italian Restaurants: Houston
my.lm<lm(Italian$food~Italian$Service+
Italian$Pct_Liked+Italian$Cost)
summary(my.lm)
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)3.689232.664001.3850.17278
Italian$Service0.478490.117444.0740.00018
Italian$Pct_Liked0.113040.024614.5943.38e05
Italian$Cost0.022360.021851.0230.31156
28
+
Italian Restaurants: Houston
29
+
Italian Restaurants: Houston
New model:
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)10.771212.329964.6232.97e05
Italian$Pct_Liked0.130470.027974.6652.58e05
Italian$Cost0.035100.019271.8210.0749
The coefficient for %Liked is smaller than the coefficient for Service;
does that mean Service is more important when it comes to predicting
Food rating?
The p-value for %Liked is smaller than the p-value for Cost; does that
mean the association between %Liked and Food rating is stronger than
the association between Cost and Food rating?
30
+
Italian Restaurants: Houston
31
+
Italian Restaurants: Houston
32
+
Polynomial Regression
If
a model is linear in the parameters, it
means we can write it as:
Y = X + e
33
+
Polynomial Regression
34
+
Salary Example
35
+
Salary Example: SLR
Standardized Residuals
1
70
0
Salary
60
-1
50
-2
40
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Years Experience
Years Experience
36
+
Salary Example: Parabola
37
+
Salary Example: Parabola
0.07
2
Standardized Residuals
1
0.05
Leverage
0
0.03
-1
0.01
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Years Experience Years Experience
38
+
Salary Example: Parabola
Standardized residuals
Residuals vs Fitted Normal Q-Q
2
77 118 77
118
29
29
4
Residuals
1
0
0
-1
-4
40 50 60 70 -2 -1 0 1 2
Fitted values Theoretical Quantiles
Standardized residuals
Standardized residuals
Scale-Location Residuals vs Leverage
7729118
2
0.0 0.6 1.2
29
1
0
Cook's distance 7 41
-2
39
+
Approach: Center explanatory
variable
40
+
Model Reduction Method 1
Partial F-Test
41
+
Model Reduction Method 1
Partial F-Test
Let RSS(Full) be the residual sum of squares from the
model with all the predictors 1, , p. Let RSS(Reduced)
be the residual sum of squares from the reduced model
(with only the remaining predictors that we dont think
are 0), and the dfs be error degrees of freedom. Then
the F-statistic for testing the above hypotheses is given
by:
42
+
Model Reduction Method 2
rank(A) = r
43
+
Model Reduction
45
+
R2: Adding Variables
Adding irrelevant predictor variables to the regression
equation increases R2.
Solution: R2 adjusted
46
+
The mean function might not
be modeled correctly because:
We didnt add variables we should have.
Simpsons paradox: Slope sign changes.
Fallacy of univariate thinking: P-values could be large when they should
be small.