Stat 608 Chapter 5

+
Stat 608 Chapter 5

+
Note to self
Add HW question on alternative hypothesis for model

reduction
Add discussion on familywise error rate under model

reduction section
2
+
Multiple Linear Regression
Chapter 5:
Multiple predictor variables
ANOVA and ANCOVA
Polynomial Regression
Assumption that model is valid
Chapter 6:
Leverage points
Transformations
Relationships between explanatory variables:
Multicollinearity
Interactions
Chapter 7:
Variable Selection
3
+
Types of Multiple Linear
Regression
Polynomial Regression (curves)
Quantitative and Categorical explanatory variables:

ANCOVA (separate lines)
Many explanatory variables (multiple dimensions)
4
+
ANOVA & REGRESSION &
ANALYSIS OF COVARIANCE
Y X e
Some are categorical
All are continuous and some are continuous
All are categorical
REGRESSION ANALYSIS OF COVARIANCE
ANOVA
5
+
ANOVA (ANalysis Of VAriance)
One-way ANOVA, without an intercept:
yi = i + ei, i = 1, 2, 3
yi = 1 x1i + 2 x2i + 3 x3i + ei
Design Matrix:
Pro: (XX) is diagonal: easy

to invert!
Con: When we have two-

way ANOVA, we have to
cut out the last indicator
variable
6
+
ANOVA
One-way ANOVA, with an intercept:
yi = 0 + 1 x1i + 2 x2i + ei
Design matrix:
7
+
Two-Way ANOVA
Model:
yi = 0 + 1 x1i + 2 x2i + x3i + ei
Design matrix:
8
+
Analysis of Covariance
Suppose we have three groups, and want to compare
the three means, holding the value of a quantitative
variable x constant.
Example: Compare diets, holding starting BMI constant.
Its possible to create three separate regression lines, as

shown below. We might also create three lines with
separate intercepts, but the same slope, or three lines
with separate slopes, but the same intercept.
Group 1:
Group 2:
Group 3:
9
+
ANCOVA: Same slopes;
Simpsons Paradox
100 120 140
Rural
Urban
Crime
80
60
40
55 60 65 70 75 80 85
Education 10
+
Interactions
An interaction between two input variables exists when the
effect of one input (X1) on the target variable (Y) is different for
different values of the other input (X 2).
Ex: Adding sugar (X1) to coffee makes it much sweeter (Y) when
the coffee is stirred (X2).
Ex: Injecting one more unit of sand (X 1) into a fracking well has
a larger effect on oil production (Y) when the well is shorter (X 2).
Ex: The amount of iron in food (Y) is higher when cooking in a

cast iron pot (X1). While tomatoes have a tiny amount of iron in
them, the acidity in tomatoes means their presence in food (X 2)
has a multiplicative effect on cast iron.
Ex: Third grade math students with ADHD (X 1) have lower math
scores (Y) if they studied while listening to music (X 2), while
those without ADHD had higher math scores when listening to
music.
...
+
Interactions: ANOVA
4.5
dish
meat
4.0
legumes
vegetable
3.5
mean of iron
3.0
2.5
2.0
1.5
Aluminum Clay Iron
pot
...
+
ANOVA: Interactions
What would the model and design matrix look like in the
case of the iron pot example?
...
+
ANCOVA Model
Where x1 is a quantitative variable and x2 is an
indicator (dummy) variable, write down the model for
separate slopes, separate intercepts:
Write down the model with separate slopes, but the

same intercept:
Write down the model with separate intercepts, but the

same slope:
...
+
ANCOVA Example
Rats are randomly assigned to be fed 0, 2, 4, and 6 mg of
one of two cancerous substances. The response variable
y is the number of tumors recorded. What should the
model look like? What should the design matrix look
like?
...
+
Multiple Regression: Interactions
If an interaction exists, there are many possible things that could

be true:
The relationship could change direction in the presence of the

third variable: the relationship between X 1 and Y is positive
before taking into account X2, and negative afterward.
The relationship might not change direction in the presence of a

third variable, but merely have a dramatic multiplicative effect:
Fast driving (X1) is much more dangerous (Y) when drunk(X 2).
Main effects might not be significant, while the interaction is

significant.
Cause and effect could run in many possible directions, but we

can only scientifically establish cause and effect through direct
experimentation.
...
+
Interactions Vs. Simpsons
Paradox
Simpsons paradox
From one explanatory variable to two, slopes change
direction, but lines can still be parallel.
More education is associated with more crime until we add
the urban/rural variable. But education has the SAME
effect on both urban and rural areas: decreasing crime.
Interactions
From two explanatory variables to two with a multiplier
effect. Lines cross.
Cooking in an iron pot has an even LARGER effect in the
presence of meat (and tomatoes).
17
+
Multiple Regression:
Interactions
Interactions between
quantitative variables: fit
different kinds of surfaces.
18
+
The Fallacy of Bivariate
Thinking
19
+
The Fallacy of Bivariate
Thinking
Plots of x1 or x2 vs. y may
seem to have no
relationship with y.
But the two variables

working together may
explain much more of the
variation in y.
Ex: Weight, Height, and

Body fat percentage.
20
+
Rank of X
The rank of our design matrix X should be the number of
columns of X. We say our design matrix is not full rank if
it isnt.
If rank (X) < # columns = p + 1, that means there exist

a linear combination of the other variables that adds up
to one of the variables. Why do we need that extra
variable??
If rank(X) < p + 1, that means rank (X X) < p+1, so

XX is not invertible.
If n < p + 1, rank(X) < p + 1 because the rank of X has

to be less than or equal to both the number of columns
and the number of rows of X. Get a bigger sample size
or get rid of some variables.
21
+
Rank of X
R error message:
Coefficients: (1 not defined because of

singularities)
SAS error message:

Note: Model is not full rank. Least-squares solutions for the
parameters are not unique. Some statistics will be misleading.
A reported DF of 0 or B means that the estimate is biased.
Note: The following parameters have been set to 0, since the

variables are a linear combination of other variables as
shown.
22
+
Step 1: Multiple Regression
Always conduct this

test first, before
looking at tests for
individual variables
(next slide).
If this p-value is
large, STOP.
23
+
Note: If we start conducting these tests for many of the

variables, performing p separate t-tests, our overall Type
I error increases.
We also run into problems when the predictor variables
are highly correlated with each other (see chapter 7). 24
+
We often write without thinking.
The implicit assumption is all the other variables are still in

the model.
If the p-value for one variable is large, the conclusion is that

variable has no significant additional effect on the response
variable (after all the other variables are entered into the
model).
When predictors are correlated, important variables may

become insignificant.
Example: oil company and well type have large p-values for
predicting well performance (oil output).
25
+
Italian Restaurants: Houston
16 20 24 28 18 22 26 60 80
50
plot(Italian) Restaurant
20
0
28
Food
22
16
24
Decor
18
12
26
Service
22
18
60
Cost
40
20
Pct_Liked
80
60
0 20 40 60 12 16 20 24 20 40 60
26
+
my.lm<lm(Italian$food~Italian$Service+Italian$Pct_Liked
+Italian$Cost)
anova(my.lm)
Response:Italian$Food
DfSumSqMeanSqFvaluePr(>F)
Italian$Service164.61964.61936.3692.605e07
Italian$Pct_Liked136.51036.51020.5484.133e05
Italian$Cost11.8601.8601.0470.3116
Residuals4681.7311.777
Residualstandarderror:1.333on46degreesoffreedom
(15observationsdeletedduetomissingness)
MultipleRsquared:0.5575,AdjustedRsquared:0.5287
Fstatistic:19.32on3and46DF,pvalue:2.989e08
27
+
my.lm<lm(Italian$food~Italian$Service+
Italian$Pct_Liked+Italian$Cost)
summary(my.lm)
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)

(Intercept)3.689232.664001.3850.17278

Italian$Service0.478490.117444.0740.00018
Italian$Cost0.022360.021851.0230.31156
28
+
The coefficient for cost is negative; does that make sense?

Interpret the slope for Cost in context.
The p-value for cost is large; does that make sense?
29
+
New model:
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)10.771212.329964.6232.97e05
Italian$Cost0.035100.019271.8210.0749
The coefficient for %Liked is smaller than the coefficient for Service;
does that mean Service is more important when it comes to predicting
Food rating?
The p-value for %Liked is smaller than the p-value for Cost; does that
mean the association between %Liked and Food rating is stronger than
the association between Cost and Food rating?

30
+
Which variable is most important?

Sorting by F-value in ANOVA table, t-value (if all variables are
quantitative), or p-value will give the same results.
If all predictors are independent of each other, sorting by
their correlation with the response will also give the same
results.
Ex: Oil Production
31
+
Calculate an approximate 95% confidence interval for the

slope for Service.
32
+
Isthe following model linear in the

parameters?
If
a model is linear in the parameters, it
means we can write it as:
Y = X + e
33
+
What does the design matrix look like for the

following model?
34
+
Salary Example
A salary curve relates salary to years of experience.

Employees might use it find out where they stand among
their peers. Personnel might use it to consider salary
adjustments when hiring new professionals.
When we fit the simple linear regression model below,

we get the residual plot on the next slide.
35
+
Salary Example: SLR
Standardized Residuals
1
70
0
Salary
60
-1
50
-2
40
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Years Experience
Years Experience
36
+
Salary Example: Parabola
Clearly there is a non-linear relationship between salary

and years of experience.
Next we fit the model
The cutoff for high leverage is hii > 2(p+1) / n when

there are p predictors plus 1 intercept.
37
+
0.07
2
Standardized Residuals
1
0.05
Leverage
0
0.03
-1
0.01
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Years Experience Years Experience
38
+
Standardized residuals
Residuals vs Fitted Normal Q-Q
2
77 118 77
118
29
29
4
Residuals
1
0
0
-1
-4
40 50 60 70 -2 -1 0 1 2
Fitted values Theoretical Quantiles
Scale-Location Residuals vs Leverage
7729118
2
0.0 0.6 1.2
29
1
0
Cook's distance 7 41
-2
40 50 60 70 0.00 0.03 0.06

Fitted values Leverage
39
+
Approach: Center explanatory
variable
40
+
Model Reduction Method 1
Partial F-Test
Suppose we have the model above, and are Interested in

testing
against the alternative hypothesis
Ha: At least one of the parameters is not 0.
That is, the question is, Can we drop these k variables

from our model? This can be tested using an F-test.
41
+
Partial F-Test
Let RSS(Full) be the residual sum of squares from the
model with all the predictors 1, , p. Let RSS(Reduced)
be the residual sum of squares from the reduced model
(with only the remaining predictors that we dont think
are 0), and the dfs be error degrees of freedom. Then
the F-statistic for testing the above hypotheses is given
by:
42
+
Ha: At least one of these not

equal.
rank(A) = r
43
+
Model Reduction
Caution: Remember multiple comparisons need

adjustments!
44
+
R2
R2 is often defined as the proportion of the variability in

the random variable Y explained by the regression
model.
45
+
R2: Adding Variables
Adding irrelevant predictor variables to the regression
equation increases R2.
Solution: R2 adjusted
The denominator is an unbiased estimate of the variance of

Y with all slopes = 0, while the numerator is an unbiased
estimate of the variance of the residuals.
Beware: when used to compare models, R 2 adjusted is

biased towards adding too many (irrelevant) predictor
variables.
46
+
The mean function might not
be modeled correctly because:
We didnt add variables we should have.
Simpsons paradox: Slope sign changes.
Fallacy of univariate thinking: P-values could be large when they should
be small.
We added variables we shouldnt have.

Correlated predictors: Slope sign changes.
Remember interpretation that we hold other variables constant: P-
values could be large when they should be small.
Chapter 7: more info.
We didnt consider interactions.

Main effects (before interaction added) could have large p-values while
interaction has small p-value.
We didnt consider polynomial terms.

A symmetric parabola may have a 0 slope when fitted with a straight
line model.
47

Stat 608 Chapter 5

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Stat 608 Chapter 5

Încărcat de

Drepturi de autor:

Formate disponibile

+

Stat 608 Chapter 5

Add HW question on alternative hypothesis for model

Add discussion on familywise error rate under model

Quantitative and Categorical explanatory variables:

Many explanatory variables (multiple dimensions)

All are categorical

REGRESSION ANALYSIS OF COVARIANCE

yi = 1 x1i + 2 x2i + 3 x3i + ei

Pro: (XX) is diagonal: easy

Con: When we have two-

yi = 0 + 1 x1i + 2 x2i + x3i + ei

Example: Compare diets, holding starting BMI constant.

Its possible to create three separate regression lines, as

Ex: The amount of iron in food (Y) is higher when cooking in a

Aluminum Clay Iron

Write down the model with separate slopes, but the

Write down the model with separate intercepts, but the

If an interaction exists, there are many possible things that could

The relationship could change direction in the presence of the

The relationship might not change direction in the presence of a

Main effects might not be significant, while the interaction is

Cause and effect could run in many possible directions, but we

But the two variables

Ex: Weight, Height, and

If rank (X) < # columns = p + 1, that means there exist

If rank(X) < p + 1, that means rank (X X) < p+1, so

If n < p + 1, rank(X) < p + 1 because the rank of X has

Coefficients: (1 not defined because of

SAS error message:

Note: The following parameters have been set to 0, since the

Always conduct this

Note: If we start conducting these tests for many of the

We often write without thinking.

The implicit assumption is all the other variables are still in

If the p-value for one variable is large, the conclusion is that

When predictors are correlated, important variables may

The coefficient for cost is negative; does that make sense?

The p-value for cost is large; does that make sense?

Which variable is most important?

Ex: Oil Production

Calculate an approximate 95% confidence interval for the

Isthe following model linear in the

What does the design matrix look like for the

A salary curve relates salary to years of experience.

When we fit the simple linear regression model below,

Clearly there is a non-linear relationship between salary

Next we fit the model

The cutoff for high leverage is hii > 2(p+1) / n when

40 50 60 70 0.00 0.03 0.06

Suppose we have the model above, and are Interested in

against the alternative hypothesis

Ha: At least one of the parameters is not 0.

That is, the question is, Can we drop these k variables

Ha: At least one of these not

Caution: Remember multiple comparisons need

R2 is often defined as the proportion of the variability in

The denominator is an unbiased estimate of the variance of

Beware: when used to compare models, R 2 adjusted is

We added variables we shouldnt have.

We didnt consider interactions.

We didnt consider polynomial terms.

S-ar putea să vă placă și