Sunteți pe pagina 1din 47

+

Stat 608 Chapter 5


+
Note to self

Add HW question on alternative hypothesis for model


reduction

Add discussion on familywise error rate under model


reduction section

2
+
Multiple Linear Regression
Chapter 5:
Multiple predictor variables
ANOVA and ANCOVA
Polynomial Regression
Assumption that model is valid

Chapter 6:
Leverage points
Transformations
Relationships between explanatory variables:
Multicollinearity
Interactions

Chapter 7:
Variable Selection

3
+
Types of Multiple Linear
Regression
Polynomial Regression (curves)

Quantitative and Categorical explanatory variables:


ANCOVA (separate lines)

Many explanatory variables (multiple dimensions)

4
+
ANOVA & REGRESSION &
ANALYSIS OF COVARIANCE
Y X e
Some are categorical
All are continuous and some are continuous

All are categorical

REGRESSION ANALYSIS OF COVARIANCE

ANOVA

5
+
ANOVA (ANalysis Of VAriance)
One-way ANOVA, without an intercept:

yi = i + ei, i = 1, 2, 3

yi = 1 x1i + 2 x2i + 3 x3i + ei

Design Matrix:

Pro: (XX) is diagonal: easy


to invert!

Con: When we have two-


way ANOVA, we have to
cut out the last indicator
variable

6
+
ANOVA
One-way ANOVA, with an intercept:

yi = 0 + 1 x1i + 2 x2i + ei

Design matrix:

7
+
Two-Way ANOVA

Model:

yi = 0 + 1 x1i + 2 x2i + x3i + ei

Design matrix:

8
+
Analysis of Covariance
Suppose we have three groups, and want to compare
the three means, holding the value of a quantitative
variable x constant.

Example: Compare diets, holding starting BMI constant.

Its possible to create three separate regression lines, as


shown below. We might also create three lines with
separate intercepts, but the same slope, or three lines
with separate slopes, but the same intercept.

Group 1:

Group 2:

Group 3:

9
+
ANCOVA: Same slopes;
Simpsons Paradox
100 120 140

Rural
Urban
Crime

80
60
40

55 60 65 70 75 80 85

Education 10
+
Interactions
An interaction between two input variables exists when the
effect of one input (X1) on the target variable (Y) is different for
different values of the other input (X 2).

Ex: Adding sugar (X1) to coffee makes it much sweeter (Y) when
the coffee is stirred (X2).

Ex: Injecting one more unit of sand (X 1) into a fracking well has
a larger effect on oil production (Y) when the well is shorter (X 2).

Ex: The amount of iron in food (Y) is higher when cooking in a


cast iron pot (X1). While tomatoes have a tiny amount of iron in
them, the acidity in tomatoes means their presence in food (X 2)
has a multiplicative effect on cast iron.

Ex: Third grade math students with ADHD (X 1) have lower math
scores (Y) if they studied while listening to music (X 2), while
those without ADHD had higher math scores when listening to
music.

...
+
Interactions: ANOVA
4.5

dish

meat
4.0

legumes
vegetable
3.5
mean of iron

3.0
2.5
2.0
1.5

Aluminum Clay Iron

pot

...
+
ANOVA: Interactions
What would the model and design matrix look like in the
case of the iron pot example?

...
+
ANCOVA Model
Where x1 is a quantitative variable and x2 is an
indicator (dummy) variable, write down the model for
separate slopes, separate intercepts:

Write down the model with separate slopes, but the


same intercept:

Write down the model with separate intercepts, but the


same slope:

...
+
ANCOVA Example
Rats are randomly assigned to be fed 0, 2, 4, and 6 mg of
one of two cancerous substances. The response variable
y is the number of tumors recorded. What should the
model look like? What should the design matrix look
like?

...
+
Multiple Regression: Interactions

If an interaction exists, there are many possible things that could


be true:

The relationship could change direction in the presence of the


third variable: the relationship between X 1 and Y is positive
before taking into account X2, and negative afterward.

The relationship might not change direction in the presence of a


third variable, but merely have a dramatic multiplicative effect:
Fast driving (X1) is much more dangerous (Y) when drunk(X 2).

Main effects might not be significant, while the interaction is


significant.

Cause and effect could run in many possible directions, but we


can only scientifically establish cause and effect through direct
experimentation.

...
+
Interactions Vs. Simpsons
Paradox
Simpsons paradox
From one explanatory variable to two, slopes change
direction, but lines can still be parallel.
More education is associated with more crime until we add
the urban/rural variable. But education has the SAME
effect on both urban and rural areas: decreasing crime.

Interactions
From two explanatory variables to two with a multiplier
effect. Lines cross.
Cooking in an iron pot has an even LARGER effect in the
presence of meat (and tomatoes).

17
+
Multiple Regression:
Interactions
Interactions between
quantitative variables: fit
different kinds of surfaces.

18
+
The Fallacy of Bivariate
Thinking

19
+
The Fallacy of Bivariate
Thinking
Plots of x1 or x2 vs. y may
seem to have no
relationship with y.

But the two variables


working together may
explain much more of the
variation in y.

Ex: Weight, Height, and


Body fat percentage.

20
+
Rank of X
The rank of our design matrix X should be the number of
columns of X. We say our design matrix is not full rank if
it isnt.

If rank (X) < # columns = p + 1, that means there exist


a linear combination of the other variables that adds up
to one of the variables. Why do we need that extra
variable??

If rank(X) < p + 1, that means rank (X X) < p+1, so


XX is not invertible.

If n < p + 1, rank(X) < p + 1 because the rank of X has


to be less than or equal to both the number of columns
and the number of rows of X. Get a bigger sample size
or get rid of some variables.

21
+
Rank of X
R error message:

Coefficients: (1 not defined because of


singularities)

SAS error message:


Note: Model is not full rank. Least-squares solutions for the
parameters are not unique. Some statistics will be misleading.
A reported DF of 0 or B means that the estimate is biased.

Note: The following parameters have been set to 0, since the


variables are a linear combination of other variables as
shown.

22
+
Step 1: Multiple Regression

Always conduct this


test first, before
looking at tests for
individual variables
(next slide).

If this p-value is
large, STOP.

23
+
Step 2: Multiple Regression

Note: If we start conducting these tests for many of the


variables, performing p separate t-tests, our overall Type
I error increases.
We also run into problems when the predictor variables
are highly correlated with each other (see chapter 7). 24
+
Step 2: Multiple Regression

We often write without thinking.

The implicit assumption is all the other variables are still in


the model.

If the p-value for one variable is large, the conclusion is that


variable has no significant additional effect on the response
variable (after all the other variables are entered into the
model).

When predictors are correlated, important variables may


become insignificant.

Example: oil company and well type have large p-values for
predicting well performance (oil output).

25
+
Italian Restaurants: Houston
16 20 24 28 18 22 26 60 80

50
plot(Italian) Restaurant

20
0
28

Food
22
16

24
Decor

18
12
26

Service
22
18

60
Cost

40
20
Pct_Liked
80
60

0 20 40 60 12 16 20 24 20 40 60

26
+
Italian Restaurants: Houston

my.lm<lm(Italian$food~Italian$Service+Italian$Pct_Liked
+Italian$Cost)

anova(my.lm)

Response:Italian$Food
DfSumSqMeanSqFvaluePr(>F)
Italian$Service164.61964.61936.3692.605e07
Italian$Pct_Liked136.51036.51020.5484.133e05
Italian$Cost11.8601.8601.0470.3116
Residuals4681.7311.777

Residualstandarderror:1.333on46degreesoffreedom
(15observationsdeletedduetomissingness)
MultipleRsquared:0.5575,AdjustedRsquared:0.5287
Fstatistic:19.32on3and46DF,pvalue:2.989e08

27
+
Italian Restaurants: Houston

my.lm<lm(Italian$food~Italian$Service+
Italian$Pct_Liked+Italian$Cost)

summary(my.lm)

Coefficients:
EstimateStd.ErrortvaluePr(>|t|)

(Intercept)3.689232.664001.3850.17278

Italian$Service0.478490.117444.0740.00018
Italian$Pct_Liked0.113040.024614.5943.38e05
Italian$Cost0.022360.021851.0230.31156

28
+
Italian Restaurants: Houston

The coefficient for cost is negative; does that make sense?


Interpret the slope for Cost in context.

The p-value for cost is large; does that make sense?

29
+
Italian Restaurants: Houston
New model:
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)10.771212.329964.6232.97e05
Italian$Pct_Liked0.130470.027974.6652.58e05
Italian$Cost0.035100.019271.8210.0749

The coefficient for %Liked is smaller than the coefficient for Service;
does that mean Service is more important when it comes to predicting
Food rating?

The p-value for %Liked is smaller than the p-value for Cost; does that
mean the association between %Liked and Food rating is stronger than
the association between Cost and Food rating?

30
+
Italian Restaurants: Houston

Which variable is most important?


Sorting by F-value in ANOVA table, t-value (if all variables are
quantitative), or p-value will give the same results.
If all predictors are independent of each other, sorting by
their correlation with the response will also give the same
results.

Ex: Oil Production

31
+
Italian Restaurants: Houston

Calculate an approximate 95% confidence interval for the


slope for Service.

32
+
Polynomial Regression

Isthe following model linear in the


parameters?

If
a model is linear in the parameters, it
means we can write it as:

Y = X + e

33
+
Polynomial Regression

What does the design matrix look like for the


following model?

34
+
Salary Example

A salary curve relates salary to years of experience.


Employees might use it find out where they stand among
their peers. Personnel might use it to consider salary
adjustments when hiring new professionals.

When we fit the simple linear regression model below,


we get the residual plot on the next slide.

35
+
Salary Example: SLR

Standardized Residuals
1
70

0
Salary
60

-1
50

-2
40

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Years Experience
Years Experience

36
+
Salary Example: Parabola

Clearly there is a non-linear relationship between salary


and years of experience.

Next we fit the model

The cutoff for high leverage is hii > 2(p+1) / n when


there are p predictors plus 1 intercept.

37
+
Salary Example: Parabola

0.07
2
Standardized Residuals
1

0.05
Leverage
0

0.03
-1

0.01
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Years Experience Years Experience

38
+
Salary Example: Parabola

Standardized residuals
Residuals vs Fitted Normal Q-Q

2
77 118 77
118
29
29

4
Residuals

1
0

0
-1
-4

40 50 60 70 -2 -1 0 1 2
Fitted values Theoretical Quantiles
Standardized residuals

Standardized residuals
Scale-Location Residuals vs Leverage
7729118

2
0.0 0.6 1.2

29

1
0

Cook's distance 7 41
-2

40 50 60 70 0.00 0.03 0.06


Fitted values Leverage

39
+
Approach: Center explanatory
variable

40
+
Model Reduction Method 1
Partial F-Test

Suppose we have the model above, and are Interested in


testing

against the alternative hypothesis

Ha: At least one of the parameters is not 0.

That is, the question is, Can we drop these k variables


from our model? This can be tested using an F-test.

41
+
Model Reduction Method 1
Partial F-Test
Let RSS(Full) be the residual sum of squares from the
model with all the predictors 1, , p. Let RSS(Reduced)
be the residual sum of squares from the reduced model
(with only the remaining predictors that we dont think
are 0), and the dfs be error degrees of freedom. Then
the F-statistic for testing the above hypotheses is given
by:

42
+
Model Reduction Method 2

Ha: At least one of these not


equal.

rank(A) = r

43
+
Model Reduction

Caution: Remember multiple comparisons need


adjustments!
44
+
R2

R2 is often defined as the proportion of the variability in


the random variable Y explained by the regression
model.

45
+
R2: Adding Variables
Adding irrelevant predictor variables to the regression
equation increases R2.

Solution: R2 adjusted

The denominator is an unbiased estimate of the variance of


Y with all slopes = 0, while the numerator is an unbiased
estimate of the variance of the residuals.

Beware: when used to compare models, R 2 adjusted is


biased towards adding too many (irrelevant) predictor
variables.

46
+
The mean function might not
be modeled correctly because:
We didnt add variables we should have.
Simpsons paradox: Slope sign changes.
Fallacy of univariate thinking: P-values could be large when they should
be small.

We added variables we shouldnt have.


Correlated predictors: Slope sign changes.
Remember interpretation that we hold other variables constant: P-
values could be large when they should be small.
Chapter 7: more info.

We didnt consider interactions.


Main effects (before interaction added) could have large p-values while
interaction has small p-value.

We didnt consider polynomial terms.


A symmetric parabola may have a 0 slope when fitted with a straight
line model.
47

S-ar putea să vă placă și