The Dangers of Multicollinearity in Multiple Linear Regression

One-Variable Investigation 3.
Mean = 0 Look for residiuals to be centered

1. Point estimation: Estimating the unknown around zero. Multiple Linear Regression
parameter by a single number calculated from
the sample. 4. Have constant standard deviation look for the
residuals to show approximately constant
2. Interval estimation: Estimating the unknown spread.
parameter with a set of plausible values, and
attaching a level measuring your confidence 5 Normally distributed -- inspect a probability
that the set (i.e., the interval) covers the plot of the the residuals; look for the points to
If points lie within 95% confidence on parameter. be close to the line; supplement with 95%
probability plot then normality holds. confidence limits for Normality. Identify p-
3. Significance Testing about hypotheses: value.
H0 : The Transformed Data Are Normal Using the data to assess evidence in favor of or
HA: They are not against some claim about the population Can find b1 = R(SY/Sx) where R = sample
parameter. correlation, Sy,x = Sample st.dev of Y or X.
Standard Error
95% Confidence level determined by Can get b0 = Y - b1 X where they are the means
of Y and X.
estimates Point-estimate +/- 2*standard error
t = (estimate null value)/standard error =
If p-value >/= significance level, the data dont b1/Seslope
sufficiently contradict the null.
If p-value < significance level, reject the null Non-Linear Relationships

hypothesis Conduct the F-Test
Option 1 Transformation
For a 95% Confidence level, significance level H0 : B1 = B2 =B3 = = Bp = 0
= 0.05 HA:: At least one of these is not zero.
Simple Linear Regression If the F-test is non-significant then the model is

useless.
If the F-test yields a significant result, then at

least one variable is important so then proceed
to conduct tests of individual effects.
Conduct T-Tests for individual effects

For each explanatory variable X, decide if there
is sufficient evidence that it is significant, in the
presence of all the other explanatory variables.
Ho : B1 = 0 H0 : B2 = 0 . ETC
Errors are assumed to be: 1. Independent Ha: B1 ! = 0 H1: B2 != 0 .
2. Mean = 0, 3. Constant standard deviation, for Option 2 Higher Order Models
all X, and 4. Normally Distributed. Partial R2
Y = B0 + B1 X + B2 X2 + err (Quadratic) (the R2 between Y and both predictors) (the R2
1. linear inspect the scatter-plot of the data between Y and just X1) / (100% - (the R2
Y = B0 + B1 X + B2 X2 + B3 X3+ err (Cubic) between Y and just X1)
2. Are independent look for the residuals to
be patternless scattered on a residual plot. Transformation should be used wherever it is
possible, instead of higher order models.
The Danger of Multicollinearity Adjusted R2 for comparing best subsets. It 1. Its the difference in slopes, it measures how
adjusts for number of variables. the quantitative effect depends on the group
In a multiple linear regression multicolinearity value in the following sense: F statistic
is when the explanatory variables are perfectly Adjusted R2 used for comparing models to For every one unit increase in the F = (SSG/DFG)/(SSE/DFE) = MSG/MSE
or strongly linearly associated with each other. indicate which is better. quantitative variable X1 , the change in Y is B3
greater/less in the category coded as 1 relative R2= sum of squares for groups/sum of
Ideal we want to have weakly correlated Non-adjusted measures the predictive strength to the category coded as 0 squares total = SSG/SSTotal = SSG/
predictors. of models. (SSG+SSE)
2. B3 captures how the vertical difference
Weakly correlated if similar coefficients Check for multicollinearity problems in between the lines changes, which measures R2 is the proportion of variability in Y
2
between single variable analysis and multi addition to adj R how the group effect depends on the attributable to (or predicted from) the
variable analysis and standard errors are the quantitative value, in the following sense. explanatory variable(s) based on the model in
same or smaller. Including Categorical Variables in the For a particular value x 1 of the quantitative question.
Regression Model variable, the difference in Y for the group coded
Total multicollinearity means that one as 1 compared to the group coded as 0 is
explanatory variable depends entirely on So far used quantitative variables. Often want more/less than it would be without the
another. This will give us an error, because to include categorical variables in the prediction interaction term, by an amount equal to x1 times
wed divide by zero. model. B3 (on average.)
Strong multicollinearity, but not total, will give Can add a dummy variable depending on
misleading results. Determined by wildly categorical number, 0 or 1.
different coefficients and standard errors. Also
look at the high correlations. More precise Beta coefficient shows how much higher or
method is VIF. lower on average the response is for the
category coded as 1 compared to the response
for the category coded as 0.
VIF
Can add multiple levels. Must add two dummy
VIF is the reciprocal of a measure called variables, each of which is binary. The group
tolerance associated with each explanatory represented by (0,0) is the baseline group.
variable.
Modeling Interaction
Tolerance obtained by: One Way Anova
1. Produce a multiple linear regression with Xj Regression models model Q Q, Anova Tukey Pairwise Comparisons
as the response, versus all the other explanatory models C Q .
variables. If an interval for the difference doesnt contain
H 0 : 1 = 2 = = n zero, then we conclude that pair is different
Xj = Bo + B1X1 + B2X2 +. from each-other.
HA : The s are not the same.
2. Obtain the R2 from this regression. Two-Way ANOVA
3. The tolerance = 1 R2j

If the interaction term is significant, then both
4. VIF = 1/tolerance explanatory variables must be kept in the model
(regardless of the p-values of the individual
If VIF > 2.5, then worry about dangerous effect tests for the separate variables.
multicollinearity.
B3 (the coefficient of the interaction term) has
If we suspect multicollinearity, consider two equivalent interpretations:
dropping some of the highly correlated
variables from the model.
Odds = P/(1-P) Nominal Logistic Regression (Y= more than
Repeated Measure Anova 2 categories, unordered)
P = Odds/(1+Odds)
For one-way Anova we measure k independent log(p/1-p) = B0 + B1 X1 + B2X2 + + BkXk
groups separate groups of subjects, each subject P = 1/ (1 + exp [- (B0 + B1 x)]
is measured once. We have Y = cat 1 , with prob P1, cat 2 , etc
ln(p/1-p) = B0 + B1 X
For repeated measures Anova, we measure one sum of all p = 1
group of subjects k times, non-independent B1 governs steepness
groups. Reference category Pc
Goodness-ofFit Tests
Two-Factor Repeated-Measures (within We define P1/Pc = odds favoring category 1
subjects) and Two Factor Mixed (with- H0 : The model fits the data over category c
between subjects)
HA: The model is not a good fit to the data.
Recall that repeated measures is where the
same subjects get measured k times (I.e where
the same subjects are exposed to all
treatments).
Check to see on what variable there is a

repeated test.
Ancova exp(B1) is the odds ratio, (odds favoring Y=1 Ordinal logistic regression
When repeated design is not feasible or it when X = x+1 / odds favoring Y=1 when X=x)
wouldnt be desirable there is an alternative Is used in cases when the response variable Y is
analysis. This is ANCOVA since the variables H0 : B1 = 0 a categorical variable with more than two levels
are covariates. We include other quantitative which have a natural ordering.
variables that are correlated with, but not part HA:: B1 != 0
of the main experimental manipulation.
For Binary Logistic there is no equivalent
ANCOVAs usefulness depends on good measure of associoation which has a nice
correlation between the covariate and the mathematical properties of R2
response, otherwise nothing is gained by it.
Percent concordant pairs.
For studies with human subjects, typical
covariates are age, socioeconomic status, Concordant
aptitude, or pre-study attitude.
if X1 > X2 , Y1 > Y2
Equality of the regression slope across the

conditions. Binary Multiple Logistic Regression
The different treatments have no effect on the log(p/1-p) = B0 + B1 X1 + B2X2 + + BkXk

covariate.
P = 1/(1+ exp[-(B0 + B1 X1 + B2X2 + + BkXk)]
Logistic Regression
Model appropriate fro a categorical response Y.
Linear Regressions Q Q , ANOVA C Q

Logistic Q C

The Dangers of Multicollinearity in Multiple Linear Regression

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

The Dangers of Multicollinearity in Multiple Linear Regression

Încărcat de

Drepturi de autor:

Formate disponibile

One-Variable Investigation 3.

Mean = 0 Look for residiuals to be centered

If p-value < significance level, reject the null Non-Linear Relationships

Simple Linear Regression If the F-test is non-significant then the model is

If the F-test yields a significant result, then at

Conduct T-Tests for individual effects

3. The tolerance = 1 R2j

Check to see on what variable there is a

Equality of the regression slope across the

The different treatments have no effect on the log(p/1-p) = B0 + B1 X1 + B2X2 + + BkXk

Model appropriate fro a categorical response Y.

Linear Regressions Q Q , ANOVA C Q

S-ar putea să vă placă și