Sunteți pe pagina 1din 24

EST-66117 Estadísticas para la

Política Publica
Multiple Regression and Model
Diagnostics
Dr. Heidi Jane Smith
Today:3/1
• Multiple Regression and Model Diagnostics
(multicollinearity heteroscedasticity and
autocorrelation)

– Readings:
• Stock and Watson, Chapter 6y7
• Berman and Wang, Chapter 15
• Acock, Chapter 10
Multiple regression analysis
• Multiple regression analysis has an interval-level
dependent variable and two or more
independent variables, which can be either
dummy or interval level. If an effect has multiple
causes, multiple regression allows us to predict
values of Y more accurately than bivariate
regression does. Multiple regression also helps
isolate the direct effect of a single independent
variable on the dependent variable, once the
effects of the other independent variables are
controlled.
The equation
• The equation for a multiple regression is similar
to that of a linear regression, although it no
longer describes a two-dimensional line:
Y-hat =b0 +b1X1 +b2X2 +b3X3 +...+bnXn + error
• Y-hat is the expected value of Y,
• X1 through Xn are the independent variables,
• b0 is the y-intercept, and
• b1 through bn are the regression coefficients
(also called partial slope coefficients).
Y hat
• To determine the expected value of Y, insert the actual
values of X1 through Xn into the equation, multiply, and
add.
• The y-intercept is still the expected value of the dependent
variable when all of the independent variables equal zero
(though the y-intercept will seldom have a practical
meaning; it frequently does not make sense for all
independent variables to equal zero).
• The partial slope coefficients show the expected change in
Y from a one-unit increase in Xi, holding all the other X’s
constant. That is, Xi changes but none of the other
independent variables do. It usually does not matter at
what values you hold the other variables constant.
Testing Assumptions
1. Model Specification
(this is, idenfication of DV and IV)
2. Testing of regression assumptions
3. Correction of assumption violation
4. Reporting of the results of the final
regression
Model Specification
• Multiple regression is an extension of a simple regression,
but an important exist between the two methods: multi-
regression aims for full model specification. Means that
affect the DV, by contrast, a simple regression examines the
effect of only one IV
• Must identify the variable that are of most (theoretical and
practical) relevance in explaining the dependent variable
• Addressing those variables that were not considered as
being most relevant
• The Assumption of full model specification is that these
other variables are justifiably omitted only when their
cumulative effect on the dependent variable is zero
• The validity of Multiple regression models centers on
examining the behavior of the error term in this regard. If
the cumulative effect of all the variable sis not zero—then
additional variables may have to be considered
Interpreting Coefficients
• Each regression coefficient is interpreted as its effect
on the dependent variable, controlled for the effects of
all the other independent variables included in the
regression.
• See exercise 1 with Auto data

• Standardized coefficient (β) is the change procuded in


the dependent variable by a unit of change in the
independent variable when both variables are
measured in terms of standard deviation.
• helps with the analysis you may square root
Adjusted R2
• Controls for the number of independent variables
• Adjusted R2 is always equal to or less than R2
• R2 is the variation in the dependent variable explained
by all of the independent variables
• Adjusted R2 is used to evaluate model specification or
(goodness of fit of the variables).
• Adjusted R2 below 0.20 are considered to suggest weak
model fit, those between 0.2 and 0.4 indicate
moderate fit, those above 0.40 indicate a strong fit and
above 0.65 very strong
• Look at model specification by theory, not statistical
relationships, are meaningful in some real-life sense
One Dummy and One Interval-level
Independent Variable
• The basic formula for a multiple regression
equation with one dichotomous independent
variable and one interval-level independent
variable is:
• Y-hat = b0 + b1 Dummy + b2 X 2 + error
• You use a dummy to be able to explain the
opposite.
F test
• Global F-test examines the overall effects of all IV jointly on
the DV. The null hypothesis is that the overall effect of all IV
jointly on the DV is statistically insignificant. The alternative
Ho is that this overall effect is stat. significant.
• The null ho implies that none of the region coefficients is
statistically significant; H1 suggest that at least one of the
regression coefficients is stat. sig.
• Regression sum of the squares is measure of the explained
variation Sum(yhat-Y)2 and the residual sum of the squares is
a measure of the unexplained variation Sum(yi-yhat t)2 Total
sum of the squares is defined as the sum of these measures
and is the bases of R-squared (R2=1(residual sum of the
squares/total sum of the squares)
Classical assumptions of OLS
A. Linear in parameters: The model is linear in the parameters β 0 through βk.
yi =β0 +β1 x1i +β2 x2i +...+βk xki +εi
B. Random sample: We have a random sample of n observations from the population.
C. Zero Conditional Mean: The expected value or conditional mean of the error term ε is 0
at every value (and every combination of values) of the independent variables.
E (ε) = E (ε | x) = 0.
Alternatively, we can state this assumption as the independence of the x’s and the error
term. That is, the x is uncorrelated with the error term.
Cov (x1, ε) = Cov (x2, ε) = Cov (xk, ε) = 0
D. No perfect collinearity: Each x has unique variation; its value cannot be perfectly
predicted by the other x’s. Independent variables can be correlated, but not perfectly
correlated. That is, an independent variable cannot be an exact linear combination of other
independent variables. Note, x2 is not an exact linear function of x.

E. Homoskedasticity: The variance in the error term ε is the same at every value (and every
combination of values) of the independent varaibles.
Var (ε) = Var (ε | x) = σ2.
Model misspecification
Definition: What happens when we violate the first assumption, a properly specified
model? We can potentially either (1) leave out an important independent variable (or
include it in the wrong form) or (2) include an irrelevant independent variable. The effects
will be different, with much more serious consequences in case (1).

Problem:
If we incorrectly leave out an important independent variable, the OLS estimator of βj
(call it bj*) will be biased. That is, E(bj*) ≠ βj. The OLS estimator bj* will remain biased
even in infinitely large samples. The estimated standard error of bj* could be larger or
smaller than the estimated standard error of bj, depending on both the additional
variation in y that could uniquely explained by adding the omitted variable to the model
and the correlation between that omitted variable and xj.
Suppose, on the other hand, that we add an irrelevant variable (x3) to the correctly
specified population regression function, so that we mistakenly test the model:
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε

OLS is still BLUE – that is, OLS still gives the best linear unbiased estimator of the
population regression function. However, adding the irrelevant variable will raise the
standard errors (both true and estimated) of the other coefficients, which will increase the
confidence interval for the coefficient and reduce the expected value of t*, making it more
difficult to reject the null hypothesis of no impact.
Multicollinearity
Multicollinearity happens when explanatory variables interact with one another and
are correlated. There is always going to be some collinearity in your model because
you cannot always separate variables from having a linear relationship from one
another.

Cause: The main causes for mulitcollinearity are small sample size; data collection
method employed might be over a limited rang of values; constraints on the model or
in the population being sampled; model specification error or a complete miss-
specification of the model, which contains too many explanatory variables.

Consequences: The consequences of high mullitcolinearity are not too bad. When
present, the precision of the estimators may be less but the model may still capture
the true value. The model will show inflate standard errors–making them less
accurate, which makes the t-test misleading (lower t ratio and higher p value) so the
independent variables will look insignificant. Furthermore, these OLS estimators are
sensitive to small changes, it will show a high R-squared despite few significant
variables, but “OLS estimators are still Blue”—indicating that they are precise but
unbiased.
Multicollinearity Detection
• Detection: There are a number of indicators and measures of possible
multicollinearity in a data set. The R2 and F-statistic are high but the t-statistics are
insignificant, which suggests that multicollinearity among the independent variables
may be leading to their insignificant coefficients.

• The independent variables have high correlation coefficients. As a rule of thumb,


inter- correlation among the independents above .80 signals a possible problem.

• When we regress one independent variable on the other independent variables


(called an auxiliary regression), we get a high R2.

• Low tolerance and high VIF (variance inflation factor). Tolerance is 1 – R2, where R2 is
for the auxiliary regression. (Stata displays tolerance as 1/VIF.) If the tolerance value is
less than some cutoff value, usually .20, the independent should be dropped from the
analysis due to multicollinearity.
• VIF can be used instead of tolerance. VIF is 1/(1-R2). If VIF>4 then multicollinearity
might be a problem. If VIF >10, you have high multicollinearity (that is, if R2 in the
auxiliary regression is greater than .90). [Run the vif command immediately after the
regression of interest (not after the auxiliary regression), which you can run with
either the fit or the regress command.]
Multicollinearity Solutions
• Do nothing since OLS estimates are BLUE even in the presence of high
multicollinearity. Simply accept that your data are not strong enough to answer all
the questions you would like to put to them.
• Most of the time, however, you will feel a strong desire to do something about it.
A common situation is that either X1 or X2 (or both) will have a statistically
significant, even strong, coefficient if the other variable is left out of the model,
but that neither will be statistically significant if both are in the model. Possible
solutions include:
– Drop one of the variables. The other variable then becomes significant and the model looks
better. In general, this should be done only on theoretical grounds (but, in practice, theory will
frequently be weak). In practice, researchers tend to let the data choose the model, typically
dropping the variable with the smaller t- statistic. This is dangerous. If the originally specified
model was correct, the new coefficients will be biased.
– Create a new variable which is a combination of X1 and X2. With a large number of
independent variables, this would typically be done with principal components analysis or
factor analysis. These are methods for finding commonalities in sets of variables and can be
quite useful, but the meaning of the new variable and of its coefficient will usually be pretty
unclear.
– Get a bigger or better data set (one with more unexplained variation in the independent
variables). This leads to smaller standard errors for the coefficients.
Heteroskedasticity
• Definition: A key assumption of the classical linear regression model is that
the error term is homoskedastic. That is, the variance of the error term is the
same at all values of X. When the variance of the error terms is different for
different observations, the error is said to be heteroskedastic.
• Problem: Heteroskedasticity introduces two problems for Ordinary Least
Squares (OLS). OLS estimators produce unbiased but not efficient estimators
of the population parameters and produce biased estimators of the variances
of the regression coefficients, which invalidates the logic of the t- and F-tests
and of the confidence intervals.
• Detection:
– Graph the residuals against the independent variable or variables that you suspect
are responsible for the heteroskedasticity. Nonconstant spread means
heteroskedasticity.
– Use a modified version of the Breusch-Pagan test. In Stata, follow the regression
output with the hettest command (null hypothesis: the residual has a constant
variance). The hettest test comes in two forms:
– Name the variables that you suspect are responsible for the heteroskedasticity.
– Run the hettest command without naming any variables, in which case hettest
uses the expected value of the dependent variable as the independent variable.
– Solution: Use robust stand errors. In Stata, at the end of the regression command,
simply add a comma "robust."
Autocorrelation
• Definition: Errors are correlated from observation to
observation. Cov (εi , εj ) ≠ 0 for i ≠ j. It is more
common in time series and panel data.
• Problem: OLS regression yields unbiased but not
efficient estimators of the population parameters and
yields biased estimators of the variance and standard
errors of the regression coefficients. So neither the t-
test nor the confidence intervals can be trusted.
• Detection: In Stata, use Durbin-Watson Test: dwstat
(null hypothesis: there is no autocorrelation). The
larger d is, the lower the positive autocorrelation
should be.
• Solution: Use robust standard errors
Assumptions Needed to Believe
Parameter Estimates for X
• E(u) = 0 In general, this assumption means that any independent variables you couldn’t think of to include in the
analysis are just noise – they have no impact on the dependent variable.

– No random or non-random measurement error in X


– Random measurement error - data collected from human responses, recording or transcribing data, etc.

• Non-random measurement error – “the extent to which a measure reflects the concept it is intended to measure.” An
example of non-random error would be: If I am trying to measure the impact of September 11 th on flying, and I include
an unrelated variable such as “pilot hair color,” that would be non-random error. At the same time, if I purposely leave
out an important variable such as “daily number of airline passengers in the past 12 months,” that in also non-random
error.

– No random or non-random measurement error in Y. Same as above.



– No correlation between X and unmeasured/unobserved variables.

• This means no omitted variable (selection bias)


And
• No simultaneity – or, it has to be clear that the independent variable impacts the dependent variable and not the other
way around.

• Must have linear functional form

• If you have significant independent variables and a low R-square, you may have poor linear function form which means
that using a linear regression equation is not appropriate.
Assumptions Needed to Believe
Significance Tests for X
• No autocorrelation – observations must be independent of one another. If observations are not independent, you
cannot trust the accuracy of that variable. This can be a problem in:
– A time series design – observation from time 1 not independent from time 2 (what is an example of this?)
– Cross-sectional data – for example, if you are observing a whole group of people at the same time and are trying to keep data of each person in the
group individually. Each person in the group is impacting the other people in the group so the observations are not independent.
• No Heteroscedasticity – if the variance of the error term is not constant for all observations, there is
heteroscedasticity. For example, if you are trying to measure the impact of gun legislation on gun related death in
the US by collecting cross-sectional data from all 50 states there would probably be heteroscedasticity. The
reason is that with such varying populations, the variance in the error term would be different with different
populations.
– Look for it in:
– cross-sectional data or time series (mostly with cross-sectional).
– aggregate data – like states (each unit has a different N)
– test scores or policy opinions
– when dependent is spending and independent or control is income

• No severe collinearity (same as multicollinearity?) – there can be no significant relationship among independent
variables. If there are, R-square will be high and independent variables will be insignificant.

• Example – If I am using hair color and ethnicity as independent variables in a regression, neither will be significant
because they are highly related.
Review again R2
• R-Squared (R2) also called the coefficient of
determination helps interpret the value as the
percentage of variation in the dependent variable that
is explained by the independent variable.
• Overall R2 varies from zero to one and is a goodness of
fit measure (high numbers closer to one indicate there
are likely to have additional factors affect the
dependent variable).
• Strength of the relationship value is between 0:1
– Typically the values of R2 are below 0.20 are considered
weak relationship and those between 0.20-0.40 are
moderate and above 0.40 are strong relationships.
Reporting Regressions
• When you are reporting on bivariate
relationships with 2 continuous variables (i.e.
simple regressions) you must report:
– 1) level of significance at which the two variables
are associated if at all (t stat).
– 2) whether the relationship between the two
variables is positive or negative (b)
– 3) the strength of the relationship (R2)
• Use Ho testing not predictions
Standard Error of the Estimate (SEE)
• Which is the spread of the y values around the
regression line as calculated for the mean value
of the independent variable, only, and assuming a
large sample.
• The SEE has an interpretation in terms of the
normal curve, i.e. x% of the y values lies within
one standard error from the calculated value of
the y, as calculated for the mean value of x using
the preceding regression model.
error term e
• Predicted Values of y (is typically different from
the observed value of y. and the predicted value
of the dependent variable y ^
• Only when R2 =1 are the observed and predicted
values identical from each other
• The difference between y and y^ is called the
region error or error term e
– y^ =a+b*x
– y =a+b*x+ e
• Assumptions about e are important, which as
that they are normally distributed

S-ar putea să vă placă și