Sunteți pe pagina 1din 15

Business Econometrics using SAS Tools (BEST)

Class XI and XII OLS BLUE and Assumption Errors

Multicollinearity
Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear predictors are height and weight of a person, years of education and income, and assessed value and square footage of a home. Consequences of high multicollinearity: 1. Increased standard error of estimates of the s (decreased reliability). 2. Often confusing and misleading results.

Example
Suppose that we t a model with x1 = income and x2 = years of education as predictors and y intake of fruits and vegetables as the response. Years of education and income are correlated, and we expect a positive association of both with intake of fruits and vegetables. ttests for the individual 1, 2 might suggest that none of the two predictors are signicantly associated to y, while the Ftest indicates that the model is useful for predicting y Contradiction? No! This has to do with interpretation of regression coecients.

F or t
1 is the expected change in y due to x1 given x2 is already in the model. 2 is the expected change in y due to x2 given x1 is already in the model. Since both x1 and x2 contribute redundant information about y once one of the predictors is in the model, the other one does not have much more to contribute. This is why the Ftest indicates that at least one of the predictors is important yet the individual tests indicate that the contribution of the predictor, given that the other one has already been included, is not really important.

Detecting Multicollinearity
Easy way: compute correlations between all pairs of predictors. If some r are close to 1 or -1, remove one of the two correlated predictors from the model. Another way: calculate the variance ination factors for each predictor x: VIF = 1/(1-R-squarej) Where R-squarej = is the coecient of determination of the model that includes all predictors except the jth predictor. If VIF 10 then there is a problem with multicollinearity.

Solution for Multicollinearity


If interest is only in estimation and prediction, multicollinearity can be ignored since it does not aect y or its standard error (either y or yy). Above is true only if the xp at which we want estimation or prediction is within the range of the data. If the wish is to establish association patterns between y and the predictors, then analyst can: Eliminate some predictors from the model. Design an experiment in which the pattern of correlation is broken.

Multicollinearity on Polynomial Models


Multicollinearity is a problem in polynomial regression (with terms of second and higher order): x and x-squared tend to be highly correlated. A special solution in polynomial models is to use zi = xi x bari instead of just xi That is, rst subtract each predictor from its mean and then use the deviations in the model.

Heteroscedasticity
When the variance of ei depends on the value of yi we say that the variances are heteroscedastic (unequal). Heteroscedasticity may occur for many reasons, but typically occurs when responses are not normally distributed or when the errors are not additive in the model. Examples of non-normal responses are: Poisson counts: number of sick days taken per person/month, or number of trac crashes per intersection/year. Binomial proportions: proportion of felons who are repeat oenders or proportion of consumers who prefer low carb foods.

Checking normality assumption


Even though we assume that e N(0, 2), moderate departures from the assumption of normality do not have a big eect on the reliability of hypothesis tests or accuracy of condence intervals. There are several statistical tests to check normality, but they are reliable only in very large samples. Easiest type of check is graphical: Fit model and estimate residuals ei = yi yi Get a histogram of residuals and see if they look normal.

Solutions for Heteroscedasticity


Unequal variances can often be ameliorated by transforming the data: Appropriate transformation - Poisson counts: y - Binomial proportions: sin1 y - Multiplicative errors: log(y) Fit the model to the transformed response. Typically, transformed data will satisfy the assumption of homoscedasticity

Serial Correlation (or Autocorrelation)


Serial correlation occurs when one observations error term (i) is correlated with another observations error term (j): Corr(i, j) 0 This usually happens because there is an important relationship (economic or otherwise) between the observations Especially problematic with Time Series data and Cluster Sampling

Problem with Autocorrelation


OLS estimates remain unbiased But, the OLS estimator is no longer the best (minimum variance) linear unbiased estimator Serial correlation implies that errors are partly predictable. For example, with positive serial correlation, then a positive error today implies tomorrows error is likely to be positive also. The OLS estimator ignores this information; more efficient estimators are available that do not.

Testing for Serial Correlation


There are a number of formal tests available In addition to formal tests always look at residual plots!! The most common test is the Durbin-Watson (DW) d Test The model needs to have an intercept term, and cant have a lagged dependent variable (i.e., Yt-1 as one of the independent variables)

Durbin Watson Test

if = 1, then we expect et = et-1. When this is true, d 0. if = -1, then we expect et = - et-1 and et - et-1 = -2 et-1. When this is true, d 4. if = 0, then we expect d 2.
Hence values of the test statistic far from 2 indicate that serial correlation is likely present.

Solution
Use OLS to estimate the regression and fix the standard errors A. We know OLS is unbiased, its just that the usual formula for the standard errors is wrong (and hence tests can be misleading) B. We can get consistent estimates of the standard errors (as the sample size goes to infinity, a consistent estimator gets arbitrarily close to the true value in a probabilistic sense) called Newey-West standard errors C. When specifying the regression, use coefficient covariance matrix option, and the HAC Newey-West method D. Most of the time, this approach is sufficient. E. If not, we cannot use OLS and need to use Generalized Least Squares.

S-ar putea să vă placă și