Documente Academic
Documente Profesional
Documente Cultură
COMP-STAT GROUP
COMP-STAT GROUP
Introduction
COMP-STAT GROUP
Regression Models
COMP-STAT GROUP
The Model
Y is dependent variable
Xs are independent variables
is the error term
Observe that the model is linear in the coefficients .
What does linearity means?
Simple linear regression : Model with only one predictor
Estimation: Least square and/or maximum likelihood estimator
COMP-STAT GROUP
Assumptions
Linearity
Normality
Homoscedasticity
Independence
(of explanatory variables, of error terms)
Number of cases
Data accuracy
Missing Data
Outliers
Main
assumptions
Assumptions (contd.)
Number of cases
The cases to independent variable ration should ideally be 20:1(min 5:1)
Accuracy of data
that you had entered valid data points
Missing data
there treatment is necessary
Outliers
COMP-STAT GROUP
Objectives of analysis
Estimation
Hypothesis testing
Confidence intervals
Prediction of new observations
COMP-STAT GROUP
An example
We have data on jet engine thrust as response variable & primary
speed of rotation, secondary speed of rotation, fuel flow rate,
pressure, exhaust temperature and ambient temperature at time of
test as regressor variables
The objective is to fit a linear regression model and check if our
model satisfies all underlying assumptions and can predict future
observations correctly
COMP-STAT GROUP
Variable selection
Important algorithms:
Forward selection
Backward elimination
Stepwise regression (preferred)
Always start with your domain knowledge. It will guide you through
the selection of variables from a set of candidate variables.
Dont rely too much on variable selection algorithm since they are too
much computer dependant.
COMP-STAT GROUP
10
COMP-STAT GROUP
11
COMP-STAT GROUP
12
Regression Diagnostics
COMP-STAT GROUP
13
Residuals
ei=yi-^yi
Lower the residuals better the model.
SAS code
Proc reg data=test;
model y=x1 x2 x3 x4;
output out=dataset STUDENT RSTUDENT PRESS
COMP-STAT GROUP
14
Residual plots
Statistical Tests
Shaipro-Wilk test
SAS code
proc univariate data=residuals normal; /*normal option for normality tests*/
var r;
qqplot r/normal(mu= est sigma=est);
/*est is for estimating mean & variance from data itself*/
run;
COMP-STAT GROUP
15
White Test
Remedy
SAS Code
16
Outlier Treatment
Is an extreme observation
Residuals considerably larger in absolute value than the others say 3 or 4 standard
deviations from the mean indicate potential y-space outliers
Are data points that are not typical of the rest of the data
Residual plots and normal probability plot are helpful in identifying outliers
Should be removed from the data before estimating the model if it is a bad (?) value
There should be strong non statistical evidence that the outlier is a bad value before
it is discarded
Sometimes desired in the analysis ( you want points of high yield or say low cost)
COMP-STAT GROUP
17
Leverage
o An observation with an extreme value on a
predictor variable is called a point with high
leverage
o Leverage is a measure of how far an independent
variable deviates from its mean
o These leverage points can have an effect on the
estimate of regression coefficients
o Leverage (>(2p+2)/n)
Influential Observations
o An observation is said to be influential if removing
the observation substantially changes the estimate
of coefficients
o Influence can be thought of as the product of
leverage and outliers
o Not all leverage points are going to be influential on
the regression coefficients
o
o Measures :
Cooks D (>1), DFFITS(2p/n), DFBETAS(>2/n)
SAS Code
use
COOKD=name1
DFFITS=name2
H=name3 /* H is for leverage*/
in the output option of proc reg
(you can also use INFLUENCE in
model option for detailed analysis)
COMP-STAT GROUP
18
Multicollinearity
Effect:
Unstable coefficients estimate
Inflated standard error of coff. Estimates
Tools to detect
Examine correlation matrix of independent variables\
Variance inflation factor (>10)(VIF) tolerance is 1/VIF
condition indices (>1000)
Variance decomposition proportions
COMP-STAT GROUP
19
Remedies
SAS code
Proc reg data=test;
model y=x1 x2/VIF TOL COLLINOINT;
/*COLLINOINT gives a detailed collinearity analysis with intercept
variable adjusted out. COLLIN option gives the same analysis with
intercept*/
COMP-STAT GROUP
20
Linearity
SAS Code
Proc sgscatter data=test;
Matrix x1 x2 x3 x4 / group=name;
Run;
COMP-STAT GROUP
21
COMP-STAT GROUP
22