Sunteți pe pagina 1din 13

Detecting Influential

Outliers in Linear
Regression
STAT 6120 Project

Definitions
Johnson (Johnson, 1992) defines an outlier as an
observation in a data set which appears to be
inconsistent with the remainder of that set of data.
Outliers can be caused by incorrect measurements,
including data entry errors, or by coming from a
different population than the rest of the data.
Outliers cause a negative effect on data analysis.
Osbome and
Overbay ( 2004 ) categorized the effects of outliers:
1. Outliers increase error variance and reduce the power of statistical tests
2. They can adversely bias or influence estimates that researchers are
interested in

Review of Methods for Detecting


Outliers
Several techniques are in use for
detecting outliers. These include
Leverage Values, Cooks Distance
and Covariance Ratio.

Linear Regression Model


Consider
the model

Y = X + e
where Y is an n x 1 vector of
observations, X is an n x p full rank
matrix of known constants, is an n x
p vector of unknown parameters and
e is an n x 1 vector of randomly
distributed errors such that E(e)= O
and V(e) =

Leverage Values
.The Hat matrix.
The diagonal elements of H are an indicator of
whether or not an observation is outlying with
respect to its x values. The leverage value for
the observation in the data matrix X is given by
A leverage value is considered to be large if it is more than
twice as large as the mean leverage value. That is leverage
values greater than are considered as outliers

Covariance Ratio
Another

measure of influence of the


observation is to compare the
estimated variance of and the
estimated variance of . Deviation
from unity indicates that the
observation is potentially influential.

Coefficient of Determination
Ratio
To
assess the influence of an observation in a linear
regression analysis we look at the changes that
occur when that observation is omitted.
The proposed method uses the value of the
coefficient of determination of the linear regression
model.
is computed, the coefficient of determination value
when the observation is deleted.
The values of and are compared by calculating
their ratio
This measure is referred to as the Coefficient of
Determination Ratio (CDR).

Coefficient of Determination
Ratio
The
CDR for the observation is defined as
= /= / x / , =1,2,,n
A suitable expression for is
= - + /1-

= -

Implementation of CDR
In
computing CDR a linear regression
analysis is carried out only once and
the regression results are used to
evaluate CDR for each observation. If
the for the observation deviates
from unity, then that observation is
influential.
The data set used is an artificial one

Implementation of CDR to Detect


Outliers in Simple Linear Regression
Scatter plot of artificial data
Five observations {28,29,30,
31,32} are suspected to be
Outliers.

Implementation of CDR to Detect


Outliers in Simple Linear Regression

Implementation of CDR to Detect


Outliers in Simple Linear Regression
We

now check the performance of


the CDR method. From the table CDR
method detects {28,29,30,31,32} as
outliers. detects {28,30,31,32} as
outliers but not 29.

Conclusion
The CDR identifies outliers that the
other methods identify as outliers. In
addition it identifies outliers when
the other methods do not. Future
work needs to be done to formally
identify exact cut-off values for the
CDR method.

S-ar putea să vă placă și