Documente Academic
Documente Profesional
Documente Cultură
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .
http://www.jstor.org/action/showPublisher?publisherCode=black. .
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series B (Methodological).
http://www.jstor.org
J. R. Statist.Soc. B (1978),
40, No. 1,pp. 85-93
SUMMARY
A method is proposed
forexamining a linearmodelforthepresence ofoneormore
importantoutliers-deviant
observations whichhavea potentially on
largeinfluence
theresulting estimates.
parameter Forthispurpose, a newstatistic and
is proposed
arediscussed.
properties
itsdistributional Themethod itself
isexploratory innature,
testsare available.Severalexamples
butexactsignificance illustratethemethod.
aspectsare discussed.
Computational
Keywords:MULTIPLE
OUTLIERS; EFFECT;DETERMINANTAL
POINTS;MASKING
LEVERAGE RATIOS;
MATRIX DECOMPOSITIONS
1. INTRODUCTION
IN recentyearstherehas been considerableinterestin the effectof outlyingobservationsin
linearmodels.Robusttechniqueshavebeendeveloped toreducetheeffectofsuchobservations.
However, manypractising are reluctant
statisticians to abandonthe"normal"linearmodel
forthesenewmethods.Hence,a largenumber oftestsandprocedures havebeendeveloped
foroutlierdetection.Mostoftheseprocedures operatein thefollowingways:
examining
(i) Sequentially, themost deviant (insome sense)observationfirst
andconsidering
onlywhenthefirst
otherobservations is beyondsomethreshold.
(ii) Withoutregard influence
tothediffering observations mayhaveontheparameter estimates
valueswhichareoftentheprimefocusoftheanalysis.
orpredicted
Nowifitis possiblethatoneobservation is deviant,itis usuallythecasethattwo,threeand
more may be deviant. However,most sequentialproceduresare susceptibleto the well-
known"masking"effect, by whichseveraldeviantobservationsmay conspireto obscurethe
mostdeviantobservation.The netresultis thatthesequentialprocedureidentifies
no outliers.
Whenan outlieris detected,theanalystis facedwitha numberof questions:
Is themeasurementprocessoutofcontrol?
Is themodelwrong?
Is sometransformation
required?
subsetoftheobservations
anidentifiable
Is there initsdifferent
thatisimportant behaviour
?
These issues affectthe interpretation and confidence in the resultingestimatesand/or
predictions. Sinceobservationsdifferin theeffecttheycan haveon theresulting equation,
it wouldbe usefulif the analyst'sattention weredirected to the observationswithlarge
(usuallycalled"leverage"points).If a pointhas almostno influence
effects on theresults,
thereis littlepointin agonizingoverhowdeviant itappears.
Thispaperpresents a methodwhichis capableoffinding withattention
severaloutliers,
directed at thosewhichcansubstantially theresultant
influence estimates.Section2 illustrates
thenatureofthedifficulties withcommon outlier
detectionprocedures.Section3 presentsthe
motivation fortheproposed Itsproperties
statistic. aredevelopedinSection4. Computational
considerations are discussedin Section5. In Section6 themethodis appliedto several
examples.
5
86 - Finding
ANDREwsAND PREGIBON thatMatter
theOutliers [No. 1,
2. THE NATUREOF T PROBLEM
Most current outlierdetection techniques are based on the analysisof least-squares
residuals,or somesimplefunction ofthem.Manyofthesearegraphical innature, including:
(i) Plotsofresiduals vsothervariables, and
(ii) Normalprobability plots.
Thesegraphical techniques are indeeduseful,and can directtheskilledanalystto problem
observations. Typically, onlytherawresiduals(observed-fitted values)are considered for
theseplots,but theneedforstandardization (by residualvariance)is imperative (see,for
example,Behnkenand Draper, 1972). For plottingpurposes,these standardized (or
Studentized) residualsare muchmoreindicative of deviantobservations thantheirraw
counterparts. Theyhavealso formedthebasis of manynon-graphical techniques as well
(e.g. Stefansky, 1972; Tietjenet al., 1973; Lund, 1975; Prescott,1975). Thesemethods
requirethemaximum Studentized residual(in absolutevalue)and a tableof critical values
fortesting itssignificance. Danieland Wood (1971,Chapter7) discussdiagnostic measures
fordetecting influentialpointsin thefactorspace,whileCook (1977)proposesquantities
proportional to Studentized residuals,usefulfor identifying an influential pointin the
combined response-factor space. Most of thesemethods are constrained to examining one
observation at a time,andthegeneralization ofthesetonon-sequential procedures is notused.
The following examples illustrate withthecommonmethods.Thesesame
thedifficulties
examples willbe re-examined in Section6, usingthemethodology developed in Sections3,
4 and 5.
Example2.1
DanielandWood(1971,Chapter 5) discussan examplewith21 observations,
oneofwhich
is quiteout of joint. Methodsbased on themaximum Studentizedresidualidentify
this
observation, butfailto declareitssignificance(at thenominal5 percentlevel). Whenthis
observation is removed,thepattern ofresiduals is unusual,butmostoutlierprocedures
will
notindicatethepresence ofmorespurious points.However, Danieland Wood,aftermuch
carefulanalysis,focustheirattention on threeof theremaining pointsand givea rather
compelling argument forsettingtheseaside. Theissuehereis notwhether ornotthesepoints
arestatistical
outliers-whatever thatmaymean-butthata data-screening procedureshould
drawtheseto ourattention forfurther examination.
Example2.2
Mickey,Dunnand Clark(1967)suggest a sequentialapproachto detecting outliersvia
stepwiseregression.Theysuggest usingtherelativedecreasein theresidualsumof squares
fromdeletingan observationas thesuitablecriteria
forlocatingbad points.Therelationship
betweenthiscriterionandstepwise liesin thefactthatthedeletion
regression ofobservations
froma regressionproblemcan be accomplished byintroducing a dummy variableforeach
pointto be deleted.Hence,whenthemost"significant" newvariableentersthemodel,the
observationassociatedwithit is deemedto be suspect.Theirapproachwas illustrated with
thedatashownin Fig. 1, whereobservation 19 wasidentifiedas beingdeviant(at the5 per
centlevelofsignificance).
Although themaincriticism ofthisapproachis itspossiblefailure
to detectoutlierswhenmorethanone is present, thepointof thisexamplehereis that
observation19haslittle
effect
onresultingestimatesand,hence,neednotbequeriedextensively!
Example2.3
Wood(1973)alsosuggests usingdummy toindicate
variables whethercertainobservations
but onlyaftertheyhave beenpreviously
shouldreceivespecialattention, identified.
His
method bya processvariablestudyofa refinery
wasillustrated 82 observations
unitinvolving
on 4 independent
variables.After fitofthedata,a thorough
a preliminary graphicalanalysis
1978] ANDREws AND PREGIBON theOutliers
- Finding thatMatter 87
and knowledgeof thesystem indicatedthatthreepointsarosefroma different operating
levelthantheothers.On furtheranalysishowever, he determinedthattheseobservations
"merelyextendedtheeffective
range"ofthevariablesinvolved,
i.e. theseare outliers
which
do notmatter.Now,notall analysesarecarriedoutwithsuchcompetence, andtheneedfor
approachseemsin order.
a unified
130-
120- .-19
110-
100-
Y
90-
80-
70-
60-
181--
0 10 20 30 40
x
FIG. 1. A regression
problemwithone outlierthatdoes notmatter(19) and one thatmight(18). In this
example,X represents theage of a childat firstword(in months)and Y represents
thechild'sscorein an
aptitudetest.
3. MOTIVATINGSTRATEGY
In thefollowing a linearmodel
sections,
y = X,3+noise (3.1)
is discussedwherey is a vectorofdependent X is a matrix
variables, ofindependentvariables
ofrankp describing a regressionproblemor designed experiment,and "noise"denotesthe
disturbances fromthemodel. Further assumptions aboutdistributionsand dependence
will
be madeonlywhereneeded.
Let DW be theoperator whichdeletestheelements associatedwiththek observations
(ij...) fromthetermsof thefollowing expression.Thus,DWy is the(n-1) x 1 vectorof
dependent variables
withtheithobservation deleted;[D I$(X'X)]'is theinverse
oftheinner
product matrixformed after i andj. For convenience
observations
deleting wemaywritethis
as DW(X1X)-1.
The complement operatorDW selects ratherthandeleteselements;forexample,
D(32)(k D(n-3)(.).
Now muchof thetheoryof optimaldesignis based on thedeterminant IX'XI, large
valuesbeingassociatedwithinformative designs(thevolumeofthejointconfidence ellipsoid
is proportional
to IX'X -i). If thedeletionof an observation
has a large(small)effecton
X'X , theobservationhas a large(small)influence
on theresulting
estimates.
themorea particular
Similarly, observation
deviatesfromthefitted values,themorethe
deletionofthisobservationwillreducetheresidualsumof squaresRSS = (y- X,)'(y- X,).
88 theOutliers
ANDREwsAND PREGIBON- Finding thatMatter [No. 1,
Thesetwoideasmaybe combined thechangein RSS x IX'X I resulting
bycalculating from
thedeletion
ofoneormoreobservations.
Thisparticular
formis convenientforstudyas
IX*'X* I = RSS x |X'XI, (3.2)
whereX* = (X: y), thematrixof independent
variableswithy appended.To assessthe
relative
changesdueto thedeletion
ofk observations
(ij...) consider
theratio
O.. =
R(k.>(X*) I. *,*X (3.3)
~ ~ IX*'X*
Thisquantity is dimensionless.Geometrically 1-{RW (X*)}I corresponds to theproportion
ofthevolumegenerated byX* attributable to thek observations(ij...). If thissubsetofthe
observationslies"farout" in thefactorspace,it willaccountfora largeproportion of the
volumeof thespace,lendingsomerealistic interpretationto theterm"outliers".Hence,
smallvalues of R).(X*) are associatedwithdeviantand/orinfluential observations.Regard-
lessofwhichis actuallythecase,itis desirable
toisolatesubsets
oftheobservations producing
smallRjk)L(X*) forfurtherscrutiny.
4. PROPERTES oF Rtzk)(X*)
A surprising
numberof algebraicpropertiescombineto make the ratiosR,,,k) (X*) easy
to calculateand study.Since
X*'X* = D(l)(X*'X*)+xj*'xi,
wherexi is theithrowofX*, itfollowsthat
D( IX*'X*| = IX*'X*I W:, (4.1)
whereW*= I- X*(X*'X*)-"X*'. To avoid singularities, it is convenientto use an
orthogonal
(Q*), triangular
(T*) decomposition of X*, namelyX* = Q*T*, and write
W*= I-Q* Q*'
Ifmorethanoneobservation is dropped,itis easybutunpleasant to showthat
I X* X* |=|X*X"''JxD)JwI
DY4)
DW. X*I xb(kj).
V I W*|
Thus,from(3.3),themeasureoftheeffect
ofdropping
k observations
(ij...) is
R,,,) (X*) - D)I W*I, (4.2)
a k x k determinant.
The distributional
propertiesofthismeasuremaybe studiedusingknownresults about
matrix compounds(Aitken, DW IW*| is a diagonalelement
1958). In particular, of W*(k),
thekthcompoundof W*. The sumof all R $3)(X*formed by droppingall combinations
takenk at a timeis foundbyusinga diagonalization
ofthen observations
w* = U* V* U*'
and V* diagonal(withelements
withU* orthogonal clearly1 and0). For X* ofrankp + 1,
wehave
?t = Pr{R
{kP(X*): R;KRg} = Pr( eZ1(}k) (4.4)
5. COMPUTATioNAL CONSIRATONS
ofwhether
Regardless thestatisticsR( )(X*) are to be usedto testthesignificance
ofa
or merely
subsetoftheobservations, toolforlocating
as a diagnostic them,itis necessary
to
Ri,9.f(X*)fork = 1, kmax. From(4.1)we have
findRk= min(kn) - .,
TABLE 1
Summarystatistics
for thedata of Example2.1
k Rk ak Observations
1 *4225375 *0596515 21
2 *2072535 *1093887 21, 4
3 *1147183 *3962458 21, 4, 2
4 *0337735 *0257674 21, 4, 3, 1
5 *0080269 *0008339 21, 4, 3, 1, 2
6 *0039551 *0029340 21, 4, 3, 1, 2, 13
TABLE2
Summarystatistics
for thedata ofExample2.2
k ?k ak Observations
1 *3350940 *3838198 18
2 *1645985 *2310607 18,2
3 *0897915 *2262575 18, 2, 19
4 *0593024 *7475751 18,2, 19, 11
6-
4-
3- .... -.
2-
TABLE 3
k RO Observations
1 *6297084 75
2 *3743162 75, 76
3 *2238964 75, 76, 77
4 41551370 75, 76, 77, 73
4-......,..#...,,
3-
2-
o0 o0s I iS 2 25
LOG RATIO
; ....... .. . .. ...
2- .. .....
I f ..
7. CoNCLUsioN
The proceduredescribedhere identifies one or more observations whichare both
(i) potentialoutliersin termsof theresponse y and (ii) potentiallygreatly the
influencing
estimatesin thelinearmodel.
In problems of moderate levelsof significance
size,conservative maybe assessed.The
examplesillustrate thattheproposedprocedure identifiespointsmissedby otherstepwise
procedures and evensomeable analysts.Attention to thepointshavinglarge
is directed
influence;outliersofno moment areignored.
The graphicaldisplaysare particularlyusefulin largeproblems wherethedirectassess-
mentofsignificance is notimportant.
The examination ofall subsetsofobservationsdoesnotseemfeasibleyet.Theproposed
procedure firstselectsthesetof m= 2 kma mostsuspectobservations and thenexamines
all subsetsoftheseofsizek or less. Thisprocedure eliminatesmuch(in ourexperienceall)
of the maskingeffect whichcan upsetsinglestep procedures whichconsideronlyone
observation. Thesecorrespond to thecasem= k = 1 in theabove.
ACKNOWLEDGEMENT
Thisresearch in partbytheNationalResearchCouncilofCanada.
wassupported
REFERENCES
and Matrices,9thed. New York: Interscience.
AITKEN,A. C. (1958). Determinants
ANDREws, testsbased on residuals.Biometrika,
D. F. (1971). Significance 58, 139-148.
- (1973). A generalmethodfortheapproximation of tailareas. Ann.Statist.,1, 367-372.
BEHNKEN,D. W. and DRAPER,N. R. (1972). Residualsand theirvariancepatterns.Technometrics, 14,
102-111.
CoLLETT, D. and LEwIs,T. (1976). The subjectivenatureof outlierrejectionprocedures.Appl.Statist.,
25, 228-237.
COOK,R. D. (1977). Detectionof influential observation in linearregression.Technometrics, 19, 1, 15-18.
DANEL, C. and WooD,F. S. (1971). Fitting Equationsto Data. New York: Wiley.
LUND,R. E. (1975). Tablesforan approximate testforoutliersinlinearmodels. Technometrics, 17,473-476.
MIcKEY,M. R., DUNN, 0. J. and CLARK,V. (1967). Note on theuse of stepwiseregression in detecting
outliers.Computers andBiomed.Res., 1, 105-111.
PREscoTr,P. (1975). An approximate testforoutliersin linearmodels. Technometrics, 17, 129-132.
RAo,C. R. (1965). LinearStatisticalInference and itsApplications, 2nd ed. New York: Wiley.
STEFANSKY, W. (1972). Rejectingoutliersin factorialdesigns.Technometrics, 14, 469-479.
TETIEN, G. L., MooRE,R. H. and BECKMAN, R. J. (1973). Testingfora singleoutlierin simplelinear
regression.Technometrics, 15, 717-721.
WILKS, S. S. (1963). Multivariate outliers.Sankhyd,
statistical SeriesA, 25, 407-426.
WooD, F. S. (1973). The use ofindividualeffects equationsto data. Technometrics,
and residualsin fitting
15, 677-694.