Sunteți pe pagina 1din 3

Data

Prep Black Swans


Infinancialmarkets,thetermblackswan1isusedtodescribeanyrare,unanticipated,yetvery influentialeventthathasamajorimpact.BlackSwaneventsaregenerallydifficultorimpossibleto predict.

Inthisissue,thefifthtutorialinourdatapreparationseries,weunwrapanothertroublingphenomenon commonlyencounteredinsampledata:whathappenswhenafewobservationslookoffanddontquite fitinwiththerestofthesample? Inthistutorial,wewilldiscusstheproblemofoutliers,howtodetectthem,andwhatwecandowith them.

Background
Instatistics,anoutlierisanobservationthatisnumericallydistantfromtherestofthedata.Inother words,anoutlierisonethatappearstodeviatemarkedlyfromothermembersofthesampleinwhichit occurs. Outliersariseduetochangesinsystembehavior,fraudulentbehavior,humanerror,instrumenterroror simplythroughnaturaldeviationsinpopulations(occurringbychanceinanydistribution,orwhenthe populationhasaheavytaileddistribution). Intimeseriesanalysis,weshouldexaminethepresenceofoutlier(s)onlyforastationaryprocess. Pleasenotethatoutliers,beingthemostextremeobservations,mayincludethesamplemaximumor sampleminimum,orboth,dependingonwhethertheyareextremelyhighorlow.However,thesample maximumandminimumarenotalwaysoutliersbecausetheymaynotbeunusuallyfarfromother observations.

Why should I care?


Theobservedseriesmaybecontaminatedbysocalledoutliers.Theseoutliersmaychangethemean leveloftheuncontaminatedseries.Furthermore,thepresenceofoutliersinthesamplesuggeststhat theunderlyingdistributionhasfattailsorkurtosis. Instatistics,estimatorsthatarecapableofcopingwithoutliersaresaidtoberobust.Forinstance,the medianisarobuststatistic,whilethemeanisnot.Naiveinterpretationsofstatisticsderivedfromdata setsthatincludeoutliersmaybemisleading.
BlackswaneventswereintroducedbyNassimNicholasTalebinhis2004bookFooledByRandomness,which concernedfinancialevents.
1

DataPrepOutliers

SpiderFinancialCorp,2012

Ontheotherhand,weshouldalwayssearchforthecausesoftheidentifiedoutliers,payingspecial attentiontolevelchanges,variance,andthoseoutliersthatcantbeexplained.

Are all outliers the same?


Notalloutliersarecreatedequally;moreimportantly,outliersdontalwaysexhibitthesamedegreeof influenceontheparametervaluesoftheproposedmodel. Inatimeseriesanalysis,weneedtoaskthefollowingquestion:whatistheimpactonthemodel parametersifweleaveanoutlierinthesampledatavs.droppingit? Inregression,wedusecooksdistancetomeasurethisinfluence,excludingonlythosevaluesthat exhibitalargeinfluence. Insum,weneedtoevaluateoutliersnotbythemagnitudeoftheirvalues,butbytheinfluencethey exhibitonthemodelsparametervalues.

How do I detect those outliers?


Ingeneral,wewishforanoutlierdetectionmethodthatcananswerthefollowingquestions: (1) Arethereoutliers? (2) Whataretheirlocations? (3) Whataretheirtypesandmagnitudes? Theoutlierdetectionmethods2fallintothreemaincategories: Modelbasedmethods,whicharecommonlyusedforidentificationwhenweassumethedata arefromanormaldistributionGrubbstest,Peircescriterion,andChauvenetscriterion,etc. DistancebasedmethodsCooksdistance OthermeasurebasedInterquartilerange,etc. Adaptivefiltering.3

Pleasebearinmindthatthemethodsabovedetectapotentialoutlier,butitisyourresponsibilityto verify,andtosomeextentexplain,theirvalues.

Thereareseveralgraphicaltechniquescan,andshould,beusedtodetectoutliers.Asimplerunsequenceplot,a boxplot,orahistogramshouldshowanyobviouslyoutlyingpoints.Anormalprobabilityplotmayalsobeuseful. 3 Youhaveafilterwhichconstantlyadaptstotheinputsignal,effectivelymatchingitsfiltercoefficientstoa hypotheticalshorttermmodelofthesignalsource,therebyreducingmeansquareerroroutput.Thisthengives youalowleveloutputsignal(theresidualerror)exceptforwhenyougetanoutlier,whichwillresultinaspike, whichwillbeeasytodetect(threshold).


2

DataPrepOutliers

SpiderFinancialCorp,2012

I have a few (possible) outliers in my data; whats next?


Goldenrule:weshouldalwayssearchforthecausesoftheidentifiedoutliers:levelchangesand variance,andthoseoutliersthatcantbeexplaineddemandspecialattention. Case 1: We can explain the outliers by exogenous disturbances Inthiscase,amoreappropriatestrategywouldbetospecifyageneralmodelinsomeformbasedon causesoftheexogenousdisturbancesandthetimeseriesparameters.Thisstrategyallowsfortheuse ofpriorinformationofthedisturbances.Itcanalsoreducethepossibilityofoverparameterization thatarisesfromtheabuseofthedetectionprocedure. Case 2: We cant explain the outliers Oncewedetectedafew(candidate)outliersinoursample,weareleftwithtwooptions: Retain:Assumetheyareagenuineoutcomeoftheunderlyingprocessandproceedwithour analysis. Exclude:Assumetheyarebadvalues4(e.g.dataentryerror),scrapthem,andassumetheyare missing.

NOTE:Intimeseries,werequireoursampledatatobeequallyspaced,sodroppinganoutlierwillcreate agap(missingvalue)inyourtimeseries.Toretaintheequalspacing,wereferyoutotheinterpolation methodsdiscussedinanearlierissue.Keepinmindthatwearefundamentallyalteringthetimeseries regardlessofwhatvaluesweplugin,soagreatdealofdiscretionisinorder. Furthermore,youshoulddifferentiateamongoutliersthroughtheirinfluenceontheunderlyingmodels parameters,andstartwiththosethatexhibitthegreatestdegreesofinfluence.

Conclusion
Theoutliersprocessingisabigandcomplexsubject,andtheanswerwilldependonhowmucheffort youwanttoinvestinit,andhoweffectiveyourmeansofoutlierdetectionprovetobe.

Deletionofoutlierdataisacontroversialpractice.

DataPrepOutliers

SpiderFinancialCorp,2012

S-ar putea să vă placă și