0 Voturi pozitive0 Voturi negative

132 (de) vizualizări5 paginiDec 11, 2011

© Attribution Non-Commercial (BY-NC)

PDF, TXT sau citiți online pe Scribd

Attribution Non-Commercial (BY-NC)

132 (de) vizualizări

Attribution Non-Commercial (BY-NC)

- Applied probability and statistics
- Outliers Detection kdd03
- Stat Homework 2
- Biostat Prelim Reviewer
- Semivariograma Robusto.
- Edu 2015 03 p Syllabus
- Alamgir
- DP_MATH2349 Semester 2, 2019
- Ece
- Statistical Models help in Simulation
- data and observations less
- 4th Qtr- Feb 27, 2019- Day 2 - Measures of Spread- Outlier and Box Plot
- Linear Regression and Tire Correlation
- 11 1487228700_16-02-2017.pdf
- IJCTT-V3I2P118
- Overview of the Canadian Proenvironnemental Index Construction Method
- Malini 2017
- Untitled
- Daily Lesson Plan Finale
- Chap 12

Sunteți pe pagina 1din 5

A MULTIVARIATE OUTLIER DETECTION METHOD

P. Filzmoser Department of Statistics and Probability Theory Vienna, AUSTRIA e-mail: P.Filzmoser@tuwien.ac.at

Abstract

A method for the detection of multivariate outliers is proposed which accounts for the data structure and sample size. The cut-oﬀ value for identifying outliers is deﬁned by a measure of deviation of the empirical distribution function of the robust Mahalanobis distance from the theoretical distribution function. The method is easy to implement and fast to compute.

1

Introduction

Outlier detection belongs to the most important tasks in data analysis. The outliers describe the abnormal data behavior, i.e. data which are deviating from the natural data variability. Often outliers are of primary interest, for example in geochemical exploration they are indications for mineral deposits. The cut-oﬀ value or threshold which divides anomalous and non-anomalous data numerically is often the basis for important decisions. Many methods have been proposed for univariate outlier detection. They are based on (robust) estimation of location and scatter, or on quantiles of the data. A major disadvantage is that these rules are independent from the sample size. Moreover, by deﬁnition of most rules (e.g. mean ±2· scatter) outliers are identiﬁed even for “clean” data, or at least no distinction is made between outliers and extremes of a distribution. The basis for multivariate outlier detection is the Mahalanobis distance. The stan- dard method for multivariate outlier detection is robust estimation of the parameters in the Mahalanobis distance and the comparison with a critical value of the χ ^{2} distribu- tion (Rousseeuw and Van Zomeren, 1990). However, also values larger than this critical value are not necessarily outliers, they could still belong to the data distribution. In order to distinguish between extremes of a distribution and outliers, Garrett (1989) introduced the χ ^{2} plot, which draws the empirical distribution function of the robust Mahalanobis distances against the χ ^{2} distribution. A break in the tail of the distributions is an indication for outliers, and values beyond this break are iteratively deleted. The approach of Garrett (1989) needs a lot of interaction of the analyst with the data since this method is not an automatic procedure. We propose a method which computes the outlier threshold adaptively from the data. By investigation of the tails of the diﬀerence between the empirical and a hypothetical distribution function we ﬁnd an adaptive threshold value which increases with sample size if there are no outliers and which is bounded in presence of outliers.

1

2

Methods for Multivariate Outlier Detection

The shape and size of multivariate data are quantiﬁed by the covariance matrix. A

well-known distance measure which takes into account the covariance matrix is the

Mahalanobis distance. For a p-dimensional multivariate sample x |
, n) the |
|||

MD |
for |
i = 1, |
, n |
(1) |

where t is the estimated multivariate location and C the estimated covariance matrix. Usually, t is the multivariate arithmetic mean, and C is the sample covariance matrix. For multivariate normally distributed data the values are approximately chi-square distributed with p degrees of freedom (χ _{p} ). Multivariate outliers can now simply be deﬁned as observations having a large (squared) Mahalanobis distance. For this pur- pose, a quantile of the chi-squared distribution (e.g., the 97.5% quantile) could be considered. However, this approach has several shortcomings. The Mahalanobis dis- tances need to be estimated by a robust procedure in order to provide reliable measures for the recognition of outliers. Single extreme observations, or groups of observations, departing from the main data structure can have a severe inﬂuence to this distance measure, because both location and covariance are usually estimated in a non-robust manner. Many robust estimators for location and covariance have been introduced in the literature. The minimum covariance determinant (MCD) estimator is probably most frequently used in practice, partly because a computationally fast algorithm is available (Rousseeuw and Van Driessen, 1999). Using robust estimators of location and scatter in formula (1) leads to so-called robust distances (RD). Rousseeuw and Van Zomeren (1990) used these RDs for multivariate outlier detection. If the squared RD for an observation is larger than, say, χ _{p}_{;}_{0}_{.}_{9}_{7}_{5} , it can be declared a candidate outlier. This approach, however has shortcomings: It does not account for the sample size n of the data, and, independently from the data structure, observations could be ﬂagged as outliers even it they belong to the data distribution. A better procedure than using a ﬁxed threshold is to adjust the threshold to the data set at hand. Garrett (1989) used the chi-square plot for this purpose, by plotting the squared Mahalanobis distances (which have to be computed at the basis of robust estimations of location and scatter) against the quantiles of χ _{p} , the most extreme points are deleted until the remaining points follow a straight line. The deleted points are the identiﬁed outliers. This method, however, is not automatic, it needs user interaction and experience on the part of the analyst. Moreover, especially for large data sets, it can be time consuming, and also to some extent it is subjective. In the next section a procedure that does not require analyst intervention, is reproducible and therefore objective, into consideration is introduced.

2

2

2

3 An Adaptive Method

The chi-square plot is useful for visualizing the deviation of the data distribution from multivariate normality in the tails. This principle is used in the following. Let G _{n} (u)

2

i 2 , and let

G(u) be the distribution function of χ _{p} . For multivariate normally distributed samples, G _{n} converges to G. Therefore the tails of G _{n} and G can be compared to detect outliers. The tails will be deﬁned by δ = χ _{p}_{;}_{1}_{−}_{α} for a certain small α (e.g., α = 0.025), and

denote the empirical distribution function of the squared robust distances RD

2

2

p _{n} (δ) = sup ^{} G(u) − G _{n} (u) ^{} ^{+}

u≥δ

is considered, where “+” indicates the positive diﬀerences. In this way, p _{n} (δ) measures the departure of the empirical from the theoretical distribution only in the tails, deﬁned by the value of δ. p _{n} (δ) can be considered as a measure of outliers in the sample. Gervini (2003) used this idea as a reweighting step for the robust estimation of multivariate location and scatter. The eﬃciency of the estimator could be improved considerably. p _{n} (δ) will not be directly used as a measure of outliers. The threshold should be inﬁnity in case of multivariate normally distributed data, i.e. extreme values or values from the same distribution should not be declared as outliers. Therefore a critical value p _{c}_{r}_{i}_{t} is introduced, which helps to distinguish between outliers and extremes. The measure of outliers in the sample is then deﬁned as

α _{n} (δ) =

0 |
if |
p |

p |
if |
p |

The threshold value which will be called adjusted quantile is then determined as c _{n} (δ) =

G

can be derived by simulation, and the result is approximately

(1 − α _{n} (δ)). The critical value for distinguishing between outliers and extremes

−1

n

_{p} crit _{(}_{δ}_{,} _{n}_{,} _{p}_{)} _{=} 0.24 − 0.003 · p

^{√} _{n}

for

δ = χ

p,0.975 2 and p ≤ 10

^{a}^{n}^{d}

_{p} crit _{(}_{δ}_{,} _{n}_{,} _{p}_{)} _{=} 0.252 − 0.0018 · p

^{√} _{n}

for

δ = χ

p,0.975 2 and p > 10

(see Filzmoser, Reimann, and Garrett, 2003).

4

Example

We simulate a data set in two dimensions in order to simplify the graphical visual- ization. 85 data points follow a bivariate standard normal distribution. Multivariate outliers are introduced by 15 points coming from a bivariate normal distribution with mean (2, 2) ^{T} and covariance matrix diag(1/10, 1/10). The data are presented in Figure 1. The MCD estimator is applied and the robust distances are computed. Figure 2 shows in more detail how the adaptive outlier detection method works. The plot shows the empirical distribution function of the ordered squared distances (points) and the theoretical distribution function of χ ^{2} _{2} (solid line). The dotted line χ _{2}_{;}_{0}_{.}_{9}_{7}_{5} ^{2} identiﬁes only four outliers which are presented in the left part of Figure 3 as dark points. The adjusted quantile gives a more realistic rule, and the identiﬁed outliers are shown in the right part of Figure 3.

3

3

2

1

0

−1

−2

−3

1.0

0.8

Cumulative probability

0.6

0.4

0.2

0.0

−3

−2

−1

0

1

2

3

Figure 1: Simulated bivariate data with outliers shown as dark points.

0

2

4

6

8

10

Adjusted quantile

97.5% quantile

Ordered squared robust distance

Figure 2: Empirical distribution function of the ordered squared distances (points) and theoretical distribution function of χ _{2} ^{2} (solid line).

4

3

3

2

2

1

1

0

0

−1

−1

−2

−2

−3

−3

Outliers based on 97.5% quantile

−3

−2

−1

0

1

2

3

Outliers based on adaptive method

−3

−2

−1

0

1

2

3

^{2}

Figure 3: Resulting outliers (dark points) according to χ _{2}_{;}_{0}_{.}_{9}_{7}_{5} quantile (right).

(left) and the adaptive

5

Summary

An automated method to identify outliers in multivariate space was developed. The method compares the diﬀerence between the empirical distribution of the squared ro- bust distances and the distribution function of the chi-square distribution. The method accounts not only for diﬀerent dimension of the data but also for diﬀerent sample size.

References

[1] Filzmoser P., Reimann C., Garrett R.G. (2003). Multivariate outlier detection in exploration geochemistry. Technical report TS 03-5, Department of Statistics, Vienna University of Technology, Austria. Dec. 2003.

[2] Garrett R.G. (1989). The chi-square plot: A tool for multivariate outlier recogni- tion. Journal of Geochemical Exploration. Vol. 32, pp. 319-341.

[3] Gervini D. (2003). A robust and eﬃcient adaptive reweighted estimator of multi- variate location and scatter. Journal of Multivariate Analysis. Vol. 84, pp. 116-144.

[4] Rousseeuw P.J., Van Driessen K. (1999). A fast algorithm for the minimum co- variance determinant estimator. Technometrics. Vol. 41, pp. 212-223.

[5] Rousseeuw P.J., Van Zomeren B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association. Vol. 85(411), pp. 633-651.

5

- Applied probability and statisticsÎncărcat deRevathi Ram
- Outliers Detection kdd03Încărcat deWalid_Sassi_Tun
- Stat Homework 2Încărcat defuturedivision
- Biostat Prelim ReviewerÎncărcat deSamantha Cruz Blanco
- Semivariograma Robusto.Încărcat desebas101125
- Edu 2015 03 p SyllabusÎncărcat deRitzuchan001
- AlamgirÎncărcat deEcaterina Draghici
- DP_MATH2349 Semester 2, 2019Încărcat deSujay Kamal
- EceÎncărcat desrijha9
- Statistical Models help in SimulationÎncărcat deMohith Reddy
- data and observations lessÎncărcat deapi-237919685
- 4th Qtr- Feb 27, 2019- Day 2 - Measures of Spread- Outlier and Box PlotÎncărcat deLen Anastacio
- Linear Regression and Tire CorrelationÎncărcat deflavio82pn
- 11 1487228700_16-02-2017.pdfÎncărcat deEditor IJRITCC
- IJCTT-V3I2P118Încărcat desurendiran123
- Overview of the Canadian Proenvironnemental Index Construction MethodÎncărcat deRiekoDita
- Malini 2017Încărcat depoojithaaalam
- UntitledÎncărcat deapi-222199954
- Daily Lesson Plan FinaleÎncărcat deAsYraf ForMer
- Chap 12Încărcat deVirender Singh
- Bi VariateÎncărcat deIera Ahmad
- BaSSy FrequenciesÎncărcat deSejahtera Bangsaku
- Normal DistributionÎncărcat delalravikant
- 1605.08373v1Încărcat deAditya Budi Fauzi
- Chapter 10Încărcat deMayank Saini
- Computing Binomial Probabilities With MinitabÎncărcat dejehana_beth
- Introduction to ProbabilityÎncărcat desbgm
- Formula Sheet Final Exam Fall2011Încărcat deMiki Obmosc
- Frequency distribution in statistics.docxÎncărcat demichelle argenio
- Linard_2012-ATRF_Project Appraisal Under Uncertainty-Revised2012Încărcat deKeith Linard

- The Phi Accrual Failure DetectorÎncărcat dexotid
- P syllabusÎncărcat deDuden
- SyllabusÎncărcat deMohan Das
- SHS Core_Statistics and Probability CG.docÎncărcat deCarol Cargullo-Zamora
- estadisticaÎncărcat deISRAELQV
- DiscreteProbabilityDistribution (1)Încărcat deBen Soco
- Department of Elect4rical:Electronic Engineering - Details of SyllabusÎncărcat deDamilola Oseni
- Formulae List Mf9Încărcat deYi Wen
- gmphd_2colÎncărcat deFany Monterrubio Lozano
- Galton Board -- From Wolfram MathWorldÎncărcat deDražen Horvat
- PTÎncărcat deFarah Saad
- Computer Science and Engg PG CurriculumÎncărcat delakshmi
- Research methods in economics part II STAT (1).pptÎncărcat desimbiro
- Ruido - Señales Correlacionadas y DecorrelacionadasÎncărcat dekreatos
- # 6 Distribusi Probabilitas DiskritÎncărcat deAhmat Safa'at
- math mcq probabilityÎncărcat deAyushMitra
- Discrete and Continuous Probability DistributionsÎncărcat dejawad
- sim.pdfÎncărcat deHamit Aydın
- ch5-4710Încărcat deLeia Seungho
- Mixed-AS-S1-Exam-Questions.pdfÎncărcat deDan
- Testing 1,2,3Încărcat deViajes Bernal
- 1 Random Variables.9401953Încărcat deGowtham Raj
- A Framework for Positive DependenceÎncărcat deMahfudhotin
- 9780521190176Încărcat deHabib Mrad
- Analyzing Data PrismÎncărcat deShannon Smith
- Problem Sheet 4Încărcat depratyush mishra
- MTI.pdfÎncărcat desayanto
- 10.1.1.2.42Încărcat deAshok Reddy
- Risk AstaÎncărcat deMauricio Bernal
- 4_Prof_Li Wind Directionality Effects on Design Wind Pres.pdfÎncărcat deSatosh Kapdi

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.