Documente Academic
Documente Profesional
Documente Cultură
Kay I. Penny
Stable URL:
http://links.jstor.org/sici?sici=0035-9254%281996%2945%3A1%3C73%3AACVWTF%3E2.0.CO%3B2-D
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/rss.html.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic
journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers,
and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take
advantage of advances in technology. For more information regarding JSTOR, please contact support@jstor.org.
http://www.jstor.org
Fri Aug 24 03:28:52 2007
AppL Statist. (1996)
45, No. 1,pp. 73-81
SUMMARY
The Mahalanobis distance is a well-known criterion which may be used for detecting outliers in multivariate
data. However, there are some discrepancies about which critical values are suitable for this purpose.
Following a comparison with Wilks's method, this paper shows that the previously recommended
+
( p ( n - l ) / ( n - p)JFp,,-, are unsuitable, and p(n - 1 ) Fp,,,-l / n ( n - p - 1 pFp,,-,-I) are the correct
critical values when searching for a single outlier.The importance of which critical values should be used is
illustrated when searching for a single outlier in a clinical laboratory data set containing 10 patients and five
variables. The jackknifed Mahalanobis distance is also discussed and the relevant critical values are given.
Finally, upper bounds for the usual Mahalanobisdistance and the jackknifed version are discussed.
Keywords: Critical values; Jackknifed Mahalanobisdistance; Mahalanobis distance; Multivariate outliers
1. Introduction
The methods discussed in this paper are illustrated on a small clinical laboratory data
set. This data set, which is reproduced in Table 1, consists of five liver function
variables for each of 10 patients. The variables contained in the second to sixth
columns of the table are alkaline phosphatase, bilirubin, gamma glutamyl trans-
pepidase, aspartate aminotransferase and alanine aminotransferase respectively.
The Mahalanobis distance D iis suggested in many texts as a method for detecting
outliers in multivariate data. For each of the n observations in a p-variable data set, a
distance value D iis calculated. Let 2 be the sample mean vector and let S be the
sample covariance matrix,
Then
?Address for correspondence: Department of Public Health, Medical School, Polwarth Building, University of
Aberdeen, Foresterhill, Aberdeen, AB9 2ZD, UK.
where x has a multivariate normal distribution with mean p and dispersion matrix X.
In practice, we estimate the parameters p and X with the sample mean vector 2 and
the sample covariance matrix S. Replacing Z with S gives
which follows from Mardia et al. (1979), section 3.5. However, this result assumes
that x is independent of S, which is not true as x is one of the observations used to
calculate S. Further complications arise when p is replaced by 2 to give D;, as x is not
independent of 2.
It is commonly suggested that an F-distribution is more appropriate than a x2-
distribution especially when dealing with small sample sizes. However, in practice
@(n - l)/(n -p)) Fp,?,, as we show below, is inappropriate when testing for
multivariate outliers in small samples.
Squared Mahalanobis distances have been calculated for each of the 10 patients
and are displayed in the penultimate column in Table 1. A comparison of the
D?-values with the two critical values mentioned above incorrectly suggests that
none of the patients are significantly outlying (i.e. at the 5% level, x,,,~,
2
= 16.75,
@(n - l)/(n -p)) FP,,,;,ln = 134.46). These critical values are calculated by using
Bonferroni bounds, which implies that the tests are rather cautious. However, this
may be regarded as a suitable approach in the context of identifying an outlier.
In Section 2, a comparison between D? and Wilks's (1963) method for multivariate
outlier detection is made. This leads to the derivation of appropriate critical values,
CRITICAL VALUES WHEN TESTING FOR OUTLIER 75
which are confirmed by simulations. In Section 3, the jackknifed Mahalanobis
distance is discussed, and appropriate critical values are given. In Section 4 upper
bounds for both the jackknifed and the usual Mahalanobis distance are discussed,
and the paper concludes with a discussion in Section 5.
where
Critical values for the use of R1 as an outlier test are approximated by using
Bonferroni bounds obtained from the lower 100a/n% points of the above beta
distribution, or equivalently the upper 100a/n% points of
as shown in Barnett and Lewis (1984), section 9.3, or Krzanowski (1988), section 8.3.
Wilks's critical values derived from the beta distribution can be converted by using
equation (2) to upper critical values for use with the squared Mahalanobis distance:
are the appropriate critical values when searching for a single outlier, whereas
(p(n - l)/(n - p)) F,,,,; are incorrect.
Returning to the data example described in Section 1, patient 1 is found to be
TABLE 2
Critical values for Df
CRITICAL VALUES WHEN TESTING FOR OUTLIER
TABLE 3
Critical values for D: found by simulation
outlying at the 1% significance level when using the correct critical values for D!, i.e.
where 2 and S-I are calculated from a sample of size n, and xi is a further observation
independent of 2 and S.
When calculating D&,xi is a further observation from the n - 1 observations used
in deriving 2(i)and S!,);therefore xi is independent of 2(,)and S(i).Hence, it follows
directly from expression (4) that
Some critical values based on the above F-distribution are given in Table 4.
Jackknifed Mahalanobis distances were calculated for 100 000 SMVN data sets,
and critical values were obtained by using method 1 described in Section 2. The
results are given alongside the theoretical critical values in Table 4. The critical values
obtained by theory and by simulation are very similar.
Jackknifed Mahalanobis distances for each of the 10 patients in the example
78 PENNY
TABLE 4
Critical values for Di0
described in Section 1 are given in the last column of Table 1. The critical values for
D& also finds patient 1 to be an outlier at the 1% significance level, i.e.
The smaller the scatter ratio, the more outlying the observation is. Hence, to find a
lower bound for Ri let Ri + 0; then
f o r n - p - 1 >O.
Simulations of 10000 SMVN data sets confirm these results (Table 5). Note that
this upper bound is achieved when p = n - 2 but is not a tight bound for p < n - 2.
A comparison with Table 2 shows that in many cases the incorrect critical values
calculated from (p(n - l)/(n - p)} F,,,,; substantially exceed these upper bounds,
whereas the critical values calculated from
2
- 1) F p , n-p- 1; a / n
+
n(n - p - 1 pFP,n-p-1; a/n)
are necessarily within these bounds.
+ m , then
Dti) is a monotone increasing function of Fp,n-p-l. Hence, as Fp,n-p-l
5. Discussion
The example described in Section 1 illustrates the importance of using the
appropriate critical values when using the Mahalanobis distance, as patient 1 goes
TABLE 5
Upper bounds for D: found by simulation
are found to be the appropriate critical values for testing for a single outlier by using
D:. Likewise,
Acknowledgements
I wish to thank Ian Jolliffe for the stimulating discussions while this work was in
progress, and also the Editor and referees for their constructive comments and
suggestions. The author is supported by a Co-operative Award in Science and
Engineering studentship funded by the Science and Engineering Research Council,
and Boots Pharmaceuticals, Nottingham.
References
Atkinson, A. C. and Mulira, H.-M. (1993) The stalactite plot for the detection of multivariate outliers.
Statist. Cornput., 3, 27-35.
CRITICAL VALUES WHEN TESTING FOR OUTLIER 81
Barnett, V. and Lewis, T. (1984) Outliers in Statistical Data, 2nd edn. New York: Wiley.
Caroni, C. and Prescott, P. (1992) Sequential application of Wilks's multivariate outlier test. Appl.