Documente Academic
Documente Profesional
Documente Cultură
Where do outliers come from, why do we care about them, and what
can we do if we have them?
Outliers can stem from a variety of sources. Some of the most common
sources are:
1. Purely by chance. In normal samples, about one out of every 20 observations can be expected to be greater than 2 standard deviations
from the mean, about one in a hundred can be 2.5 standard deviations
from the mean. Presence of an outlier does not necessarily mean that
there is a problem.
2. Failure of the data generating process. In this instance, outlier identification may be of extreme importance in isolating a breakdown in an
experimental process.
3. A subject being measured may not be homogeneous with the other
subjects. In this case, we would say the sample comes from a contaminated distribution, i.e., the sample we are observing has been
contaminated with observations which do not belong.
4. Failure of the measuring instrument, whether mechanical or human.
5. Error in recording the measurement, for example, on the data sheet
or log, or in computer entry.
David and Paulson (1965) summarized why outliers should be identified:
(1) to screen the data for further analysis;
(2) to flag problems in the data generation process (see (2) above);
(3) the extreme points may be the only points of real interest.
What do we do if we have outliers? If possible, we should try to identify the reason for being so extreme. It is of utmost importance that the
first thing to be done when identifying an observation as an outlier that the
data be checked as to their accuracy in recording and data entry. This is
usually, but not always, the easiest data verification that can be made. For
example, Gibbons and McDonald (1980a, 1980b) performed some diagnostics on the regression of mortality rates on socioeconomic and air pollution
data for some Standard Metropolitan Statistical Areas (SMSA's), as previously reported by Lave and Seskin (1970, 1977). SMSA's which had high
influence on one or more coefficients in the resulting regression were identified. A check of the socioeconomic data from an easily obtainable source
(U.S. Bureau of the Census, 1973) showed that some of the data for one of
the outlier SMSA's (Providence, Rhode Island) had had the decimal point
transposed one place to the left.
If an identified outlier(s) has been caused by data recording error, the
value should be corrected and analysis can proceed. If it is determined
that the data have been recorded correctly, investigation into other reasons
For certain outlier tests, it is known (or assumed) that the number of
possible outliers in a sample is known a priori] in addition, some of the
tests also assume knowledge of the direction of the outlying observation(s).
In this section we describe outlier tests of these two types, which consist
mainly of Grubbs' and Dixon's tests and extensions thereof.
When two or more outliers occur in a data sample, a test for a single
outlier may not detect one outlier since the other outlier(s) inflates the
variance of the sample (in the case of tests like Grubbs' test or the range
test) or the numerator (in the case of Dixon's test), thereby "masking"
the detection of the largest outlier. For this reason, it is also necessary to
consider testing for more than one outlier. On the other hand, if k > 1
outliers are tested for and there are less than k outliers, one of two things
may occur:
(1) The test may reject the null hypothesis of no outliers because
of a large influence the tnie outliers have on the test statistic, thereby
identifying more outliers than there really are. This is an effect known as
swamping.
(2) The true extreme values may not have enough influence to attain a
significant test statistic, so no outliers are identified. This can be thought
of as reverse masking.
Grubbs' (1950) test is perhaps the most widely known and intuitively obvious test for identifying a single outlier in a sample of n observations. It also
seems to have high potential as a test for normality against other types of
rp
-*- J~l
"~~
\ X (n
rt\
V ( )
XI
>
where x and s are the sample mean and standard deviation, respectively.
In the event that the direction of the possible outlier is known, Ti would
be used for identifying an extreme low observation, and Tn would be used
to identify an extreme high observation (Grubbs, 1950; 1969). In each case
the test statistic is compared to the upper a percentile of the distribution,
with large values of the test statistic indicating the presence of an outlier.
Grubbs (1950) showed that Tn was identical to the ratio of the squared
sum of differences with and without the suspected outlier,
where xn is the mean of the sample excluding x^ny The analogous test
statistic for testing for a lower-tail outlier,
D D
D
,DD
- 3
Expected Quantile
Figure 6.1 Q-Q plot of average annual erosion rates (m/yr) for 13 East
Coast states.
so that T = 2.74, which is just significant at the 0.01 level.
Uthoff (1970) showed that Ti and Tn were the MPLSI goodness of fit
tests for left and right exponential alternatives to the normal distribution,
respectively (Section 4.2.4). Surprisingly, Box (1953) suggested that T is
"like" the LRT for a uniform alternative to a normal distribution, implying
that it is also a powerful test against short-tailed distributions. Whereas
the test statistic is compared to the upper critical value of the null distribution (either TI , Tn or T, as appropriate) when testing for an outlier, testing
for a short-tailed or skewed distributions requires the use of the lower tail
of the null distribution of the test statistic.
Critical values for T, TI and Tn (Grubbs and Beck, 1972) are given in
Table B8.
6.2.2 Dixon Tests for a Single Outlier
Dixon's test statistic for a single upper-tail outlier (Dixon, 1950; 1951) is
the ratio of the distance between the two largest observations to the range,
lQ =
Example 6.2. For the 15 height differences between cross- and selffertilized Zea mays plants (Data Set 2), two lower values may be
outliers. As a test for a single lower-tail outlier,
-48 - (-67)
r'10 = - / ::.' = 0.134
75 - (-67)
14 - (-67)
r'20 =::
] ::: = 0.570.
75 - (-67)
which well exceeds the 1% critical value of 0.523; therefore, we
have identified x = 67 as a lower-tail outlier.
The sequential use of r'1Q and r20 in Examples 6.2 and 6.3 is indicative
of a masking effect, and hence there is the possibility of two outliers in
this sample. However, more formal sequential procedures for identifying
multiple outliers are given below (Section 6.4).
6.2.3 Range Test
The range test has been discussed earlier (Section 4.1.2, 4.2.2) as the LR
and MPLSI test for a uniform alternative. The range test statistic is
One of the earlier suggested uses for u was as a test for outliers and as
an alternative to b? (David, Hartley and Pearson, 1954), where for some
examples they showed that u was identifying samples with a single outlier
(in either direction) where b^ was not rejecting normality. Alternatively,
Barnett and Lewis (1994) suggested the use of the range test as a test "of
a lower and upper outlier-pair .T(I), X( n ) in a normal sample".
While the use of u for detecting short-tailed alternatives requires comparison to the lower percentiles of the test statistic distribution, testing
for an outlier or upper/lower outlier pair requires comparison of u to the
upper a percentile of the distribution (Table BIO).
6.2.4 Grubbs Test for k Outliers in a Specified Tail
Grubbs (1950) presented extensions of his test for a single outlier to a test
for exactly two outliers in a specified tail (S2_1 n/S2 and Si2/S2 for two
upper and two lower outliers, respectively); this was subsequently extended
to a test for a specified number k, 1 < k < n, of outliers in a sample (Tietjen
and Moore, 1972). Here it is assumed that we know (or suspect we know)
how many observations are outliers as well as which of the tails contains the
outlying observations. The test, denoted Lk (and its opposite tail analog
Z/), is the ratio of the sum of squared deviations from the sample mean of
the non-outliers to the sum of squared deviations for the entire sample,
(n l),si2
where xn-k is the mean of the lowest n k order statistics. This is the test
for k upper-tail outliers. Similarly,
n
. .
LIL,
E '= , , , (r.t;\
r*
, "l2
Expected Quantile
Figure 6.2 Q-Q plot of leukemia latency period in months for 20 patients
following chemotherapy.
is the test statistic for k lower-tail outliers, where x^_k is the mean of the
highest n k order statistics. Obviously, Lk and L are always less than
1; if the variance of the subsample is sufficiently smaller than that of the
full sample, then there is an indication that the suspected observations are
indeed outliers. Therefore, the test statistic is compared to the lower tail of
the null test distribution, and the null hypothesis of no outliers is rejected
if Lk (LI.} is smaller than the appropriate critical value (Table B18).
Example 6.4. Data on the latency period in months of leukemia
for 20 patients following chemotherapy (Data Set 9) may contain
3 large values (over 100 months), whereas the remaining cases
have no latency period greater than 72 (Figure 6.2). For this data
set,
Lk = Sl8tlgj20/S2 = 6724/30335 = 0.222
which is well below the 0.01 significance level of 0.300 for k = 3
and n = 20.
6.2.5 Grubbs Test for One Outlier in Each Tail
Grubbs (1950) also presented a test for identifying one outlier in each tail,
using
^-\n\/
_2 \
C2 /c-2 _ 2^i=2 (X(i) ~ Xl,n)
1 nl
\\n /
-\o
These residuals are sorted to obtain the order statistics of the r;, and the
ordered observations z^ are defined as the signed value of r^y The test
statistic for k outliers is then computed similarly to Z/&, using the z<i\,
(,;) - I n _fc) :
where z^ are the deviations from the mean, ordered irrespective of sign, z
is the mean of all of the z^ and zn-k is the mean of Z(\), 2(2), , Z( n -fc)Since this test statistic is the ratio of a reduced sum of squares to a
full sum of squares, the comparison to determine significance is whether
Ek is less than the appropriate critical value (Table B20).
Example 6.5. For the 31 observations in the July wind data (Data
Set 8) there is one observation below 5 rnph (3.8) and one above
14 mph (17.1). These two observations happen to be the largest
deviations (in absolute value) from the sample mean of 9.01, with
Z(si} = 8.09 and Z^Q) 5.21. The full sample sum of squares of
z is 217.05, while the sum of squares based on the reduced sample
of 29 observations is 124-19. This results in the test statistic
E2 = 124.19/217.05 = 0.572.
Therefore, the reduction in the sum of squares is not quite sufficient, at the 0.05 level, to identify the two largest observations as
simultaneous outliers, in comparison to the stated critical value of
0.568.
MINIMUM
21
MAXIMUM
-I
31
35
I-
41
50
72
Figure 6.3 Box plot of average water well alkalinity for 58 wells.
Similar to sequential tests, however, tests for normal mixtures are not
as sensitive to outliers as tests with the correct number of outliers specified.
Normal mixtures and tests are described more fully in Chapter 11.
If all ti < A,;(/3), then no outliers are declared; if any of the 7; > A^(/3)
then m outliers are declared, where ra is the largest value of i such that
U > \t(P).
Since sequential methods are defined to give an overall a significance
level over every step in the procedure, sequential tests will not be as sensitive at picking out rn outliers as a test designed for detecting exactly m
outliers. Prescott (1979) suggested that a sequential procedure for no more
than three outliers is sufficient, since if "there are more than a few outliers
present the problem ceases to be one of outlier detection and perhaps the
underlying structure of the data should be examined in some other way" .
calculated
ESDi =
max
j=l,...,ni+l
I Xj X{ \ Isi,
i 1, . . . , k
where Xi and s? are the mean and variance, respectively, of the subsample
remaining after the first i 1 steps and observation deletions. Jain (1981)
gave critical values for the ESD procedure for k up to 5 (Table B21).
i 1, . . . , k.
Here the x\ are the order statistics of the subsample at step z, and s? is
the variance of the subsample. This method has been found to have poor
power properties (Jain, 1981).
where the m7' are the sample moments for subsample i at step i.
where S?-^ is the sum of squares of the sample obtained by deleting the j
observations furthest from the original sample mean (in either direction).
The number of outliers identified by this test is m, where m is the maximum
j such that Dj < A 7 .
Prescott also gave the formula for calculating S?-\ from the full sample
sum of squares, using only functions of the residuals. By letting
rv.11
I
fy* .
.1*1
rv*
.<-,
i.e., the signed residual from the full sample mean, then
V )
/(n-j)
where the summations are over the j most extreme deviations from the
sample mean (note that the sign of r' is retained in this calculation).
In keeping with his suggestion, Prescott gave critical values for his
procedure only for k = 2 and 3 (Table B24).
a = Z (i)/( n ~ 2fc)
i=k+l
nk
i=fc+i
Then, if IQ is the full sample,
= max | Xi a \ /b =\ x a \ /b
Io
R-2 = max | Xi a \ /b =\ x1 a \ /b . . .
ii
so that the Ri are sequentially the largest standardized residuals from the
trimmed mean. These test statistics are then compared to the critical
values corresponding to the specified (a, n, k) to determine whether any
outliers exist in the original sample.
Example 6.8. For the July wind data (Data Set 8), two outliers
will again be specified, as in Example 6.5. The trimmed mean and
standard deviation based on the 27 middle observations are
b = 1.872
which result in test statistic values of
Hi = (17.1 - 8.89)/1.872 = 4.386
R2 = (8.89 - 3.8)/1.872 = 2.719.
From the table of critical values for a sample size of 31, using
a = 0.05 and k = 2 (Table B25), the critical values for comparison
to RI and R-2 are about 4-60 and 8.55, respectively, indicating
that there neither of the observations should be considered outliers.
However, note that Grubbs' T 3.01 would have just identified a
single outlier at the 0.05 level, had that been the model specified.
Rosner (1977) gave critical values for selected sample sizes up to 100
for a = 0.10,0.05 and 0.01 and k = 2,3 and 4; Jairi (1981) gave similarvalues for k up to 5 (Table B25).
References
Aitken, M., and Wilson, G.T. (1980). Mixture models, outliers and the
EM algorithm. Technometrics 22, 325-331.
Barnett, V., and Lewis, T. (1994). Outliers in Statistical Data, 3rd ed.
John Wiley and Sons, New York.
Box, G.E.P. (1953). A note on regions for tests of kurtosis. Biometrika 40,
465-468.
David, H.A. (1981). Order Statistics 2nd ed. John Wiley and Sons, New
York.
David, H.A., Hartley, H.O. and Pearson, E.S. (1954). The distribution
of the ratio, in a single normal sample, of the range to the standard
deviation. Biometrika 41, 482-493.
David, H.A. and Paulson, A.S. (1965). The performance of several tests
for outliers. Biometrika 52, 429-436.
Dixon, W. (1950). Analysis of extreme values. Annals of Mathematical
Statistics 21, 488-505.
Dixon, W. (1951). Ratios involving extreme values. Annals of Mathematical Statistics 22, 68-78.
Ferguson, T.S. (1961). On the rejection of outliers. Proceedings, Fourth
Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, 253-287.
Gibbons, D.I., and McDonald, G.C. (1980a). Examining regression relationships between air pollution and mortality. GMR-3278, General
Motors Research Laboratories, Warren, Michigan.
Gibbons, D.I., and McDonald, G.C. (1980b). Identification of influential
geographical regions in an air pollution and mortality analysis. GMR3455, General Motors Research Laboratories, Warren, Michigan.
Grubbs, F. (1950). Sample criteria for testing outlying observations. Annals of Mathematical Statistics 21, 27-58.
Grubbs, F. (1969). Procedures for detecting outlying observations in samples. Technometrics 11, 1-19.
Grubbs, F., and Beck, G. (1972). Extension of sample sizes and percentage
points for significance tests of outlying observations. Technometrics 14,
847-859.
Hawkins, D.M. (1980). Identification of Outliers. Chapman and Hall,
New York.
Hoaglin, D.C., Iglewicz, B., and Tukey, J.W. (1986). Performance of some
resistant rules for outlier labeling. Journal of the American Statistical
Association 81, 991-999.
Lave, L.B. and Seskin, E.P. (1970). Air pollution and human health. Science 169, 723-733.
Lave, L.B. and Seskin, E.P. (1977). Air Pollution and Human Health.
John Hopkins University Press, Baltimore.
Prescott, P. (1979). Critical values for a sequential test for many outliers.
Applied Statistics 28, 36-39.
Rosner, B. (1975). On the detection of many outliers. Technometrics 17,
221-227.
Rosner, B. (1977). Percentage points for the RST many outlier procedure.
Technomctrics 19, 307-312.
Thodc, Jr., H.C. (1985). Power of absolute moment tests against symmetric
non-normal alternatives. Ph.D. dissertation, University Microfilms, Ann
Arbor, Michigan.
Thode, Jr., H.C., Smith, L.A. and Finch, S.J. (1983). Power of tests of
normality for detecting scale contaminated normal samples. Communications in Statistics - Simulation and Computation 12, 675-695.
Tietjen G.L. (1986). The analysis and detection of outliers. In D'Agostino,
R.B., and Stephens, M.A., eds., Goodness-of-Fit Techniques, Marcel
Dekker, New York.
Tietjen, G.L., and Moore, R.H. (1972). Some Grubbs-type statistics for
the detection of outliers. Technometrics 14, 583-597.
Tukey, J.W. (1977).
Reading, MA.
U.S. Bureau of the Census (1972). City and County Data Book, 1972.
U.S. Government Printing Office, Washington, D.C.
Uthoff, V.A. (1970). An optimum test property of two well-known statistics. Journal of the American Statistical Association 65, 1597-1600.