Documente Academic
Documente Profesional
Documente Cultură
BUREAU OF MINES
This publication has been cataloged as follows:
Remmenga, Elmer E
A graphical method of removing outlier values from analyti-
cal data, by E. E. Remmenga and R. G. Burdick. [Washington]
U.S. Dept. of the Interior, Bureau of Mines [1971]
ILLUSTRATIONS
1. Typical curve shapes for good data and data containing erroneous
points............................................................. 4
2. Assumptions of standard deviation to be obtained by various methods.. 5
3. Curves obtained with four different sets of data..................... 6
4. Frequency distributions of a population, using 5-percent errors at
varying standard deviations from the mean.......................... 8
5. Effect of the number and size of outliers on the a versus percentage
of N curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A GRAPHICAL METHOD OF REMOVING OUTLIER VALUES
FROM ANALYTICAL DATA
by
ABSTRACT
INTRODUCTION
At the start of the sampling and analysis phase of the Heavy Metals
Program, as conducted by the U.S. Bureau of Mines, the overall precision and
accuracy of the assay laboratory was determined to be quite good. On this
basis, and to keep the workload within reason, only duplicate samples were
sent for assaying. The assay data, when received, were compiled and given to
the reporting engineer in tabular form with the minimum, maximum, and mean
values appended to the bottom of the report. It was then the responsibility
of the engineer to review the duplicate data compilations. Possible outlier
values were detected on the basis of large differences between duplicates.
An additional complicating factor was that these samples had come from
all over the United States from a variety of geologic provinces and the vari-
ous values encountered ranged from minute to quite large. This prevented the
engineer from getting a "feel " for the data, and rendered a nonsubjective
approach to the screening of data virtually impossible. It was also found
that the detection of outlier values was no simple "black and white" choice
for the data analyst; rather, there was a recurring problem of defining what
constituted an outlier for each suite of data.
For this reason, it was decided that a method would have to be devised to
identify outliers and to remove them from suites of data with a minimal amount
of subjectivity on the part of the analyst. The method must incorporate some
mathematical or graphical solution in order to minimize bias. Such a method
would also enable other analysts to reach the same conclusions from the same
data; that is, various analysts should not exhibit a bias differential. A
somewhat similar approach to the removal of outliers has been described by
Grubbs. 4
The method was originally devised to be used with data that had been
analyzed as duplicate samples. It was found that the difference between
duplicates was a good measure of analytical precision because irrespective of
the numerical values of the data, it can be assumed that duplicate samples
should give, or should approach giving, identical results. 5 Larger discrepan-
cies between duplicates, therefore, tend to point up lower reliability. Fur-
thermore, the "difference" population would be expected to behave as a normal
statistical population. That is, only a very small portion of the duplicates
will show perfect agreement, while a majority of the duplicates will have some
reasonably small but measurable difference, and again a small segment will
exhibit still larger differences. These facts have been noted both in the
analysis values and in the difference between duplicate values with the data
on which we have experimented . And still further, it can be shown
4
Grubbs, F. E. Procedures for Detecting Outlying Observations in Samples.
Technometrics, v. ll, No . 1, 1969, pp. 1-21.
5
Unpublished correspondence between R. G. Burdick and the U.S. Geological
Survey; available for consultation at the Bureau of Mines, Denver, Colo.
3
statistically that in almost all cases the data representing the suite of sam-
ples truly belong to one population. However, in the case of suspected out-
liers, it can also be shown that the farflung data probably do not belong in
the main population. In the case of extreme outliers, values which do not
belong in the population will change the basic statistics commensurate with
their number and magnitudes.
PROCEDURE
The technique finally settled upon was found easier to computerize than
to use manually, although on smaller sets of data it would be practical to use
it manually. A description of the method follows. If necessary, the raw data
are first brought to a common base. As an example, in the case of gold and
silver assays, the assay weights were converted to dollars per ton. Next, the
algebraic differences between the first and second values were determined for
each assay pair. (The absolute differences have been tried but yield a fre-
quency distribution which , while nearly lognormal, is not as amenable to manip-
ulation statistically.) The definitive statistics of the differences, that is,
range, mean, variance, and standard deviation, are then calculated and
recorded . The differences are next ranked from largest negative to largest posi-
tive, and a frequency distribution is made to whatever scale is practical for
the particular data being inspected. The next step is to set aside a small
number of the data representing the largest absolute differences. (In some
cases it is practical to remove only one or two items at a time.) The statis -
tics are then calculated for the remainder of the data. This data removal-
recalculation phase may be cycled as many times as desired; however, suffi-
cient info rmation to use the method may be obtained after only a small portion
of the data have been set aside (in most cases about 10 to 15 percent).
.,~
(mean=O.II, CT=0.96) after which it will
become asymptotic with
- the normal data pat-
This method is not limited to duplicate data but may be used for most
normally or l ognormally distributed data (fig. 3). The computer version has
the advantage that it may be used effectively by technical people who are not
statistics oriented. Since almost all scientific endeavors entail the use of
graphic representation in one form or another, nonstatisticians would probably
prefer the graphic solution to the more complex mathematical solution.
It should always be borne in mind that this method removes (or sets aside)
all of the extreme values, whether they are erroneous or not. Therefore , in
practical applications, if the extreme values cannot be positively identified
as "outliers," we must weigh the benefits of removing erroneous data against
the possible harm done by discarding good data (keeping in mind that the prob-
lem of outlier removal was generated by a disproportionately large number of
extreme values as compared to the hypothesized normal distribution, and that
4 0 .8
500 random norma I numbers. 500 random I ognormal numbers .
-
b
2 0.4 -
J--
b
I
-
0 I I I
500 450 400 500 450 400
Number of samples used to calculate fT Number of samples used to calculate fT
0 .8 5
454 differences
·~
between 400 differences between
duplicate assays . duplicate assays (known to
t-
contain "w ild" data) .
3
b
0 .4
l b
2
-
0 0
454 418 382 400 360 320
Number of samples used to calculate fT Number of samples used to calculate fT
some of these must be erroneous) . In this case, the final decision of whether
or not to reject data remains a subjective one that depends largely on the
characteristics of the data. However, the proposed technique does provide an
objective means of calling attention to those extreme values that are most
likely to be erroneous.
First, if the "outliers" are such that they fall under or nearly under
the normal distribution curve, it is impossible to distinguish them from good
data. On the other extreme, if the outlier values fall very far from the mean
(say lOcr ) they are readily detected . For data lying between these extremes,
the following generalities hold:
The data from which these generalities have been drawn are shown in fig-
ures 4 and 5. Figure 4 shows the frequency distribution of 500 normally dis-
tributed numbers of which 5 percent we re s e lected at random and artificially
made into outliers by moving them farther and farther from the mean (2cr to 9cr ).
It is apparent that as the outliers are moved farther out, they are more
easily detected in the "tails" region of the distribution curve. The curves
of cr versus N shown in figure 5 indicate the effect of varying the percentage
and displacement from the mean of the outlier values. It is apparent that the
smaller the percentage and displacement of the outliers, the more difficult it
is to detect their presence by this graphical technique .
100
,.,
(.J
c
~ 50
0"
...
Q)
I.L
•••
100
1tr
,.,
(.J
c
~ 50
0"
...
Q)
I.L
100
9 tT
,.,
(.J
~ 50
~
0"
...
Q)
I.L
0 • •••
1.25
1.00
-- -------·--·· 5 cr
b
I 0 /o outliers - - - 7cr
0.75 - - - 9cr
0.50
1.75
2cr
- - - - 3 cr
1.50
----4cr
----------··- 5 cr
1.25 - - - - 7 cr
b ---9cr
1.00
out I i er s
0 .75
1.25
---2cr
- - - - 3 cr
1.00
'" ----4cr
b
~"" 10°/o outliers
0.75
----~-
~~--:----
0 . 50~--------._--------~--------~--------~
100 90 80
N (percentage of samples used to calculate cr )
(Bottom curve approximates outlier-free normal data.)
FIGURE 5. • Effect of the Number and Size of Out I iers on the a Versus
Percentage of N Curves.
10
CONCLUSIONS
A method has been developed for identifying and setting aside extreme
values from suites of data with minimum subjectivity or bias on the part of
the analyst. In general, the method consists of the following steps:
Steps 3 and 4 may be cycled as many times as the analyst desires. Upon com-
pletion, the analyst has a series of standard deviations associated with data
sets consisting of N observations (with N decreasing as more and more of the
outlying values are removed).
A graph of standard deviation plotted against N for the series has been
found to be diagnostic of the presence of erroneous extreme values in the
original data set, and has been used to detect and remove the offensive obser-
vations in an objective manner. A downward bias in the estimated variance is
shown to exist in data truncated in this manner, and a compensation is
indicated.