Sunteți pe pagina 1din 15

IR II7472

Bureau of Mines Report of Investigations/ January 1971

A Graphical Method of Removing


Outlier Values From Analytical Data

UNITED STATES DEPARTMENT OF THE INTERIOR


Report of Investigations 7472

A Graphical Method of Removing


Outlier Values From Analytical Data

By E. E. Remmenga and R. G. Burdick

UNITED STATES DEPARTMENT OF THE INTERIOR

BUREAU OF MINES
This publication has been cataloged as follows:

Remmenga, Elmer E
A graphical method of removing outlier values from analyti-
cal data, by E. E. Remmenga and R. G. Burdick. [Washington]
U.S. Dept. of the Interior, Bureau of Mines [1971]

10 P• illus. (U.S. Bureau of Mines . Report of investigations


7472)

1. Mathematical statistics . 2. Mathe matical analysis. I. Burdick,


Richard G., jt. auth. II. Title. (Series)

TN23.U7 no. 7472 622.06173

U.S. Dept. of the Int. Library


CONTENTS

Abstract .......................................................... ....... 1


Introduction............................................................. 1
Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Conclusions .............................................................. 10

ILLUSTRATIONS

1. Typical curve shapes for good data and data containing erroneous
points............................................................. 4
2. Assumptions of standard deviation to be obtained by various methods.. 5
3. Curves obtained with four different sets of data..................... 6
4. Frequency distributions of a population, using 5-percent errors at
varying standard deviations from the mean.......................... 8
5. Effect of the number and size of outliers on the a versus percentage
of N curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A GRAPHICAL METHOD OF REMOVING OUTLIER VALUES
FROM ANALYTICAL DATA

by

E. E. Remmengo 1 and R. G. Burdick 2

ABSTRACT

Often a collection of analytical data, which should reasonably be


expected to follow the normal distribution, contains too many extreme values,
which the experimenter is inclined to eliminate according to some arbitrary
procedure. The stepwise computer procedure described for this purpose is simi-
lar to many other truncation procedures; however, because truncation of a basi-
cally normal distribution causes a downward bias in the estimated varianc e ,
this procedure provides a graphical me thod to compensate for the bias.

INTRODUCTION

Whenever a collection of analytical data is compiled, it is almost inevi-


table that a few erroneous extreme values will be encountered. These outliers
may result from improper initial sampling or subsampling, faulty sample prepa-
ration, equipment malfunctions, analytical technique blunders, or even compila-
tion itself. Errors in this last category may be detected and corrected.
However, errors of the first four types are generally difficult or impossible
to detect and confirm. "Suspicious" results may (usually) be culled out by
the experimenter , but not without the risk of introducing bias. It must, of
course, be borne in mind that some of the extreme values may represent valid
data .

As mentioned by Youden 3 the experimenter, when faced by the need to estab-


lish the reliability of a particular method, must rely heavily on experience
unless he has a background in statistics . The use of experience must neces-
sarily be prone to subjectivity on the part of the experimenter. In addition,
when data are being collected from an area in which little experience exists,
it is virtually impossible to establish the reliability to be expected without
eithe r the use of statistics, or the compilation of an inordinate amount of

Math ematical statistician.


2
Engine ering technician.
Both authors are with the Mine Syst ems Engineering Group, Bureau of Mines,
Denver, Colo .
3 Youden, W. J . Statistical Methods for Chemists. John Wiley & Sons, Inc .,
New York, 1951, 114 pp.
2

data (which amounts to the ga1n1ng of experience). Therefore, some statisti-


cal method is to be preferred for data analysis, whenever possible, in order
to give the analyst the necessary information about his data.

At the start of the sampling and analysis phase of the Heavy Metals
Program, as conducted by the U.S. Bureau of Mines, the overall precision and
accuracy of the assay laboratory was determined to be quite good. On this
basis, and to keep the workload within reason, only duplicate samples were
sent for assaying. The assay data, when received, were compiled and given to
the reporting engineer in tabular form with the minimum, maximum, and mean
values appended to the bottom of the report. It was then the responsibility
of the engineer to review the duplicate data compilations. Possible outlier
values were detected on the basis of large differences between duplicates.

An additional complicating factor was that these samples had come from
all over the United States from a variety of geologic provinces and the vari-
ous values encountered ranged from minute to quite large. This prevented the
engineer from getting a "feel " for the data, and rendered a nonsubjective
approach to the screening of data virtually impossible. It was also found
that the detection of outlier values was no simple "black and white" choice
for the data analyst; rather, there was a recurring problem of defining what
constituted an outlier for each suite of data.

For this reason, it was decided that a method would have to be devised to
identify outliers and to remove them from suites of data with a minimal amount
of subjectivity on the part of the analyst. The method must incorporate some
mathematical or graphical solution in order to minimize bias. Such a method
would also enable other analysts to reach the same conclusions from the same
data; that is, various analysts should not exhibit a bias differential. A
somewhat similar approach to the removal of outliers has been described by
Grubbs. 4

The method was originally devised to be used with data that had been
analyzed as duplicate samples. It was found that the difference between
duplicates was a good measure of analytical precision because irrespective of
the numerical values of the data, it can be assumed that duplicate samples
should give, or should approach giving, identical results. 5 Larger discrepan-
cies between duplicates, therefore, tend to point up lower reliability. Fur-
thermore, the "difference" population would be expected to behave as a normal
statistical population. That is, only a very small portion of the duplicates
will show perfect agreement, while a majority of the duplicates will have some
reasonably small but measurable difference, and again a small segment will
exhibit still larger differences. These facts have been noted both in the
analysis values and in the difference between duplicate values with the data
on which we have experimented . And still further, it can be shown
4
Grubbs, F. E. Procedures for Detecting Outlying Observations in Samples.
Technometrics, v. ll, No . 1, 1969, pp. 1-21.
5
Unpublished correspondence between R. G. Burdick and the U.S. Geological
Survey; available for consultation at the Bureau of Mines, Denver, Colo.
3

statistically that in almost all cases the data representing the suite of sam-
ples truly belong to one population. However, in the case of suspected out-
liers, it can also be shown that the farflung data probably do not belong in
the main population. In the case of extreme outliers, values which do not
belong in the population will change the basic statistics commensurate with
their number and magnitudes.

An occasional outlier in a population for which there is a valid basis


for a normal distribution, such as exists for the distribution of diff e rence
between pairs, is of no concern. However, for the data at hand, it was found
that there was a disproportionate amount of data in the tails of the distribu-
tion which otherwise appeared to be acceptably normal. Further, these extreme
values were erratic, were possibly from a different population, and were of
the type of data expected to occ ur due to faulty operation, malfuncti on, o r
blundering.

PROCEDURE

Discarding of these data, resulting in a truncation of th e curv e , is the


immediate first thought. This report describes an orderly objective procedure
for this purpose. There must also be a realization that truncation cannot
distinguish between good data which exist as a member of the normal distribu-
tion and erroneous data which should be eliminated. Truncation may be desira-
ble to "tidy up" the data, but the estimates of variance resulting from remov-
ing data which are not erroneous will be biased downward. A compensation for
this bias should be considered.

The technique finally settled upon was found easier to computerize than
to use manually, although on smaller sets of data it would be practical to use
it manually. A description of the method follows. If necessary, the raw data
are first brought to a common base. As an example, in the case of gold and
silver assays, the assay weights were converted to dollars per ton. Next, the
algebraic differences between the first and second values were determined for
each assay pair. (The absolute differences have been tried but yield a fre-
quency distribution which , while nearly lognormal, is not as amenable to manip-
ulation statistically.) The definitive statistics of the differences, that is,
range, mean, variance, and standard deviation, are then calculated and
recorded . The differences are next ranked from largest negative to largest posi-
tive, and a frequency distribution is made to whatever scale is practical for
the particular data being inspected. The next step is to set aside a small
number of the data representing the largest absolute differences. (In some
cases it is practical to remove only one or two items at a time.) The statis -
tics are then calculated for the remainder of the data. This data removal-
recalculation phase may be cycled as many times as desired; however, suffi-
cient info rmation to use the method may be obtained after only a small portion
of the data have been set aside (in most cases about 10 to 15 percent).

The next step is to graph the standard deviation (a ) as a function of th e


data sample size (N). It will be found that one of two conditions will exist
in the resulting curve defined by these points:
4

2~--------~~----------~~----------~~----------, 1. The curve


will descend gradually
as N decreases, with a
491 random normal points slight concave-upward
(mean= 0.04, CT = 0.82) curvature, and it will
approximate the normal
b I- - data pattern drawn to
the same scale.

----~---------------- 2. The curve


will start from a high
standard deviation and
0~------~'--------~'---------~'------~ fall to a markedly
100 90 80 lower one when the out-
N (percentage of samples used to calculate CT) lier samples are
removed (see fig. 1).
2r-----------~--------,l.---------.-l--------~

After falling to the


491 random normal points
lower point, the curve
with 10 outliers included wi 11 show a "dog leg,"

.,~
(mean=O.II, CT=0.96) after which it will
become asymptotic with
- the normal data pat-

-------- tern, provided that


the data were basi-
cally normally dis-
tributed and had been
0 I I I contaminated with an
100 90 80
abnormal number of
N (percentage of samples used to calculate CT) extreme values. The
graphs in figure 1
Fl GUR E 1. - Typical Curve Shapes for Good Data and Data were obtained by apply-
Containing Erroneous Points. ing the removal proce-
dure first to a set of
normal random numbers and then to the same set of random numbers after adding
a factor to 10 of the tail elements to create extreme outliers.

The sample size at which the curve of cr versus N becomes asymptotic to


the normal curve pattern is accepted as the point separating the "good" data
from the erroneous data. It can be shown that the data points creating the
"dog leg" above this normal pattern have a high probability of being "outliers 11
which should be corrected by inspection or reruns, or possibly removed from
the suite, while the asymptotic portion represents the standard deviations to
be expected under the assumption of normality (fig. 2). In addition, if
either a straight line or the normal data pattern is projected backwards from
the asymptotic part of a "dog leg" curve, a very close approximation to the
expected standard deviation of the assumed normal distribution can be obtained
from the point where this line crosses the ordinate.
5

It has been found that


there is a good agreement
A- assumption of u, usual methods
between this "graphic"
of data truncation. method and Grubbs 1 6 more
8 - projection of asymptotic portion rigorous mathematical solu-
of curve to Qive hiQher estimate tion of the third and fourth
of u.
C- superimposition of norma I curve to moments (skewness and
show where u should have been kurtosis) of the frequency
with I 00 °/o valid data . distribution curve, for
removal of outliers. The
one possible advantage to
this method (other than its
simplicity) is the possibil-
t ity of estimating what the
standard deviation should
+
b have been in cases where
outliers exist. Other
methods of calculating the
standard deviation from a
truncated frequency curve
give an estimate that is too
low. While an extremely
accurate estimate is not
claimed for this method, it
can be seen by inspection of
figure 2 that it does offer
an improvement over results
obtained with other tech-
- + N (percentage of samples used to calculate u ) niques. The mathematical
FIGURE 2. -Assumptions of Standard Deviations To Be form of the normal pattern
Obtained by Various Methods. of the a versus N curve has
not been studied as yet, but
should prove to be a useful tool for estimating the standard deviation of a
truncated distribution.

This method is not limited to duplicate data but may be used for most
normally or l ognormally distributed data (fig. 3). The computer version has
the advantage that it may be used effectively by technical people who are not
statistics oriented. Since almost all scientific endeavors entail the use of
graphic representation in one form or another, nonstatisticians would probably
prefer the graphic solution to the more complex mathematical solution.

It should always be borne in mind that this method removes (or sets aside)
all of the extreme values, whether they are erroneous or not. Therefore , in
practical applications, if the extreme values cannot be positively identified
as "outliers," we must weigh the benefits of removing erroneous data against
the possible harm done by discarding good data (keeping in mind that the prob-
lem of outlier removal was generated by a disproportionately large number of
extreme values as compared to the hypothesized normal distribution, and that

6 Work cited in footnote 4.


0'1

4 0 .8
500 random norma I numbers. 500 random I ognormal numbers .
-

b
2 0.4 -

J--
b

I
-
0 I I I
500 450 400 500 450 400
Number of samples used to calculate fT Number of samples used to calculate fT

0 .8 5
454 differences

·~
between 400 differences between
duplicate assays . duplicate assays (known to
t-
contain "w ild" data) .
3

b
0 .4
l b
2

-
0 0
454 418 382 400 360 320
Number of samples used to calculate fT Number of samples used to calculate fT

FIGURE 3. ·Curves Obta ined With Four Different Sets of Data.


7

some of these must be erroneous) . In this case, the final decision of whether
or not to reject data remains a subjective one that depends largely on the
characteristics of the data. However, the proposed technique does provide an
objective means of calling attention to those extreme values that are most
likely to be erroneous.

To illustrate the "sensitivity" of this method, several tests were made


using 500 normally distributed numbers with a mean of 0.0, a variance of 0.67,
and a standard deviation of 0.82 . From these tests, the following conclusions
have been drawn regarding sensitivity.

First, if the "outliers" are such that they fall under or nearly under
the normal distribution curve, it is impossible to distinguish them from good
data. On the other extreme, if the outlier values fall very far from the mean
(say lOcr ) they are readily detected . For data lying between these extremes,
the following generalities hold:

1. If 1 percent of the population are outliers, they must be displaced


from the mean by at least 4cr or 5cr before they are definitely detectable.

2. If 5 percent of the population are outliers, they must be displaced


by at least 2cr or 3cr to be detectable.

3. If 10 percent of the population are outliers, they are detectable


when they are displaced by 2cr .

The data from which these generalities have been drawn are shown in fig-
ures 4 and 5. Figure 4 shows the frequency distribution of 500 normally dis-
tributed numbers of which 5 percent we re s e lected at random and artificially
made into outliers by moving them farther and farther from the mean (2cr to 9cr ).
It is apparent that as the outliers are moved farther out, they are more
easily detected in the "tails" region of the distribution curve. The curves
of cr versus N shown in figure 5 indicate the effect of varying the percentage
and displacement from the mean of the outlier values. It is apparent that the
smaller the percentage and displacement of the outliers, the more difficult it
is to detect their presence by this graphical technique .

As stated earlier, however, the "seriousness" of the outliers (for ana-


lytical purposes) depends upon their number and magnitude. Therefore, an
error or errors, undetectable by this method, will not change the basic sta-
tistics of the suite of data as much as will errors in the detectable range.
8

•-Outliers deporting from the normal distribution.

100

,.,
(.J
c
~ 50
0"
...
Q)

I.L

•••

100

1tr
,.,
(.J
c
~ 50
0"
...
Q)

I.L

100

9 tT

,.,
(.J

~ 50
~
0"
...
Q)

I.L

0 • •••

FIGURE 4. ·Frequency Distributions of a Population, Using 5-Percent Errors at Varying


Standard Deviations From the Mean.
9

1.25

1.00
-- -------·--·· 5 cr
b
I 0 /o outliers - - - 7cr
0.75 - - - 9cr

0.50

1.75
2cr
- - - - 3 cr
1.50
----4cr
----------··- 5 cr
1.25 - - - - 7 cr
b ---9cr
1.00
out I i er s
0 .75

1.25
---2cr
- - - - 3 cr
1.00
'" ----4cr
b
~"" 10°/o outliers
0.75
----~-
~~--:----
0 . 50~--------._--------~--------~--------~
100 90 80
N (percentage of samples used to calculate cr )
(Bottom curve approximates outlier-free normal data.)

FIGURE 5. • Effect of the Number and Size of Out I iers on the a Versus
Percentage of N Curves.
10

CONCLUSIONS

A method has been developed for identifying and setting aside extreme
values from suites of data with minimum subjectivity or bias on the part of
the analyst. In general, the method consists of the following steps:

1. Compute the standard deviation for the entire suites of data.

2. Rank the data.

3. Remove one or more of the most extreme values.

4. Recompute the statistics for this smaller set of data.

Steps 3 and 4 may be cycled as many times as the analyst desires. Upon com-
pletion, the analyst has a series of standard deviations associated with data
sets consisting of N observations (with N decreasing as more and more of the
outlying values are removed).

A graph of standard deviation plotted against N for the series has been
found to be diagnostic of the presence of erroneous extreme values in the
original data set, and has been used to detect and remove the offensive obser-
vations in an objective manner. A downward bias in the estimated variance is
shown to exist in data truncated in this manner, and a compensation is
indicated.

INT .• BU . OF MINES , PGH.,PA, 15743

S-ar putea să vă placă și