Documente Academic
Documente Profesional
Documente Cultură
Test Name
Reference Section
a
A2
Vh
Geary's test
Anderson-Darling test
skewness
kurtosis
chi-squared test
D'Agostino's D
Kolmogorov-Smirnov
Tietjen-Moore test for > 1 outlier
Section 3.3.1
Section 5.1.4
Section 3.2.1
Section 3.2.2
Section 5.2
Section 4.3.2
Section 5.1.1
Section 6.2.6
Section 5.1
Section 2.3.3
Section 2.3.2
Section 4.4.1
Section 3.2.3
Section 6.2.4
Section 4.2
Section 2.3.2
Section 3.2.3
Section 4.2.6
Section 3.4.1, 6.2.1
Section 4.3.1
Section 4.3.3
Section 4.3.1
Section 4.1.2, 6.2.3
Section 3.4.2, 4.1.3
Section 5.1.3
Section 5.1.2
Section 2.3.1
Section 2.3.2
Section 5.1.3
Section 4.4.2
&2
x2
D
D*
Ek
EDF tests
*i,F 2
k2
-ft-mn,
Kl
Lk
LaBreque's tests
P-P correlation test
sample entropy test
joint kurtosis/skewness test
Grubbs' test for > 1 outlier
MPLSI tests
r
R
Ss
T
Tin, Tin
T2
r*
u
u2
u
w
w'
w2
z
usually not of interest, so a test which has high power at detecting skewed
and long-tailed symmetric alternatives need only be considered. Therefore,
it is important to be able to identify which tests are competitively powerful under certain specific situations, in case some information is known
concerning the alternative.
It is also important to know which tests have decent power under all
types of alternatives, for those instances where no a priori information is
power of these tests with W showed that at least some of the EDF tests
were useful as tests for normality. This nearly doubled the number of tests
that could be used in power comparisons.
The complexity of power comparisons, which were to become almost
mandatory when presenting new tests, was also increased by the number of
alternatives used to compare tests in the earlier studies. Shapiro, Wilk and
Chen (1968) used 45 parameterizations of 12 different alternative distributions. Pearson, D'Agostino and Bowman (1977) presented power estimates
for 58 parameterizations of 12 different distributions. While both of these
studies only included a small number of useful tests (Pearson, D'Agostino
and Bowman included only four omnibus tests and four directional tests)
and did not include composite hypothesis EDF tests, they set a standard
which would be difficult to measure up to, given space limitations in journals. Not only would a power comparison use up a lot of space when it
included all tests (or at, least all that had shown some useful characteristics), but very little additional information would be gained on ensuing
publications, which would also have to include all tests (plus one new one)
and the large range of alternatives.
These difficulties gave rise to the practice of comparing a new test
with a small subset of tests for normality and/or alternatives, during the
time when the development of tests for normality flourished, from 1975 to
the middle of the 1980's. For example, Locke and Spurrier (1976) only
compared their two tests (T\n and T^n] with \/b\. Although this is an
extreme example of the limitations on power comparisons, there were very
few large scale comparisons which could be used to directly compare a large
number of tests for a broad range of alternatives.
In addition, there was no common standard for the design of the power
comparisons. Different studies used different sample sizes and a levels. Reliability of the estimated power differed between studies, because different
numbers of replications were used. Tests were sometimes used as two-tailed
and sometimes as one-tailed tests; sometimes it was not stated how many
tails were used. In some comparisons the estimated power of a new test was
based on a new simulation, while the power estimates for the comparison
tests were obtained from a previously published study.
20. For the three long-tailed symmetric alternatives they used, W was seen
to be generally competitive with 62, although neither was decisively better.
In their more extensive simulation, Shapiro, Wilk and Chen (1968) used
the same tests but included ten long-tailed symmetric alternatives and five
sample sizes between 10 and 50. W and 62 were again competitive, with
&2 tending to be more powerful for the larger sample sizes.
D'Agostino (1971) compared D with the simulation results of Shapiro,
Wilk and Chen (1968). Using only samples of size 50, he determined that
D was competitive with W and 62 for long-tailed symmetric distributions.
D'Agostino and Rosman (1974) only compared Geary's test, W and D, but
used samples of size 20, 50 and 100; they found that both D and a worked
well for long-tailed symmetric alternatives. Hogg (1972) showed virtually
no difference between UthofT's (1968) ?7, asymptotically equivalent to a,
and 62 for logistic and double exponential alternatives. Smith (1975), using the same tests as Hogg, only considered symmetric long-tailed stable
alternatives, and suggested that U be used, especially as tail heaviness
increases.
Csorgo, Seshadri and Yalovsky (1973) showed that a KolmogorovSmirnov test based on a sample characterization of normality yields results comparable to those of W for small samples. Stephens (1974) compared composite hypothesis EDF tests to W and D and showed that the
Anderson-Darling and Cramer-von Mises tests were comparable to both
tests, with A2 being slightly better than W and W2, and not quite as good
as D. The Kolmogorov-Smirnov test D* always performed worse than the
other tests. Green and Hegazy (1976) compared D, W and some modified
EDF tests, with D nearly always being most powerful for the Cauchy and
double exponential alternatives for samples between 5 and 80.
Filliben (1975) showed little difference between a, D, W, W', 62 and
r for samples of size 20; for samples of size 50, 62 and W did not seem
competitive, and a was marginally the best test. Gastwirth and Owens
(1977) indicated that a seemed better than b-2 for long-tailed symmetric
alternatives. For samples of size 20, Spiegelhalter (1977) showed that the
MPLSI test for a double exponential alternative dominated W and 62 for
several long-tailed symmetric alternatives.
Of the tests introduced by LaBreque (1977), F\ was always most powerful when compared to W, A2, a and 62> sometimes by an appreciable
amount. His results were based only on samples of size 12 and 30.
Pearson, D'Agostino and Bowman (1977) showed that for omnibus
tests, the combined skewness and kurtosis test Kg was more powerful than
W or D; however, when used in a directional manner, D was better than
all of the omnibus tests, but there was no real difference between D and
a directional (upper-tailed) 62 test. White and MacDonald (1980) also
indicated that the power of D slightly exceeded both W and 62- Spiegelhalter (1980) showed a possible slight advantage of his omnibus test Ss
over W and 62 in some circumstances. Oja (1981) showed Ty and Locke
and Spurrier's (1977) T* tests to be essentially equivalent in power, and
better than W and 62 fr samples of size 20. A modification of T^ (Oja,
1983) which is easier to calculate showed slight loss of power over T%, but
still exceeded that of the other tests. The MPLSI test for a Cauchy alternative (Franck, 1981) had better performance than other tests for stable
alternatives, at least for samples of size 20; for samples of size 50 not much
difference was demonstrated.
For scale contaminated normal alternatives, which have population
kurtosis greater than 3, kurtosis and other absolute moment tests with
exponent greater than 3 had higher power, on average, than other tests,
including D, a, U and W (Thode, Smith and Finch, 1983). However, ),
a and u had nearly equivalent power to the absolute moment tests for the
heavier tailed mixtures.
Thodc (1985) showed that a, D and U were the best tests for detecting
double exponential alternatives.
Looney and Gulledge (1984; 1985) showed essentially no difference in
power among correlation tests based on different plotting positions except,
notably, W', which had slightly lower power than the others. Gan and
Koehler (1990) also showed that the correlation test based on the plotting
position i/(n + 1) had slightly higher power than A2 and W. Tests based
on normalized spacings (Lockhart, O'Reilly and Stephens, 1986) were not
as good as cither A2 or W.
lation tests were about equivalent in power. For samples of size 20 and 40,
the A2 test based on normalized spacings showed a slight but consistent
advantage over W (Lockhart, O'Reilly and Stephens, 1986). W and A2
outperformed the P-P plot correlation test k"2 (Gan and Koehler, 1990).
appropriate for long-tailed symmetric alternatives. Many of the power comparisons described above have shown that these tests should be used, in a
one-tailed manner, under these circumstances. LaBreque's FI needs to be
investigated further for this class of alternatives. For short-tailed symmetric distributions, theoretical and simulation results indicate the best tests
are u, one-tailed b^ or Grubbs' statistic.
If there is no prior knowledge about the possible alternatives, then an
omnibus test would be most appropriate. A joint skewness and kurtosis
test such as K~ provides high power against a wide range of alternatives,
as does the Anderson-Darling A2. The Wilk-Shapiro W showed relatively
high power among skewed and short-tailed symmetric alternatives when
compared to other tests, and respectable power for long-tailed symmetric
alternatives.
Tests to be avoided for evaluation of normality include the x2 test
and the Kolmogorov-Smirnov test D*. The %2 test, however, has often
been shown to have among the highest power of all tests for a lognorrnal
alternative. Half-sample methods (Stephens, 1978) and spacing tests also
have poor power for testing for normality.
Relative to tests for normality, there have been few power comparisons of
tests for outliers. One reason may be because there are relatively few outlier
tests, and of the outlier tests each has a specific function. For example,
Grubbs' (1950, 1969) and Dixon's (1950, 1951) outlier tests are used to
detect a single outlier, whereas Lk is a test for k > 1 outliers; therefore, a
comparison between Tn and 3, say, would be meaningless. Similarly, Lk
and sequential procedures would not be comparable since for the former
the number of outliers is prespccified while for the latter the number of
outliers is tested for sequentially. Some of the tests that are usually not
thought of as outlier tests (\/b\, b-2, w, tests for normal mixtures) are not
often compared to tests labeled as outlier tests.
Ferguson (1961) compared the power of b^ and \/b~i with Grubbs' and
Dixon's outlier tests. He used normal random samples and added a fixed
constant to one observation in each sample. He found virtually no difference
in power between Grubbs' outlier test and x/&T, with Dixon's test being only
slightly less powerful. Similarly, he computed 62, T and Dixon's r for the
same samples in order to determine the power when two-tailed tests were
required. Here he found that 62 and T were virtually identical and again
Dixon's test was only slightly lower in power.
In a second experiment Ferguson added a positive constant to two observations in the samples to determine the power of T and 62 > in particular
because of the possibility of the masking effect on T. Kurtosis did significantly better than Grubbs' outlier test when there was more than one
outlier in the sample; Dixon's test was not included in this experiment.
Thode, Smith and Finch (1983) showed T to be one of the most powerful tests at detecting scale contaminated normal distributions, a commonly
used model for generating samples with outliers. Samples were generated
differently than those of Ferguson: contaminating observations were generated randomly so that none, one or more than one contaminating observation may have existed in each sample. T was shown to have 92% relative
power to kurtosis over all parameterizations studied, where relative power
was defined as the ratio of sample sizes needed for each test (62 to T} in
order to obtain the same power. T significantly outperformed \fb{, the
Dixon test and w, which had only 65% relative power to 62Whereas in the above three studies the measure of performance of
a test was simply how often the null hypothesis was rejected, Tietjen and
Moore (1972) and Johnson and Hunt (1979) examined other characteristics
of outlier tests. Specifically, they were interested in the performance of the
tests in detecting which and how many observations were identified as
outliers using Ek or (sequentially) other tests for outliers.
Johnson and Hunt (1979) claimed that T was superior to the TietjenMoore, W and Dixon tests when there was one extreme value in the sample
tested. T did show loss of performance, especially compared to more general normality tests, when there was more than one outlier (Ferguson, 1961;
Johnson and Hunt, 1979). In a comparison of a number of tests for normality and goodness of fit in the context of normal mixtures, Mendell, Finch
and Thode (1993) showed that ^Jb[ was the most powerful test when more
than one outlier was present (and they were all in the same direction).
References
Csorgo, M., Seshadri, V., and Yalovsky, M. (1973). Some exact tests for
normality in the presence of unknown parameters. Journal of the Royal
Statistical Society B 35, 507-522.
D'Agostino, R.B. (1971). An omnibus test of normality for moderate and
large size samples. Biometrika 58, 341-348.
D'Agostino, R.B., and Rosrnan, B. (1974). The power of Geary's test of
normality. Biometrika 61, 181-184.
Dixon, W. (1950). Analysis of extreme values. Annals of Mathematical
Statistics 21, 488-505.
Dixon, W. (1951). Ratios involving extreme values. Annals of Mathematical Statistics 22, 68-78.
Ferguson, T.S. (1961). On the rejection of outliers. Proceedings, Fourth
Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, 253-287.
Filliben, J.J. (1975). The probability plot coefficient test for normality.
Technometrics 17, 111-117.
Franck, W.E. (1981). The most powerful invariant test of normal versus
Cauchy with applications to stable alternatives. Journal of the American
Statistical Association 76, 1002-1005.
Can, F.F., and Koehler, K.J. (1990). Goodness-of-fit tests based on P-P
probability plots. Technometrics 32, 289-303.
Gastwirth, J.L., and Owens, M.E.B. (1977). On classical tests of normality.
Biometrika 64, 135-139.
Geary, R.C. (1947). Testing for normality. Biometrika 34, 209-242.
Green, J.R., and Hegazy, Y.A.S. (1976). Powerful modified-EDF goodness
of fit tests. Journal of the American Statistical Association 71, 204-209.
Grubbs, F. (1950). Sample criteria for testing outlying observations. Annals of Mathematical Statistics 21, 27-58.
Grubbs, F. (1969). Procedures for detecting outlying observations in samples. Technometrics 11, 1-19.
Hogg, R.V. (1972). More light on the kurtosis and related statistics. Journal of the American Statistical Association 67, 422-424.