Sunteți pe pagina 1din 15

CHAPTER 7

POWER COMPARISONS FOR UNIVARIATE TESTS


FOR NORMALITY
"Depending on the nature of the alternative distribution and on the
sample size, the various procedures show to better or worse advantage."
Shapiro, Wilk and Chen, 1968
The most frequent measure of the value of a test for normality is its power,
the ability to detect when a sample comes from a non-normal distribution.
All else being equal (which decidedly never happens) the test of choice is
the most powerful. However, in addition to power which depends on both
the alternative distribution and sample size, choice of test when assessing
normality can be based on a variety of other reasons, including ease of
computation and availability of critical values. Ideally, one would prefer
the most powerful test for all situations, while in reality no such test exists.

7.1 Power of Tests for Univariate Normality


Often, while the specific alternative is not known, some general characteristics of the data may be known in advance (e.g., skewness). If not, there
may be limited concerns about the types of departures from normality. For
example, regression residuals which are symmetric but have short tails are

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

Table 7.1 Univariate tests for normality discussed in this chapter.


Test Symbol

Test Name

Reference Section

a
A2
Vh

Geary's test
Anderson-Darling test
skewness
kurtosis
chi-squared test
D'Agostino's D
Kolmogorov-Smirnov
Tietjen-Moore test for > 1 outlier

Section 3.3.1
Section 5.1.4
Section 3.2.1
Section 3.2.2
Section 5.2
Section 4.3.2
Section 5.1.1
Section 6.2.6
Section 5.1
Section 2.3.3
Section 2.3.2
Section 4.4.1
Section 3.2.3
Section 6.2.4
Section 4.2
Section 2.3.2
Section 3.2.3
Section 4.2.6
Section 3.4.1, 6.2.1
Section 4.3.1
Section 4.3.3
Section 4.3.1
Section 4.1.2, 6.2.3
Section 3.4.2, 4.1.3
Section 5.1.3
Section 5.1.2
Section 2.3.1
Section 2.3.2
Section 5.1.3
Section 4.4.2

&2

x2
D
D*
Ek

EDF tests
*i,F 2
k2

-ft-mn,

Kl
Lk

LaBreque's tests
P-P correlation test
sample entropy test
joint kurtosis/skewness test
Grubbs' test for > 1 outlier

MPLSI tests
r

R
Ss
T
Tin, Tin

T2

r*
u

u2
u

w
w'
w2
z

probability plot correlation


rectangular skewness/kurtosis test
omnibus MPLSI test
Grubbs' outlier test
Locke and Spurrier tests
Oja's test
Locke and Spurrier test
range test
Uthoff's test
Watson's test
Kuiper's V
Wilk-Shapiro test
Shapiro-Francia test
Crarner-von Mises test
Lin and Mudholkar's test

usually not of interest, so a test which has high power at detecting skewed
and long-tailed symmetric alternatives need only be considered. Therefore,
it is important to be able to identify which tests are competitively powerful under certain specific situations, in case some information is known
concerning the alternative.
It is also important to know which tests have decent power under all
types of alternatives, for those instances where no a priori information is

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

available. It is also useful to have tests which can be used as substitutes


for each other. Therefore, we have compiled the results of many studies
into a summary of the power of tests for normality. We then make recommendations for testing based on different scenarios.

7.1.1 Background of Power Comparison Simulations


The advent of computers, along with the seminal papers on tests for normality by Shapiro and Wilk (1965) and Shapiro, Wilk and Chen (1968),
essentially set the standards for the development and presentation of new
tests for normality (as well as tests for other distributions). In general,
theoretical power calculations for specific tests are either difficult or intractable; in cases where power could be estimated, it was usually based
on asymptotic approaches (e.g., Geary, 1947). Thus, simulation became the
vehicle of convenience for estimating power and the comparison of tests.
At the time, there were relatively few tests for normality: besides VF,
there were only four moment-type tests (\/bi, b?, u, and a) and the more
general goodness of fit tests (e.g., x2 and EDF tests). The choice of test
was actually more limited than that since the x2 and EDF tests were only
valid for simple hypotheses, which is not a common practical situation.
Also, W had only been developed for sample sizes up to 50.
Shortly after the introduction of W, Lilliefors (1967) presented some
distributional results for the Kolmogorov-Smirnov test, D*, for a composite
normal null hypothesis. In hindsight it can be stated that this was not
useful since this test is almost universally not recommended as a test for
normality because of its poor power properties.
In 1971, D'Agostino introduced his D statistic, for use as an omnibus
test in samples of over size 50. Shapiro and Francia (1972), Weisberg and
Bingharn (1975) and Filliben (1975) suggested correlation tests similar in
construction to W which also overcame the sample size limitation of W.
Between the introduction of W in 1965 and the probability plot correlation test in 1975, there were essentially no other new tests for normality
introduced, making Filliben's (1975) simulation the last word in power
comparisons at the time, since he included all of the well-known tests (excluding EDF tests) for normality. The only exceptions seem to be those
tests developed by Uthoff (1968; 1973).
The use of EDF tests as tests for normality did not become popular until about that time, when Stephens (1974) not only developed null
distributions for composite EDF tests for the normal distribution, but also
identified relationships of critical values with functions of sample size, making these tests more widely available and applicable. A comparison of the

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

power of these tests with W showed that at least some of the EDF tests
were useful as tests for normality. This nearly doubled the number of tests
that could be used in power comparisons.
The complexity of power comparisons, which were to become almost
mandatory when presenting new tests, was also increased by the number of
alternatives used to compare tests in the earlier studies. Shapiro, Wilk and
Chen (1968) used 45 parameterizations of 12 different alternative distributions. Pearson, D'Agostino and Bowman (1977) presented power estimates
for 58 parameterizations of 12 different distributions. While both of these
studies only included a small number of useful tests (Pearson, D'Agostino
and Bowman included only four omnibus tests and four directional tests)
and did not include composite hypothesis EDF tests, they set a standard
which would be difficult to measure up to, given space limitations in journals. Not only would a power comparison use up a lot of space when it
included all tests (or at, least all that had shown some useful characteristics), but very little additional information would be gained on ensuing
publications, which would also have to include all tests (plus one new one)
and the large range of alternatives.
These difficulties gave rise to the practice of comparing a new test
with a small subset of tests for normality and/or alternatives, during the
time when the development of tests for normality flourished, from 1975 to
the middle of the 1980's. For example, Locke and Spurrier (1976) only
compared their two tests (T\n and T^n] with \/b\. Although this is an
extreme example of the limitations on power comparisons, there were very
few large scale comparisons which could be used to directly compare a large
number of tests for a broad range of alternatives.
In addition, there was no common standard for the design of the power
comparisons. Different studies used different sample sizes and a levels. Reliability of the estimated power differed between studies, because different
numbers of replications were used. Tests were sometimes used as two-tailed
and sometimes as one-tailed tests; sometimes it was not stated how many
tails were used. In some comparisons the estimated power of a new test was
based on a new simulation, while the power estimates for the comparison
tests were obtained from a previously published study.

7.1.2 Power Comparisons: Long-Tailed Symmetric Alternatives


Shapiro and Wilk (1965) compared W, \fb\, b?, and u. They included the
X2 and EDF tests in their comparison but assumed known parameters so
that the tests could be used with a simple hypothesis; therefore, they will
not be discussed here. Their simulation only included 200 samples of size

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

20. For the three long-tailed symmetric alternatives they used, W was seen
to be generally competitive with 62, although neither was decisively better.
In their more extensive simulation, Shapiro, Wilk and Chen (1968) used
the same tests but included ten long-tailed symmetric alternatives and five
sample sizes between 10 and 50. W and 62 were again competitive, with
&2 tending to be more powerful for the larger sample sizes.
D'Agostino (1971) compared D with the simulation results of Shapiro,
Wilk and Chen (1968). Using only samples of size 50, he determined that
D was competitive with W and 62 for long-tailed symmetric distributions.
D'Agostino and Rosman (1974) only compared Geary's test, W and D, but
used samples of size 20, 50 and 100; they found that both D and a worked
well for long-tailed symmetric alternatives. Hogg (1972) showed virtually
no difference between UthofT's (1968) ?7, asymptotically equivalent to a,
and 62 for logistic and double exponential alternatives. Smith (1975), using the same tests as Hogg, only considered symmetric long-tailed stable
alternatives, and suggested that U be used, especially as tail heaviness
increases.
Csorgo, Seshadri and Yalovsky (1973) showed that a KolmogorovSmirnov test based on a sample characterization of normality yields results comparable to those of W for small samples. Stephens (1974) compared composite hypothesis EDF tests to W and D and showed that the
Anderson-Darling and Cramer-von Mises tests were comparable to both
tests, with A2 being slightly better than W and W2, and not quite as good
as D. The Kolmogorov-Smirnov test D* always performed worse than the
other tests. Green and Hegazy (1976) compared D, W and some modified
EDF tests, with D nearly always being most powerful for the Cauchy and
double exponential alternatives for samples between 5 and 80.
Filliben (1975) showed little difference between a, D, W, W', 62 and
r for samples of size 20; for samples of size 50, 62 and W did not seem
competitive, and a was marginally the best test. Gastwirth and Owens
(1977) indicated that a seemed better than b-2 for long-tailed symmetric
alternatives. For samples of size 20, Spiegelhalter (1977) showed that the
MPLSI test for a double exponential alternative dominated W and 62 for
several long-tailed symmetric alternatives.
Of the tests introduced by LaBreque (1977), F\ was always most powerful when compared to W, A2, a and 62> sometimes by an appreciable
amount. His results were based only on samples of size 12 and 30.
Pearson, D'Agostino and Bowman (1977) showed that for omnibus
tests, the combined skewness and kurtosis test Kg was more powerful than
W or D; however, when used in a directional manner, D was better than
all of the omnibus tests, but there was no real difference between D and
a directional (upper-tailed) 62 test. White and MacDonald (1980) also

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

indicated that the power of D slightly exceeded both W and 62- Spiegelhalter (1980) showed a possible slight advantage of his omnibus test Ss
over W and 62 in some circumstances. Oja (1981) showed Ty and Locke
and Spurrier's (1977) T* tests to be essentially equivalent in power, and
better than W and 62 fr samples of size 20. A modification of T^ (Oja,
1983) which is easier to calculate showed slight loss of power over T%, but
still exceeded that of the other tests. The MPLSI test for a Cauchy alternative (Franck, 1981) had better performance than other tests for stable
alternatives, at least for samples of size 20; for samples of size 50 not much
difference was demonstrated.
For scale contaminated normal alternatives, which have population
kurtosis greater than 3, kurtosis and other absolute moment tests with
exponent greater than 3 had higher power, on average, than other tests,
including D, a, U and W (Thode, Smith and Finch, 1983). However, ),
a and u had nearly equivalent power to the absolute moment tests for the
heavier tailed mixtures.
Thodc (1985) showed that a, D and U were the best tests for detecting
double exponential alternatives.
Looney and Gulledge (1984; 1985) showed essentially no difference in
power among correlation tests based on different plotting positions except,
notably, W', which had slightly lower power than the others. Gan and
Koehler (1990) also showed that the correlation test based on the plotting
position i/(n + 1) had slightly higher power than A2 and W. Tests based
on normalized spacings (Lockhart, O'Reilly and Stephens, 1986) were not
as good as cither A2 or W.

7.1.3 Power Comparisons: Short-Tailed Symmetric Alternatives


For samples from size 10 to 50, Shapiro and Wilk (1965) and Shapiro, Wilk
and Chen (1968) showed that u usually dominated both b? and W when
the alternative was short-tailed and symmetric. This may not be surprising
since u is the likelihood ratio test for a uniform alternative to normality.
Using samples of size 50, D'Agostino (1971) indicated that D usually
had lower power than w, W and b^. D'Agostino and Rosman (1974) only
compared Geary's test, W and D, using samples of size 20, 50 and 100; they
found that a single-sided a worked best for short-tailed symmetric alternatives, compared to the other two tests. Hogg (1972) showed dominance of
u for a uniform alternative.
EDF tests based on characterizations of normality (Csorgo, Seshadri
and Yalovsky, 1973) were never as powerful as W for the short-tailed alternatives they used; however, they did not use any other tests for normality

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

in their comparison. Using only the uniform distribution as a short-tailed


alternative, Stephens (1974) showed that W dominated D and the EOF
tests. Green and Hegazy (1976) compared D, W and some modified EDF
tests which were more powerful than both D and W] however, again only
the uniform alternative was used, and u was not included.
Filliben (1975) showed that u was most powerful for all of the shorttailed alternatives he used in his simulation study, with 62 being a somewhat distant second. For samples of size 20, Spiegelhalter (1977) showed
that the MPLSI test for a uniform alternative dominated W and 62 for the
two short-tailed alternatives he used; however, this test is equivalent to u.
Of the tests compared by LaBreque (1977), u was always most powerful; however, his only short-tailed symmetric alternative was the uniform,
and his results were based only on samples of size 12 and 30.
Pearson, D'Agostino and Bowman (1977) showed that for omnibus
tests, W was somewhat better than K2 and R (their notation for the
rectangular bivariate joint skewness and kurtosis test, Section 3.2.3), while
a two-tailed D had poor power. As a directional test, 62nad appreciably
higher power than D; however, u was not considered in this comparison.
Spiegelhalter's (1980) omnibus test Ss was comparable to W and 62 for
the two alternatives (uniform and Tukey(0.7)) he included in his study for
samples of size 20 and 50. Oja (1981) showed T% and Locke and Spurrier's
(1977) T* tests to be essentially equivalent in power, and better than W
for samples of size 20. A modification of T^ (Oja, 1983) showed a slight
increase in power over T%.
For selected sample sizes, Thode (1985) showed that for a uniform
alternative the best tests were u, the lower-tailed Grubbs' T, and absolute
moment tests with moment greater than 2 (including 62 > the absolute fourth
moment test). No other short-tailed symmetric alternatives were examined.
Of all correlation tests based on different plotting positions, W had
the highest power (Looney and Gulledge, 1984; 1985). Tests based on
normalized spacings (Lockhart, O'Reilly and Stephens, 1986) were not as
good as either A2 or W. W had noticeably higher power than EDF tests,
including A2, and correlation tests based on P-P plots, such as k2 (Gan
and Koehler, 1990).
7.1.4 Power Comparisons: Asymmetric Alternatives
For samples of size 10 to 50, Shapiro and Wilk (1965) and Shapiro, Wilk
and Chen (1968) showed a possible advantage of W over other tests, including the commonly used \Ai"> against asymmetric alternatives. D'Agostino
(1971) showed this also for samples of size 50, with D having poor power
relative to W and \fb\.

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

Tests based on characterization of normality (Csorgo, Seshadri and


Yalovsky, 1973) had poor power for asymmetric alternatives, and here also
W was slightly better than ^/b\ for samples up to size 35. Stephens (1974)
showed that W had higher power than composite EOF tests, although
A2 was somewhat competitive; D fared poorly in his comparison. For
samples of size 90, Stephens used W rather than W, and showed that
it was slightly more powerful than A2. For all sample sizes, D and the
Kolmogorov-Smirnov tests did poorly, with W2, Kuiper's V and U2 having
intermediate power.
Filliben (1975) showed that for samples of size 20 and 50, the best tests
were W, r and \/bi, m that order. Locke and Spurrier (1976) compared
Tin, T-2n and \A7 fr asymmetric alternatives, breaking them down into
distributions with both tails light (e.g., beta distributions), both tails heavy
(e.g., Johnson U), and one light tail and one heavy tail (e.g., gamma). For
both tails heavy, ^/bl was best, while for other alternatives T\n was best,
particularly for those alternatives with both tails light.
For samples of size 12 and 30, LaBreque's (1977) F-2 was better than
W when vT^T was 2 or less, and W was slightly better otherwise. Slightly
less powerful than these two tests, and essentially equivalent to each other,
were A2, \fb~i and LaBreque's F\.
Of the omnibus tests compared by Pearson, D'Agostino and Bowman
(1977), W was by far the most powerful; it was also somewhat competitive
with the single-tailed ^/b\ and right angle tests, which were about equal
in power. Against stable alternatives, Saniga and Miles (1979) showed
that \fb\ was more powerful than W for samples of sizes between 10 and
100; they also included 62, D, u and a joint skewness/kurtosis test for
comparison. Using a %2 and lognorrnal distribution as alternatives, White
and MacDonald (1980) showed that W and W were equivalent in power for
samples from 20 to 50, and were more powerful than \fb\. For samples of
size 100, W was the most powerful test. For samples of size 20 and 30, Lin
and Mudholkar (1980) showed the highest power for Vasicek's (1976) Kmn
for beta distributions, while for other alternatives either W or z had the
highest power. EDF tests (Kolmogorov-Smirnov and Cramer-von Mises)
and A/5i" were also compared in this simulation. Spiegelhalter's (1980)
omnibus test was better than W for samples of size 20, while W had higher
power than Ss for samples of size 50; \fb[ was less powerful than both tests
for both sample sizes. Oja (1981) showed T\ and Locke and Spurrier's
(1977) T\n tests to be essentially equivalent in power, and better than W
and ^/b~i for samples of size 20. A modification of T\ (Oja, 1983) which is
easier to calculate showed power equivalent to TI.
In their comparison of correlation tests and W, Looney and Gulledge
(1984; 1985) showed a slight advantage in power for W, although all corre-

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

lation tests were about equivalent in power. For samples of size 20 and 40,
the A2 test based on normalized spacings showed a slight but consistent
advantage over W (Lockhart, O'Reilly and Stephens, 1986). W and A2
outperformed the P-P plot correlation test k"2 (Gan and Koehler, 1990).

7.1.5 Recommendations for Tests of Univariate Normality


On the basis of power, the choice of test is directly related to the information available or assumptions made concerning the alternative. The more
specific the alternative, the more specific and more powerful the test will
usually be; this will also result in the most reliable recommendation. A test
should also be based on ease or practicality of computation, and necessary
tables (coefficients, if applicable, and critical values) should be available.
All recommendations were based on the assumption that the parameters
of the distribution are unknown.
Regardless of the degree of knowledge concerning the distribution, it
should be common practice to graphically inspect the data. Therefore,
our first recommendation is to always inspect at least a histogram and a
probability plot of the data.
If the alternative is completely specified up to the parameters, then an
optimal test for that distribution should be used, i.e., either a likelihood
ratio test or a MPLSI test (Chapter 4), assuming such a test exists. For
example, for an exponential distribution Grubbs' statistic is the MPLSI
test. If no alternative-specific test can be used, then a related test may
be available, e.g., Uthoff's U is the MPLSI test for a double exponential
alternative, but since critical values are not readily available a single-sided
a could be used in its place. The next choice would be a directional test for
the class of the alternative, e.g., a one-tailed \fb[ for a gamma alternative.
If the shape and the direction of the shape (e.g., skewed to the left;
symmetric with long tails) are assumed known in the event the alternative
hypothesis is true, but a specific alternative is not, then usually a onetailed test of the appropriate type will be more powerful than omnibus or
bidirectional tests. Grubbs' statistic, W or one-tailed \/bi will usually be
among the most powerful of all tests for detecting a skewed distribution
in a known direction. These are also the tests of choice (using the appropriate choice of critical values for Grubbs' statistic and \/&7) for skewed
alternatives in which the direction of skewness is not prespecified.
Uthoff's U is the MPLSI test for a double exponential alternative,
and Geary's test is asymptotically equivalent to [/; therefore, these tests
would be likely candidates for detecting long-tailed symmetric alternatives.
D'Agostino's D is based on a kernel which indicates that it might also be

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

appropriate for long-tailed symmetric alternatives. Many of the power comparisons described above have shown that these tests should be used, in a
one-tailed manner, under these circumstances. LaBreque's FI needs to be
investigated further for this class of alternatives. For short-tailed symmetric distributions, theoretical and simulation results indicate the best tests
are u, one-tailed b^ or Grubbs' statistic.
If there is no prior knowledge about the possible alternatives, then an
omnibus test would be most appropriate. A joint skewness and kurtosis
test such as K~ provides high power against a wide range of alternatives,
as does the Anderson-Darling A2. The Wilk-Shapiro W showed relatively
high power among skewed and short-tailed symmetric alternatives when
compared to other tests, and respectable power for long-tailed symmetric
alternatives.
Tests to be avoided for evaluation of normality include the x2 test
and the Kolmogorov-Smirnov test D*. The %2 test, however, has often
been shown to have among the highest power of all tests for a lognorrnal
alternative. Half-sample methods (Stephens, 1978) and spacing tests also
have poor power for testing for normality.

7.2 Power of Outlier Tests

Relative to tests for normality, there have been few power comparisons of
tests for outliers. One reason may be because there are relatively few outlier
tests, and of the outlier tests each has a specific function. For example,
Grubbs' (1950, 1969) and Dixon's (1950, 1951) outlier tests are used to
detect a single outlier, whereas Lk is a test for k > 1 outliers; therefore, a
comparison between Tn and 3, say, would be meaningless. Similarly, Lk
and sequential procedures would not be comparable since for the former
the number of outliers is prespccified while for the latter the number of
outliers is tested for sequentially. Some of the tests that are usually not
thought of as outlier tests (\/b\, b-2, w, tests for normal mixtures) are not
often compared to tests labeled as outlier tests.

7.2.1 Power Comparisons of Tests

Ferguson (1961) compared the power of b^ and \/b~i with Grubbs' and
Dixon's outlier tests. He used normal random samples and added a fixed
constant to one observation in each sample. He found virtually no difference
in power between Grubbs' outlier test and x/&T, with Dixon's test being only
slightly less powerful. Similarly, he computed 62, T and Dixon's r for the

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

same samples in order to determine the power when two-tailed tests were
required. Here he found that 62 and T were virtually identical and again
Dixon's test was only slightly lower in power.
In a second experiment Ferguson added a positive constant to two observations in the samples to determine the power of T and 62 > in particular
because of the possibility of the masking effect on T. Kurtosis did significantly better than Grubbs' outlier test when there was more than one
outlier in the sample; Dixon's test was not included in this experiment.
Thode, Smith and Finch (1983) showed T to be one of the most powerful tests at detecting scale contaminated normal distributions, a commonly
used model for generating samples with outliers. Samples were generated
differently than those of Ferguson: contaminating observations were generated randomly so that none, one or more than one contaminating observation may have existed in each sample. T was shown to have 92% relative
power to kurtosis over all parameterizations studied, where relative power
was defined as the ratio of sample sizes needed for each test (62 to T} in
order to obtain the same power. T significantly outperformed \fb{, the
Dixon test and w, which had only 65% relative power to 62Whereas in the above three studies the measure of performance of
a test was simply how often the null hypothesis was rejected, Tietjen and
Moore (1972) and Johnson and Hunt (1979) examined other characteristics
of outlier tests. Specifically, they were interested in the performance of the
tests in detecting which and how many observations were identified as
outliers using Ek or (sequentially) other tests for outliers.
Johnson and Hunt (1979) claimed that T was superior to the TietjenMoore, W and Dixon tests when there was one extreme value in the sample
tested. T did show loss of performance, especially compared to more general normality tests, when there was more than one outlier (Ferguson, 1961;
Johnson and Hunt, 1979). In a comparison of a number of tests for normality and goodness of fit in the context of normal mixtures, Mendell, Finch
and Thode (1993) showed that ^Jb[ was the most powerful test when more
than one outlier was present (and they were all in the same direction).

7.2.2 Recommendations for Outlier Tests


When there is the possibility of a single outlier in a sample, T has consistently been shown to be better than other procedures. For more than
one outlier, ^/b\ and &2 should be used when there are outliers in one or
in both directions, respectively. However, if identification of the number of
outliers is of concern, then a sequential procedure should be used.

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

References
Csorgo, M., Seshadri, V., and Yalovsky, M. (1973). Some exact tests for
normality in the presence of unknown parameters. Journal of the Royal
Statistical Society B 35, 507-522.
D'Agostino, R.B. (1971). An omnibus test of normality for moderate and
large size samples. Biometrika 58, 341-348.
D'Agostino, R.B., and Rosrnan, B. (1974). The power of Geary's test of
normality. Biometrika 61, 181-184.
Dixon, W. (1950). Analysis of extreme values. Annals of Mathematical
Statistics 21, 488-505.
Dixon, W. (1951). Ratios involving extreme values. Annals of Mathematical Statistics 22, 68-78.
Ferguson, T.S. (1961). On the rejection of outliers. Proceedings, Fourth
Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, 253-287.
Filliben, J.J. (1975). The probability plot coefficient test for normality.
Technometrics 17, 111-117.
Franck, W.E. (1981). The most powerful invariant test of normal versus
Cauchy with applications to stable alternatives. Journal of the American
Statistical Association 76, 1002-1005.
Can, F.F., and Koehler, K.J. (1990). Goodness-of-fit tests based on P-P
probability plots. Technometrics 32, 289-303.
Gastwirth, J.L., and Owens, M.E.B. (1977). On classical tests of normality.
Biometrika 64, 135-139.
Geary, R.C. (1947). Testing for normality. Biometrika 34, 209-242.
Green, J.R., and Hegazy, Y.A.S. (1976). Powerful modified-EDF goodness
of fit tests. Journal of the American Statistical Association 71, 204-209.
Grubbs, F. (1950). Sample criteria for testing outlying observations. Annals of Mathematical Statistics 21, 27-58.
Grubbs, F. (1969). Procedures for detecting outlying observations in samples. Technometrics 11, 1-19.
Hogg, R.V. (1972). More light on the kurtosis and related statistics. Journal of the American Statistical Association 67, 422-424.

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

Johnson, B.A., and Hunt, H.H. (1979). Performance characteristics for


certain tests to detect outliers. Proceedings of the Statistical Computing Section, Annual Meeting of the American Statistical Association,
Washington, D.C.
LaBreque, J. (1977). Goodness-of-fit tests based on nonlinearity in probability plots. Technometrics 19, 293-306.
Lilliefors, H.W. (1967). On the Kolmogorov-Smirnov test for normality
with mean and variance unknown. Journal of the American Statistical
Association 62, 399-402.
Lin, C.-C., and Mudholkar, G.S. (1980). A simple test for normality against
asymmetric alternatives. Biometrika 67, 455-461.
Locke, C., and Spurrier, J.D. (1976). The use of U-statistics for testing
normality against non-symmetric alternatives. Biometrika 63, 143-147.
Locke, C., and Spurrier, J.D. (1977). The use of U-statistics for testingnormality against alternatives with both tails heavy or both tails light.
Biometrika 64, 638-640.
Lockhart, R.A., O'Reilly, F.J., and Stephens, M.A. (1986). Tests of fit
based on normalized spacings. Journal of the Royal Statistical Society
B 48, 344-352.
Looney, S.W., and Gulledge, Jr., T.R. (1984). Regression tests of fit and
probability plotting positions. Journal of Statistical Computation and
Simulation 20, 115-127.
Looney, S.W., and Gulledge, Jr., T.R. (1985a). Probability plotting positions and goodness of fit for the normal distribution. The Statistician
34, 297-303.
Mendell, N.R., Finch, S.J., and Thode, Jr., H.C. (1993). Where is the likelihood ratio test powerful for detecting two component normal mixtures?
Biometrics 49, 907-915.
Oja, H. (1981). Two location and scale free goodness of fit tests. Biometrika
68, 637-640.
Oja, H. (1983). New tests for normality. Biometrika 70, 297-299.
Pearson, E.S., D'Agostino, R.B., and Bowman, K.O. (1977). Tests for
departure from normality: comparison of powers. Biometrika 64, 231246.
Saniga, E.M., and Miles, J.A. (1979). Power of some standard goodness of
fit tests of normality against asymmetric stable alternatives. Journal of
the American Statistical Association 74, 861-865.

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

Shapiro, S.S., and Francia, R.S. (1972). Approximate analysis of variance


test for normality. Journal of the American Statistical Association 67,
215-216.
Shapiro, S.S., and Wilk, M.B. (1965). An analysis of variance test for
normality (complete samples). Biometrika 52, 591-611.
Shapiro, S.S., Wilk, M.B., and Chen, H.J. (1968). A comparative study of
various tests for normality. Journal of the American Statistical Association 62, 1343-1372.
Smith, V.K. (1975). A simulation analysis of the power of several tests for
detecting heavy-tailed distributions. Journal of the American Statistical
Association 70, 662-665.
Spiegelhalter, D.J. (1977). A test for normality against symmetric alternatives. Biometrika 64, 415-418.
Spiegelhalter, D.J. (1980). An omnibus test for normality for small samples. Biometrika 67, 493-496.
Stephens, M.A. (1974). EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association 69, 730-737.
Stephens, M.A. (1978). On the half-sample method for goodness of fit.
Journal of the Royal Statistical Society B 40, 64-70.
Thode, Jr., H.C. (1985). Power of absolute moment tests against symmetric
non-normal alternatives. Ph.D. dissertation, University Microfilms, Ann
Arbor, MI.
Thode, Jr., H.C., Smith, L.A., and Finch, S.J. (1983). Power of tests of
normality for detecting scale contaminated normal samples. Communications in Statistics - Simulation arid Computation 12, 675-695.
Tietjen, G.L., and Moore, R.H. (1972). Some Grubbs-type statistics for
the detection of outliers. Technometrics 14, 583-597.
Uthoff, V.A. (1968). Some scale and origin invariant tests for distributional
assumptions. Ph.D. thesis, University Microfilms, Ann Arbor, MI.
Uthoff, V.A. (1973). The most powerful scale and location invariant test
of the normal versus the double exponential. Annals of Statistics 1,
170-174.
Vasicek, O. (1976). A test for normality based on sample entropy. Journal
of the Royal Statistical Society B 38, 54-59.

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

Weisberg, S., and Bingham, C. (1975). An approximate analysis of variance


test for non-normality suitable for machine calculation. Technometrics
17, 133-134.
White, H., and MacDonald, G.M. (1980). Some large-sample tests for
nonnormality in the linear regression model. Journal of the American
Statistical Association 75, 16-31.

Copyright 2002 by Marcel Dekker, Inc. All Rights Reserved.

S-ar putea să vă placă și