Pave
Consright
The Society for Psy chop slotogcal Revearch tn
The Continuing Problem of False Positives in
Repeated Measures ANOVA in Psychophysiology:
A Multivariate Solution
MIcHaeL W. Vasey AND JULIAN F. THAYER
Department of Psychology, The Pennsylvania State University, University Park. Pennsylvania
ABSTRACT
Violation of the validity assumptions of repeated m
problem in psychophysiology. Such n results in positive bias for those tests involving the
repeated measures factor(s). Recently it has been shown that the tests of simple interactions and
multiple comparisons are even more vulnerable to bias (Boik, 1981; Mitzel & Games, 1981). The
present paper offers a discussion of the validity assumptions for both overall and sub-effect tests
and describes 2 multivariate approach which allows exact analysis of such designs. A modification
of the univariate approach is also described. Validity concerns for both approaches are much less
problematic than those of the traditional approach.
DESCRIPTORS: Repeated measures designs, Sphericity assumption, Multivariate anal}
sures analysis of variance continues to be a
of
variance, False positives.
Repeated measures designs are among the most
common experimental designs in psychophysiolog-
ical research. These designs have traditionally been
analyzed through univariate analysis of variance
(ANOVA). In general, ANOVA assumes that the
data are normally distributed with homogenous
variance among groups. However, due to the cor-
related nature of repeated measures data, ANOVA
applied to such designs carries an additional re-
quirement known as sphericity or circularity. The
condition of sphericity exists if and only if the var-
iance of all pairwise differences between repeated
measurements is constant. This is a complex as-
sumption and will be described fully below. In the
past, several papers have appeared in Psychophys-
ology which have cautioned that this assumption
is quite unrealistic when applied to psychophysio-
logical data and is frequently violated (Wilson, 1967,
1974; Jennings & Wood, 1976; Keselman & Rogan,
‘The authors wish to thank Robert Stern, Ralph O'Brien,
and Robert Kennedy for their helpful comments.
A shorter version of this paper was presented at the
thirteenth meeting of the Psychophysiology Society, Lon-
don, December, 1985.
Address requests for reprints to: Julian F, Thayer, De-
partment of Psychology, The Pennsylvania State Univer-
sity, University Park, Pennsylvania 16802.
1980). The consequence of such violation is posi-
tively biased or liberal tests, meaning that the like-
lihood of a Type I (false positive) error exceeds the
probability, or alpha (a) level, set by the user. Ke-
selman and Rogan (1980) pointed out that for the
.05 level of significance the bias can reach 2a, while
for the ,01 level it can reach 6a. It is clear that such
a bias could result in a high incidence of nonrep-
licable results published in our journals (Games,
1976).
A variety of procedures, generally involving re-
duction of the degrees of freedom through multi-
plication by some value « (epsilon), have been ad-
vocated to guard against such bias. These proce-
dures, such as the three-step approach suggested by
Greenhouse and Geisser (1959), will be described
below. Unfortunately these safeguards have not
typically been used by psychophysiologists. Jen-
nings and Wood (1976) reported that 84% of the
studies having repeated measures designs which ap-
peared in volume 12 of Psychophysiology (1975)
apparently ignored the possibility of bias. There-
after Games (1976) offered to provide FORTRAN
IV programs implementing such a procedure. Fi-
nally, Keselman and Rogan (1980), noting a con-
tinued neglect of this issue, reminded researchers
of the need for such safeguards.
479480
Given such a thorough treatment of this prob-
Jem, one might expect that_psychophysiologists
would now guard against such test level bias as a
matter of course, Regrettably, a review of volumes,
21 and 22 (1984 and 1985) revealed that more than
50% of such studies remain unprotected against this
problem. It should be noted that only designs hav-
ing three or more levels of a repeated measures
factor were considered since the assumption of
sphericity is aiways fulfilled for two levels. The above
is not to suggest that all of these unprotected studies
have suffered increased incidence of false positive
results. Certainly the bias for overall effects is typ-
ically small and many of these studies would re-
main unchanged if appropriate adjustments were
applied. However, the potential for bias does exist
and since appropriate safeguards are now com-
monly available, we argue that they should be rou-
tinely applied. Currently that is not the case. As we
will describe below, the problem is especially great
for sub-effect tests.
The above finding demonstrates a need for fur-
ther discussion of this topic. However, a more se-
rious problem exists which has not previously been
discussed in the psychophysiological literature. The
traditional safeguards like the Greenhouse and
Geisser (1959) three-step approach and « adjust-
ments in general, protect only the F tests for main
effect of, or interactions with, repeated measures
factor(s). It has recently become clear that the spe-
cific comparisons which typically follow and clarify
significant overall tests are even more vulnerable
to inflated Type 1 error rates which may reach ten
10 fifteen times the nominal alpha under nonspher-
icity (Boik, 1981; Mitzel & Games, 1981; Harris,
1985). Given such bias it seems clear that many of
the sub-effect tests reported in the literature cannot
help but be erroneous. Unfortunately, discussions
of this issue have previously been confined to the
statistical literature, and few psychophysiologists
have taken this problem into account. In our review
of volumes 21 and 22 of Psychophysiology, less than
5% of the studies appeared to consider this problem.
For these reasons the present paper offers further
discussion of the validity assumptions for both
overall and sub-effect tests as well as a description
of two approaches to analysis of such designs for
which validity concerns are much less problematic.
These designs are perhaps best conceptualized as
multivariate in nature. Unless sphericity is assured,
they can frequently be analyzed most simply and
validly via multivariate analysis of variance (MAN-
OVA) (O'Brien & Kaiser, 1985), Such an approach
does not make the assumption of sphericity and
therefore yields bias free tests even when that con-
dition is violated (Harris, 1985, section 3.8). How-
Vasey and Thayer
Vol. 24, No. 4
ever, a modification of the « adjustment approach
can also be used effectively and this too shall be
discussed. Both approaches have a place in the anal-
ysis of repeated measures designs and sample size
is the primary consideration when choosing be-
tween them,
Review of y Assumptions
All applications of ANOVA require that the data
be normally distributed with homogeneous vari-
ance among groups. ANOVA is typically robust to
violations of these assumptions, However, repeated
measures designs introduce intercorrelations among
the means on which comparisons are based. These
intercorrelations allow use of a pooled or average
estimate of error variance and therefore greater
power than between group designs. Unfortunately,
under this condition, the p values yielded by uni-
variate F tests are accurate only if highly restrictive
assumptions concerning the nature of these inter-
correlations are met, Since a detailed theoretical
discussion of these assumptions is beyond the scope
of this article, the reader is referred to several thor-
ough reviews in the statistical literature (see Huynh
& Mandeville (1979) or Rogan, Keselman, & Men-
doza (1979) for a discussion of these assumptions
for both simple and complex designs). In general,
the p values of the F tests are accurate only when
the variance-covariance matrix E can be said to be
spherical or circular. This is true if and only if the
variance of all the contrasts between repeated mea-
surements which compose the overall comparison
of interest (e.g. the within subject main effect) is
constant, For the mathematically minded, this is
the case if the covariance matrix Z satisfies the
equation E = CEC = o°l where E is the error
matrix, C is a (k~1) k orthonormal contrast
matrix, I is the identity matrix of rank (k—1), and
k is the number of repeated measurements, The
scalar, °, represents the common experimental er
ror of the contrasts. The matrix € can be any set
of (k~1) orthogonal contrasts which define the
comparison of interest (e.g. the within subject main
effect). These contrasts are normalized by dividing
each weight by \/c'c), the square root of the sum
of the squared weights of'a given contrast. In reality,
this contrast matrix may be smaller than (k—1) by
k since one may not be interested in all (k~ 1) con-
trasts. Quite often the best tests are multiple degree
of freedom sub-effects or simple effects. A good dis-
cussion for psychophysiologists of such contrast
matrices and sphericity can be found in Keselman
and Rogan (1980). More simply, sphericity exists
if and only if the contrasts represented by C have
equal variance and zero covarianceJuly, 1987
Another way to think of sphericity is based on
generalization of the dependent / statistic. In order
to get an estimate of the error term in the simple
one-factor repeated measures case, one can gener-
alize from the variance estimate of the difference
between two dependent means that is used for the
f test: (S} + S} — 2S,.). The F for a one-way re-
peated measures design involves at least three such
comparisons, and a natural way to derive the error
estimate for the F is to pool the error terms for each
of the pairwise comparisons. In this way the overall
or pooled error term, MS, = (mean of $3) — (mean
of S,,). However, in order to do this pooling, it
must be assumed that all possible values of (S} +
S?. — 2S,.) are estimating the same quantity and
are therefore approximately equal. This is the as-
sumption of sphericity.
It should be noted that designs with several re-
peated measures factors can have more than one
assumption of sphericity. The number of sphericity
assumptions is (2 — 1) where T is the number of
repeated measures factors included in the test. Thus
two repeated measures factors result in three spher-
icity assumptions, one for each main effect and one
for their interaction. In cases where one or more
between group factors are included, we must ad-
ditionally assume that the variance-covariance
matrices for the set of contrasts are identical for all
levels of these factors. In other words homogeneity
of variance among groups is assumed just as it is
in all between groups designs.
One form of sphericity that is often discussed is
called compound symmetry. This condition occurs
when all variances of the repeated measurements
are equal and all pairwise correlations between the
repeated measurements are equal. This condition
is sufficient but not necessary for validity and there-
fore may not always be fulfilled under sphericity.
Though it is not necessary, the absence of com-
pound symmetry does indicate that sphericity is
unlikely (O'Brien & Kaiser, 1985). In general,
O’Brien and Kaiser (1985) argue that sphericity is
unnatural for most repeated measures data” and
that “it is commonly violated in most designs with
more than two repeated measurements.”
Since the condition of sphericity is difficult to
concretely describe, we will use the condition of,
compound symmetry to better illustrate why psy.
chophysiological data are unlikely to possess the
requisite covariance structure for valid repeated
measures ANOVA. Recall, however, that this con-
dition is merely sufficient and an examination of
the correlation matrix of the repeated measure-
ments is not adequate to rule out sphericity, How-
ever, in any study with one or more effective ma-
nipulations over time, one would not expect equal
Repeated Measures MANOVA
481
correlations between all pairs of repeated measure-
ments. Clearly one would expect measurements
taken prior to a manipulation to correlate more
highly with one another than with those taken after
manipulation. Even in cases with no active manip-
ulation one would expect successive or adjacent
measurements to covary more highly than non-ad-
jacent measurements (Winer, 1971; Rogan et al.,
1979), Clearly, the assumption of equal pairwise
correlations is unrealistic in many cases and it is
unlikely that sphericity exists under such condi-
tions.
everal procedures to test for the presence of
sphericity have been developed. Thus it is theoret-
ically possible to conduct such tests to protect against
making unnecessary power reducing adjustments,
Such a practice has indeed been recommended
(Huynh & Feldt, 1970; Huynh & Mandeville, 1979).
However, Rogan et al. (1979) have shown that use
of such preliminary tests differs little from uniform
use of multivariate or adjusted univariate tests. In
addition, Jittle is known about the characteristics
of such tests under violations of the assumption of
multivariate normality. O'Brien and Kaiser (1985)
noted that one such test, Mauchley’s Criterion W,
is quite sensitive to such violations of normality as
well as small sample size, Davidson (1972) has also
shown that Box’s (1954) test, which is admittedly
for the more rigorous condition of compound sym-
metry, is only useful if the sample-size n exceeds k
by at least 20, However, as we shall see, that is
exactly the point at which the MANOVA approach
achieves power comparable to that of the univariate
approach, therefore rendering the test of little use
(Davidson, 1972).
When the condition of sphericity is not fulfilled,
the F ratios computed are not distributed like the
tabulated F distribution. As previously mentioned,
the true Type I error rates associated with these
ratios are typically greater than the nominal alpha.
For example, Huynh and Feldt (1980) examined a
design with 5 repeated measurements and 3 levels
of a between groups factor. The variances of the 5
measurements were identical but the correlation
matrix was:
1.00 .80 60 .40 30
1.00 80 .60 .40
R 1.00.80 .60
1.00.80
1.00
For this example, even when all other assumptions
are satisfied and the group’s sample sizes are infi-
nitely large, the test of the Group Repeated Mea-
sures interaction has a Type I error rate of 09 when
the nominal a=,05. Notice that this covariance