Sunteți pe pagina 1din 9
Pave Consright The Society for Psy chop slotogcal Revearch tn The Continuing Problem of False Positives in Repeated Measures ANOVA in Psychophysiology: A Multivariate Solution MIcHaeL W. Vasey AND JULIAN F. THAYER Department of Psychology, The Pennsylvania State University, University Park. Pennsylvania ABSTRACT Violation of the validity assumptions of repeated m problem in psychophysiology. Such n results in positive bias for those tests involving the repeated measures factor(s). Recently it has been shown that the tests of simple interactions and multiple comparisons are even more vulnerable to bias (Boik, 1981; Mitzel & Games, 1981). The present paper offers a discussion of the validity assumptions for both overall and sub-effect tests and describes 2 multivariate approach which allows exact analysis of such designs. A modification of the univariate approach is also described. Validity concerns for both approaches are much less problematic than those of the traditional approach. DESCRIPTORS: Repeated measures designs, Sphericity assumption, Multivariate anal} sures analysis of variance continues to be a of variance, False positives. Repeated measures designs are among the most common experimental designs in psychophysiolog- ical research. These designs have traditionally been analyzed through univariate analysis of variance (ANOVA). In general, ANOVA assumes that the data are normally distributed with homogenous variance among groups. However, due to the cor- related nature of repeated measures data, ANOVA applied to such designs carries an additional re- quirement known as sphericity or circularity. The condition of sphericity exists if and only if the var- iance of all pairwise differences between repeated measurements is constant. This is a complex as- sumption and will be described fully below. In the past, several papers have appeared in Psychophys- ology which have cautioned that this assumption is quite unrealistic when applied to psychophysio- logical data and is frequently violated (Wilson, 1967, 1974; Jennings & Wood, 1976; Keselman & Rogan, ‘The authors wish to thank Robert Stern, Ralph O'Brien, and Robert Kennedy for their helpful comments. A shorter version of this paper was presented at the thirteenth meeting of the Psychophysiology Society, Lon- don, December, 1985. Address requests for reprints to: Julian F, Thayer, De- partment of Psychology, The Pennsylvania State Univer- sity, University Park, Pennsylvania 16802. 1980). The consequence of such violation is posi- tively biased or liberal tests, meaning that the like- lihood of a Type I (false positive) error exceeds the probability, or alpha (a) level, set by the user. Ke- selman and Rogan (1980) pointed out that for the .05 level of significance the bias can reach 2a, while for the ,01 level it can reach 6a. It is clear that such a bias could result in a high incidence of nonrep- licable results published in our journals (Games, 1976). A variety of procedures, generally involving re- duction of the degrees of freedom through multi- plication by some value « (epsilon), have been ad- vocated to guard against such bias. These proce- dures, such as the three-step approach suggested by Greenhouse and Geisser (1959), will be described below. Unfortunately these safeguards have not typically been used by psychophysiologists. Jen- nings and Wood (1976) reported that 84% of the studies having repeated measures designs which ap- peared in volume 12 of Psychophysiology (1975) apparently ignored the possibility of bias. There- after Games (1976) offered to provide FORTRAN IV programs implementing such a procedure. Fi- nally, Keselman and Rogan (1980), noting a con- tinued neglect of this issue, reminded researchers of the need for such safeguards. 479 480 Given such a thorough treatment of this prob- Jem, one might expect that_psychophysiologists would now guard against such test level bias as a matter of course, Regrettably, a review of volumes, 21 and 22 (1984 and 1985) revealed that more than 50% of such studies remain unprotected against this problem. It should be noted that only designs hav- ing three or more levels of a repeated measures factor were considered since the assumption of sphericity is aiways fulfilled for two levels. The above is not to suggest that all of these unprotected studies have suffered increased incidence of false positive results. Certainly the bias for overall effects is typ- ically small and many of these studies would re- main unchanged if appropriate adjustments were applied. However, the potential for bias does exist and since appropriate safeguards are now com- monly available, we argue that they should be rou- tinely applied. Currently that is not the case. As we will describe below, the problem is especially great for sub-effect tests. The above finding demonstrates a need for fur- ther discussion of this topic. However, a more se- rious problem exists which has not previously been discussed in the psychophysiological literature. The traditional safeguards like the Greenhouse and Geisser (1959) three-step approach and « adjust- ments in general, protect only the F tests for main effect of, or interactions with, repeated measures factor(s). It has recently become clear that the spe- cific comparisons which typically follow and clarify significant overall tests are even more vulnerable to inflated Type 1 error rates which may reach ten 10 fifteen times the nominal alpha under nonspher- icity (Boik, 1981; Mitzel & Games, 1981; Harris, 1985). Given such bias it seems clear that many of the sub-effect tests reported in the literature cannot help but be erroneous. Unfortunately, discussions of this issue have previously been confined to the statistical literature, and few psychophysiologists have taken this problem into account. In our review of volumes 21 and 22 of Psychophysiology, less than 5% of the studies appeared to consider this problem. For these reasons the present paper offers further discussion of the validity assumptions for both overall and sub-effect tests as well as a description of two approaches to analysis of such designs for which validity concerns are much less problematic. These designs are perhaps best conceptualized as multivariate in nature. Unless sphericity is assured, they can frequently be analyzed most simply and validly via multivariate analysis of variance (MAN- OVA) (O'Brien & Kaiser, 1985), Such an approach does not make the assumption of sphericity and therefore yields bias free tests even when that con- dition is violated (Harris, 1985, section 3.8). How- Vasey and Thayer Vol. 24, No. 4 ever, a modification of the « adjustment approach can also be used effectively and this too shall be discussed. Both approaches have a place in the anal- ysis of repeated measures designs and sample size is the primary consideration when choosing be- tween them, Review of y Assumptions All applications of ANOVA require that the data be normally distributed with homogeneous vari- ance among groups. ANOVA is typically robust to violations of these assumptions, However, repeated measures designs introduce intercorrelations among the means on which comparisons are based. These intercorrelations allow use of a pooled or average estimate of error variance and therefore greater power than between group designs. Unfortunately, under this condition, the p values yielded by uni- variate F tests are accurate only if highly restrictive assumptions concerning the nature of these inter- correlations are met, Since a detailed theoretical discussion of these assumptions is beyond the scope of this article, the reader is referred to several thor- ough reviews in the statistical literature (see Huynh & Mandeville (1979) or Rogan, Keselman, & Men- doza (1979) for a discussion of these assumptions for both simple and complex designs). In general, the p values of the F tests are accurate only when the variance-covariance matrix E can be said to be spherical or circular. This is true if and only if the variance of all the contrasts between repeated mea- surements which compose the overall comparison of interest (e.g. the within subject main effect) is constant, For the mathematically minded, this is the case if the covariance matrix Z satisfies the equation E = CEC = o°l where E is the error matrix, C is a (k~1) k orthonormal contrast matrix, I is the identity matrix of rank (k—1), and k is the number of repeated measurements, The scalar, °, represents the common experimental er ror of the contrasts. The matrix € can be any set of (k~1) orthogonal contrasts which define the comparison of interest (e.g. the within subject main effect). These contrasts are normalized by dividing each weight by \/c'c), the square root of the sum of the squared weights of'a given contrast. In reality, this contrast matrix may be smaller than (k—1) by k since one may not be interested in all (k~ 1) con- trasts. Quite often the best tests are multiple degree of freedom sub-effects or simple effects. A good dis- cussion for psychophysiologists of such contrast matrices and sphericity can be found in Keselman and Rogan (1980). More simply, sphericity exists if and only if the contrasts represented by C have equal variance and zero covariance July, 1987 Another way to think of sphericity is based on generalization of the dependent / statistic. In order to get an estimate of the error term in the simple one-factor repeated measures case, one can gener- alize from the variance estimate of the difference between two dependent means that is used for the f test: (S} + S} — 2S,.). The F for a one-way re- peated measures design involves at least three such comparisons, and a natural way to derive the error estimate for the F is to pool the error terms for each of the pairwise comparisons. In this way the overall or pooled error term, MS, = (mean of $3) — (mean of S,,). However, in order to do this pooling, it must be assumed that all possible values of (S} + S?. — 2S,.) are estimating the same quantity and are therefore approximately equal. This is the as- sumption of sphericity. It should be noted that designs with several re- peated measures factors can have more than one assumption of sphericity. The number of sphericity assumptions is (2 — 1) where T is the number of repeated measures factors included in the test. Thus two repeated measures factors result in three spher- icity assumptions, one for each main effect and one for their interaction. In cases where one or more between group factors are included, we must ad- ditionally assume that the variance-covariance matrices for the set of contrasts are identical for all levels of these factors. In other words homogeneity of variance among groups is assumed just as it is in all between groups designs. One form of sphericity that is often discussed is called compound symmetry. This condition occurs when all variances of the repeated measurements are equal and all pairwise correlations between the repeated measurements are equal. This condition is sufficient but not necessary for validity and there- fore may not always be fulfilled under sphericity. Though it is not necessary, the absence of com- pound symmetry does indicate that sphericity is unlikely (O'Brien & Kaiser, 1985). In general, O’Brien and Kaiser (1985) argue that sphericity is unnatural for most repeated measures data” and that “it is commonly violated in most designs with more than two repeated measurements.” Since the condition of sphericity is difficult to concretely describe, we will use the condition of, compound symmetry to better illustrate why psy. chophysiological data are unlikely to possess the requisite covariance structure for valid repeated measures ANOVA. Recall, however, that this con- dition is merely sufficient and an examination of the correlation matrix of the repeated measure- ments is not adequate to rule out sphericity, How- ever, in any study with one or more effective ma- nipulations over time, one would not expect equal Repeated Measures MANOVA 481 correlations between all pairs of repeated measure- ments. Clearly one would expect measurements taken prior to a manipulation to correlate more highly with one another than with those taken after manipulation. Even in cases with no active manip- ulation one would expect successive or adjacent measurements to covary more highly than non-ad- jacent measurements (Winer, 1971; Rogan et al., 1979), Clearly, the assumption of equal pairwise correlations is unrealistic in many cases and it is unlikely that sphericity exists under such condi- tions. everal procedures to test for the presence of sphericity have been developed. Thus it is theoret- ically possible to conduct such tests to protect against making unnecessary power reducing adjustments, Such a practice has indeed been recommended (Huynh & Feldt, 1970; Huynh & Mandeville, 1979). However, Rogan et al. (1979) have shown that use of such preliminary tests differs little from uniform use of multivariate or adjusted univariate tests. In addition, Jittle is known about the characteristics of such tests under violations of the assumption of multivariate normality. O'Brien and Kaiser (1985) noted that one such test, Mauchley’s Criterion W, is quite sensitive to such violations of normality as well as small sample size, Davidson (1972) has also shown that Box’s (1954) test, which is admittedly for the more rigorous condition of compound sym- metry, is only useful if the sample-size n exceeds k by at least 20, However, as we shall see, that is exactly the point at which the MANOVA approach achieves power comparable to that of the univariate approach, therefore rendering the test of little use (Davidson, 1972). When the condition of sphericity is not fulfilled, the F ratios computed are not distributed like the tabulated F distribution. As previously mentioned, the true Type I error rates associated with these ratios are typically greater than the nominal alpha. For example, Huynh and Feldt (1980) examined a design with 5 repeated measurements and 3 levels of a between groups factor. The variances of the 5 measurements were identical but the correlation matrix was: 1.00 .80 60 .40 30 1.00 80 .60 .40 R 1.00.80 .60 1.00.80 1.00 For this example, even when all other assumptions are satisfied and the group’s sample sizes are infi- nitely large, the test of the Group Repeated Mea- sures interaction has a Type I error rate of 09 when the nominal a=,05. Notice that this covariance