Sunteți pe pagina 1din 7

A Cluster Analysis Method for Grouping Means in the Analysis of Variance Author(s): A. J. Scott and M.

Knott Source: Biometrics, Vol. 30, No. 3 (Sep., 1974), pp. 507-512 Published by: International Biometric Society Stable URL: http://www.jstor.org/stable/2529204 . Accessed: 22/04/2011 10:40
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=ibs. . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to Biometrics.

http://www.jstor.org

BIOMETRICS SEPTEMBER

30, 507-512 1974

A CLUSTER ANALYSIS METHOD FOR GROUPING MEANS IN THE ANALYSIS OF VARIANCE


A. J. SCOTT
AND

M.

KNOTT

University Auckland, Auckland, New Zealand and The London School of Economics and Political Science of

SUMMARY
It is sometimes useful in an analysis of variance to split the treatmentsinto reasonably homogeneous groups. Multiple comparison proceduresare oftenused forthis purpose, but a more direct method is to use the techniques of cluster analysis. This approach is illustrated for several sets of data, and a likelihood among the resultinggroups. ratio test is developed forjudging the significanceof differences

1. INTRODUCTION

means When the resultof an F-test in an analysis of variance shows that treatment are. to it differ, is oftenimportant obtain some idea of wherethese differences There are that have procedures many papers dealing with the large numberof multiplecomparison among means. For some purposesit is been proposed to help the search for differences homogeneousgroups. In discussinga hyenough to split the means into approximately potheticalexample, Tukey [1949] said "At a low and practical level, what do we want groups as oftenas we can to do? We wish to separate the varietiesinto distinguishable withouttoo frequently separatingvarietieswhichshould stay together".Tukey proposed a sequence of multiplecomparisonproceduresto accomplishthis grouping,each based Anothermethod of groupingthe treatmentmeans is cluster on an intuitivecriterion. techof analysis. The possibility using clusteranalysis in place of a multiplecomparison nique was suggestedby Plackett in his discussionof the review paper by O'Neill and Wetherill[1971]. method of cluster In this paper we study the consequencesof using a well-known means in a balanced design and show how a analysis to partitionthe sample treatment of likelihoodratio test gives a method of judging the significance the difcorresponding ferences amonggroupsobtained.
2. A LIKELIHOOD RATIO TEST FOR TWO GROUPS

means y', Y2, *--,yk with sample treatment Suppose we have a set of independent -- N(,ui, 2), and an independent of the commonvariancewhere(vS2)/-2 estimate,s2, yi of XP. We could check the homogeneity the means with an F-test in the usual way, but groupsthen homogeneous if we suspect that the means fall into two distinct,internally alternative.Thus, let B0 it is natural to considerthe likelihoodratio test forthis specific into two be the maximumvalue, taken over all possible partitionsof the k treatments likelihood(ML) groups,of the betweengroupssum of squares,and let 602 be the maximum estimateof a2 whenall gu'sare assumed equal, i.e.
507

508
A

BIOMETRICS, SEPTEMBER
2 = ( )2

1974

V2]/(k

+ v).

that the likelihoodratio test forthe null hypothesis Then it is easy to verify Ho: -u (i = 1, ***, k) against the alternativethat gz is equal eitherto ml or m2 (withat least one meanin each group),whereml and m2represent unknown the means ofthe two groups, is the is equivalentto a testthat rejectsHo ifBo/6O2 too large.This requires nulldistribution of B /6O2. Using the methods of Hartigan ([1972] section 4), it can be shown that, as
p

k l-

a', k"

in to 2) converges distribution a normalrandomvariable with mean


7r )

zero and variance 8(r2 2( rk"/2(---2

whenHo is true. Hence, take fora modified test statisticX =


-

2)BO/602. Then it follows that k12(

2) is asymptotically equivalent to
k 2 degrees of freedom

where Z is a

x2

random variable with Po =

the percentagepoints of the null distribution X This suggestsapproximating (D.F.). of with voD.F. An extensive simulation was carriedout to deterby those of a x2distribution mine exact percentagepointsof the null distribution X,and the resultsare reported of in section5. It turnsout that the x2approximation very good indeed even forkcas small is as two (see Table 2) and shouldbe adequate formost practicalsituations. betweenthis likelihoodratio test and a standard There is a one to one relationship that partition the treatments methodof cluster of analysis.CalculationofXinvolvesfinding forwhichthe betweengroupssum of squares is a maximum(or equivalently, the within groups sum of squares is a minimum).This partitionis the one given by the methodof cluster analysis suggestedby Edwards and Cavalli-Sforza [1965], when applied to the a univariatemeans Yi ,.--, k . Whetherone considers likelihoodratio test or uses the intuitivejustifications clusteranalysis,it is clear that the ML partitionunderH1 gives of an estimateof whichmeans are in each of the groupspostulatedby H1 . We need to be 1 possiblepartitions able to findthis ML partition easily to calculate X. There are 2k-l_ of the k means into two nonempty groupsbut Fisher[1958]has shownthat it is enoughto look at the (k - 1) partitions formed ordering by the means and dividingbetweentwo successiveones. This makes the calculationfeasibleeven by hand if k is no morethan 11 or 12. In particularwhen k = 3, it is only necessaryto split the orderedmeans at the largestgap.
.

TABLE 1
SIMULATION RESULTS FOR TWO GROUPS, EACH CONTAINING

MEANS, WITH

40

D. F.

Number of

groups obtained 1 2 More than 2 Percent error

1 4497 589 4 88.2

2 2677 2270 53 54.6

3 562 4276 162 14.5

4 20 4660 320 6.8

GROUPING MEANS IN ANOVA TABLE 2


95% POINTS FOR THE DISTRIBUTION 7r

509

Bo

OF

2(ir - 2) .o2
V

k 2 2.75 4.97 5.44 5.50 5.50 5.47 5.28 5.46 5 6.60 9.31 9.77 10.30 10.44 10.47 10.89 10.09 10 12.11 15.13 15.92 16.22 16.61 16.73 17.22 16.58 20 21.74 26.14 27.14 27.77 27.93 28.01 29.26 28.25

k 0 1 2 3 4 5 co

From x2

3. THREE OR MORE GROUPS

In practicethereare likelyto be severalgroupsof treatments that it is not always so enoughto splitthe meansintojust two groups.We adopt the hierarchical splitting method suggestedby Edwards and Cavalli-Sforza[1965] in theirwork on clusteranalysis. This startswith the best split into two groups,based on the betweengroupssum of squares, and then applies the same procedure separatelyto each subgroupin turn.The subdivision processis continued untilthe resulting groupsare judged to be homogeneous application by of the X test of section2. This methodis simpleto apply,and it is ofteneasierto interpret the resultsin an unambiguous way witha hierarchical methodin whichthe groupsat any stage are relatedto those of the previousstage. Choosingan appropriate value fora is difficult. a is too small, the splitting If process will terminate soon, whileif a is too large,the processwill go too farand split homotoo geneoussets of means. It would be particularly desirableto know the probability that the methodwill split the means into morethan p groupswhenin factthe treatments into fall exactlyp groups.Suppose first that thereare reallytwo groupsof treatments. theyare If so widelyseparated that the true groupswill almost always be recovered, then the probabilitythat we stop at two groupsis (1 _ a)2 wherea is the significance level forX used at the secondstage. If the separation less extreme, estimated is the groupswillappear more homogeneous than the truegroups,and the probability stopping two groupsis greater of at than (1 - a)2. Thus the probability splitting treatments of the into morethan two groups is bounded above by a* = 1 - (1 - a)2. More generally, we attemptto split j homoif geneousgroupsat some stage, the probability getting least one significant of at splitis no morethan a* = 1 - (1 - a)'. We have done some simulation with k = 10 means, 5 in each group,and v = 40 D.F. Five thousandsamplesweregenerated 5 = 1, 2, 3, 4 where for = is the normalizeddistance betweenthe group means ml and M2 . The (MI 2 M2)/>was used to judge the significance splits with a = 0.05 so that the x approximation of upper bound is a* = 0.0975, and the resultsare set out in Table 2. Effective complete separation startsat about 8 = 5, and the upperboundis a good approximation values of for

510

BIOMETRICS, SEPTEMBER 1974

5 largerthan this. If 5 < 3, most of the errorsresultfromfailureto recognizegenuine splits,and the overall errorrate would be reduced by choosinga largervalue of a. For splits,and the overall values of 5 > 4, on the otherhand, thereare too many improper rate would be reducedwitha smallervalue of a. error 4. EXAMPLES

We illustrate our methodwith threeexamplesthat have been used forothermultiple is comparisonprocedures.In each case the x2 approximation used to judge significance at a nominallevel of 5%. Example 1 involving Shulkcum (see Duncan [1955]) conducted a randomizedblock experiment six blocks of seven varietiesof barley.The varietymeans were49.6, 58.1, 61.0, 61.5, 67.6, 71.2, 71.3. The analysis of variance gave a numericalF value of 4.61 on 30 D.F. which that there are differences among the means. When the orderedmeans suggestsstrongly likelihood below. are numbered1 to 7, our modified analysisgives the breakdown 1234 1234567
1 = 8X

X = 17.73
567

234
5

.04

X = 0.99 67 This suggeststhat the groupsshould be 1234,567.The split between 1 and 234 is on the into 1,234,567 of borderline significance, and we may want to split the groups further especiallyin the light of the simulationresultsreportedin section 3. This is the result suggestedby Plackett in the discussionof O'Neill and Wetherill[1971] on the basis of a normalprobability plot.
Example 2 Tukey [1949] gave examples of his methodsin use on the resultsof a 6 X 6 Latin on was not a simpleLatin square designbut square experiment potatoes. The experiment had a factorialstructurewhich makes multiple comparison proceduresinappropriate. This will be ignoredhere,as it was in Tukey's analysis.The six means were 345.0, 405.2, estimateof theirvariancewas 254.4 with 426.4, 477.8, 502.2, 601.8, and the independent 20 D.F. 1 The breakdown the means (numbered to 6) obtainedby our analysisis givenbelow. of 123

X = 12.23
23

123456

X = 22.97 45 456 X = 16.97


6

2 X= 1.30 4 X = 4.53 5

GROUPING MEANS IN ANOVA

511

The estimated grouping 1,23,45,6. is Tukey obtainedthe same partition usinghis approach. Example 3 Tukey [1949] had a second example fromSnedecor ([1946] Example 11.28) which concerneda 7 X 7 Latin square experiment potatoes. The means were 341.9, 360.6, on 363.1, 379.9, 386.3, 387.1, and the estimateof the varianceof the sample means was 90.63 with 30 D.F. The breakdownof the means (numbered1 to 7) obtained by our analysisis given below.
1234 1234567

X 4.42 234 -5 X = 2.24


67

X = 15.83
567

Using the x2 approximation, obtain the estimatedgrouping1234,567.Tukey did not we obtain the split between4 and 5 using his batteryof tests,but separatedthe first mean from others.He remarkedthat one uses an ordinary the if t-test the particular for grouping obtainedhere (1234,567) the resultis significant.
5. THE DISTRIBUTION OF X

Some information about the distribution X is already available. When k = 2, X of reducesto a multiple (v + t,2)1, wherethe randomvariablet. has a Studenttdistribution of with v D.F., and percentage pointscan be obtaineddirectly from those of the t distribution. When v = 0, thereis no directestimate cr2, Xcan be expressed a monotone of and as function of C = Bo/Wowhere Wo is the minimum withingroups sum of squares. Tables of the percentagepoints of C have been given by Engelman and Hartigan [1969]. As indicated in section2, the approximate distribution X as k of is x,'2 with vo = kI/(r-2). Information about the distribution X forothervalues of v and k was obtained by of simulationusing 5000 independently sampled values of X in each case. Normal random variables were generated by the Marsaglia-Bray [1964] transform pseudo-random of uniform variates { fa} wherepi is the fractional part of 236[ + 1) + 54197344997]. 1i1(26 The histograms the simulatedvalues of X were checked to make sure there were no of obvious peculiarities in and none due to regularities the pseudo-random numbergenerator was found.From the simulatedvalues of X the first Fouriersine coefficients Fn(X) 100 of distribution function fromthe n = 5000 A/a were calculated,whereF,(X) is the empirical simulatedvalues of X and a = 27r(k+2) is an upper bound for X. A smoothed version, in the formula of m Al Fn*(X), Fn(X)was obtainedby usingthe first coefficients ,/2, O2 **m , Fn*(X)= sin a

Plots of Fn*(X)forvarious values of m suggestedthat m = 70 would be the best choice because small and irregular oscillations began appearingin the plots forvalues of m larger than 70.

512

BIOMETRICS,

SEPTEMBER

1974

x2 is approximately we trieda x2approximation of Since the distribution X as k -o this is through probability a plot. Wilson and in othercases. A simpleway to investigate thenthe distribution yl/3 is closelyapproximated of showedthat if Y Hilferty [1931] X(,2 We plotted X1!3 against 41 (Fn*(X)), where4 is the standard by a normal distribution. for normaldistribution function, a gridof X values equally spaced in X1!3. The plots were if almostexact straight lines,particularly attentionwas limitedto values of X withFn*(X) of was carriedout between0.50 and 0.99. A finalsmoothing the simulated distributions lines by least squares to the plots betweenthe 50%0and 99% values. by fitting straight coefficients these straight for line fitswere always around 0.999. Table 2 The correlation shows 5% level significance points for X estimatedfromthe straightlines fittedto the with v0 D.F. worked so probabilityplots. However, since the limitingx2 approximation for well formoderatevalues of v/kand gave a conservative approximation small values of v, it appears that more precisepercentage pointswill only be needed fora large value ofv/icin practice.
ACKNOWLEDGMENT

We appreciatethe helpfulcomments on made by an Associate Editor and referees an earlierdraftparticularly reference the asymptotic the to resultsof Hartigan [1972].
UNE MPTHODE D'ANALYSE EN GROUPES SUR DES MOYENNES VARIANCE RESUME II est quelquefois utile dans une analyse de variance de repartirles traitementsen des groupesraisonnablement homogenes. Des procedures de comparaisons multiples sont souvent utilisees dans ce but, mais une methode plus directe consiste a employerles techniques de l'analyse en groupes. Ce point de vue est illustre par plusieurs ensembles de donnees, et un test du maximum de vraisemblance est applique pour si entre les groupes sont significatives. determiner les differences REFERENCES Duncan, D. B. [1955]. Multiple range and multiple F tests. Biometrics11, 1-42. Edwards, A. W. F. and Cavalli-Sforza, L. L. [1965]. A method for cluster analysis. Biometrics21, 362-75. Engleman, L. and Hartigan, J. A. [1969]. Percentage points of a test forclusters.J. Amer. Statist.Ass. 64, 1647-8. Fisher, W. K. [1958]. On groupingformaximum homogeneity.J. Amer. Statist.Ass. 53, 789-98. Hartigan, J. A. [1972]. Direct clusteringof a data matrix. J. Amer. Statist.Ass. 67, 123-9. Marsaglia, G. and Bray, T. A. [19641.A convenientmethod for generatingnormal variables. SIAM Rev. 6, 260-4. O'Neill, R. and Wetherill,G. B. [1971]. The present state of multiple comparison methods. J. R. Statist. Soc. 33, 218-50. Snedecor, G. W. [1946]. StatisticalMethods.Collegiate Press, Ames, Iowa. Tukey, J. W. [1949]. Comparing individual means in the analysis of variance. Biometrics5, 99-114. Wilson, E. G. and Hilferty,M. M. [1931]. The distributionof Chi-square. Proc Nat. Acad. Sci. U.S.A. 17, 684-8. EN ANALYSE DE

Received July 1972, Revised November 1973


Key Words: Multiple comparisons; Cluster analysis; Grouping means.

S-ar putea să vă placă și