Documente Academic
Documente Profesional
Documente Cultură
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms
Biometrika Trust, Oxford University Press are collaborating with JSTOR to digitize,
preserve and extend access to Biometrika
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
[ 256 ]
BY W. G. COCHRAN
Department of BiostatistiCs, School of Hygiene and Public Health, Johns Hopkins University
I. INTRODUCTION
The x2 test has long been used to test the significance of differences between ratios or
centages in two or more independent samples. It sometimes happens that each member of a
sample is matched with a corresponding member in every other sample, in the hope of securing
a more accurate comparison among the percentages. The matching may be based either on
the characteristics of the members, or on the fact that the partners in a group are subjected
to some test that is the same for all members of the group but varies from one group to
another.
Since the matching may introduce correlation between the results in different samples,
it invalidates the ordinary X2 test, which gives too few significant results if the matching is
effective. For the case where there are two samples, an appropriate test is easily constructed.
An example has been given by McNemar (1949), who presents this test. In his data, 205
soldiers were asked whether they thought that the war against Japan would last more or
less than a year. They were subsequently asked the same question after a lecture on the
difficulties of the war against Japan. Matching occurs because each sample contains exactly
the same soldiers.
The replies may be classified in a 2 x 2 frequency table as shown in Table 1.
After lecture
Total
Less More
36 169 205
Before the lecture, 70 men out of the 205 thought that the war would last less than a year,
whereas after the lecture this number has dropped to 36. The comparison which we wish to
make is that between the two frequencies 70/205 and 36/205. There are several ways in which
the test may be derived. Perhaps the easiest is to note that both numerators, 70 and 36,
contain the 36 (a) men who persisted in thinking that the war would last less than a year.
Hence, equality of the numerators would imply that the same number of men changed from
'Less' to 'More' as changed from 'More' to 'Less', In other words, if the lecture is without
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 257
effect we would expect half the persons who changed their minds to change in one direction
and half in the other. Thus the test can be made by testing whether the numbers (b) and
(c) are binomial successes and failures out of n = (b + c) trials, with probability 2. For this
with 1 degree of freedom. A correction for continuity can be applied by subtracting 1 from
the absolute value of the numerator before squaring.
The two-sample case has also been discussed in a study by Denton & Beecher (1949),
where the object was to find out whether subjects reacted more frequently to an injection of
a new drug than they did to one of isotonic sodium chloride, which was used as a control.
They give a x2 test, attributed to Mosteller, which differs slightly from that given above.
The object of this paper is to extend the test to the situation where there are more than
two samples. An example is provided in some studies of variability among interviewers in
sample surveys. Each interviewer called at a different group of houses, but any house assigned
to an interviewer was matched with one of the houses assigned to each other interviewer
according to the characteristics of the housewife. A test of whether the percentage of 'yes'
answers to some question differed from interviewer to interviewer is a test of the type that
we are considering.
In a second example, the effectiveness of a number of different media for the growth of
diphtheria bacilli was investigated by the Communicable Disease Centre, U.S. Public Health
Service. In one series, specimens were taken from the throats of sixty-nine suspected cases,
Each specimen was grown on each of four media A, B, C, D. The probability that growth
takes place will depend on the number of diphtheria bacilli present, and in a number of
cases there might well be no bacilli present.
No. of No. of
A B C D cases Before After cases
o 0 0 0 59
_____ _____ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ____T o tals 135 169
Totals (TI) 6 10 7 10
Results are shown in Table 2.* Where there are four media, the 2 x 2 table does not seem
well adapted to a succinct presentation. Instead, each medium is allotted to one column of
the table. A 1 denotes that growth occurred with that medium, a 0 that no growth occurred.
Thus in Table 2 there were four specimens in which all four media exhibited growth, two
specimens in which media A, B and D, but not C, showed growth, and so on. To illustrate
* I wish to thank Dr Martin Frobisher, Chief, Bacteriology Laboratories, Communicable Disease
Centre. U.S. Public Health Service. Atlanta. for permission to use these data for illustration.
Biometrika 37 I7
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
258 Comparison of percentages in matched samples
the relation to the method of presentation in a 2 x 2 table, McNemar's results are also shown
in this form, where a 1 denotes the answer 'more than a year '.
The column totals are the total numbers of ls. The problem is to test whether these totals
differ significantly among media.
2. MATHEMATICAL FRAMEWORK
For a discussion of the theory of the test we shall adopt a less concise method of presentation
than that given in Table 2. Each matched group will be placed in a different row of the table.
Thus the table for the diphtheria data would contain 69 rows and 4 columns. The probability
of a 1 is presumed to vary from row to row, usually in a manner that is known only vaguely.
Nevertheless, an exact test can be developed by the familiar method in which the population
is generated by randomization. The observed total number ui of successes (l's) in the ith
row is regarded as fixed. If the null hypothesis is true, every one of the c columns is con-
sidered equally likely to obtain one of these successes. The population of possible results in
the ith row consists of the (C) ways in which the uj successes can be distributed among
c columns.
This specification has one consequence that might be questioned. If a row contains no
successes, or c successes, the population generated in that row consists only of the single
case that actually occurred. As will be seen, this implies that such rows play no part in the
test of significance. This is evident in the two-sample test, which makes no use of the number
of cases a and d in the cells of Table 1 where there was no change of opinion. On the other
hand, for given values of b and c, one might feel intuitively that significance ought to be more
definitely established if there are no cases in which the samples give the same result (i.e. a
and d are zero) than if there are a large number of such cases. Whether this feeling is sound is
perhaps debatable, and I do not see how weight can be given to it without losing the advantage
of an exact test.
The test criterion that will be used is (Tj - T)2, where Tj is the total number of success
in the jth column. This is the same criterion as in the ordinary x2 test for the situation where
the columns are independent. It may not be the best criterion. For the usefulness of the
data from a row for the purpose of detecting differences among columns may depend on the
probability of success in the row. That is, the situation may be similar to that which occurs
in dosage-mortality experiments, in which, for maximum sensitivity per observation,
comparisons of two drugs must be made close to the median lethal doses. This suggests that
in extensive data it might be advisable to group the rows according to the value of uj and
to perform some kind of weighting on the Tj values for different values of Ui. I have occasio
ally used this approach, but it mav be difficult to decide what form the weighting should take,
particularly in a new type of experimentation. A test based on the unweighted totals will
often serve our purpose.
3. THE LIMITING DISTRIBUTION
We consider first the limiting distribution of the test criterion when the number of rows r is
very large. Let the variate x*> take the value 1 if there is a success in the cell in the ith row
and jth column, and 0 if there is a failure. By the properties of the randomization in that
row, these two events occur with probabilities ui]c and 1 - uti/c, respectively. Hence
E(xi1) = C
u, o2(Xtj)
C\C/
= __ ( i
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 259
By symmetry, the covariance is the same for any two cells in the same row. Since the row
total of the xil is fixed at ui and thus has zero variance, the covariance of xij and Xik is found
to be _-Ui (1 - u)
cov (xiJxik)- (c-1) (j- *k).
These results enable us to arrive, by non-rigorous methods, at the form of the limiting
distribution. Since the randomization is independent in different rows, the means, variances
and covariances of the column totals T, will be corresponding expressions above, summed
over the rows. If the number of rows is large, the joint distribution of these totals may be
expected to tend to the multivariate normal. Finally, if a set of c variates Tj follow a multi-
variate normal distribution with common variance o-2 and common covariance po-2, it is
well known that (Tj- T)2 is distributed as X2er2(l - p), with (c - 1) degrees of freedom
(Walsh, 1947). In this case
C c ( c '2 P (c-)
(2 1.(T1-T2)2 _ bC)
From (1) Q becomes Q (2)(1) 2 ( b+c
In the ordinary x2 test, valid when the samples are independent, we have
y (T - T)2
2= ; (2)
c c
where u = Yu,/r.
r7-2
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
260 Comparison of percentages in matched samples
Under what conditions does this test coincide with the new (Q) test? It might be anti-
cipated that this should happen when the probability of success does not change from row
to row. The results are in line with this expectation.
Consider the application of both tests to a series of tables, all of which have the same set
of row totals. From (1) and (2) the new test gives a greater, an equal or a smaller number of
significant results than the ordinary test, according as
ru~l__)_ (U - -)2
c c
If we wish to test the null hypothesis that the probability of success is the same in all
rows, this could be done by an ordinaryX2 test on the row totals ui. Since rows are independent,
the value of X2 would be Z (U, -)2
2 =
Xr=
To find the denominator of Q, a separate frequency distribution of the values of the row
totals ui may be made.
Value of uj Frequency
4 4
3 5
2 1
0 59
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 261
It is easy to show algebraically, as may be verified in this example, that the value of Q
If there are only two columns, an exact small-sample test presents no difficulty. The Q test
is essentially equivalent to the sign test (Cochran, 1937; Dixon & Mood, 1946), for which
tables are available in the references cited.* This can be seen by the argument used in ? 1.
Apart from its divisor, Q is (T1 - T2)2, i.e. the square of the difference between the number of
successes in the two columns. We may ignore all rows that contain either 2 or 0 successes,
since these do not affect the value of Q. Consequently (T1 - T2) is the difference between the
number r1 of rows in which the results are (1 .0) and the number r2 in which they are (0 1).
If n = r1 + r2, this difference equals (2r, - n).
For any row that has one success, the probabilities of a (1.0) and of a (0-1) on the null
hypothesis are both a. This shows that r1 is distributed in the binomial (Q + 2,)n which i
quantity that is tabulated in the sign test.
For an exact test when c = 2, the procedure is therefore as follows: (i) ignore all rows with
2 or 0 successes; (ii) count the number of rows with a single success in the first column, and
refer to the tables of the sign test, where n is the total-number of rows that have one success.
In the small tables that have more than two columns, the exact distribution of Q can be
tabulated by enumerating all configurations generated by the randomization. Since the
number of possible cases is large, a comprehensive listing of exact significance levels would
be laborious to construct. As a check on the accuracy of the limiting distribution in small
samples, the exact distribution of Q was worked out for the following eight cases:
The figures following the semicolon are the ui values: e.g. 2515 means th
of the rows and 1 in the remaining five. No case in which ui = c or 0 was in
number of such rows may be added to the basic table without affecting the value of Q.
Some of the cases are rather closely related in their structure. Nevertheless, it seemed
best to include all of them in presenting summary comparisons. The cases were chosen as
indicative of the smallest samples in which the x2 approximation to the distribution of Q
likely to be needed. Smaller samples can of course occur in practice, but in this event it is
relatively easy to make an exact test of significance from the exact distribution of Q.
The exact distribution was compared not only with the x2 approximation, but also with
an F-test applied to the data by means of an analysis of variance into the components
D.F.
Rows (r- 1)
Columns (c-1)
Rows x columns (r-1) (c-1)
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
262 Comparison of percentages in matched samples
where F is the ratio of the mean squares for columns and rows x columns. If the data had
been measured variables that appeared normally distributed, instead of a collection of l's
and O's, the F-test would be almost automatically applied as the appropriate method.
Without having looked into the matter, I had once or twice suggested to research workers
that the F-test might serve as an approximation even when the table consists of l's and O's.
As a testimony to the modern teaching of statistics, this suggestion was received with
incredulity, the objection being made that the F-test requires normality, and that a mixture
of l's and O's could not by any stretch of the imagination be regarded as normally distributed.
The same workers raised no objection to a x2 test, not having realized that both tests require
to some extent an assumption of normality, and that it is not obvious whether F or %2 is
more sensitive to the assumption. Inclusion of the F-test is also worth while in view of the
widespread interest in the application of the analysis of variance to non-normal data.
The total number of values in a population is sufficiently small so that correction for
continuity makes an appreciable difference. Application of the correction requires a little
inspection of the data. Usually the values of Z(Tj- T)2 increase by 2's, but with c = 2 or 3
the increase may be much greater, and it is necessary to discover what is the value of Q
immediately below the one actually obtained, and enter the table with a value midway
between the two. For x2 the results are given both with and without correction, since
experience in other problems has suggested that the correction may not be helpful when
there are more than two samples. For F the correction was a decided improvement and only
corrected values are shown.*
It is easy to build up the exact distribution row by row. Members of the first row need not
be permuted, but all other rows must be. Consider the diphtheria example in Table 2. If
the sixty-three rows which show either all positives or no positives are omitted, this becomes
the case c = 4, r = 6; 352. We start with the row (1 110) and add successively four rows with
ui = 3 and one row with ui = 2. Addition of the second row gives the four cases
1110 1110 1110 1110
0111 1011 1101 1110
At this stage the possible sets of column totals are (2220) with probability 1/4 and (2211)
with probability 3/4. All permutations of the third row are now added, and so on. The total
number of cases is (44) (6), or 1536, but these combine to give only nine different values of Q.
where the averages are taken without regard to sign. The numbers of overestimates and
underestimates made by each approximation are also shown. In Table 3, x2 denotes the
uncorrected value, and x'2 and F' denote the values after correction for continuity.
* Actually, the incomplete beta function rather than F was corrected for continuity, since the former
was more convenient for reading significance probabilities. Results differ slightly, but not materially,
from those given by correcting F itself.
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 263
From the percentage errors it appears that X2 (uncorrected) and F' have performed about
equally well, both being slightly better than X%2. None of the methods is free from bias. X'2
tends to overestimate and F' to underestimate. Over the range as a whole x2 comes off fairly
well with 23 overestimates and 32 underestimates, but it appears that a negative bias in
the region of 02 to 0*1 is being counteracted by a positive bias in the region of 0-02 to 0 005.
For practical use X2 is preferable to F', since it is slightly easier to calculate, though the
possible application of F' to more complex tables should be borne in mind.
Range of x ' F,
exact P F2 __2_ -I /_ _ ___ ___
+ I+ - + -
0.2 -0 1 15 7 5 1 10 7 4 6 5
0.1 -002 14 18 15 7 19 21 5 7 19
0)02-0 005 21 46 26 15 3 17 1 1 17
Average ortotal 16 25 17 23 32 45 10 14 41
1 %O level, the corresponding figures are about 0012 and 0O008. These results appear c
enough for routine decisions. For true probabilities below 0005 all methods tend to go to
pieces. F' may give values only one-quarter of the true probability, while the two x2 values
may be six or eight times too high. An exact assessment of a very small probability is rarely
essential.
It may be of interest to note the probabilities given by the various approximations for
the diphtheria example in Table 2. We have already calculated x2, with P = 0045. The
exact P is 81/1536, or 0053, while x'2 gives 0O080 and F' gives 0 062. All methods agree to
the extent of indicating a probability somewhere close to the region of significance.
It has been mentioned that the value of Q, and hence of x2, is unaffected by any rows whi
contain c or 0 successes. This is not so for F, where the degrees of freedom (r - 1) (c - I) in
the denominator are obviously increased by the addition of rows of any kind. The value of
F itself is also affected. Without resort to details, what happens is that if we take a basic
table containing no rows with c or 0 successes, and add to it an increasing number of such rows,
the probabilities given by F' (corrected) or F (uncorrected) increase slowly until ultimately
they agree with those given by x'2 and x2 respectively. This implies, incidentally, that at
intermediate stages F' may give a better approximation than any of the methods previously
presented, because for the basic table the probability given by x12 is in general too high and
that by F' too low. In the eight worked examples, this was so when half of the rows were
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
264 Comparison of percentages in matched samples
c's or O's. In fact, it might be possible, as a purely empirical device, to set a quota of such
rows which would be included in calculating F (whether they were actually present or not),
so as to make F or F' a good approximation to the exact probability. This approach was not
pursued since x2 seems good enough for most purposes. The approach may appear slightly
repugnant logically, but is no more so than the use of an empirical approximation to an exact
frequency distribution.
Some investigation was undertaken in an attempt to discover why, at low values of the
exact significance probability, x2 gives an overestimate of the probability. As might be
expected, the principal reason seems to be that in small samples the true variance of Q is
less than that ascribed to it by the x2 approximation. The true variance of Q can be obtained
by the usual rather laborious methods. We find
where Xk (= )
The mean value of Q agrees with that of x2, but the variance is always slightly too low.
These results provide another approximation to the exact distribution of Q, in which instead
of the x2 distribution we use a type III distribution with exactly the same first two moments
as Q, and with, in general, non-integral degrees of freedom. This approximation was tested
on the eight examples. It gave a substantial improvement for probabilities less than 0 005,
but in the region between 0*2 and 0 005 was only slightly better than x2. A similar elaboration
of F produced about the same results.
As mentioned previously, the eight examples which were worked lead to a recommendation
not to use the correction for continuity with X2. This conclusion applies only when there a
more than two samples. With two samples, the argument for a continuity correction is
already provided by Yates's examination of the correction when used with the binomial
distribution. As a check, two exact distributions with only two columns were worked out,
and both showed x'2 superior to x2, though x'2 still tended to overestimate the probabilitie
A subdivision of the eight worked examples into the four examples with c = 3 and the four
examples with c = 4 or 5 indicated that the superiority of x2 over x'2 was slightly greater
in the latter group. With c = 3, the average percentage errors were 23 for x'2 and 18 for x2,
whereas with c = 4 or 5 the corresponding figures were 27 and 15 respectively.
In the limiting distribution all totals T, have the same expectations, variances, and co-
variances when the null hypothesis holds. This implies that if we divide E (Tj. -T)2 into
components by the usual rules for subdividing a sum of squares, each component, when
multiplied by the factor which converts it to Q, will follow a x2 distribution in large samples.
This procedure requires some care in its application. The diphtheria example is not very
suitable for illustration, since the total x2 is barely significant and would probably not be
subdivided into components. The artificial example in Table 4, with data similar to those in
the diphtheria example but showing more significance, will be used.
In the frequency distribution of u, the rows with *ui = 4 or 0 have been omitted. Since
E(T - T)2 = (6)2+ (15)2+ (12)2+ (17)2_ (50)2/4 = 69,
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 265
Suppose that there is some reason to expect that A may perform differently from B, C
and D. We might then wish to divide Q into the components A v. B, C, D and B v. C v. D. For
the first component we calculate
By subtraction from the total Q, 18-81, we find Q2 to be 3-45 (2 D.F.). It represents a com-
parison of the totals of B, C and D.
1 1 1 1 ~~~~~4 2 5
1 1 1 1 6 >u,= 34, Eu2= 92.
o 0 0 5 cuj- u2=4(34)-92=44.
Totals (T1) 6 15 12 17
The difficulty is that since Q1 is definitely significant, the null hypothesis that the pro-
bability of success within a row is the same in all four columns can no longer be maintained
for the development of a comparative test of B, C and D amongst themselves. It seems
better, when Q1 is significant, to recalculate Q2, using only the data from the relevant columns.
If we reject the first column, the B, C and D totals do not change, but the frequency dis-
tribution of ui (ignoring 3's and O's) becomes
Ui f
2 7
10 . SUMMARY
In this paper the familiar x2 test for comparing the percentages of successes in a number of
independent samples is extended to the situation in which each member of any sample is
matched in some way with a member of every other sample. This problem has been en-
countered in the fields of psychology, pharmacology, bacteriology and sample survey design.
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
266 Comparison of percentages.in matched samples
A solution has been given by McNemar (1949) and others when there are only two
samples.
In the more general case, the data are arranged in a two-way table with r rows and
c columns, in which each column represents a sample and each row a matched group. The
test criterion proposed is - _ - _ _) Z(Tj - T)2
Q c(zIu) (EU4) f
where Tj is the total number of successes in thejth sample (column) and ui the total number of
successes in the ith row. If the true probability of success is the same in all samples, the
limiting distribution of Q, when the number of rows is large, is the x2 distribution with (c - 1)
degrees of freedom. The relation between this test and the ordinary x2 test, valid when
samples are independent, is discussed.
In small samples the exact distribution of Q can be constructed by regarding the row totals
as fixed, and by assuming that on the null hypothesis every column is equally likely to obtain
one of the successes in a row. This exact distribution is worked out for eight examples in
order to test the accuracy of the X2 approximation to the distribution of Q in small samples.
The number of samples ranged from c = 3 to c = 5. The average error in the estimation of
a significance probability was about 14 % in the neighbourhood of the 5 % level and about
21 % in the neighbourhood of the 1 % level. Correction for continuity did not improve the
accuracy of the approximation although it is recommended when there are only two samples.
Another approximation, obtained by scoring each success as '1 and each failure as '0, and
performing an analysis of variance of the data, was also investigated. The F-test, corrected
for continuity, performed about as well as the x2 approximation (uncorrected), but is slightly
more laborious.
The problem of subdividing X2 into components for more detailed tests is briefly discusse
In conclusion, my thanks are due to Miss Elizabeth 0. Grant and Mrs Elizabeth S. Jamison
for considerable assistance in the computations. This work was done as part of a contract
with the Office of Naval Research, U.S. Navy Department.
REFERENCES
COCHRAN, W. G. (1937). The efficiencies of the binomial series tests of significance of a mean and of
a correlation coefficient. J. R. Statist. Soc. 100, 69.
DENTON, J. E. & BEECHER, H. K. (1949). New analgesics. J. Amer. Med. Ass. 141, 1051.
DIXON, W. J. & MOOD, A. M. (1946). The statistical sign test. J. Amer. Statist. Ass. 41, 557.
Hsu, P. L. (1949). The limiting distribution of functions of sample means and application to testing
hypotheses. Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability,
p. 359. University of California Press.
McNEMAR, Q. (1949). Psychological statistics. New York: John Wiley and Sons.
MOSTELLER, F. J. (1947). Equality of margins. Amer. Statist. 1, 12.
WALSH, J. E. (1947). Concerning the effect of intraclass correlation on certain significance tests. Ann.
Math. Statist. 18, 88.
This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms