Sunteți pe pagina 1din 12

Biometrika Trust

The Comparison of Percentages in Matched Samples


Author(s): W. G. Cochran
Source: Biometrika, Vol. 37, No. 3/4 (Dec., 1950), pp. 256-266
Published by: Oxford University Press on behalf of Biometrika Trust
Stable URL: http://www.jstor.org/stable/2332378
Accessed: 28-10-2017 17:53 UTC

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms

Biometrika Trust, Oxford University Press are collaborating with JSTOR to digitize,
preserve and extend access to Biometrika

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
[ 256 ]

THE COMPARISON OF PERCENTAGES IN MATCHED SAMPLES

BY W. G. COCHRAN

Department of BiostatistiCs, School of Hygiene and Public Health, Johns Hopkins University

I. INTRODUCTION

The x2 test has long been used to test the significance of differences between ratios or
centages in two or more independent samples. It sometimes happens that each member of a
sample is matched with a corresponding member in every other sample, in the hope of securing
a more accurate comparison among the percentages. The matching may be based either on
the characteristics of the members, or on the fact that the partners in a group are subjected
to some test that is the same for all members of the group but varies from one group to
another.
Since the matching may introduce correlation between the results in different samples,
it invalidates the ordinary X2 test, which gives too few significant results if the matching is
effective. For the case where there are two samples, an appropriate test is easily constructed.
An example has been given by McNemar (1949), who presents this test. In his data, 205
soldiers were asked whether they thought that the war against Japan would last more or
less than a year. They were subsequently asked the same question after a lecture on the
difficulties of the war against Japan. Matching occurs because each sample contains exactly
the same soldiers.
The replies may be classified in a 2 x 2 frequency table as shown in Table 1.

Table 1. Comparison of ratios in two samples

After lecture

Total

Less More

Before lecture: Less 36 (a) 34(b) 70


More 0(c) 135(d) 135

36 169 205

Before the lecture, 70 men out of the 205 thought that the war would last less than a year,
whereas after the lecture this number has dropped to 36. The comparison which we wish to
make is that between the two frequencies 70/205 and 36/205. There are several ways in which
the test may be derived. Perhaps the easiest is to note that both numerators, 70 and 36,
contain the 36 (a) men who persisted in thinking that the war would last less than a year.
Hence, equality of the numerators would imply that the same number of men changed from
'Less' to 'More' as changed from 'More' to 'Less', In other words, if the lecture is without

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 257

effect we would expect half the persons who changed their minds to change in one direction
and half in the other. Thus the test can be made by testing whether the numbers (b) and
(c) are binomial successes and failures out of n = (b + c) trials, with probability 2. For this

2(b- +)2 (c- n)2 (b-c)2 (34-O)2


in in b+c 34 + 0

with 1 degree of freedom. A correction for continuity can be applied by subtracting 1 from
the absolute value of the numerator before squaring.
The two-sample case has also been discussed in a study by Denton & Beecher (1949),
where the object was to find out whether subjects reacted more frequently to an injection of
a new drug than they did to one of isotonic sodium chloride, which was used as a control.
They give a x2 test, attributed to Mosteller, which differs slightly from that given above.
The object of this paper is to extend the test to the situation where there are more than
two samples. An example is provided in some studies of variability among interviewers in
sample surveys. Each interviewer called at a different group of houses, but any house assigned
to an interviewer was matched with one of the houses assigned to each other interviewer
according to the characteristics of the housewife. A test of whether the percentage of 'yes'
answers to some question differed from interviewer to interviewer is a test of the type that
we are considering.
In a second example, the effectiveness of a number of different media for the growth of
diphtheria bacilli was investigated by the Communicable Disease Centre, U.S. Public Health
Service. In one series, specimens were taken from the throats of sixty-nine suspected cases,
Each specimen was grown on each of four media A, B, C, D. The probability that growth
takes place will depend on the number of diphtheria bacilli present, and in a number of
cases there might well be no bacilli present.

Table 2. Method of presentation suited to more than two columns


Diphtheria media Soldiers' replies

No. of No. of
A B C D cases Before After cases

1 1 1 1 ~~4 1 1 135 (d)


1 I 0 1 2 0 1 34 (b)
o i 1 1 3 0 0 36 (a)
o i 0 1 1 _ _ _ _ _ _ _ _

o 0 0 0 59
_____ _____ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ____T o tals 135 169

Totals (TI) 6 10 7 10

Results are shown in Table 2.* Where there are four media, the 2 x 2 table does not seem
well adapted to a succinct presentation. Instead, each medium is allotted to one column of
the table. A 1 denotes that growth occurred with that medium, a 0 that no growth occurred.
Thus in Table 2 there were four specimens in which all four media exhibited growth, two
specimens in which media A, B and D, but not C, showed growth, and so on. To illustrate
* I wish to thank Dr Martin Frobisher, Chief, Bacteriology Laboratories, Communicable Disease
Centre. U.S. Public Health Service. Atlanta. for permission to use these data for illustration.
Biometrika 37 I7

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
258 Comparison of percentages in matched samples
the relation to the method of presentation in a 2 x 2 table, McNemar's results are also shown
in this form, where a 1 denotes the answer 'more than a year '.
The column totals are the total numbers of ls. The problem is to test whether these totals
differ significantly among media.

2. MATHEMATICAL FRAMEWORK

For a discussion of the theory of the test we shall adopt a less concise method of presentation
than that given in Table 2. Each matched group will be placed in a different row of the table.
Thus the table for the diphtheria data would contain 69 rows and 4 columns. The probability
of a 1 is presumed to vary from row to row, usually in a manner that is known only vaguely.
Nevertheless, an exact test can be developed by the familiar method in which the population
is generated by randomization. The observed total number ui of successes (l's) in the ith
row is regarded as fixed. If the null hypothesis is true, every one of the c columns is con-
sidered equally likely to obtain one of these successes. The population of possible results in

the ith row consists of the (C) ways in which the uj successes can be distributed among
c columns.
This specification has one consequence that might be questioned. If a row contains no
successes, or c successes, the population generated in that row consists only of the single
case that actually occurred. As will be seen, this implies that such rows play no part in the
test of significance. This is evident in the two-sample test, which makes no use of the number
of cases a and d in the cells of Table 1 where there was no change of opinion. On the other
hand, for given values of b and c, one might feel intuitively that significance ought to be more
definitely established if there are no cases in which the samples give the same result (i.e. a
and d are zero) than if there are a large number of such cases. Whether this feeling is sound is
perhaps debatable, and I do not see how weight can be given to it without losing the advantage
of an exact test.
The test criterion that will be used is (Tj - T)2, where Tj is the total number of success
in the jth column. This is the same criterion as in the ordinary x2 test for the situation where
the columns are independent. It may not be the best criterion. For the usefulness of the
data from a row for the purpose of detecting differences among columns may depend on the
probability of success in the row. That is, the situation may be similar to that which occurs
in dosage-mortality experiments, in which, for maximum sensitivity per observation,
comparisons of two drugs must be made close to the median lethal doses. This suggests that
in extensive data it might be advisable to group the rows according to the value of uj and
to perform some kind of weighting on the Tj values for different values of Ui. I have occasio
ally used this approach, but it mav be difficult to decide what form the weighting should take,
particularly in a new type of experimentation. A test based on the unweighted totals will
often serve our purpose.
3. THE LIMITING DISTRIBUTION

We consider first the limiting distribution of the test criterion when the number of rows r is
very large. Let the variate x*> take the value 1 if there is a success in the cell in the ith row
and jth column, and 0 if there is a failure. By the properties of the randomization in that
row, these two events occur with probabilities ui]c and 1 - uti/c, respectively. Hence

E(xi1) = C
u, o2(Xtj)
C\C/
= __ ( i

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 259

By symmetry, the covariance is the same for any two cells in the same row. Since the row

total of the xil is fixed at ui and thus has zero variance, the covariance of xij and Xik is found
to be _-Ui (1 - u)
cov (xiJxik)- (c-1) (j- *k).

These results enable us to arrive, by non-rigorous methods, at the form of the limiting
distribution. Since the randomization is independent in different rows, the means, variances

and covariances of the column totals T, will be corresponding expressions above, summed
over the rows. If the number of rows is large, the joint distribution of these totals may be
expected to tend to the multivariate normal. Finally, if a set of c variates Tj follow a multi-
variate normal distribution with common variance o-2 and common covariance po-2, it is
well known that (Tj- T)2 is distributed as X2er2(l - p), with (c - 1) degrees of freedom
(Walsh, 1947). In this case

C c ( c '2 P (c-)

so that o2(l-P)= (c) Eu (1 U

Hence, when the number of rows is large,

(c1 - (jT)2 c(c- 1) E (Tj-T)2(1


Eu 76 i (-) C,(E 7i) (Zu76,)
is distributed as x2 with (c - 1) D.F.
A rigorous proof of this result may be obtained by the method developed by Hsu (1949)
and will not be given here. The only restriction needed is a rather obvious one, to guard
against the possibility that as the number of rows tends to infinity, the value of ui might
c or 0 in all but a finite number of rows. If this happens, the size of the population is still finite
in the limit, because permutations within rows having ui = c or 0 do not generate any new
cases. This situation is avoided by stipulating that for at least one intermediate value of up
the number of rows having that value must tend to infinity.
When there are only two samples (c = 2) the test reduces to that given in ? 1. If a 1
denotes a reply of 'more than a year' it will be seen from Table 1 that

T1= c+d, T2= b+d,

>u = b+c+2d, Z2u= b+c+4d.

(2 1.(T1-T2)2 _ bC)
From (1) Q becomes Q (2)(1) 2 ( b+c

4. COMPARISON WITH THE ORDINARY %2 TEST

In the ordinary x2 test, valid when the samples are independent, we have

y (T - T)2
2= ; (2)
c c

where u = Yu,/r.
r7-2

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
260 Comparison of percentages in matched samples
Under what conditions does this test coincide with the new (Q) test? It might be anti-
cipated that this should happen when the probability of success does not change from row
to row. The results are in line with this expectation.
Consider the application of both tests to a series of tables, all of which have the same set
of row totals. From (1) and (2) the new test gives a greater, an equal or a smaller number of
significant results than the ordinary test, according as

Si-C) > C -- 1C)

Since 2uj = ru, the left-hand side may be expressed as

ru~l__)_ (U - -)2
c c

It follows that relations (3) are equivalent to

ru~l_ I ) _ u-u (4)

If we wish to test the null hypothesis that the probability of success is the same in all
rows, this could be done by an ordinaryX2 test on the row totals ui. Since rows are independent,
the value of X2 would be Z (U, -)2
2 =

Xr=

with (r - 1) degrees of freedom. Thus relations (4) can be written


r, %2 (5)

The expected value


is large. Thus the eq
totals is just about equal to its expectation. The analysis also shows that the Q test gives
more significant results when X2 exceeds its expectation, and fewer significant results when
X2 is below expectation.

5. APPLICATION TO THE EXAMPLE

In the example (Table 2) we have c = 4,

(Tj --T)2 = 62 + 102 + 72 + 102 _ (33)2/4 = 12*75.

To find the denominator of Q, a separate frequency distribution of the values of the row
totals ui may be made.
Value of uj Frequency
4 4
3 5
2 1
0 59

Zui = 1T1= 33, Zui= 113.


i j

Hence (4) (33)


from- (113)
(1) Q
with 3 degrees of freedom, corresponding to a probability of 0-045.

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 261

It is easy to show algebraically, as may be verified in this example, that the value of Q

is not altered if we omit all rows in which ui = c or 0. In this respect, Q behaves as we


expect a test based on the randomization process to behave.
In the example, 63 of the 69 rows have 4 or 0 successes, so that only 6 rows really con-
tribute to the frequency distribution of Q. It may be doubted whether a limiting distribution
which assumes a large number of such rows can be applied here. This question is discussed
in ?6.
6. THE DISTRIBUTION OF Q IN SMALL SAMPLES

If there are only two columns, an exact small-sample test presents no difficulty. The Q test
is essentially equivalent to the sign test (Cochran, 1937; Dixon & Mood, 1946), for which
tables are available in the references cited.* This can be seen by the argument used in ? 1.
Apart from its divisor, Q is (T1 - T2)2, i.e. the square of the difference between the number of
successes in the two columns. We may ignore all rows that contain either 2 or 0 successes,
since these do not affect the value of Q. Consequently (T1 - T2) is the difference between the
number r1 of rows in which the results are (1 .0) and the number r2 in which they are (0 1).
If n = r1 + r2, this difference equals (2r, - n).
For any row that has one success, the probabilities of a (1.0) and of a (0-1) on the null
hypothesis are both a. This shows that r1 is distributed in the binomial (Q + 2,)n which i
quantity that is tabulated in the sign test.
For an exact test when c = 2, the procedure is therefore as follows: (i) ignore all rows with
2 or 0 successes; (ii) count the number of rows with a single success in the first column, and
refer to the tables of the sign test, where n is the total-number of rows that have one success.
In the small tables that have more than two columns, the exact distribution of Q can be
tabulated by enumerating all configurations generated by the randomization. Since the
number of possible cases is large, a comprehensive listing of exact significance levels would
be laborious to construct. As a check on the accuracy of the limiting distribution in small
samples, the exact distribution of Q was worked out for the following eight cases:

c= 3, r= 10; 2515, c= 4, r= 6; 352.


c= 3, r = 10; 219, c=4, r= 9; 332313,
c= 3, r= 11; 2110, c =4, r= 10; 332314,
c= 3, r= 16; 2115, c=5, r= 8; 42322212.

The figures following the semicolon are the ui values: e.g. 2515 means th
of the rows and 1 in the remaining five. No case in which ui = c or 0 was in
number of such rows may be added to the basic table without affecting the value of Q.
Some of the cases are rather closely related in their structure. Nevertheless, it seemed
best to include all of them in presenting summary comparisons. The cases were chosen as
indicative of the smallest samples in which the x2 approximation to the distribution of Q
likely to be needed. Smaller samples can of course occur in practice, but in this event it is
relatively easy to make an exact test of significance from the exact distribution of Q.
The exact distribution was compared not only with the x2 approximation, but also with
an F-test applied to the data by means of an analysis of variance into the components
D.F.

Rows (r- 1)
Columns (c-1)
Rows x columns (r-1) (c-1)

* This has been pointed out by Mosteller (1947).

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
262 Comparison of percentages in matched samples

where F is the ratio of the mean squares for columns and rows x columns. If the data had
been measured variables that appeared normally distributed, instead of a collection of l's
and O's, the F-test would be almost automatically applied as the appropriate method.
Without having looked into the matter, I had once or twice suggested to research workers
that the F-test might serve as an approximation even when the table consists of l's and O's.
As a testimony to the modern teaching of statistics, this suggestion was received with
incredulity, the objection being made that the F-test requires normality, and that a mixture
of l's and O's could not by any stretch of the imagination be regarded as normally distributed.
The same workers raised no objection to a x2 test, not having realized that both tests require
to some extent an assumption of normality, and that it is not obvious whether F or %2 is
more sensitive to the assumption. Inclusion of the F-test is also worth while in view of the
widespread interest in the application of the analysis of variance to non-normal data.
The total number of values in a population is sufficiently small so that correction for
continuity makes an appreciable difference. Application of the correction requires a little
inspection of the data. Usually the values of Z(Tj- T)2 increase by 2's, but with c = 2 or 3
the increase may be much greater, and it is necessary to discover what is the value of Q
immediately below the one actually obtained, and enter the table with a value midway
between the two. For x2 the results are given both with and without correction, since
experience in other problems has suggested that the correction may not be helpful when
there are more than two samples. For F the correction was a decided improvement and only
corrected values are shown.*
It is easy to build up the exact distribution row by row. Members of the first row need not
be permuted, but all other rows must be. Consider the diphtheria example in Table 2. If
the sixty-three rows which show either all positives or no positives are omitted, this becomes
the case c = 4, r = 6; 352. We start with the row (1 110) and add successively four rows with
ui = 3 and one row with ui = 2. Addition of the second row gives the four cases
1110 1110 1110 1110
0111 1011 1101 1110

1221 2121 2211 2220

At this stage the possible sets of column totals are (2220) with probability 1/4 and (2211)
with probability 3/4. All permutations of the third row are now added, and so on. The total
number of cases is (44) (6), or 1536, but these combine to give only nine different values of Q.

7. RESULTS OF THE SMALL-SAMPLE TESTS

In appraising the tabular X2 and F approximations, attention was concentrated on the


in which the exact probability lies between 0f2 and 0 005. Table 3 shows the average per-
centage errors in the estimates of significance probabilities for each of three subdivisions of
the region. The percentage error is

100(tabular P - true P)/(true P),

where the averages are taken without regard to sign. The numbers of overestimates and
underestimates made by each approximation are also shown. In Table 3, x2 denotes the
uncorrected value, and x'2 and F' denote the values after correction for continuity.

* Actually, the incomplete beta function rather than F was corrected for continuity, since the former
was more convenient for reading significance probabilities. Results differ slightly, but not materially,
from those given by correcting F itself.

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 263

From the percentage errors it appears that X2 (uncorrected) and F' have performed about
equally well, both being slightly better than X%2. None of the methods is free from bias. X'2
tends to overestimate and F' to underestimate. Over the range as a whole x2 comes off fairly
well with 23 overestimates and 32 underestimates, but it appears that a negative bias in
the region of 02 to 0*1 is being counteracted by a positive bias in the region of 0-02 to 0 005.
For practical use X2 is preferable to F', since it is slightly easier to calculate, though the
possible application of F' to more complex tables should be borne in mind.

Table 3. Average percentage errors in estimating significance probabilities

Percentage error No. of overestimates (+) and underestimates

Range of x ' F,
exact P F2 __2_ -I /_ _ ___ ___

+ I+ - + -

0.2 -0 1 15 7 5 1 10 7 4 6 5
0.1 -002 14 18 15 7 19 21 5 7 19
0)02-0 005 21 46 26 15 3 17 1 1 17

Average ortotal 16 25 17 23 32 45 10 14 41

At the true 5 % level, average errors o


that the tabular approximations might give a value of 0.057 or 0 043 instead of 0 05. At the

1 %O level, the corresponding figures are about 0012 and 0O008. These results appear c
enough for routine decisions. For true probabilities below 0005 all methods tend to go to
pieces. F' may give values only one-quarter of the true probability, while the two x2 values
may be six or eight times too high. An exact assessment of a very small probability is rarely
essential.
It may be of interest to note the probabilities given by the various approximations for
the diphtheria example in Table 2. We have already calculated x2, with P = 0045. The
exact P is 81/1536, or 0053, while x'2 gives 0O080 and F' gives 0 062. All methods agree to
the extent of indicating a probability somewhere close to the region of significance.

8. FURTHER NOTES ON THE SMALL-SAMPLE CASE

It has been mentioned that the value of Q, and hence of x2, is unaffected by any rows whi
contain c or 0 successes. This is not so for F, where the degrees of freedom (r - 1) (c - I) in
the denominator are obviously increased by the addition of rows of any kind. The value of
F itself is also affected. Without resort to details, what happens is that if we take a basic
table containing no rows with c or 0 successes, and add to it an increasing number of such rows,
the probabilities given by F' (corrected) or F (uncorrected) increase slowly until ultimately
they agree with those given by x'2 and x2 respectively. This implies, incidentally, that at
intermediate stages F' may give a better approximation than any of the methods previously
presented, because for the basic table the probability given by x12 is in general too high and
that by F' too low. In the eight worked examples, this was so when half of the rows were

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
264 Comparison of percentages in matched samples
c's or O's. In fact, it might be possible, as a purely empirical device, to set a quota of such
rows which would be included in calculating F (whether they were actually present or not),
so as to make F or F' a good approximation to the exact probability. This approach was not
pursued since x2 seems good enough for most purposes. The approach may appear slightly
repugnant logically, but is no more so than the use of an empirical approximation to an exact
frequency distribution.
Some investigation was undertaken in an attempt to discover why, at low values of the
exact significance probability, x2 gives an overestimate of the probability. As might be
expected, the principal reason seems to be that in small samples the true variance of Q is
less than that ascribed to it by the x2 approximation. The true variance of Q can be obtained
by the usual rather laborious methods. We find

_E(Q) = (c -i); V(Q) =( )I'-s 9


[(81- 82)2

where Xk (= )

The mean value of Q agrees with that of x2, but the variance is always slightly too low.
These results provide another approximation to the exact distribution of Q, in which instead
of the x2 distribution we use a type III distribution with exactly the same first two moments
as Q, and with, in general, non-integral degrees of freedom. This approximation was tested
on the eight examples. It gave a substantial improvement for probabilities less than 0 005,
but in the region between 0*2 and 0 005 was only slightly better than x2. A similar elaboration
of F produced about the same results.
As mentioned previously, the eight examples which were worked lead to a recommendation
not to use the correction for continuity with X2. This conclusion applies only when there a
more than two samples. With two samples, the argument for a continuity correction is
already provided by Yates's examination of the correction when used with the binomial
distribution. As a check, two exact distributions with only two columns were worked out,
and both showed x'2 superior to x2, though x'2 still tended to overestimate the probabilitie
A subdivision of the eight worked examples into the four examples with c = 3 and the four
examples with c = 4 or 5 indicated that the superiority of x2 over x'2 was slightly greater
in the latter group. With c = 3, the average percentage errors were 23 for x'2 and 18 for x2,
whereas with c = 4 or 5 the corresponding figures were 27 and 15 respectively.

9. SUBDIVISION OF X2 INTO COMPONENTS

In the limiting distribution all totals T, have the same expectations, variances, and co-
variances when the null hypothesis holds. This implies that if we divide E (Tj. -T)2 into
components by the usual rules for subdividing a sum of squares, each component, when
multiplied by the factor which converts it to Q, will follow a x2 distribution in large samples.
This procedure requires some care in its application. The diphtheria example is not very
suitable for illustration, since the total x2 is barely significant and would probably not be
subdivided into components. The artificial example in Table 4, with data similar to those in
the diphtheria example but showing more significance, will be used.
In the frequency distribution of u, the rows with *ui = 4 or 0 have been omitted. Since
E(T - T)2 = (6)2+ (15)2+ (12)2+ (17)2_ (50)2/4 = 69,

we find Q- (4) (3) (69) - (3) (69) = 18.81, with 3D.F.


44 11

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
W. G. COCHRAN 265

Suppose that there is some reason to expect that A may perform differently from B, C
and D. We might then wish to divide Q into the components A v. B, C, D and B v. C v. D. For
the first component we calculate

[3(6) 44]2 = 66-33, Q - 3(56.33) - 15-36 (1 D.F.).


12 11

By subtraction from the total Q, 18-81, we find Q2 to be 3-45 (2 D.F.). It represents a com-
parison of the totals of B, C and D.

Table 4. Artificial example to illustrate subdivision of x2

A B C D No. of Frequency distribution of u


cases Uif
3 8

1 1 1 1 ~~~~~4 2 5
1 1 1 1 6 >u,= 34, Eu2= 92.
o 0 0 5 cuj- u2=4(34)-92=44.

Totals (T1) 6 15 12 17

The difficulty is that since Q1 is definitely significant, the null hypothesis that the pro-
bability of success within a row is the same in all four columns can no longer be maintained
for the development of a comparative test of B, C and D amongst themselves. It seems
better, when Q1 is significant, to recalculate Q2, using only the data from the relevant columns.
If we reject the first column, the B, C and D totals do not change, but the frequency dis-
tribution of ui (ignoring 3's and O's) becomes
Ui f
2 7

Xui = 14, Xu- = 28,


Hence = (3) (2) (12*67)
2 14

The difference between Q2 and Q' arises solely in the conversion fa


which has been altered from 3/11 to 3/7. This changes the significance probability. from
0178 for Q2 to -0@66 for Q'. The exact probability, computed from the data for B, C and D
alone, is found to be 0078.
From such cases as I have examined, the ordinary rule for the subdivision of the sum of
squares, and hence of Q, appears good enough for a preliminary inspection. When the
situation is similar to that in this example, the advisability of recomputing tests that are of
special interest should be noted.

10 . SUMMARY

In this paper the familiar x2 test for comparing the percentages of successes in a number of
independent samples is extended to the situation in which each member of any sample is
matched in some way with a member of every other sample. This problem has been en-
countered in the fields of psychology, pharmacology, bacteriology and sample survey design.

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms
266 Comparison of percentages.in matched samples
A solution has been given by McNemar (1949) and others when there are only two
samples.
In the more general case, the data are arranged in a two-way table with r rows and
c columns, in which each column represents a sample and each row a matched group. The
test criterion proposed is - _ - _ _) Z(Tj - T)2
Q c(zIu) (EU4) f
where Tj is the total number of successes in thejth sample (column) and ui the total number of
successes in the ith row. If the true probability of success is the same in all samples, the
limiting distribution of Q, when the number of rows is large, is the x2 distribution with (c - 1)
degrees of freedom. The relation between this test and the ordinary x2 test, valid when
samples are independent, is discussed.
In small samples the exact distribution of Q can be constructed by regarding the row totals
as fixed, and by assuming that on the null hypothesis every column is equally likely to obtain
one of the successes in a row. This exact distribution is worked out for eight examples in
order to test the accuracy of the X2 approximation to the distribution of Q in small samples.
The number of samples ranged from c = 3 to c = 5. The average error in the estimation of
a significance probability was about 14 % in the neighbourhood of the 5 % level and about
21 % in the neighbourhood of the 1 % level. Correction for continuity did not improve the
accuracy of the approximation although it is recommended when there are only two samples.
Another approximation, obtained by scoring each success as '1 and each failure as '0, and
performing an analysis of variance of the data, was also investigated. The F-test, corrected
for continuity, performed about as well as the x2 approximation (uncorrected), but is slightly
more laborious.
The problem of subdividing X2 into components for more detailed tests is briefly discusse

In conclusion, my thanks are due to Miss Elizabeth 0. Grant and Mrs Elizabeth S. Jamison
for considerable assistance in the computations. This work was done as part of a contract
with the Office of Naval Research, U.S. Navy Department.

REFERENCES

COCHRAN, W. G. (1937). The efficiencies of the binomial series tests of significance of a mean and of
a correlation coefficient. J. R. Statist. Soc. 100, 69.
DENTON, J. E. & BEECHER, H. K. (1949). New analgesics. J. Amer. Med. Ass. 141, 1051.
DIXON, W. J. & MOOD, A. M. (1946). The statistical sign test. J. Amer. Statist. Ass. 41, 557.
Hsu, P. L. (1949). The limiting distribution of functions of sample means and application to testing
hypotheses. Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability,
p. 359. University of California Press.
McNEMAR, Q. (1949). Psychological statistics. New York: John Wiley and Sons.
MOSTELLER, F. J. (1947). Equality of margins. Amer. Statist. 1, 12.
WALSH, J. E. (1947). Concerning the effect of intraclass correlation on certain significance tests. Ann.
Math. Statist. 18, 88.

This content downloaded from 186.239.80.39 on Sat, 28 Oct 2017 17:53:01 UTC
All use subject to http://about.jstor.org/terms

S-ar putea să vă placă și