Documente Academic
Documente Profesional
Documente Cultură
Analysis of 2 K Tables
Shiva Gautam a
a
Harvard Medical School, Boston, Massachusetts, U.S.A.
Online Publication Date: 25 August 2004
Analysis of 2
K Tables
Shiva Gautam
Harvard Medical School, Boston, Massachusetts, U.S.A.
INTRODUCTION
Data in 2 K contingency tables are encountered quite
frequently in biomedical, epidemiological, social, and
behavioral studies. The variable representing two rows is
often called the row variable, whereas the variable
representing K columns is called column variable. (Representation of a data set either in a 2 K or a K 2 table
is just a matter of convenience.) Depending on the research design, either of the column and row variables may
be outcome (response) variables or only one of them may
be an outcome variable. More specifically, an observation
may have been simultaneously categorized into one of the
two rows and into one of the K column categories, or an
observation may have been first drawn from a given
classification of one of the variables (row or column) and
then classified into one of the categories of the other
variable (column or row). For example, without taking
into account the pros and cons of study designs, consider a
possible study to evaluate the association between smoking and lung cancer. The investigator may choose a design
in which he/she first selects two groups of people according to whether they have or have no cancer. Then
each subject is classified into one of the smoking history
categories (e.g., nonsmoker, light smoker, heavy smoker,
etc.). Similarly, the investigator may first select people
according to smoking status, and then classify each
subject from each smoking group according to whether
he/she has or has no lung cancer. Finally, the investigator
may select a fixed number of subjects and then simultaneously classify them into one of the two lung cancer
categories and into one of the several smoking categories. In many situations, the same computational procedures can be used while analyzing data regardless of the
study design.
In the analysis of 2 K nominal table it is important
to distinguish between a nominal table and an ordinal
table. The data from the lung cancer and smoking study
alluded above give rise to an ordinal 2 K table as the
column of the tables (e.g., nonsmoker, light smoker,
heavy smoker, etc.) follow an ordering (increasing) or a
hierarchy in the sense that any one category will either
be at a higher level or at a lower level than any of the
other remaining categories. Sometimes such an ordering
among categories is also called simple ordering. A 2K
Encyclopedia of Biopharmaceutical Statistics
DOI: 10.1081/E-EBS 120023105
Copyright D 2004 by Marcel Dekker, Inc. All rights reserved.
ANALYSIS OF 2
K NOMINAL TABLES
Let nij denote the number of observations in the ith row
(i = 1, 2) and jth column ( j = 1, 2, . . ., K) as displayed in
2
k
P
P
Table 1. Also, let ni
nij , nj
nij , and
2
k
2
k
i1 P
j1
P
P
P
n
nij
ni
nj . The columns of
i 1j 1
i1
j1
2 X
k
X
nij ^yij 2 ;
i 1j 1
ni nj
.
where ^yij
n
The statistic X2 is asymptotically distributed as
the chi-square variate with (K 1) degrees of freedom. A
large value of X2 provides evidence against the null
hypothesis. The null hypothesis is often stated as there is
no association between the row and column variables.
Depending on the research question, the null hypothesis
could be that the distribution of proportions in each row
(two populations) is the same or the column proportions
(K-populations) are the same. As usual, the decision to
reject (or not to reject) the null hypothesis is based on
1
ORDER
REPRINTS
Analysis of 2 K Tables
...
...
Total
1
2
Total
n11
n21
n+1
n12
n22
n+2
...
...
...
...
...
...
n1k
n2k
n+k
n1 +
n2 +
n
the p-value. Agresti[1] is an excellent source on chisquare analysis of two-way nominal categorical tables.
2 X
k
X
nij lognij =^
yij
i 1j 1
ni nj
.
where ^
yij
n
For large n, G2 also has chi-square distribution
with (K 1) degrees of freedom. Hence both X2 and G2
analyses of a given data set in a 2 K nominal table will
generally yield similar results for large n.
Logistic Regression
When the two rows of a 2K table represent response,
logistic regression may be used to analyze the data by
modeling the probability of response (e.g., present vs.
absent). Let p =probability of response in the first row.
Define dummy variable X2, X3, . . ., Xk such that Xj =1 if
the observation is from the jth category (j =2, 3, . . ., k),
and Xj =0, otherwise. The logistic regression model can be
represented as
^ X2 b
^ Xk
^ b
3
logitp b
1
2
k
p
where logit p log
.
1 p
^ is log odds of responding in row 1
Note that b
1
from column 1 (reference column) or equivalently when
X2, X3,. . .Xk equal to 0 and X1 equals to 1. In other
^ =log(n1j/n2j). From the above equation b
^ is
words, b
1
j
the excess of log odds responding in row 1 due to the jth
column than the response due to the first column. In
^ represents odds ratio (odds of
other words, expb
j
response due to the jth column compared to odds of
response due to the first column).
[Note: Suppose p1 denotes the probability of lung
cancer for smoker, and p2 denotes the chance of lung
ORDER
REPRINTS
Analysis of 2 K Tables
Exact Tests
An Example
Consider Table 2 from Helmes and Fekken.[8] The table
is also reproduced in Agresti.[1] The table classifies
psychiatric patients by their diagnosis and whether the
treatment prescribed drugs.
The Pearson chi-square statistic from Table 2 is
X2 = 84.180 (p <0.0001, df = 4). This suggests an association between the diagnosis and whether or not a patients
treatment prescribed drugs. The above table shows that a
schizophrenic patient is most likely to be treated by drugs
followed by patient diagnosed as active disorder and
personality disorder, respectively. A patient with neurosis
has an almost 50% chance of being prescribed drugs,
whereas a patient classified as having special symptoms is
not likely to be treated by a drug. The Pearson chi-square
test rejects the hypothesis that these proportions in the
population are the same. In other words, there seems to be
an association between the diagnosis and whether or not
the treatment prescribed a drug. The last row shows the
odds ratio of being prescribed with a drug for a given
diagnosis compared to odds of being prescribed with a
drug if a patient is diagnosed as schizophrenic.
Schizophrenia
Active disorder
Neurosis
Personality disorder
Special symptoms
Total
Drugs
No drugs
Total
% Drugs
Odds ratio
105
8
113
92.92
1.0
12
2
14
85.71
0.46
18
19
37
48.65
0.07
47
52
69
68.12
0.07
0
13
13
0
0
182
94
276
ORDER
Analysis of 2 K Tables
Logistic Regression
Let
X2
1;
0;
X3
1;
0;
1;
X4
0;
X5
REPRINTS
1;
0;
if Active Disorder
otherwise
if Neurosis
otherwise
if Personality Disorder
otherwise
if Other Symptoms
otherwise
rmax
n
276 0:5523. This shows that the
observed significant association is not solely due to a large
sample size. A natural question that may arise is whether
some of the categories can be combined without losing
much information.[7] An investigator may be further
interested to determine whether this association is mostly
due to only a few selected categories of the table. Gautam
SE(b )
p-value
Exp(b )
X2
X3
X4
X5
Constant
0.783
2.629
2.676
23.777
2.575
0.847
0.493
0.418
11,147.524
0.367
0.356
0.000
0.000
0.998
0.00
0.457
0.072
0.069
0.000
13.125
ANALYSIS OF 2
K ORDERED TABLES
Pearsons chi-square procedures and other tests developed
for analyzing data in 2 K nominal tables do not
incorporate the information on ordering among the
columns of the table. These tests are not directed toward
any specific alternate hypothesis. In analyzing data in a
2 K ordered table, investigators will obviously want
to use as much information as possible provided by the
data and also often want to determine whether the null
hypothesis can be rejected against a specific alternate
hypothesis (e.g., increasing response with the columns).
A test that utilizes ordering information will have
increased power compared to a test for nominal tables.[1]
Methods for analyzing data in 2 K ordered tables may
be broadly classified into two groups, namely, methods
that assign and that do not assign numerical scores to the
ordered categories, respectively. Methods that do assign
numerical scores to the ordered categories may further
ORDER
REPRINTS
Analysis of 2 K Tables
ORDER
REPRINTS
Analysis of 2 K Tables
An Example
Consider Table 4 which classifies maternal drinking and
congenital sex organ malformation of babies.[15]
If the two sample Wilcoxon rank sum test is used then
the p-value = 0.56 which is also the p-value from the trend
test with midranks as category scores. If equally spaced
scores {1, 2, 3, 4, 5} are used then the p-value =0.20 (from
the trend test, t-test, logistic regression, linear regression,
and correlation analysis). In an example such as this
perhaps the mid-values of the interval represent the
underlying continuous measure. Graubard and Korn used
scores of 0, 1.5, 4.0, and 7 (somewhat arbitrary) which
yield a p-value equal to 0.01.[15] Iso-chi-square analysis
for this data set yields a p-value of 0.02. These are
exact p-values.
Stochastic Ordering
Stochastic ordering, in the context of a 2 K ordered table,
is defined as having the cumulative distribution function
(CDF) of one of the rows not crossing the distribution
function of the other. In terms of the entries of Table 1,
j
j
P
P
F1j
n1t =n1 and F2j
n2t =n2 . If F2j F1j
t1
t1
<1
12
35
17,066
48
14,464
38
788
5
126
1
37
1
CONCLUSION
In this article some existing methods of analyzing data in
2 K (K>2) contingency tables are discussed. Pearsons
chi-square test statistic which is widely used to analyze
nominal data is shown to be related to maximal correlation. Using maximal correlation an investigator may
determine if only a few categories contribute to the observed association. This relationship between the chisquare and the maximal correlation may also shed light on
whether the large value of the chi-square test statistic is
only due to a large sample size.
The paper also discusses methods of analysis of 2 K
ordered table. Some of these methods use order-preserving scores and others that do not use such scores. Several
of the methods that utilize scores are equivalent to each
other. As these methods are directed toward a particular
alternative hypothesis they have more power in general
than the methods that do not utilize such scores. Also,
these methods are computationally simple. However, the
scores chosen are often arbitrary. In many situations the
columns may provide some indication (e.g., interval,
actual dose of a drug, etc.), where it makes sense to use
certain scores. But in a situation where the columns are
defined as low, medium, and high, it may be
difficult to come up with a set of score. In such situations,
the Iso-chi-square method may be useful. Iso-chi-square
may be considered as a natural extension of the Pearson
chi-square to the 2 K ordered table in the sense that if
this procedure is applied to 2 K nominal tables, the test
statistic is the Pearsons chi-square test statistic. Also, Isochi-square may be considered as a link between methods
that do and do not utilize order-preserving score.
All the 2 K tables discussed here are assumed to have
simple ordering. There may be other types of tables where
the ordering between two categories is not simple. For
example, parental drinking or smoking may be classified
as neither parent, mother only, father only, and
both parents. The level of the first or the last category
has a distinct hierarchy compared with any other categories. However, such a hierarchy between the second, the
first, and the third is not defined. Similarly, some 2 K
tables may have mixed categories (both nominal and
ordinal categories) or may have open-ended categories.[20,21] The method of Iso-chi-square may be extended
ORDER
REPRINTS
Analysis of 2 K Tables
8.
9.
10.
ACKNOWLEDGMENTS
REFERENCES
1. Agresti, A. Categorical Data Analysis, 2nd Ed.; Wiley:
New York, 2002.
2. Hosmer, D.W.; Lemeshow, S. Applied Logistic Regression,
2nd Ed.; Wiley: New York, 2000.
3. Collett, D. Modeling Binary Data; Chapman & Hall:
London, 1991.
4. Haberman, S.J. Test for independence in two-way contingency tables based on canonical correlation and linear-bylinear interaction. Ann. Stat. 1981, 9, 1178 1186.
5. Gautam, S.; Kimeldorf, G. Some results on the maximal
correlation in 2 K contingency tables. Am. Stat. 1999, 53
(4), 336 341.
6. Goodman, L.A.; Kruskal, W.H. Measures of association
for cross-classifications. J. Am. Stat. Assoc. 1954, 49,
732 764.
7. Bishop, Y.M.M.; Fienberg, S.E.; Holland, P.W. Discrete
Multivariate Analysis: Theory and Practice; The MIT
Press: Cambridge, 1995.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
All information and materials found in this article, including but not limited
to text, trademarks, patents, logos, graphics and images (the "Materials"), are
the copyrighted works and other forms of intellectual property of Marcel
Dekker, Inc., or its licensors. All rights not expressly granted are reserved.
Get permission to lawfully reproduce and distribute the Materials or order
reprints quickly and painlessly. Simply click on the "Request Permission/
Order Reprints" link below and follow the instructions. Visit the
U.S. Copyright Office for information on Fair Use limitations of U.S.
copyright law. Please refer to The Association of American Publishers
(AAP) website for guidelines on Fair Use in the Classroom.
The Materials are for your personal use only and cannot be reformatted,
reposted, resold or distributed by electronic means or otherwise without
permission from Marcel Dekker, Inc. Marcel Dekker, Inc. grants you the
limited right to display the Materials only on your personal computer or
personal wireless device, and to copy and download single copies of such
Materials provided that any copyright, trademark or other notice appearing
on such Materials is also retained by, displayed, copied or downloaded as
part of the Materials and is not removed or obscured, and provided you do
not edit, modify, alter or enhance the Materials. Please refer to our Website
User Agreement for more details.