Documente Academic
Documente Profesional
Documente Cultură
1. Introduction
When you are working with discrete data in particular, counts of observations you may want to
determine:
a) if the proportions of the counts that you found are significantly
different from expected proportions
b) if there is an association between variables with count data (e.g. in
contingency tables)
In those cases, chi-squared (2) tests can help you to draw conclusions. If the descriptions above are a
little too abstract, below are some simplified examples to give you a feel for the types of data and
questions for which chi-squared may be a valid statistical method:
Examples: Test for expected proportions (proportions can be equal or unequal)
Example 1. You are counting three types of butterflies, Type A, Type B, and Type C under the
null hypothesis that the proportions of the three types in a certain ecosystem are the same. Over
the course of your study, you count 371 Type A butterflies, 262 Type B butterflies, and 302 Type
C butterflies. Do you have sufficient evidence to retain the null hypothesis? Or, alternatively,
should you reject the null hypothesis and determine that the proportions are NOT the same for
the three types?
Example 2. An engineering school states that its student body is equally split between students
studying chemical engineering, mechanical engineering and computer engineering. You collect
information for 500 students in the engineering school, finding that 132 are studying chemical
engineering, 208 are studying mechanical engineering, and 160 are studying computer
engineering. Do you have sufficient evidence to conclude that the student body is not equally
split between the three engineering emphases?
Example 3. A farmer tells you the following: in his apple orchard at any given time, 60% of the
apples are Brayburns, 30% are Granny Smiths, and 10% are Fijis. You collect a random sample
to test his claimed proportions. You collect: 46 Brayburns, 21 Granny Smiths, and 18 Fijis. Do
you have sufficient evidence to dispute the farmers claim?
-.,*/%'0%1%2
!"#$%&'()*+,%
!"#
$%&'()
*'+,
-"./
!"
!#
$$
01/
%$
&"
#%
2%3/"./
$!
$$
#"
Do any of these examples sound like the data you are trying to analyze? If yes, chi-squared might be a
valid statistical test for you.
2. Background
Generally, chi-squared tests for proportions or association work by calculating the 2 test statistic
(Equation 1) as follows:
! =
(!"#$%&$'!!"#$%&$')!
!"#$%&$'
(Equation 1)
where 2 is the sum of the expression above for all possible combinations of variables
(e.g. in the grocery store example above, you would be summing 9 terms).
For example, lets say your testing for equal proportions as in the engineering student example above. If
the null hypothesis were true, then the expected number of students in chemical, mechanical and
computer engineering (if 500 sampled) would be ~167 students each. However, the observed values for
the respective emphases are 132, 208, 160. Using Equation 1 above, the 2 test statistic is calculated and
compared to the critical 2 value (which can be looked up in charts, but most likely youll use R as
shown below).
Chi-squared doesnt tell you much about the strength of an association between variables
Highly sensitive to sample size (very important if the counts are very low see above)
5. Chi-Squared in RStudio
In R and RStudio, the function for chi-squared is chisq.test()
Chi-squared tests in RStudio are straightforward, so long as you have your data arranged appropriately.
Several examples for different types of chi-squared (test for expected proportions, test for association)
are shown.
View the results of the chi-squared test by calling the test name you created (here, carsX2test),
which for this example yields:
You can also view other parameters of the test (included expected values) using the following
commands:
> TestName$observed #View observed counts
> TestName$expected #View expected count values based on the null
proportions (in the p arguments)
> TestName$stdres #View standardized residuals (if > |2|, values
are much different from expected values)
For example, for the cars data above you can view the expected counts for each car color by:
> carsX2test$expected #View expected counts for each car color
[1] 13 13 13 13
p = rep(1/length(x), length(x))
If you want to change the code to account for different expected proportions, you need to include an
amended argument containing an alternate vector for p.
For example: You think that there are twice as many blueberries in a berry basket as there are
raspberries and blackberries (i.e., you think that the baskets are comprised of raspberries,
blackberries, and blueberries). You take a random sampling form a basket and find 22 blueberries, 9
raspberries and 16 blackberries.
H0: There are twice as many blueberries as there are raspberries or blackberries
H1: There are not twice as many blueberries as there are raspberries or blackberries
In R, you could perform chi-squared as follows (note that the p is changed to the proportions that YOU
enter, instead of the default which assumes you are testing for equal proportions):
> berry <- c(22,9,16) #Creates a vector containing sample counts for
blueberries, raspberries and blackberries
> null.prob <- c(0.5,0.25,0.25) #A vector containing the corresponding
expected proportions of blueberries, raspberries and blackberries
> berryX2test <- chisq.test(berry, p = null.prob) #Chi-Square goodness
of fit test (Null: Proportions of blueberries to raspberries to
blackberries is 0.5: 0.25: 0.25). p = 0.3203 >> retain the null
hypothesis!
And you can update the row and column names to match the data you are using:
The outcome is an organized, labeled data table that can be used in chi-squared analysis:
> DrugTable
Stroke No Stroke
Drug
12
28
Placebo
18
22
**Note: There are many ways to create a data table in R. Explore some other options!
Once you have your data in a contingency table in RStudio, performing chi-squared is straightforward.
For this example, the null and alternative hypotheses are as follows:
H0 :
H1 :