Sunteți pe pagina 1din 6

Chi-Squared (2 ) Tests for

Expected Proportions (Goodness of Fit) and Associations


Prepared by Allison Horst
Bren School of Environmental Science & Management
UC Santa Barbara

1. Introduction
When you are working with discrete data in particular, counts of observations you may want to
determine:
a) if the proportions of the counts that you found are significantly
different from expected proportions
b) if there is an association between variables with count data (e.g. in
contingency tables)
In those cases, chi-squared (2) tests can help you to draw conclusions. If the descriptions above are a
little too abstract, below are some simplified examples to give you a feel for the types of data and
questions for which chi-squared may be a valid statistical method:
Examples: Test for expected proportions (proportions can be equal or unequal)
Example 1. You are counting three types of butterflies, Type A, Type B, and Type C under the
null hypothesis that the proportions of the three types in a certain ecosystem are the same. Over
the course of your study, you count 371 Type A butterflies, 262 Type B butterflies, and 302 Type
C butterflies. Do you have sufficient evidence to retain the null hypothesis? Or, alternatively,
should you reject the null hypothesis and determine that the proportions are NOT the same for
the three types?
Example 2. An engineering school states that its student body is equally split between students
studying chemical engineering, mechanical engineering and computer engineering. You collect
information for 500 students in the engineering school, finding that 132 are studying chemical
engineering, 208 are studying mechanical engineering, and 160 are studying computer
engineering. Do you have sufficient evidence to conclude that the student body is not equally
split between the three engineering emphases?
Example 3. A farmer tells you the following: in his apple orchard at any given time, 60% of the
apples are Brayburns, 30% are Granny Smiths, and 10% are Fijis. You collect a random sample
to test his claimed proportions. You collect: 46 Brayburns, 21 Granny Smiths, and 18 Fijis. Do
you have sufficient evidence to dispute the farmers claim?

Examples: Test for associations (contingency tables)


Example 1. You are trying to determine whether a persons income level (just split into Low,
Medium, and High income) is associated with their grocery store preference (i.e., whether they
are most likely to choose Vons, Trader Joes, or Gelsons). You ask a number of individuals from
Low, Medium and High income levels where they prefer to shop of the three choices, finding the
data shown below (shown as a contingency table). Chi-squared could help you to decide whether
there is no association between income level and market choice (the null hypothesis), or
alternatively if there is an association between income level and market choice (the alternative
hypothesis).

-.,*/%'0%1%2
!"#$%&'()*+,%

!"#

$%&'()

*'+,

-"./

!"

!#

$$

01/

%$

&"

#%

2%3/"./

$!

$$

#"

Do any of these examples sound like the data you are trying to analyze? If yes, chi-squared might be a
valid statistical test for you.

2. Background
Generally, chi-squared tests for proportions or association work by calculating the 2 test statistic
(Equation 1) as follows:

! =

(!"#$%&$'!!"#$%&$')!
!"#$%&$'

(Equation 1)

where 2 is the sum of the expression above for all possible combinations of variables
(e.g. in the grocery store example above, you would be summing 9 terms).
For example, lets say your testing for equal proportions as in the engineering student example above. If
the null hypothesis were true, then the expected number of students in chemical, mechanical and
computer engineering (if 500 sampled) would be ~167 students each. However, the observed values for
the respective emphases are 132, 208, 160. Using Equation 1 above, the 2 test statistic is calculated and
compared to the critical 2 value (which can be looked up in charts, but most likely youll use R as
shown below).

3. Null and Alternative Hypotheses


When using chi-squared to test to test for expected proportions, the null and alternative hypotheses are
as follows:
H0: The proportions for the population from which the sample were drawn are not different from
the expected proportions (or claimed proportions) for the population
H1: The proportions for the population from which the sample were drawn are different from the
expected proportions (or claimed proportions) for the population
When using chi-squared to test for associations between variables, the null and alternative hypotheses
are as follows:
H0: There is no association between variables
H1: There is an association between variables

4. Assumptions and Pitfalls of Chi-Squared


One of the best things about chi-squared is the lack of assumptions you need to satisfy. Depending on
how many variables you have, however, chi-squared may be inappropriate. The following guidelines can
help you to decide whether or not chi-squared is a good choice.
To use chi-squared:

All observations should be independent


Less than 20% of cells in the contingency table should have an expected value of 5 or less
All observed counts in the contingency table should be at least 1
Generally: beware of very small counts (< 5) in any part of your contingency table!

Some limitations of chi-squared:

Chi-squared doesnt tell you much about the strength of an association between variables
Highly sensitive to sample size (very important if the counts are very low see above)

5. Chi-Squared in RStudio
In R and RStudio, the function for chi-squared is chisq.test()
Chi-squared tests in RStudio are straightforward, so long as you have your data arranged appropriately.
Several examples for different types of chi-squared (test for expected proportions, test for association)
are shown.

If you are testing for equal proportions:


Follow along with the example below to learn how to perform chi-squared to test for equal
proportions:
You hypothesize that the proportions of red, blue, white and black cars driving through an
intersection are the same. You record observations for 2 hours and count 12 red, 14 blue, 11
white, and 15 black cars. Can you reject the null hypothesis that the proportions are the same?
> cars <- c(12,14,11,15) #A vector containing sample counts for
each car color, called cars
> carsX2test <- chisq.test(cars) #Performs chi-squared goodness
of fit to test for equal likelihoods of all outcomes. p = 0.8568
>> retain the null hypothesis (likelihoods of all car colors are
the same)

View the results of the chi-squared test by calling the test name you created (here, carsX2test),
which for this example yields:

You can also view other parameters of the test (included expected values) using the following
commands:
> TestName$observed #View observed counts
> TestName$expected #View expected count values based on the null
proportions (in the p arguments)
> TestName$stdres #View standardized residuals (if > |2|, values
are much different from expected values)

For example, for the cars data above you can view the expected counts for each car color by:
> carsX2test$expected #View expected counts for each car color
[1] 13 13 13 13

If you are testing for unequal proportions:


In the examples above, we have used the chi-squared distribution to determine whether we will reject or
retain the null hypothesis that the likelihoods of all possible outcomes are equal. Sometimes, we will not
expect the likelihoods of all possible outcomes to be equal how do we test whether our observations
are sufficient to reject the expected unequal likelihoods?
The default argument in the chisq.test() is that the expected proportions (likelihoods) for all p
ossible outcomes are the same. The expected proportion is calculated by the default argument:

p = rep(1/length(x), length(x))

If you want to change the code to account for different expected proportions, you need to include an
amended argument containing an alternate vector for p.
For example: You think that there are twice as many blueberries in a berry basket as there are
raspberries and blackberries (i.e., you think that the baskets are comprised of raspberries,
blackberries, and blueberries). You take a random sampling form a basket and find 22 blueberries, 9
raspberries and 16 blackberries.
H0: There are twice as many blueberries as there are raspberries or blackberries
H1: There are not twice as many blueberries as there are raspberries or blackberries
In R, you could perform chi-squared as follows (note that the p is changed to the proportions that YOU
enter, instead of the default which assumes you are testing for equal proportions):
> berry <- c(22,9,16) #Creates a vector containing sample counts for
blueberries, raspberries and blackberries
> null.prob <- c(0.5,0.25,0.25) #A vector containing the corresponding
expected proportions of blueberries, raspberries and blackberries
> berryX2test <- chisq.test(berry, p = null.prob) #Chi-Square goodness
of fit test (Null: Proportions of blueberries to raspberries to
blackberries is 0.5: 0.25: 0.25). p = 0.3203 >> retain the null
hypothesis!

View the results of the test as above.

If you are testing for an association between variables in a contingency table:


The most difficult part of performing chi-squared with data in a contingency table is actually getting
your data into a usable data table. There are several ways to create a data table.
For example, imagine that you are testing the impact of a drug on stroke occurrence in subjects. For 40
patients treated with the drug (Drug), 12 had a stroke over the course of observation and 28 did not have
a stroke. For 40 patients treated with the placebo (Placebo), 18 had a stroke over the course of
observation and 22 did not have a stroke. We want to create a contingency table (in the form of a data
table) so that we can perform chi-squared goodness of fit.
First, create vectors containing the counts for each row:
row1 <- c(12,28) #Counts for 'Stroke' and 'No Stroke' for patients in
the Drug treatment
row2 <- c(18,22) #Counts for 'Stroke' and 'No Stroke' for patients in
the Placebo treatment

Then, create a data table using the rbind() function:


DrugTable <- rbind(row1,row2) #Creates a table containing row1 and
row2 counts

And you can update the row and column names to match the data you are using:

colnames(DrugTable) <- c("Stroke", "No Stroke")


rownames(DrugTable) <- c("Drug","Placebo")

The outcome is an organized, labeled data table that can be used in chi-squared analysis:
> DrugTable
Stroke No Stroke
Drug
12
28
Placebo
18
22

**Note: There are many ways to create a data table in R. Explore some other options!
Once you have your data in a contingency table in RStudio, performing chi-squared is straightforward.
For this example, the null and alternative hypotheses are as follows:
H0 :

There is no difference in stroke likelihood between Drug and Placebo treated


patients (orThere is no association between treatment and stroke occurrence in
Drug versus Placebo treated patients)

H1 :

There is a significant difference in stroke likelihood for Drug versus Placebo


treated patients (orThere is an association between treatment and stroke
occurrence in Drug versus Placebo treated patients)

You then perform chi-squared using chisq.test(DataTableName) as below:


TableX2 <- chisq.test(DrugTable) #Performs chi-squared goodness of fit for
the 2x2 contingency table "DrugTable"

S-ar putea să vă placă și