Sunteți pe pagina 1din 71

CHI SQUARE (χ )

Yetty Dwi Lestari


Department of Management,
FEB
Airlangga University
Why Chi Square? (χ2)
• We want to compare two variables, but…
• Not all variables are interval-level, so we cannot use regression.
• Hypothesis Tests for Difference of Means and Difference of
Proportions only allow us to compare two groups with one
value.
• We need something else. . .
What is Chi Square? (χ )
2

• The test statistic for testing hypothesis


comparing 2 or more nominal categories
• The Chi Square Statistic compares
nominal values in a cross-tabulation table,
making what are called row by column
comparisons or “r x c” tables.
Data Types

Data

Quantitative Qualitative

Discrete Continuous

4
A Nominal variable is…

… is a categorical variable with mutually exclusive categories. For


example gender where male = 1 and female = 2.

Mention another examples!


Hypothesis Tests
Qualitative Data
Qualitative
Data

2 or more
1 pop. pop.
Proportion Independence

2 pop.

Z Test Z Test 2 Test 2 Test

EPI809/Spring 2008 6
Chi-Square Tables

2 tables are included in most statistics texts


and consist of columns and rows, with columns
representing areas under the curve and rows
associated with the degrees of freedom (df)
which, for 2 tests of homogeneity and
independence are: df = (r-1)(c-1).
Chi-Square Tables
• Typical columns are: 2.100 2.050 2.025 2.010 2.005
• The decision rule both for for a chi-square test of homogeneity and
one of independence is:
DR: Reject H0 in favor of HA if and only if 2calc > 2crit.
Otherwise, FTR H0.
• In the case of homogeneity, this is essentially:
DR: Reject similarity of processes in favor of distinctions between
them if and only if their profiles differ markedly from one another
so that the preponderance of evidence supports distinctions.
Otherwise, FTR H0.
2crit Determination
With 8 df and  = .05
df 2.100 2.050 2.025 2.010 2.005

1 2.7055 3.8415 5.0239 6.6349 7.8794


2 4.6052 5.9915 7.3778 9.2103 10.5966
3 6.2514 7.8147 9.3484 11.3449 12.8381
4 7.7794 9.4877 11.1433 13.2767 14.8602
. . . . . .
. . . . . .
8 13.3616 15.5073 17.5346 20.0902 21.9550
. . . . . .
. . . . . .
30 40.2560 43.7729 46.9792 50.8922 53.6720
Reject H0 if 2 > 9.488
Using the Table…

15 - 10
Degrees of
Freedom
5–1=4
Right-Tail
Area
 = 0.05
Characteristics of the 15 - 11

Chi-Square Distribution

… it is positively skewed
… it is non-negative
… it is based on degrees of freedom
…when the degrees of freedom change
a new distribution is created

…e.g.
Copyright © 2004 McGraw-Hill Ryerson Limited. All rights reserved.
Characteristics of the 15 - 12

Chi-Square Distribution

df = 3

df = 5
df = 10

2
Copyright © 2004 McGraw-Hill Ryerson Limited. All rights reserved.
Summing up the properties of the 2
Distribution:
 2 distribution ranges from zero to some positive value,
i.e., ‘no difference’ to some ‘big difference’.
 2 distribution is not symmetrical, but skewed to the
right, from zero to a large positive 2. Chi square looks at
differences from zero. Its value depends on the number
of comparisons made, that is, the number of df. Note that
the critical value of chi square gets bigger as the df get
bigger, just because the more comparisons made the
more likely you are to find differences, so df corrects for
this.
 There are many different 2 distributions. Like the t
distribution, 2 varies with degrees of freedom.
CHI SQUARE APPLICATIONS

(χ2)

Test of
Independency
Test between Test of
proportions normality
Basic Assumption of the Null
Hypothesis
• There is no difference in the population, the
difference you observe is just the chance
variation of your sample.
•We are comparing observed values
(“frequency actually observed in our sample,
written “fo”) to some set of expected by
chance frequencies (written “fe”).
(χ2)
test for difference between two proportions

• Comparing the tallies or count of categorical responses


betweeen two independenct groups – two way cross-
classification table (contingency table)
• Ho : there is no difference between the two populations
proportions
• Ho : p1 = p2
• H1: two populations proportions are not the same
• H1 : p1 ≠ p2
(χ )
2

test for independency


----------------------------
Goodness-of-Fit Test: 15 - 18

Equal Expected Frequencies

Let f0 and fe be the observed and expected


frequencies respectively
H0: There is no difference between the
observed and expected frequencies
H1: There is a difference between the
observed and the expected frequencies

Copyright © 2004 McGraw-Hill Ryerson Limited. All rights reserved.


Goodness-of-Fit Test: 15 - 19

Equal Expected Frequencies

 (fo )2

- fe
… the test statistic is:  2
=  
 fe

…the critical value is a chi-square value with


(k-1) degrees of freedom,
where k is the number of categories

Copyright © 2004 McGraw-Hill Ryerson Limited. All rights reserved.


Steps in Test of Hypothesis
1. Determine the appropriate test
2. Establish the level of significance:α
3. Formulate the statistical hypothesis
4. Calculate the test statistic
5. Determine the degree of freedom
6. Compare computed test statistic against a
tabled/critical value

20
1. Determine Appropriate Test

Chi Square is used when both variables are


measured on a nominal scale.
It can be applied to interval or ratio data that have
been categorized into a small number of groups.
It assumes that the observations are randomly
sampled from the population.
All observations are independent (an individual
can appear only once in a table and there are no
overlapping categories).
It does not make any assumptions about the shape
of the distribution nor about the homogeneity of
variances.
21
2. Establish Level of Significance

α is a predetermined value
The convention
• α = .05
• α = .01
• α = .001

22
3. Determine The Hypothesis:
Whether There is an Association
or Not
Ho : The two variables are independent
Ha : The two variables are associated

23
4. Calculating Test Statistics
Contrasts observed frequencies in each cell of a
contingency table with expected frequencies.
The expected frequencies represent the number of
cases that would be found in each cell if the null
hypothesis were true ( i.e. the nominal variables
are unrelated).
Expected frequency of two unrelated events is
product of the row and column frequency divided
by number of cases.
Fe= Fr Fc / N

24
4. Calculating Test Statistics

 ( Fo - Fe )  2
 = 
2

 Fe 

25
4. Calculating Test Statistics

 ( Fo - Fe )  2
 = 
2

 Fe 

26
27

5. Determine Degrees of
Freedom
df = (R-1)(C-1)
6. Compare computed test statistic
against a tabled/critical value
The computed value of the Pearson chi-
square statistic is compared with the critical
value to determine if the computed value is
improbable
The critical tabled values are based on
sampling distributions of the Pearson chi-
square statistic
If calculated 2 is greater than 2 table
value, reject Ho 28
Example

Suppose a researcher is interested in voting


preferences on gun control issues.
A questionnaire was developed and sent to a
random sample of 90 voters.
The researcher also collects information
about the political party membership of the
sample of 90 respondents.

29
Bivariate Frequency Table or
Contingency Table
Favor Neutral Oppose f row

Democrat 10 10 30 50

Republican 15 15 10 40

f column 25 25 40 n = 90

30
Bivariate Frequency Table or
Contingency Table
Favor Neutral Oppose f row

Democrat 10 10 30 50

Republican 15 15 10 40

f column 25 25 40 n = 90

31
Row frequency
Bivariate Frequency Table or
Contingency Table
Favor Neutral Oppose f row

Democrat 10 10 30 50

Republican 15 15 10 40

f column 25 25 40 n = 90

32
Bivariate Frequency Table or
Contingency Table
Favor Neutral Oppose f row

Democrat 10 10 30 50

Republican 15 15 10 40

f column 25 25 40 n = 90
Column frequency

33
1. Determine Appropriate Test

1. Party Membership ( 2 levels) and Nominal


2. Voting Preference ( 3 levels) and Nominal

34
35

2. Establish Level of Significance

Alpha of .05
3. Determine The Hypothesis

• Ho : There is no difference between D & R


in their opinion on gun control issue.

• Ha : There is an association between


responses to the gun control survey and the
party membership in the population.

36
4. Calculating Test Statistics

Favor Neutral Oppose f row

Democrat fo =10 fo =10 fo =30 50


fe =13.9 fe =13.9 fe=22.2
Republican fo =15 fo =15 fo =10 40
fe =11.1 fe =11.1 fe =17.8
f column 25 25 40 n = 90

37
4. Calculating Test Statistics

Favor Neutral Oppose f row


= 50*25/90
Democrat fo =10 fo =10 fo =30 50
fe =13.9 fe =13.9 fe=22.2
Republican fo =15 fo =15 fo =10 40
fe =11.1 fe =11.1 fe =17.8
f column 25 25 40 n = 90

38
4. Calculating Test Statistics

Favor Neutral Oppose f row

Democrat fo =10 fo =10 fo =30 50


fe =13.9 fe =13.9 fe=22.2
= 40* 25/90
Republican fo =15 fo =15 fo =10 40
fe =11.1 fe =11.1 fe =17.8
f column 25 25 40 n = 90

39
4. Calculating Test Statistics

(10 - 13.89) 2 (10 - 13.89) 2 (30 - 22.2) 2


 =
2
  
13.89 13.89 22.2

(15 - 11.11) 2 (15 - 11.11) 2 (10 - 17.8) 2


 
11.11 11.11 17.8

= 11.03

40
41

5. Determine Degrees of
Freedom

df = (R-1)(C-1) =
(2-1)(3-1) = 2
6. Compare computed test statistic
against a tabled/critical value
α = 0.05
df = 2
Critical tabled value = 5.991
Test statistic, 11.03, exceeds critical value
Null hypothesis is rejected
Democrats & Republicans differ
significantly in their opinions on gun
control issues
42
SPSS Output for Gun Control
Example

Chi-Square Tests

Asymp. Sig.
Value df (2-sided)
Pearson Chi-Square 11.025a 2 .004
Likelihood Ratio 11.365 2 .003
Linear-by-Linear
8.722 1 .003
Association
N of Valid Cases 90
a. 0 cells (.0%) have expected count less than 5. The
minimum expected count is 11.11.

43
Another case...
Approval for President Obama by Race

BLACKS WHITES

APPROVE 69 156

DISAPPROVE 21 144
The formula for 2 is:
( fo - fe )
2

 =
2

fe
OR, sometimes written:
(O - E ) 2
 = 2

E
Where fo is the observed frequency of each
category in each cell of a table.
O or fo is what we observe from our sample, the
observed frequency. NOTE that 2 works with
frequencies in each cell.

E or fe is the expected frequency, the number of


people who would show up in each cell IF the null
hypothesis were true, if there was no racial
difference in approval, if the frequencies were due
solely to chance.
For each cell in the table we are to compare
what we observe to what we should expect by
chance:
• Subtract the value of the hypothetical expectancy (fe) from the observed
frequency (fo) for each cell.
• Square each of these deviations.
• Divide each of the squared differences by the expected value of each cell.
• Finally, take the sum of the squared fo- f e differences to get χ2 .
The Chi Square statistic tests :
• Whether the difference between what you observe and what
chance would predict is due to sampling error.
• The greater the deviation of what we observe to what we
would expect by chance, the greater the probability that the
difference is NOT due to chance.
DIFFERENCE BETWEEN EXPENSIVE
AND CHEEP BEER
• Consumer Reports routinely finds that many
people who claim they can taste the difference
can’t — they are influenced by the label.
• How would you test the idea that people cannot
really tell the difference, and that they are really
responding to the price label information. How
do we disentangle the label effect from taste?
What is the null? ==> No difference
We expect: beer 1 = beer 2 = beer 3

Study Design: Sample 150 beer drinkers. Place


before them 3 bottles, one labeled with name of
well-known high-priced beer, another a medium-
priced beer, and the third a low priced beer.
Bottles counter balanced to control for order
effects.
All 150 Subjects taste each beer and state
preference.
The Full Table
High Priced Medium Low
Beer Priced Beer Priced Beer

Observed fo 77 41 32

Expected fe 50 50 50
Step 1. Hypothesis:
Null = the proportions preferring each beer
should be equal IF indeed the beers are equal and
if preferences are not influenced by the label. Here,
chance would predict 50 people in each group if
label did not matter. The ratios of O to E values
should be the same across all 3 comparisons if
label does not matter. The O : E ratios in each
column should be the same. Our alternative
hypothesis is that preferences will follow the status
of beer 1 > beer 2 > beer 3.
Step 2. The Distribution: .

Since we are interested in the effect of one


nominal variable on another nominal variable
the 2 distribution is appropriate -- we are doing a
row by column [r * c] analysis.

Step 3. Level of Significance:


Set alpha at .05 for 95% confidence.
Step 4. Determine Critical Value of 2*:
The chi square distribution changes shape by
degrees of freedom, just as does the t distribution.
Degrees of freedom change as a function of the
number of comparisons made.
Formula for degrees of freedom of 2:
df = (r - 1) x (c - 1)
where r = number of rows; c = number of columns
We have a 3 by 2 table, so df = (3 - 1) x (2 - 1) = 2.

(Also – when doing a One-way Chi-square: just subtract k-1


categories.)

Step 5. Decision: Let's fill in the table:

(O - E ) 2
 =
2

E
Beer Hi Priced Med Priced Lo Priced

Observed 77 41 32

Expected 50 50 50

O-E 27 -9 -18

(O-E)2 729 81 324

(O-E)2 / E 14.58 1.62 6.48

2 = [(O-E)2 / E] = 14.58 + 1.62 + 6.48 = 22.68


Look up our p-value of 2 = 22.68 in Chi Square
table at 2 df. Find that the 22.68 is even beyond
.01 significance.

The probability is p< .0005, that is, less that 5


chances in 10,000 would produce a difference this
big just by chance. Or better, less than 5 samples
10,000 of the same size would produce a
difference this big.
Step 6. Interpret:

The Chi Square value of 22.68 is beyond the


critical value of 5.991.

Therefore reject the null hypothesis of equality.


People do respond to price label information.
Goodness-of-Fit Test:
Normality
… the test investigates
if the observed frequencies in a frequency distribution
match the theoretical normal distribution

15 - 59
…to determine the mean and standard deviation
of the frequency distribution
- Compute the z-value for the lower class limit
and the upper class limit for each class
- Determine fe for each category
- Use the chi-square goodness-of-fit test to
determine if fo coincides with fe
attention
• Suppose we knew the mean and standard deviation of
population but wished to find whether some sample data
conform to the normal distribution :
d.f. = k = 1
• On the other hand, if we don’t know the mean and standard
deviation of population but we wish to test whether some
sample data follow the normal distribution
d.f. = k = p=1
Where p is the number of population parameter being estimated
from the sample data
Goodness-of-Fit Test:
Normality

 A sample of 500 donations to the Arthritis

15 - 61
Foundation is reported in the
following frequency distribution

 Is it reasonable to conclude that the distribution is


normally distributed with a mean of $10 and a
standard deviation of $2?

 Use the .05 significance level


… continued

Amount Spent fo Area fe (fo- fe )2/fe

15 - 62
<$6 20
$6 up to $8 60
$8 up to $10 140
$10 up to $12 120
$12 up to $14 90
>$14 70
Total 500
… continued

To compute fe for the first class,


first determine the z - value

15 - 63
X - m 6 - 10
z = = = - 2 . 00
s 2
Now…
find the probability of a z - value less than –2.00

P( z < -2.00) = 0.5000 - .4772 = .0228


… continued

Amount Spent fo Area fe (fo- fe )2/fe

15 - 64
<$6 20 .02
$6 up to $8 60 .14
$8 up to $10 140 .34
$10 up to $12 120 .34
$12 up to $14 90 .14
>$14 70 .02
Total 500
… continued

The expected frequency is the probability of a


z-value less than –2.00 times the sample size

15 - 65
f e = (. 0228 )( 500 ) = 11 . 40
The other expected frequencies
are computed similarly
… continued

Amount Spent fo Area fe (fo- fe )2/fe

15 - 66
<$6 20 .02 11.40 6.49
$6 up to $8 60 .14 67.95 .93
$8 up to $10 140 .34 170.65 5.50
$10 up to $12 120 .34 170.65 15.03
$12 up to $14 90 .14 67.95 7.16
>$14 70 .02 11.40 301.22
Total 500 500 336.33
… continued
Step 1 H0: The observations follow the normal distribution
H0: The observations do NOT follow the normal
distribution
Step 2

15 - 67
 = 0.05

Step 3 H0 is rejected if 2 >7.815, df = 6

Step 4 2 = 336.33

H0: is rejected.
The observations do NOT follow the normal distribution
A contingency table is used to investigate
whether two traits or characteristics
are related

15 - 68
… each observation is classified according to two criteria
…the usual hypothesis testing procedure is used

… the degrees of freedom is equal to:


(number of rows -1)(number of columns -1)

… the expected frequency is computed as:


Expected Frequency = (row total)(column total)/grand total
Is there a relationship between the
location of an accident and the gender

15 - 69
of the person involved in the accident?

A sample of 150 accidents reported to the


police were classified by type and gender.
At the .05 level of significance, can we
conclude that gender and the location of
the accident are related?
… continued

Sex
Location Total
Work Home Other
Male 60 20 10 90

15 - 70
Female 20 30 10 60

Total 80 50 20 150
The expected frequency for the work-male
intersection is computed as (90)(80)/150 =48
Similarly, you can compute the
expected frequencies for the other cells
… continued
Step 1 H0: The Gender and Location are NOT related
H0: The Gender and Location are related
Step 2  = 0.05
H0 is rejected if  2 >5.991, df = 2

15 - 71
Step 3
(…there are (3- 1)(2-1) = 2 degrees of freedom)

Step 4 Find the value of  2

 2 =
(60 - 48 )2
 ... 
(10 - 8 )
2

48 8
= 16 . 667
H0: is rejected.
Gender and Location are related!