Sunteți pe pagina 1din 6

Ch.

12 Examples & Illustrations


An Illustration of the shapes of 2 distribution is given below for df equal to 2, 5 and 10. You can
see that it progressively approaches symmetrical looking shape.

Chi-square distribution
df = 2

Chi-square distribution
df = 5
P(lower)
.9999

P(upper)
.0001

Chi-square
25.74

P(lower)
1.0000

P(upper)
1.00E-05

Chi-square
41.30

Chi-square distribution
df = 10

The tables for the 2 distribution for various degrees of freedom from 1 to 100 are given on the
back of the book. The degree of freedom is indicated in the first column. For any degree of
freedom the 2 value given in the tables is the value which would have the area to the right
indicated by the subscript of 2 on top row. For example for 10 degrees of freedom the 2 value
which will have 10 percent area to the right is 15.9871.The 2 distribution is generally used for
Goodness of Fit Test or Test of Independence.

1. The 2 Test of Goodness of Fit in The Case of Equal Expected Frequencies


Example 1: A marketing manager for a manufacturer of sports cards plans to begin a new series
with the pictures and playing statistics of six former major league baseball players. She sets up a
booth and sells 120 cards. She wants to find out whether the picture of any player (with playing
statistics) makes significant difference in the sale of the sports cards or not.
Ho: Sales is Independent of Players Profile or all players are treated equally
H1: At least one player is not treated same as others
The sample results are given below:
Player
No. of cards(fo) Expected Freq(fe)
A
13
20
B
30
20
C
14
20
D
15
20
E
28
20
F
20
20
_____________________________________________
Total
120
120

Note that a general convention is to consider n sufficiently large for Chi-square test if all the
expected frequencies (fe) are at least equal to 5. If for any cell the value of expected frequency
falls below 5, it is better to combine it with another category. The formula to calculate Chisquare is given below:
2 = (fo- fe)2/fe is distributed as 2 with degree of freedom k-1, where k is the number of
categories. We have to subtract one because only five of the six frequencies can be arbitrarily
determined once the total is fixed. The sixth is determined when five others and the total are
given. Let us do the calculations indicated by the formula as follows:
fo- fe
-7
10
-6
-5
8
0

(fo- fe)2 (fo- fe)2/ fe


49
2.45
100
5.00
36
1.80
25
1.25
64
3.20
0
0

13.7

= 2 calculated.

Total
2

For 6-1 = 5 degrees of freedom, the table value of 2.05 = 11.0705


Since the calculated test statistic exceeds the Table value, we reject the Null hypothesis of
Independence. In other words, we can say with 95% confidence that the sample does not fit the
hypothesis of Uniform Distribution or equal expected frequencies. But the 2.01 is 15.0863. So we
cannot have 99% confidence in this statement. The impact of players photo and profile is
somewhat significant but not strongly significant.

The MegaStat result is given below. In excel create two columns: one for observed and one for
expected frequencies. Then go to Chi-square/cross tab. Then select Goodness of Fit and in the
dialogue box fill the input section, put 0 for number of parameters estimated and you get the
results. The following table shows all the calculation we did above using calculator. It gives the
value of calculated test statistic exactly equal to what we obtained.
Goodness of Fit Test
observed
expected
13
20.000
30
20.000
14
20.000
15
20.000
28
20.000
20
20.000
120
120.000
13.70 chi-square
5 df
.0176 p-value

O-E
-7.000
10.000
-6.000
-5.000
8.000
0.000
0.000

(O - E) / E
2.450
5.000
1.800
1.250
3.200
0.000
13.700

The p-value indicates that we can reject the Null at 5% but not at 1% level test.
2. The 2 Test of Goodness of Fit in The Case of Unequal Expected Frequencies
Example2: A recent (hypothetical) national survey of hospital admissions for people between 25
and 50 years who had hospital admissions during a two years period showed that 40% had 1
admission only, 20% had two admissions, 14% had 3 admissions, 10% had 4 admissions, 8%
had 5 admissions, 6% had 6 admissions and only 2% had 7 or more admissions. The mayor of a
small city claims that his city is much healthier than the national average. He even cites the
percentages for the two extreme categories. He says that 44% of local population in the given
age group have only one hospital admissions (compared to 40% national) and the percentage of 6
or more admissions is only 5% compared to national 8%. His claim was in fact based on a
sample of 400 randomly selected people in the specified age group who were interviewed by a
local Newspaper. It was revealed that 176 people had only 1 admission, 75 had 2 admissions, 50
had 3 admissions, 44 had 4 admissions, 35 had 5 admissions, 15 had 6 admissions and only 5 had
7 or more admissions. Is the claim of the mayor valid? Test at 5% and 10%.
Looking at the two extreme categories the mayors claim seems to have strong evidence. But

Statisticians in the local University wanted to test the claim using more scientific methods. Does
the overall data support the mayors claim?
The Null hypothesis in this case is that all the categories (number of hospital admissions) in the
local population are the same as in the national population. The alternative hypothesis is that the
local and national patterns (or percentages) are different. We will obtain the expected frequencies
by multiplying the percentages in the national survey by the total number of observation in the
local survey. For example the expected frequency for only one admission is 0.40*400= 160
(assuming equality between local and national percentages). The following table will make it
clear.
Admissions
1
2
3
4
5
6
7+
Total

National%
40
20
14
10
8
6
2
100

fe
160
80
56
40
32
24
8
400

fo
176
75
50
44
35
15
5
400

fo fe
16
-5
-6
4
3
-9
-3
0

(fo fe)2
256
25
36
16
9
81
9
---

(fe fo)2/ fe
1.600
0.313
0.643
0.400
0.281
3.375
1.125
7.737

The calculated test statistic is 7.737 and the degree of freedom is 7-1 = 6.
For this df the table gives 2.10 =10.6446 and 2.05 =12.5916. Thus the Null hypothesis of no
difference between national and local populations with respect to the number of hospital
admissions cannot be rejected even at 10% level. The mayors claim was found to lack strong
evidence from the data when the scientific hypothesis testing method was applied although
initially it seemed to have some evidence.
To the computer it does not matter whether the case is that of equal expected frequencies or
unequal expected frequencies. The process is the same.
Goodness of Fit Test
observed
176
75
50
44
35
15
5
400
7.74 chi-square
6 df
.2580 p-value

expected

O-E

(O - E) / E

160.000
80.000
56.000
40.000
32.000
24.000
8.000
400.000

16.000
-5.000
-6.000
4.000
3.000
-9.000
-3.000
0.000

1.600
0.313
0.643
0.400
0.281
3.375
1.125
7.737

The p-value clearly supports our above conclusion.

3. Chi-Square Test of Independence using Contingency Table


Example 3: A sample of 500 individuals was collected to study whether the letter grade has
significant impact on the income after 10 years of graduation. Suppose income level is divided
into three (arbitrary) groups as High Income, Middle Income and Low Income. The observed
frequencies are shown in the Contingency Table below:

Table of observed frequencies of Income level by Letter Grade


Grade
Income
A
B
C
D
Total
High

18

14

12

50

Middle

52

70

100

78

300

Low

20

26

58

46

150

Total

90

110

170

130

500

We have the observed frequencies and need to find the expected frequencies. After that the
formula for the test statistic is the same as in the case of Goodness of Fit test. The formula for the
expected frequencies is based on the Null Hypothesis that the rows and columns are independent
of each other.
If feij denotes the expected frequency in cell (i,j) then
feij = (Row i total*Column j total)/Grand Total
For example the expected frequency in cell (1,1) or the left upper corner cell would be
50*90/500 = 9 whereas the observed frequency is 18. It is also customary to show both types of
frequencies in the same table so that pair wise differences can be easily calculated. The row and
column totals for the observed and expected frequencies must be identical. Therefore, if you
have to do rounding, keep this in mind.
Table of observed and expected frequencies of Income level by Letter Grade
Grade
Income
High
Middle
Low
Total

A
18
(9)
52
(54)
20
(27)
90

B
14
(11)
70
(66)
26
(33)
110

C
12
(17)
100
(102)
58
(51)
170

D
6
(13)
78
(78)
46
(39)
130

Total
50
300
150
500

The degree of freedom formula is: df = (number of rows-1)*(number of columns-1).


In the present example this would give (3-1)*(4-1) = 6 degrees of freedom.
5

2 = (fo- fe)2/fe = {(18-9)2/9} + {(14-11)2/11} +..+{(46-39)2/39}= 20.92


Note that there are 12 terms in the above sum, one for each cell.
The table values for 6 df are: 2.05 =12.5916 and 2.01 =16.8119. The calculated test statistic
exceeds both. Therefore, Null Hypothesis of Independence is rejected even at 1% level.
The MegaStat results are given below. Looking at the low p-value we can say with 99%
confidence that letter grade does matter for future incomes: not a big surprise. (Note that in
MegaStat for contingency tables, you do not need to enter the expected frequencies, only provide
the observed frequencies).
Chi-square Contingency Table Test for Independence
A

Total

14
11.00
3.00
0.82
70
66.00
4.00
0.24
26
33.00
-7.00

12
17.00
-5.00
1.47
100
102.00
-2.00
0.04
58
51.00
7.00

6
13.00
-7.00
3.77
78
78.00
0.00
0.00
46
39.00
7.00

50
50.00
0.00
15.06

HIGH

Observed
Expected
O-E
(O - E) / E

MED

Observed
Expected
O-E
(O - E) / E

LOW

Observed
Expected
O-E

18
9.00
9.00
9.00
52
54.00
-2.00
0.07
20
27.00
-7.00

(O - E) / E

1.81

1.48

0.96

1.26

5.52

Observed
Expected
O-E
(O - E) / E

90
90.00
0.00
10.89
20.93
6
.0019

110
110.00
0.00
2.55
chi-square
df
p-value

170
170.00
0.00
2.47

130
130.00
0.00
5.03

500
500.00
0.00
20.93

Total

300
300.00
0.00
0.36
150
150.00
0.00

S-ar putea să vă placă și