Documente Academic
Documente Profesional
Documente Cultură
252anova 1/26/07 (Open this document in 'Outline' view!) Roger Even Bove
F. ANALYSIS OF VARIANCE
1. 1-Way Analysis of Variance
a. The ANOVA model - relation to regression
The one-way ANOVA model is used to compare the means of more than two samples, taken from
populations that are all assumed to have the same variance. Each sample (called a treatment) is usually
represented as a column, but there is no requirement that each column have the same number of items in
it.
We will assume the model xij a j eij , where i 1 through n j , and n n j .We thus have m
j 1,m
treatments, n j items in each column and a total of n observations. (Thus x ij should be the number in
column i and row j )
b. An ANOVA problem
The following data describes monthly expenses for energy in three random samples of essentially
identical homes. Each column represents expenses on one fuel. .05 .
Fuel
1 2 3
89 104 86
101 120 98
87 98 100
87 110 96
Sum 364 432 380
H 0 : 1 2 3
Our hypotheses are
In the notation used here, i is replaced by a dot to indicate
87 98 2 110 98 2 96 98 2
2
(ii) The sum of squares within treatments has the same number of terms, but highlights the contribution
to the total sum of squares generated by the difference between the individual numbers and the column
(treatment) means.
89 91 2 104 108 2 86 95 2
101 91 2 120 108 2 98 95 2
SSW xij x j 2
j i
87 91
2
98 108 2 100 95 2
516 (iii) The sum
87 91 2 110 108 2 96 95 2
of squares between treatments also has the same number of terms, but it highlights the contribution to the
total sum of squares generated by the difference between the column (treatment) means and the overall
mean.
91 98 2 108 98 2 95 98 2
91 98 2 108 98 2 95 98 2
SSB x .j x 2 632
91 98 108 98 2 95 98 2
j i 2
91 98 2 108 98 2 95 98 2
But, because of the repetition of the column mean, this can be simplified to SSB n j x. j x
2
4 91 98 4108 98 4 95 98 632 .
2 2 2
But note that SSB SSW SST , so that the computation of one of the three sums of squares is
unnecessary. The material is summarized in a table like the one below.
Source SS DF MS F
SSB MSB
Between SSB m 1 MSB m 1 F MSW
SSW
Within SSW n m MSW n m
Total SST n 1
We fill in the table with the numbers we have computed and compare the F that we have computed with
an F with the appropriate significance level and degrees of freedom shown in the DF column. If the
F that we have computed is larger than the table F , reject the null hypothesis.
Source SS DF MS F F.05 H0
Between 632 2 316 5.51 F 2 ,9
4.26 Column means equal
s
Within 516 9 57.333
Total 1148 11
The s for significant difference indicates that the null hypothesis of equality of means has been rejected.
ns for no significant difference would indicate that the null hypothesis has not been rejected.
3
Fuel Sum
1 2 3
89 104 86
101 120 98
87 98 100
87 110 96
Sum 364 + 432 + 380 1176 x ij
nj 4+ 4+ 4 12 n
x j 91.00 108.00 95.00 1176
98 x
12
SS 33260 + 46920 + 36216 116396 x 2
ij
4 28970 12 98 2 632
Source SS DF MS F F.05 H0
Between 632 2 316 5.51 F 2,9 4.26 Column means equal
s
Within 516 9 57.333
Total 1148 11
Explanation: Since the Sum of Squares (SS) column must add up, 516 is found by subtracting
632 from 1148. Since n 12 , the total degrees of freedom are n 1 11 . Since there are 3 random
samples or columns, the degrees of freedom for Between is 3 1 = 2. Since the Degrees of Freedom (DF)
column must add up, 9 = 11 2. The Mean Square (MS) column is found by dividing the SS column by
MSB
the DF column. 316 is MSB and 57.333 is MSW . F , and is compared with F.05 from
MSW
the F table df 1 2, df 2 9 . To see this as Minitab output go to 252anovaex1.
d. Confidence Intervals
i. A single Confidence Interval
If we desire a single interval, we use the formula for the difference between two means when the variance
is known. For example, if we want the difference between means of column 1 and column 2.
1 1
1 2 x 1 x 2 t n m s , where s MSW .
2 n1 n 2
4
b. An example
Insulation 1 Insulation 2
(Factor B1 ) (Factor B 2 )
Fuel 1 (Factor A1 89 87
) 101 87
Fuel 2 (Factor A2 120 98
) 110 104
Fuel 3 (Factor A3 100 86
) 98 96
This problem has R 3 rows, C 2 columns and, within each cell P 2 measurements. We can
compute a table of means which shows means for each cell, row and column, as well as an overall mean.
Insulation 1 Insulation 2 Row means
(Factor B1 ) (Factor B 2 )
Fuel 1 (Factor A1 x11 95 x12 87 x1 91
)
Fuel 2 (Factor A2 x 21 115 x 22 101 x 2 108
)
Fuel 3 (Factor A3 x 31 99 x 32 91 x 3 95
)
Column Means
x1 103 x2 93 x x 98
Now we do the computation of sums of squares, using the same simplification that we use in computing a
sample variance.
SST x
i j k
ijk x 2
89 98 2
101 98 2 87 98 2 87 98 2 120 98 2 96 98 2
89 2 101 2 87 2 87 2 120 2 96 2 12 98 2 1148
2
S W xijk xij
ijk
89 95 2 101 95 2 87 87 2 87 87 2 120 115 2 96 91 2
89 2 101 2 2 95 2 87 2 87 2 2 87 2 120 2 110 2 2115 2
86 2 96 2 2 91 2 192
6
2
SSR CP xi x
2 2 91 98 2 108 98 2 95 98 2
i
2 2 91 2 108 2 105 2 3 98 2 632
2
SSC RP x j x
3 2 103 98 2 93 98 2 3 2 103 2 93 2 2 98 2 300
2
S I P xijxi xj x , but we do not compute this because SST SSR SSC SSI SSW , so that
ij
SSI SST SSR SSC SSW 1148 632 300 192 24
7
c. Confidence Intervals
i. A Single Confidence Interval
If we desire a single interval we use the formula for a Bonferroni Confidence Interval below with m 1 .
RC1,RC P1 2MSW
For cell means, use
x x RC 1F
1 21 1 21
.
P
2 MSW
For row means, use 1 2 x1 x 2 R 1 F R 1, RC P 1 .
PC
For column means, use
C1,RC P1 2MSW
x x C 1F
1 2 1 2
PR
8
RCP1 2MSW
If we only need m different intervals, use for cell means
x x t
1 21 1 21 2m P
.
2 MSW
Use for row means 1 2 x1 x 2 t RC P 1 .
2m PC
RC P1 2MSW
Use for column means
x x t
1 2 1 2 2m PR
.
RC,RC P1 MSW
For cell means, use
x x q
1 21 1 21
.
P
MSW
For row means, use 1 2 x1 x 2 q R , RC P 1 .
PC
C,RC P1 MSW
For column means, use
x x q
1 2 1 2
PR
Note that if P 1 , replace RC P 1 with R 1 C 1 .
4. Kruskal-Wallis Test
Equivalent to one-way ANOVA when the underlying distribution is non-normal.
H 0 : Columns come from same distribution or medians equal.
Example: Use same example as for one-way ANOVA, but assume that data comes from non-normal
source. Assume that .05 . There are n 12 data items, so rank them from 1 to 12. Let n i be the
number of items in column i and SRi be the rank sum of column i . n
ni .
Original Data Ranked Data
Treatment Treatment Treatment Treatment Treatment Treatment
1 2 3 1 2 3
89 104 86 4 10 1
101 120 98 9 12 6.5
87 98 100 2.5 6.5 8
87 110 96 2.5 11 . 5 .
18.0 39.5 20.5
SRi
4 4 4
ni
To check the ranking, note that the sum of the three rank sums is 18.0 + 39.5 + 20.5 = 78.0, and that the
n n 1 1213
sum of the first n numbers is 78.
2 2
11
12 SRi 2
Now, compute the Kruskal-Wallis statistic H
n n 1
ni
3 n 1
i
12 18.0 2 39.5 2 20.5 2
313 1 576.125 39 5.3173 . If we look up
1213 4 4 4
13
this result in the (4, 4, 4) section of the Kruskal-Wallis table (Table 9) , we find that the p-value for
H 5.6538 is .054 and that the p-value for H 4.6539 is .097, so the p-value for H 5.3173
must lie between these two. Since both are above .05 , do not reject H 0 .
If the size of the problem is larger than those shown in Table 9, use the 2 distribution, with
df m 1 , where m is the number of columns. For example, if each of m 3 columns contains 6
items, .05 and H 5.3173 , compare H with 2 2 5.9915 . Since
.05 H
is smaller than
.205 , do not reject the null hypothesis.
5. Friedman Test
Equivalent to two-way ANOVA with one observation per cell when the underlying distribution is non-
normal.
H 0 : Columns come from same distribution or medians equal. Note that the only difference between this
and the Kruskal-Wallis test is that the data is cross-classified in the Friedman test.
Example: Three groups of 4 matched workers are trained to do a task by four different methods. When
each worker is observed later, he or she is given a grade of 1 through 10 on performance of the task. Note
that because this data is ordinal, ANOVA is not appropriate. Assume that .05 . In the data below, the
methods are represented by c 4 columns, and the groups by r 3 rows.. In each row the numbers are
ranked from 1 to c 4 . For each column, compute SRi , the rank sum of column i .
SR 3r c 1
2 12 2
Now compute the Friedman statistic F
rc c 1
i
i
12
11 2 5 2 4 2 10 2 3 3 5 1 262 45 7.4 . If we find the place on
3 4 5 5
the Friedman Table (Table 8) for 4 columns and 3 rows, we find that the p-value for F2 7.4 is .033.
Since the p-value is below .05 , reject the null hypothesis.
If the size of the problem is larger than those shown in Table 10, use the 2 distribution, with
df c 1 , where c is the number of columns. For example, if each of c 5 columns contains 6
2 4
items, .05 and F2 7.4 , compare F2 with .05 9.4877 . Since F2 is not larger than
.205 , do not reject the null hypothesis.