Sunteți pe pagina 1din 14

Question 1

a) There definitely seems to be a difference in the treatments, when looking at


the plot. Treatment 2, 3 and 4 seem to have less impact than treatment 5, 6
and 7. So using a fertilizer definitely increases the weight of the lentils. The
summary output confirms this as the p-value is smaller than 0.001. From the
last two plots it can be concluded that overall the model is left-skewed, but
this is only due to one observation so I believe the model is still okay.
R-code:
d.lentil <- read.table("http://stat.ethz.ch/Teaching/Datasets/WBL/lentil.dat",
header=TRUE)
d.lentil$TR <- factor(d.lentil$TR)
stripchart(Y ~ TR, data = d.lentil, xlab = "treatment", vertical = TRUE, pch =
1)
fit.av <- aov(Y ~ TR, d.len)
summary(fit.av)
(mfrow=c(1,2))
plot(fit.av, which = 1:2)
Df
TR
6
Residuals
14
--Signif. codes:
0 `***' 0.001 `**' 0.01

Sum Sq Mean Sq F value Pr(>F)


115792 19299 45.97 2.03e-08 ***
5878 420
`*' 0.05 `.' 0.1 ` ' 1

b) The contrasts are orthogonal as the sum over all the different treatment
elements for each row adds up to zero.

c i c i
n
n=1
i
7

c) From the summary below it can be concluded that treatment 4, 5 and 6 reject
the null Hypothesis as their P-value is larger than 0.05.
R-code:
mat.contr <- cbind(c(-6,1,1,1,1,1,1), c(0,-1,-1,-1,1,1,1),
c(0,2,-1,-1,2,-1,-1), c(0,0,-1,1,0,-1,1),
c(0,-2,1,1,2,-1,-1), c(0,0,1,-1,0,-1,+1))
contrasts(d.len$TR) <- mat.contr
r.len <- aov(Y ~ TR, data = d.len)
> summary(r.len, split=list(TR=list(L1=1,L2=2,L3=3,L4=4,L5=5,L6=6)))
Df Sum Sq Mean Sq F value Pr(>F)
TR
6 115792 19299 45.965 2.03e-08 ***
TR: L1
1 73201 73201 174.348 2.72e-09 ***
TR: L2
1 34060 34060 81.124 3.36e-07 ***
TR: L3
1 8251 8251 19.651 0.000568 ***
TR: L4
1 271
271 0.645 0.435378
TR: L5
1
2
2 0.005 0.942679
TR: L6
1
7
7 0.016 0.900906
Residuals 14 5878
420
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> summary.lm(r.len)
Call:
aov(formula = Y ~ TR, data = d.len)
Residuals:
Min
1Q Median
3Q
Max
-40.667 -12.000 1.333 14.000 22.333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 341.286
4.471 76.327 < 2e-16 ***
TR1
24.103
1.825 13.204 2.72e-09 ***
TR2
43.500
4.830 9.007 3.36e-07 ***
TR3
15.139
3.415 4.433 0.000568 ***
TR4
-4.750
5.915 -0.803 0.435378
TR5
0.250
3.415 0.073 0.942679
TR6
-0.750
5.915 -0.127 0.900906
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 20.49 on 14 degrees of freedom
Multiple R-squared: 0.9517,
Adjusted R-squared: 0.931
F-statistic: 45.96 on 6 and 14 DF, p-value: 2.028e-08

Question 2
a) As can be seen from the summary the p-value is smaller than 0.001, this
means that there are significant differences in the response time. Thus it can
be concluded that all types do not have the same expected response time.
That means we reject the null hypothesis.
R-code:
y <- c(9, 12, 10, 8, 15,20, 21, 23, 17, 30, 6, 5, 8, 16, 7)
type <- c(1,2,3)
circ <- data.frame(Type = type, Y = y)
circ$Type <- as.factor(circ$Type)
par(mfrow=c(1,1))
stripchart(Y ~ Type, data = circ, xlab = "treatment", vertical = TRUE, pch =
1)
circ.fit <- aov(Y ~ Type, data = circ)
summary(circ.fi)
par(mfrow=c(1,2))
plot(circ.fit, which = 1:2)
Df Sum Sq Mean Sq F value Pr(>F)
Type
2 543.6 271.8 16.08 0.000402 ***
Residuals 12 202.8 16.9
--Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

b) From the treatment means it can be concluded that only type 2 differs a lot.
Type 1 and type 3 are close to zero.
R-code:
TukeyHSD(circ.fit, "Type", conf.level = 0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Y ~ Type, data = circ)
$Type
diff
lwr
upr
p adj
2-1 11.4 4.463555 18.336445 0.0023656
3-1 -2.4 -9.336445 4.536445 0.6367043
3.2 -13.8 -20.736445 -6.863555 0.0005042

c) To construct a set of orthogonal bases, we need to know the null-hypothesis.


As the questions asks us to compare the type 2 with the other types the nullhypothesis will be:
H0: 2=0.5*(1 + 3)
which is the same as:
H0: 1 + 3 - 22 = 0
This means that one of the contrasts will be:
c = (1, -2, 1)

Now the only thing we still need to do is find a vector which is orthogonal to
the one we just found. An example of such a vector is:
c* = (1, 1, 1)
d) From the summary output it can be concluded that the contrast is significant
as the p-value is smaller than 0.001. So we reject the null-hypothesis.
R-code:
mat.contr <- rbind(c(1,-2,1),c(1,1,1))
circ.mc <- glht(circ.fit, linfct = mcp(Type = mat.contr))
summary(circ.mc, test = adjusted("bonferroni"))
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: User-defined Contrasts
Fit: aov(formula = Y ~ Type, data = circ)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
1 == 0 -25.200
4.503 -5.596 0.000234 ***
2 == 0 2.400
2.600 0.923 0.748310
--Signif. codes:
0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Question 3
a) There seems to be quite a large difference in the fuel consumption depending
on the city and depending on the car type. While the type 5 and type 3 car
have a lower fuel consumption and lower fuel consumption for the city Los
Angeles. Type 1, 2 and 4 have higher overall fuel consumption and have the
lowest fuel consumption in the city San Francisco. So there seems to be some
interaction between these two parameters.
R-code:
fuel <- read.table("http://stat.ethz.ch/Teaching/Datasets/WBL/automob.dat",
header=TRUE)
plot(KMP4L ~ AUTO, data = fuel, pch=as.numeric(fuel$STADT), xlab="type of
car",ylab="KMP4L")
legend('topleft',levels(fuel$STADT), pch=1:3)

b) Both the type of car and the city have influence on the fuel consumption as
the p-value is smaller than 0.05 and the null-hypothesis can be rejected. Also
there is interaction between the type of car and the city where it was
measured, since the p-value is also smaller than 0.001 and the null hypothesis
is rejected.
R-code:
fuel$AUTO <- as.factor(fuel$AUTO)
fit.av <- aov(KMP4L ~ AUTO+STADT+AUTO*STADT, data=fuel)
summary(fit.av)
Df Sum Sq Mean Sq F value Pr(>F)
AUTO
4 179.74 44.94 34.108 9.16e-11 ***
STADT
2 19.60 9.80 7.438 0.00238 **
AUTO:STADT 8 244.62 30.58 23.209 7.58e-11 ***
Residuals 30 39.52 1.32
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

c) Based on the TA plot one can see that the scatter seems to be quite random,
so it can be assumed that the variance is indeed constant. Also, there seem
seems to be no large systematic errors in the plot and the smoother follows
the trendline of the residuals vs. fitted values quite well, so it can be assumed
that the expectation is indeed zero and that there are no systematic errors.
Based on the QQ plot the model fits the values quite well, so it can be indeed
assumed that the errors have a random distribution.
R-code:
par(mfrow=c(1,2))
plot(fit.av, which = 1:2)

d) The interaction between the factors STADT and AUTO can be seen in the plot
below, especially in the line of San Francisco, as it deviates most from the
other two lines.
R-code:
par(mfrow=c(1,1))
interaction.plot(x.factor = fuel$AUTO, trace.factor = fuel$STADT, response =
fuel$KMP4L, legend = FALSE, xlab = 'type of car', ylab = 'fuel consumpmtion
in km/4L')
legend('topleft', levels(fuel$STADT), lty = 1:3)

e) As can be seen from all the summaries, in all cases it has an influence on the
fuel consumption as the p-value is below 0.001 and thus the null-hypothesis is
rejected.
R-code:
fit.1 <- aov(KMP4L ~ AUTO, subset(fuel, STADT=='Los Angeles'))
summary(fit.1)
Df Sum Sq Mean Sq F value Pr(>F)
AUTO
4 232.60 58.15 55.72 8.44e-07 ***
Residuals 10 10.44 1.04
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

fit.2 <- aov(KMP4L ~ AUTO, subset(fuel, STADT=='Portland'))


summary(fit.2)
Df Sum Sq Mean Sq F value Pr(>F)
AUTO
4 128.0 31.99 14.68 0.000344 ***
Residuals 10 21.8 2.18
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

fit.3 <- aov(KMP4L ~ AUTO, subset(fuel, STADT=='San Francisco'))


summary(fit.3)
Df Sum Sq Mean Sq F value Pr(>F)
AUTO
4 63.78 15.946 21.87 6.24e-05 ***
Residuals 10 7.29 0.729
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

f) It can be seen from the summary output that only the type of car has an
influence on the fuel consumption as only this parameter has a p-value
smaller than 0.001. The influence of the city and the interaction between city
and type of car can be neglected as the p-value is larger than 0.05 and the
null-hypothesis is accepted.
R-code:
fit.4 <- aov(KMP4L ~ AUTO+STADT+AUTO*STADT, subset(fuel, STADT!='San
Francisco'))
summary(fit.4)
Df Sum Sq Mean Sq F value Pr(>F)
AUTO
4 349.5 87.38 54.218 1.87e-10 ***
STADT
1 0.9 0.93 0.575 0.457
AUTO:STADT 4 11.1 2.77 1.717 0.186
Residuals 20 32.2 1.61
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Question 4
a) From the plot, seems that most of the people have a higher percentage of
platelets that are aggregated after having smoked a cigarette, except for
treatment person 10 who seem to have a lower percentage of aggregated
platelets after smoking
R-code:
smoking <read.table("http://stat.ethz.ch/Teaching/Datasets/WBL/smoking.dat",
header=TRUE)
smoking$PERSON <- as.factor(smoking$PERSON)
smoking$PERIODE <- as.factor(smoking$PERIODE)
par(mfrow=c(1,1))
interaction.plot(x.factor = smoking$PERIODE, trace.factor =
smoking$PERSON, response = smoking$AGGREG, legend = TRUE, xlab =
'Treatment Period', ylab = 'Number of Aggregated Platelets')

To do a two-way analysis of Variance, first the formula needs to be


determined. As there is only one observation per person namely before and
after, it is not possible to check the interaction of the factors PERSON and
PERIODE, so a correct formula is:
AGGREG ~ PERIODE+PERSON.
fit.av <- aov(AGGREG ~ PERIODE+PERSON, data=smoking)
summary(fit.av)
Df Sum Sq Mean Sq F value Pr(>F)
PERIODE
1 580 580.4 18.25 0.00163 **
PERSON
10 5468 546.8 17.19 5.25e-05 ***
Residuals 10 318 31.8
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

From the ANOVA-table it can be seen that there is indeed a significant


difference in the clotting of platelets before and after smoking as the p-value
is smaller than 0.05.
b) The t-test has the same results as the two-way ANOVA, as the p-value is
smaller than 0.05 so there is a difference.
R-code:
t.test(AGGREG~PERIODE, data=smoking, paired=T)
Paired t-test
data: AGGREG by PERIODE
t = -4.2716, df = 10, p-value = 0.001633
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-15.63114 -4.91431
sample estimates:
mean of the differences
-10.27273

c) The t-test and the two-way ANOVA have the same results, both agree that
there is a difference in the percentage of aggregated platelets after smoking.

Question 5
a) It seems like an increasing in the degrees of freedom of the denominator
makes the maximum of the curve become higher and the minimum smaller.
R-code:
curve(df(x, df1=3, df2=1), xlim=c(0,5), ylim=c(0,1), lty=1, add=F,
ylab="density")
curve(df(x, df1=3, df2=5), lty=2, add=T)
curve(df(x, df1=3, df2=10), lty=3, add=T)
curve(df(x, df1=3, df2=20), lty=4, add=T)
legend('topright', paste("F", paste(rep(3,4), c(1,5,10,20), sep=",")), lty=1:4)

b) The quantiles for all the distributions are:


> qf(0.95, df1=3,
[1] 215.7073
> qf(0.95, df1=3,
[1] 5.409451
> qf(0.95, df1=3,
[1] 3.708265
> qf(0.95, df1=3,
[1] 3.098391

df2=1)
df2=5)
df2=10)
df2=20)

c) The different plots for different degrees of freedom in the nominator can be
seen below:

d) The quantiles for all the distributions are:


> qf(0.95, df1=1, df2=20)
[1] 4.351244
> qf(0.95, df1=5, df2=20)
[1] 2.71089
> qf(0.95, df1=10, df2=20)
[1] 2.347878
> qf(0.95, df1=20, df2=20)
[1] 2.124155

e) The p-value for every distribution when the F-value is 2.37 is as follows:
> pf(2.37, df1=1, df2=20)
[1] 0.8606402
> pf(2.37, df1=5, df2=20)
[1] 0.9235699
> pf(2.37, df1=10, df2=20)
[1] 0.951802
> pf(2.37, df1=20, df2=20)
[1] 0.9697754

f) The degrees of freedom of the numerator is determined by the levels of each


factor.
The degrees of freedom of the denominator are determined by the amount of
observations.

S-ar putea să vă placă și