Practical 4: Example Sheet 2

Practical 4
Oskar Hollinsworth
13 February 2018
Example sheet 2
Question 5
BrainSizeLM2 <- lm(PIQ ~ MRI_Count + Height)

plot(ellipse(BrainSizeLM2, c(2, 3)), type = "l")
cints <- confint(BrainSizeLM2, level=0.95)
hint <- cints["Height", ]
mricint <- cints["MRI_Count", ]
points(mricint, hint, col="red")
points(mricint, rev(hint), col="red")
cints <- confint(BrainSizeLM2, level=0.975)
hint <- cints["Height", ]
mricint <- cints["MRI_Count", ]
abline(h=hint, col="blue")
abline(v=mricint, col="blue")
0
−1
−2
Height
−3
−4
−5
0.00010 0.00015 0.00020 0.00025 0.00030 0.00035
MRI_Count
The correlation between the coefficients is identical to the correlation between the variables.
summary(BrainSizeLM2, correlation = TRUE)$correlation
## (Intercept) MRI_Count Height

## (Intercept) 1.0000000 -0.1715599 -0.6943382
## MRI_Count -0.1715599 1.0000000 -0.5883772
## Height -0.6943382 -0.5883772 1.0000000
1
cor(Height, MRI_Count)
## [1] 0.5883772
Question 6
We see that Knock Hill has the longest time despite a very small distance and climb. We subtract an hour
from this point.
pairs(hills)
1000 3000 5000 7000
25
dist
15
5
7000
4000
climb
1000
200
time
50 100
5 10 15 20 25 50 100 150 200
hills[(hills$time > 50) & (hills$dist < 10), ]
## dist climb time

## Ben Lomond 8.0 3070 62.267
## Goatfell 8.0 2866 73.217
## Lomonds 9.5 2200 65.000
## Knock Hill 3.0 350 78.650
## Criffel 6.5 1750 50.500
hls <- hills
hls["Knock Hill", "time"] <- hls["Knock Hill", "time"] - 60
The data is very bunched near the origin so taking logarithms could help to linearise the relationship. When
doing so, we should include an intercept term to give the model freedom when the independent variable is 1.
For example, if y = y(0)xp , then log y = log(y(0)) + p log(x). So the intercept is log(y(0)).
hlslm1 <- lm(time ~ dist + climb, data=hls)
summary(hlslm1)
##
## Call:
2
## lm(formula = time ~ dist + climb, data = hls)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.632 -4.934 1.007 4.541 27.903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.94198 2.58005 -5.016 1.90e-05 ***
## dist 6.34556 0.36047 17.604 < 2e-16 ***
## climb 0.01175 0.00123 9.555 6.83e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.8 on 32 degrees of freedom
## Multiple R-squared: 0.9712, Adjusted R-squared: 0.9694
## F-statistic: 540.2 on 2 and 32 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(hlslm1)
Standardized residuals
Residuals vs Fitted Normal Q−Q
Bens of Jura Bens of Jura
Residuals
1 3
10
−20
Cairngorm Moffat Chase

−2
Ben Nevis Ben Nevis
50 100 150 −2 −1 0 1 2
Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

0.0 1.0 2.0
Bens of Jura Bens of Jura

4
BenMoffat
NevisChase
1
1
Lairig Ghru 0.5

Cook's distance 0.5
1
−2
Moffat Chase
50 100 150 0.0 0.2 0.4 0.6
Fitted values Leverage

par(mfrow=c(1,1))
hlslm2 <- lm(log(time) ~ log(dist) + log(climb), data=hls)

summary(hlslm2)
##
## Call:
## lm(formula = log(time) ~ log(dist) + log(climb), data = hls)
##
3
## Residuals:
## Min 1Q Median 3Q Max
## -0.52624 -0.06273 0.00452 0.06846 0.31384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29359 0.27312 1.075 0.29
## log(dist) 0.91141 0.06534 13.949 3.76e-15 ***
## log(climb) 0.24889 0.04761 5.228 1.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1607 on 32 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9491
## F-statistic: 317.8 on 2 and 32 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(hlslm2)
Residuals vs Fitted Normal Q−Q
Cairn Table Cairn
JuraTable
2
Bens of Jura Bens of
Residuals
0.0
0
−0.6
−3
Black Hill Black Hill
3.0 3.5 4.0 4.5 5.0 −2 −1 0 1 2
Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

Black Hill 1
BensCairn Table
2
ofCow
JuraHill 0.5
Cairn Table Bens of Jura
1.0
−1
0.5
Cook's distance 1
0.0
−4
3.0 3.5 4.0 4.5 5.0 0.00 0.10 0.20 0.30
Fitted values Leverage

par(mfrow=c(1,1))
Overall, I prefer the logarithmic model as the residual vs. fitted graph looks much better. The following is a
95% prediction interval for the record time of a hypothetical race.
hyporace <- data.frame("dist"=5.3, "climb"=1100)
predict(hlslm2, hyporace, level=0.95, interval="prediction")
## fit lwr upr

## 1 3.556545 3.224142 3.888948
4
Question 9
Performing the one-sided or two-sided t-test on the externally studentised residuals gives a highly signifcant
p-value. This suggests that the human brain body size ratio is indeed an outlier amongst animals. In this
case, I think that a one-sided test is more appropriate as we are clearly seeking to prove that the human
brain is unusually large.
lmMamm <- lm(log(brain) ~ log(body), data=mammals)
n <- nrow(mammals)
p <- 1
eta <- rstudent(lmMamm)["Human"]
pval1 <- 1 - pt(eta, n-p-1)
pval2<- 1 - pf(eta^2, 1, n-p-1)
cat("One-sided: ", pval1, " Two sided: ", pval2)
## One-sided: 0.001766988 Two sided: 0.003533976

Trinity is known both for their wine budget and topping the Tompkins table. Hence I think that the two-sided
test is more appropriate here. Either way, Trinity is an outlier. We really get a lot of firsts.
file_path <- "http://www.statslab.cam.ac.uk/~sb2116/statistical_modelling/data/"
Colleges <- read.csv(paste0(file_path, "Colleges.csv"))
attach(Colleges)
lmClg <- lm(PercFirsts ~ log(WineBudget))
eta <- rstudent(lmClg)["Trinity"]
pval1 <- 1 - pt(eta, n-p-1)
pval2<- 1 - pf(eta^2, 1, n-p-1)
cat("One-sided: ", pval1, " Two sided: ", pval2)
## One-sided: NA Two sided: NA

The graph shows that Trinity is in fact the most extreme outlier.
plot(log(WineBudget), PercFirsts, ylim=c(10, 45))
text(log(WineBudget), PercFirsts, rownames(Colleges), cex=0.6, pos=3)
abline(lmClg)
5
10 15 20 25 30 35 40 45
Trinity
Pembroke
Emma
PercFirsts
Churchill
Peterhouse TrinityHall
StCatharine's Jesus
Clare King's
Christ's StJohn's
CorpusChristi
Downing
Magdalene
Fitzwilliam Selwyn Caius
Girton
Homerton
Newnham Robinson
StEdmund's MurrayEdwards
10.0 10.5 11.0 11.5 12.0 12.5
log(WineBudget)
If I perform this test on a college based on its appearance as an outlier, then that college is no longer a
random data point. It is now a college with an extreme position on this graph. Hence this will be much more
likely to have an extreme studentised residual. Hence the p-value will be an underestimate.

Practical 4: Example Sheet 2

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Practical 4: Example Sheet 2

Încărcat de

Drepturi de autor:

Formate disponibile

Practical 4

BrainSizeLM2 <- lm(PIQ ~ MRI_Count + Height)

0.00010 0.00015 0.00020 0.00025 0.00030 0.00035

## (Intercept) MRI_Count Height

hills[(hills$time > 50) & (hills$dist < 10), ]

## dist climb time

Cairngorm Moffat Chase

Ben Nevis Ben Nevis

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

Bens of Jura Bens of Jura

Lairig Ghru 0.5

50 100 150 0.0 0.2 0.4 0.6

Fitted values Leverage

hlslm2 <- lm(log(time) ~ log(dist) + log(climb), data=hls)

Black Hill Black Hill

3.0 3.5 4.0 4.5 5.0 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

3.0 3.5 4.0 4.5 5.0 0.00 0.10 0.20 0.30

Fitted values Leverage

## fit lwr upr

## One-sided: 0.001766988 Two sided: 0.003533976

## One-sided: NA Two sided: NA

10.0 10.5 11.0 11.5 12.0 12.5

S-ar putea să vă placă și