Sunteți pe pagina 1din 6

Practical 4

Oskar Hollinsworth
13 February 2018

Example sheet 2

Question 5

BrainSizeLM2 <- lm(PIQ ~ MRI_Count + Height)


plot(ellipse(BrainSizeLM2, c(2, 3)), type = "l")
cints <- confint(BrainSizeLM2, level=0.95)
hint <- cints["Height", ]
mricint <- cints["MRI_Count", ]
points(mricint, hint, col="red")
points(mricint, rev(hint), col="red")
cints <- confint(BrainSizeLM2, level=0.975)
hint <- cints["Height", ]
mricint <- cints["MRI_Count", ]
abline(h=hint, col="blue")
abline(v=mricint, col="blue")
0
−1
−2
Height

−3
−4
−5

0.00010 0.00015 0.00020 0.00025 0.00030 0.00035

MRI_Count
The correlation between the coefficients is identical to the correlation between the variables.
summary(BrainSizeLM2, correlation = TRUE)$correlation

## (Intercept) MRI_Count Height


## (Intercept) 1.0000000 -0.1715599 -0.6943382
## MRI_Count -0.1715599 1.0000000 -0.5883772
## Height -0.6943382 -0.5883772 1.0000000

1
cor(Height, MRI_Count)

## [1] 0.5883772

Question 6

We see that Knock Hill has the longest time despite a very small distance and climb. We subtract an hour
from this point.
pairs(hills)
1000 3000 5000 7000

25
dist

15
5
7000
4000

climb
1000

200
time

50 100
5 10 15 20 25 50 100 150 200

hills[(hills$time > 50) & (hills$dist < 10), ]

## dist climb time


## Ben Lomond 8.0 3070 62.267
## Goatfell 8.0 2866 73.217
## Lomonds 9.5 2200 65.000
## Knock Hill 3.0 350 78.650
## Criffel 6.5 1750 50.500
hls <- hills
hls["Knock Hill", "time"] <- hls["Knock Hill", "time"] - 60

The data is very bunched near the origin so taking logarithms could help to linearise the relationship. When
doing so, we should include an intercept term to give the model freedom when the independent variable is 1.
For example, if y = y(0)xp , then log y = log(y(0)) + p log(x). So the intercept is log(y(0)).
hlslm1 <- lm(time ~ dist + climb, data=hls)
summary(hlslm1)

##
## Call:

2
## lm(formula = time ~ dist + climb, data = hls)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.632 -4.934 1.007 4.541 27.903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.94198 2.58005 -5.016 1.90e-05 ***
## dist 6.34556 0.36047 17.604 < 2e-16 ***
## climb 0.01175 0.00123 9.555 6.83e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.8 on 32 degrees of freedom
## Multiple R-squared: 0.9712, Adjusted R-squared: 0.9694
## F-statistic: 540.2 on 2 and 32 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(hlslm1)
Standardized residuals
Residuals vs Fitted Normal Q−Q
Bens of Jura Bens of Jura
Residuals

1 3
10
−20

Cairngorm Moffat Chase


−2

Ben Nevis Ben Nevis

50 100 150 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals

Scale−Location Residuals vs Leverage


0.0 1.0 2.0

Bens of Jura Bens of Jura


4

BenMoffat
NevisChase
1
1

Lairig Ghru 0.5


Cook's distance 0.5
1
−2

Moffat Chase

50 100 150 0.0 0.2 0.4 0.6

Fitted values Leverage


par(mfrow=c(1,1))

hlslm2 <- lm(log(time) ~ log(dist) + log(climb), data=hls)


summary(hlslm2)

##
## Call:
## lm(formula = log(time) ~ log(dist) + log(climb), data = hls)
##

3
## Residuals:
## Min 1Q Median 3Q Max
## -0.52624 -0.06273 0.00452 0.06846 0.31384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29359 0.27312 1.075 0.29
## log(dist) 0.91141 0.06534 13.949 3.76e-15 ***
## log(climb) 0.24889 0.04761 5.228 1.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1607 on 32 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9491
## F-statistic: 317.8 on 2 and 32 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(hlslm2)

Standardized residuals
Residuals vs Fitted Normal Q−Q
Cairn Table Cairn
JuraTable

2
Bens of Jura Bens of
Residuals

0.0

0
−0.6

−3

Black Hill Black Hill

3.0 3.5 4.0 4.5 5.0 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals

Scale−Location Residuals vs Leverage


Black Hill 1
BensCairn Table
2

ofCow
JuraHill 0.5
Cairn Table Bens of Jura
1.0

−1

0.5
Cook's distance 1
0.0

−4

3.0 3.5 4.0 4.5 5.0 0.00 0.10 0.20 0.30

Fitted values Leverage


par(mfrow=c(1,1))

Overall, I prefer the logarithmic model as the residual vs. fitted graph looks much better. The following is a
95% prediction interval for the record time of a hypothetical race.
hyporace <- data.frame("dist"=5.3, "climb"=1100)
predict(hlslm2, hyporace, level=0.95, interval="prediction")

## fit lwr upr


## 1 3.556545 3.224142 3.888948

4
Question 9

Performing the one-sided or two-sided t-test on the externally studentised residuals gives a highly signifcant
p-value. This suggests that the human brain body size ratio is indeed an outlier amongst animals. In this
case, I think that a one-sided test is more appropriate as we are clearly seeking to prove that the human
brain is unusually large.
lmMamm <- lm(log(brain) ~ log(body), data=mammals)
n <- nrow(mammals)
p <- 1
eta <- rstudent(lmMamm)["Human"]
pval1 <- 1 - pt(eta, n-p-1)
pval2<- 1 - pf(eta^2, 1, n-p-1)
cat("One-sided: ", pval1, " Two sided: ", pval2)

## One-sided: 0.001766988 Two sided: 0.003533976


Trinity is known both for their wine budget and topping the Tompkins table. Hence I think that the two-sided
test is more appropriate here. Either way, Trinity is an outlier. We really get a lot of firsts.
file_path <- "http://www.statslab.cam.ac.uk/~sb2116/statistical_modelling/data/"
Colleges <- read.csv(paste0(file_path, "Colleges.csv"))
attach(Colleges)
lmClg <- lm(PercFirsts ~ log(WineBudget))
eta <- rstudent(lmClg)["Trinity"]
pval1 <- 1 - pt(eta, n-p-1)
pval2<- 1 - pf(eta^2, 1, n-p-1)
cat("One-sided: ", pval1, " Two sided: ", pval2)

## One-sided: NA Two sided: NA


The graph shows that Trinity is in fact the most extreme outlier.
plot(log(WineBudget), PercFirsts, ylim=c(10, 45))
text(log(WineBudget), PercFirsts, rownames(Colleges), cex=0.6, pos=3)
abline(lmClg)

5
10 15 20 25 30 35 40 45
Trinity

Pembroke

Emma
PercFirsts

Churchill
Peterhouse TrinityHall
StCatharine's Jesus
Clare King's
Christ's StJohn's
CorpusChristi
Downing
Magdalene
Fitzwilliam Selwyn Caius
Girton
Homerton
Newnham Robinson

StEdmund's MurrayEdwards

10.0 10.5 11.0 11.5 12.0 12.5

log(WineBudget)
If I perform this test on a college based on its appearance as an outlier, then that college is no longer a
random data point. It is now a college with an extreme position on this graph. Hence this will be much more
likely to have an extreme studentised residual. Hence the p-value will be an underestimate.

S-ar putea să vă placă și