Sunteți pe pagina 1din 5

R Workshop: Hult International Business School Prof.

Rolleigh

Linear Regression with R


Refer to the Roller Coaster dataset available on Canvas. Assuming you have attached the dataset,
the commands below perform statistical analysis on the Roller Coaster dataset. The basic function
that computes the estimated slope and intercept of a regression line (and much more) is lm(). To get
the estimated intercept and slope of the least squares regression line that relates the top speed of a
coaster to its height, use
> lm(Speed~Height)

Call:
lm(formula = Speed ~ Height)

Coefficients:
(Intercept) Height
39.5093 0.1715

You get much more information when you save the call into an “object” (which I named fit below).

Now I can fit the linear regression model:


> fit = lm(Speed~Height, data=coasters)
> summary(fit)
Residuals:
Min 1Q Median 3Q Max
-14.629 -3.722 -1.004 3.522 17.344

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.509313 1.537364 25.70 <2e-16 ***
Height 0.171485 0.008743 19.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.884 on 78 degrees of freedom


Multiple R-squared: 0.8314, Adjusted R-squared: 0.8293
F-statistic: 384.7 on 1 and 78 DF, p-value: < 2.2e-16

Interpretation of slope: For every 100 feet increase in the height of a coaster, the top speed increases by
17mph, on average.

1
You don’t really need the data=coasters in the lm() call because the
coaster dataset is attached, but its good practice. In advanced work, try to
avoid attaching datasets, it will only clutter your workspace. Instead, use
the data= argument to specify in which dataset to look for the variables.

To superimpose the fitted regression line (see plot to the right) uses the
abline() command:

> plot(Speed~Height, data=coasters)


> abline(fit, col="red", lwd=2)

Also, note that the object fit contains a lot of information. Type
> names(fit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
to get a list of all the components of the object. You can access those components by the name of the
object followed by a dollar sign followed by the name of the component: object$component. So, for
instance, to retrieve the estimated coefficients for intercept and slope, you can type
> fit$coefficients
(Intercept) Height
39.5093129 0.1714846
but

> coefficients(fit)
also works.

To get the fitted values for each of the height values (the x-variable) in your data set, type
> fit$fitted
1 2 3 4 5 6 7
49.11245 72.94881 61.28785 68.66169 66.43239 57.51519 52.88511
8 9 10 11 12 13 14
60.43043 51.51323 55.62886 65.91794 55.28589 48.94096 60.94489
...
78 79 80
60.08746 69.51911 74.66365
but > fitted(fit) also works. These are the 80 values for the estimated speed of a coaster based
on the linear regression model.

Confidence interval and hypothesis test (testing if the slope is significantly different from zero, i.e., if
there is a linear relationship at all) for the slope (for the last week of classes)

To test the null hypothesis H0: beta=0 against the alternative hypothesis HA: beta≠0, we simply refer to
the t-statistic displayed for Height. How is it computed? It’s simply the estimated slope (0.1715) divided
by the standard error (0.0087), which results in t=19.61. Is this an extreme value if the null hypothesis
were true? We have to judge the value relative to a t-distribution with df=78 (the sample size – 2).
Certainly, the t value is extreme, with a p-value of less than 10^(-16) (see output above). Hence, we have

2
sufficient evidence to reject the null hypothesis and conclude the true slope is significantly different
from zero.
As always, a confidence interval is more informative. To find a confidence interval for the slope (i.e., the
effect of Height), simply type

> confint(fit)
2.5 % 97.5 %
(Intercept) 36.4486558 42.5699699
Height 0.1540783 0.1888909

and read off the CI for height: [0.154, 0.189]. This means that we are 95% confident that the effect of a
one-unit (one foot) increase in height results in an increase of at least 0.154mph and at most 0.189mph
in the top speed of a roller coaster, on average. Since a one foot increase is pretty meaningless given the
height values, we look at a 100 feet increase: We are 95% confident that for a 100 feet increase in the
height of a coaster, the top speed increases by at least 15.4mph and at most 18.9mph, on average.
(Clearly, zero is not contained in the interval, indicating a statistical significant effect of height on top
speed.)

Checking Assumptions, Residual Plots:


You can get the (raw) residuals through
> resid = residuals(fit)
> head(resid)
1 2 3 4 5 6
-4.112449 -5.948806 4.712145 4.338308 -1.432392 7.484806
> tail(resid)
75 76 77 78 79 80
2.428951 -3.314802 -1.006621 5.912537 6.480885 7.336348

but it is much better to use studentized (or standardized) residuals in any residual plot. You get
studentized residuals through

> resid.stud = rstudent(fit)


> head(resid.stud)
1 2 3 4 5 6
-0.7095541 -1.0191460 0.8050043 0.7399164 -0.2434978 1.2897406
> tail(resid.stud)
75 76 77 78 79 80
0.4159708 -0.5668076 -0.1715293 1.0130954 1.1103849 1.2624853

3
The obligatory residual plot, plotting studentized residuals
versus the x variable is obtained through

> plot(resid.stud~Height, data=coasters,


main="Residual Plot", ylab="Studentized
Residuals")
> abline(h=0,col="red")

You can also plot the studentized residuals versus the fitted
values fitted(fit), which will result in the exact same
picture, but a different scaling on the x-axis.
If you want to identify which residuals are troublesome, then
use the identify() function. I.e.,
> identify(resid.stud~Height)
[1] 41

> coasters[41,]
Name Park State Country Duration Speed Height Drop
41 Oblivion Alton Towers Alton England NA 68 65 180

This observation has a large positive residual, meaning we underestimated its top speed based on its
height. If you check on the internet (http://en.wikipedia.org/wiki/Oblivion_%28roller_coaster%29), you
will find that this coaster (Oblivion) is basically just a free fall, explaining why it reaches a higher top
speed than coasters of similar height.

You also want to look at a histogram and QQ-


plot of the studentized residuals

> hist(resid.stud)
> qqnorm(resid.stud)
> qqline(resid.stud, col="red")

Estimation and Prediction:


To estimate the expected top speed of a roller coaster that has a height of 191 feet, use
> predict(fit, newdata=list(Height=191))
1
72.26287

which is essentially just


> 39.51 + 0.1715*191
[1] 72.2665
Interpretation: We expect the top speed of a roller coaster with a height of 191 feet to be 72mph.

To get a confidence interval for this estimate, use


> predict(fit,newdata=list(Height=191), interval="confidence")
fit lwr upr
1 72.26287 70.83917 73.68657
4
Interpretation: We are 95% confident that the average top speed for roller coasters with a height of 191
feet is at least 70.8 mph and at most 73.7 mph. Or: We expect the top speed of a coaster with a height
of 191 feet to be within 70.8 and 73.7 mph, with 95% confidence.

Plot Confidence Limits:


> height.new=seq(min(Height),max(Height),length=100)
> pred=predict(fit,newdata=list(Height=height.new),
interval="confidence")
> head(pred)
fit lwr upr
1 48.94096 46.70769 51.17424
2 49.57321 47.39159 51.75482
3 50.20545 48.07481 52.33609
4 50.83769 48.75729 52.91808
5 51.46993 49.43899 53.50086
6 52.10217 50.11985 54.08449
> plot(Speed~Height)
> lines(pred[,1]~height.new,col="red") #plot fitted line
> lines(pred[,2]~height.new,col="red",lty=2) #plot lower bound on E[Y]
> lines(pred[,3]~height.new,col="red",lty=2) #upper bound

One can also add prediction intervals for predicting the top speed (Y):
> pred1=predict(fit,newdata=list(Height=height.new), interval="prediction")
> head(pred1)
fit lwr upr
1 48.94096 37.01603 60.86590
2 49.57321 37.65784 61.48857
3 50.20545 38.29931 62.11159
4 50.83769 38.94044 62.73494
5 51.46993 39.58123 63.35863
6 52.10217 40.22168 63.98266
> lines(pred1[,2]~height.new,
col="blue",lty=3) #lower prediction
limit
> lines(pred1[,3]~height.new,
col="blue",lty=3) #upper prediction
limit