Sunteți pe pagina 1din 21

PAGE 1

STAT443 Assignment # 1 Solution Winter 2017 Instructor: S. Chenouri

Due: January, 26, 2017

You may work in pairs if you choose; both names and ID numbers should appear on it, and both will receive
the same mark. (No extra credit will be given for working alone.)
For any parts involving R, you should hand in the R code and output, as well as your interpretations of the
output. You will NOT receive marks for uncommented R code or output. You must submit your assignment
through CrowdMark, one per pair.

Problem 1. In an R session you can load a dataset, called name, using data(name). Load the time series
objects Nile, UKgas, co2, nhtemp, and JohnsonJohnson. For each time series comment briefly on the
following aspects. Justifying your comments if possible. [25 mark]

a) What is the period of the time series? [5 mark]

b) Is there a seasonal effect and, if so, is it additive or multiplicative? [5 mark]

c) What can you say about the level and trend? [5 mark]

d) Do you think that there are any change points? [5 mark]

e) Are the time series stationary? [5 mark]

You may use the R function decompose( ) for exploring additive and multiplicative forms.

data(Nile)
frequency(Nile)
[1] 1
plot(Nile,ylab="Annual flow", xlab="year",main="Annual flow of the river Nile at Aswan")

Annual flow of the river Nile at Aswan


1400
1200
1000
Annual flow

800
600

1880 1900 1920 1940 1960

year

1
For the Nile time series, the frequency is 1, and therefore period is yearly. There is no obvious seasonal
component. There seems to be a downward trend prior to 1910, and then flattening after 1920. In addition,
variability seems to be more prior to 1920. This signals a possible change in variability around 1920. The
mean of the process also seems to have changed after 1920. From all these we conclude that the process
does not look like a stationary process.

data(UKgas)
frequency(UKgas)
[1] 4

plot(UKgas,ylab="Quarterly UK gas Consumption", xlab="year",main="Quarterly UK gas Consumption from 1960Q1 to 1986Q4")


plot(decompose(UKgas,type="additive"))
plot(decompose(UKgas,type="multiplicative"))

Quarterly UK gas Consumption from 1960Q1 to 1986Q4


1200
1000
Quarterly UK gas Consumption

800
600
400
200

1960 1965 1970 1975 1980 1985

year

Decomposition of additive time series


1000
observed
600
700 200
500
trend
300
100
150
seasonal
50
200 -150 -50
random
0
-200

1960 1965 1970 1975 1980 1985

Time

2
Decomposition of multiplicative time series

1000
observed
600
700 200
500
trend
300
100
1.4
seasonal
1.0
1.6 0.6
random
1.2
0.8

1960 1965 1970 1975 1980 1985

Time

For the UKgas dataset, the frequency is 4, and therefore period is quarterly, so a seasonal component is
present. There is an obvious upward trend in the data. But it is not easy to decide about the additivity
or multiplicativity of the components. We have depicted both in the figures. The decompositions seem
quite similar except that there is big spike right after 1970 in the random part of the multiplicative case.
Because of the presence of trend and seasonality, both the mean and variance functions change with time
and therefore we conclude that the underlying process is not stationary. After removing the seasonality
and trend, by looking at the random component of the time series, we see two change points in terms of
variability. The first around 1970 and the second around 1978. So the random part is still not stationary.
data(co2)
frequency(co2)
[1] 12

plot(co2,ylab="Atmospheric concentrations of CO2", xlab="year",main="Atmospheric concentrations of CO2 per million")


plot(decompose(co2,type="additive"))

Atmospheric concentrations of CO2 per million


360
Atmospheric concentrations of CO2

350
340
330
320

1960 1970 1980 1990

year

3
Decomposition of additive time series

360
observed
340
360 320
trend
340
1 2 3 320
seasonal
-1
-3
0.5
random
0.0
-0.5

1960 1970 1980 1990

Time

For the co2 dataset, the frequency is 12, and therefore period is monthly. Obviously, a seasonal component is
present. There is an obvious upward linear trend in the data. The trend and seasonality seem to be additive.
Because of the presence of trend and seasonality, both the mean and variance functions change with time
and therefore we conclude that the underlying process is not stationary. After removing the seasonality and
trend, by looking at the random component of the time series, we see at least two change points in terms of
variability. The first around 1975 and the second around 1980. So the random part is still not stationary.
data(nhtemp)
frequency(nhtemp)
[1] 1

plot(nhtemp,ylab="Average annual temperature", xlab="year",main="Average annual temperature in degrees Fahrenheit in New Haven, from 1912 to 1971")
#plot(decompose(nhtemp,type="additive"))

Average annual temperature in degrees Fahrenheit in New Haven, from 1912 to 1971
54
Average annual temperature

53
52
51
50
49
48

1910 1920 1930 1940 1950 1960 1970

year

4
For the nhtemp time series, the frequency is 1, and therefore period is yearly. There is no obvious seasonal
component. There seems to be an upward trend though. There is no obvious change point. Because of the
upward trend, the mean function is not constant, signalling the process is not stationary.
data(JohnsonJohnson)
frequency(JohnsonJohnson)
[1] 4
plot(JohnsonJohnson,ylab="Quarterly earnings (dollars) per Johnson & Johnson", xlab="year",main="Quarterly earnings (dollars) per Johnson & Johnson")
plot(decompose(JohnsonJohnson,type="additive"))
plot(decompose(JohnsonJohnson,type="multiplicative"))

Quarterly earnings (dollars) per Johnson & Johnson

Quarterly earnings (dollars) per Johnson & Johnson

15
10
5
0

1960 1965 1970 1975 1980

year

Decomposition of additive time series


15
observed
10
5
140
10
trend
6
2
seasonal
0.0
-0.4
-0.8
12
random
0
-1
-2
-3

1960 1965 1970 1975 1980

Time

Decomposition of multiplicative time series


15
observed
10
5
140
10
trend
6
2
1.05
seasonal
0.95
0.85
1.2
random
1.0
0.8

1960 1965 1970 1975 1980

Time

5
For the JohnsonJohnson dataset, the frequency is 4, and therefore period is quarterly, so a seasonal
component is present. There is an obvious upward trend in the data. But it is not easy to decide about the
additivity or multiplicativity of the components. We have depicted both in the figures. The decompositions
seem quite similar except that there is big spike right before 1980 in the random part of the additive case.
Because of the presence of trend and seasonality, both the mean and variance functions change with time
and therefore we conclude that the underlying process is not stationary. After removing the seasonality
and trend, by looking at the random component of the time series, we see two change points in terms of
variability. The first around 1972 and the second around 1978. So the random part is still not stationary.

i.i.d.
Problem 2. A model for a non-stationary time series may be Xt = + t + Yt , where Yt N (0, 2 ). [10
mark]

a) How many parameters need to be estimated in this model? [2 mark]

Solution: Three parameters, , and 2 .

b) What might be a problem with using such a model to forecast far into the future? [8 mark]

Solution: It only says that the times series is generated as a linear function of time plus noise. It
does not account for possible seasonality, change point, heteroscedasticity of variance, past values, etc.

Problem 3. Download the Global-mean monthly Land-Ocean Temperature dataset from D2L under the
name GLOTemp1880-2016.csv. Use R command similar to below to read the data into R [25 mark]

glotemp<-read.csv("GLOTemp1880-2016.csv",header=TRUE)
glotemp<-glotemp[,-1]

a) Produce a time series plot of the data. Plot the aggregated annual mean series and a boxplot that
summarizes the observed values for each season, and comment on the plots. [5 mark]

glotemp<-read.csv("GLOTemp1880-2016.csv",header=TRUE)
glotemp<-glotemp[,-1]
glotemp

class(glotemp)

# converting to a time series object


glotemp.ts<-ts(c(t(glotemp)), start=c(1880,1),end=c(2016,11),freq=12)

class(glotemp.ts)

plot(glotemp.ts,ylab="Mean monthly temperature index",main=


"Global-mean monthly Land-Ocean Temperature Index")

6
Global-mean monthly Land-Ocean Temperature Index

1.0
Mean monthly temperature index

0.5
0.0
-0.5

1880 1900 1920 1940 1960 1980 2000 2020

Time

There is a clear upward trend starting around 1950, but no obvious seasonality present in the data.
The aggregated annual mean series, depicted below, indicates this upward trend more clearly.

Yrglotemp.ts<-aggregate(glotemp.ts,FUN=mean)
plot(Yrglotemp.ts,ylab="Mean yearly temperature index",main=
"Global-mean yearly temperature index: 1880 to 2016")

Global-mean yearly temperature index: 1880 to 2016


0.8
0.6
Mean yearly temperature index

0.4
0.2
0.0
-0.2
-0.4

1880 1900 1920 1940 1960 1980 2000 2020

Time

In the following figure, we provide boxplots of time series for each months of year. These show that the
global mean temperature is more or less the same for all moths but the variability changes. Variabilities

7
in the months June through November seem to be less than the other months. In addition, there are
large outliers in the data, indicating skewness to the right for the global mean temperature. This
observation corresponds to the upward trend in the global temperature.

boxplot(glotemp.ts~cycle(glotemp.ts),xlab="Month",ylab="Monthly rates %",


main="Summary of Global-mean yearly temperature index: 1880-2016")

Summary of Global-mean yearly temperature index: 1880-2016


1.0
0.5
Monthly rates %

0.0
-0.5

1 2 3 4 5 6 7 8 9 10 11 12

Month

b) Decompose the series into the components trend, seasonal effect, and residuals, and plot the decom-
posed series. Produce a plot of the trend with a superimposed seasonal effect. [5 mark]

# Decomposition of the global mean temperature time series


gmtdecomp<-decompose(glotemp.ts)
plot(gmtdecomp)

8
Decomposition of additive time series

1.0
observed
0.5
0.0
-0.5
1.0
0.5
trend
0.0
-0.5
0.02
seasonal
0.00
0.4 -0.02
0.2
random
0.0
-0.2

1880 1900 1920 1940 1960 1980 2000 2020

Time

The decomposition clearly indicates that the global mean temperature has been rising since 1950.
There is no obvious seasonality in the dataset as it is the global temperature. The random part of the
decomposition seems to have constant mean function but its variability changes with time.

plot(gmtdecomp$trend,lwd=2,ylab="Mean yearly temperature index",


main="Trend and Trend + seasonality components superimposed")
points(gmtdecomp$trend+gmtdecomp$seasonal,col="red",type="l")

Trend and Trend + seasonality components superimposed


1.0
Mean yearly temperature index

0.5
0.0
-0.5

1880 1900 1920 1940 1960 1980 2000 2020

Time

9
Superimposing seasonality over the trend, indicates that the seasonality effect is minimal and behaves
like noise.

c) Fit an appropriate Holt-Winters model to the monthly data. Explain why you chose that particular
Holt-Winters model, and give the parameter estimates. [5 mark]

# Holt-Winters
gmtHW<-HoltWinters(glotemp.ts, seasonal="additive",start.periods=12)
plot(gmtHW)

The function HoltWinters finds the optimal values of , , and by minimizing the squared one-step
prediction error if these parameters are set to NULL which is the default option in HoltWinters.

> gmtHW
Holt-Winters exponential smoothing with trend and additive seasonal component.

Call:
HoltWinters(x = glotemp.ts, seasonal = "additive", start.periods = 12)

Smoothing parameters:
alpha: 0.4206628
beta : 8.667418e-05
gamma: 0.0926661

Coefficients:
[,1]
a 0.891751423
b -0.000804561
s1 -0.006871003
s2 0.039762988
s3 0.059108293
s4 0.101251553
s5 0.038239190
s6 0.013137417
s7 -0.018456018
s8 -0.011760986
s9 0.018610438
s10 0.029048824
s11 0.037910274
s12 0.031457408

10
Holt-Winters filtering

1.0
Observed / Fitted

0.5
0.0
-0.5

1880 1900 1920 1940 1960 1980 2000 2020

Time

d) Using the fitted model, forecast values for the years 2017 to 2020. Add these forecasts to a time series
plot of the original series. Under what circumstances would these forecasts be valid? What comments
of caution would you make to an economist or politician who wanted to use these forecasts to make
statements about the potential impact of global warming on the world economy? [10 mark]

> gmtHW.Forecast
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 0.8840759
2017 0.9299053 0.9484460 0.9897847 0.9259678 0.9000615 0.8676635 0.8735539 0.9031208 0.9127546 0.9208115 0.9135541 0.8744211
2018 0.9202506 0.9387913 0.9801300 0.9163131 0.8904067 0.8580087 0.8638992 0.8934661 0.9030999 0.9111568 0.9038994 0.8647664
2019 0.9105958 0.9291366 0.9704753 0.9066583 0.8807520 0.8483540 0.8542445 0.8838113 0.8934452 0.9015021 0.8942446 0.8551117
2020 0.9009411 0.9194818 0.9608205 0.8970036 0.8710973 0.8386993 0.8445898 0.8741566 0.8837904 0.8918473 0.8845899 0.8454569

plot(glotemp.ts,lwd=2,ylab="Mean yearly temperature index",main="Holt-Winters Forecasts")


lines(gmtHW.Forecast,col="red")

Holt-Winters Forecasts
1.0
Mean yearly temperature index

0.5
0.0
-0.5

1880 1900 1920 1940 1960 1980 2000 2020

Time

11
In the above forecasts, we assume the additivity of the trend and seasonality components. As long as
the seasonality and trend remain the same, the short term forecasts are expected to be reliable.

Problem 4. (From the book James, Witten, Hastie and Tibshirani (2013)) In this problem you will create
some simulated data and fit simple linear regression models to it. Make sure to use set.seed(1) prior to
starting part (a) to ensure consistent results. [40 mark]

set.seed(1)

a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N (0, 1)
distribution. This represents a feature, X. [2 mark]

# a)
x<-rnorm(n=100,mean=0,sd=1)

b) Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N (0, 0.25)
distribution that is a normal distribution with mean zero and variance 0.25. [3 mark]

# b)
eps<-rnorm(n=100,mean=0,sd=sqrt(0.25))

c) Using x and eps, generate a vector y according to the model

Y = 1 + 0.5 X +  (1)

What is the length of the vector y? What are the values of 0 and 1 in this linear model? [3 mrak]

# c)
y<--1+0.5*x+eps

length(y)
[1] 100

Note that 0 = 1 and 1 = 0.5.

d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe. [3
mark]

# d)
plot(x,y,main="Scatterplot of x vs y")

12
Scatterplot of x vs y

0.5
0.0
-0.5
-1.0
y

-1.5
-2.0
-2.5

-2 -1 0 1 2

Just looking at the scatterplot of x and y values, we may conclude that there is either a linear or
quadratic relationship between x and y.

e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do b0
and b1 compare to 0 and 1 ? [4 mark]

# e)
fitlin1<-lm(y~x)
summary(fitlin1)

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-0.93842 -0.30688 -0.06975 0.26970 1.17309

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.01885 0.04849 -21.010 < 2e-16 ***
x 0.49947 0.05386 9.273 4.58e-15 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.4814 on 98 degrees of freedom


Multiple R-squared: 0.4674,Adjusted R-squared: 0.4619
F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15

13
The estimates are b0 = 1.01885 and b1 = 0.49947 which are very close to the true values 0 = 1
and 1 = 0.5. The respective standard errors of the estimates are 0.04849 and 0.05386. The adjusted
R-squared of the fit is 0.4619.

f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line
on the plot, in a different color. Use the legend() command to create an appropriate legend. [5 mark]

# f)
plot(x,y,main="Scatterplot of x vs y along with the fitted and true regression lines")
abline(a=fitlin1$coefficient[1],b=fitlin1$coefficient[2],col="blue")
abline(a=-1,b=0.5,col="red")
legend(x=c(-2,-1),y=c(0.0,0.5),c("Fitted line","True line"), col=c("blue","red"),lty=c(1,1))

Scatterplot of x vs y along with the fitted and true regression lines


0.5

Fitted line
True line
0.0
-0.5
-1.0
y

-1.5
-2.0
-2.5

-2 -1 0 1 2

As it is seen, the fitted line is very close to the true line, with almost the same slope with slightly
smaller intercept estimate.

g) Now fit a polynomial regression model that predicts y using x and x2 . Is there evidence that the
quadratic term improves the model fit? Explain your answer. [5 mark]

# g)
fitpoly1<-lm(y~x+I(x^2))
summary(fitpoly1)

14
Call:
lm(formula = y ~ x + I(x^2))
Residuals:
Min 1Q Median 3Q Max
-0.98252 -0.31270 -0.06441 0.29014 1.13500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.97164 0.05883 -16.517 < 2e-16 ***
x 0.50858 0.05399 9.420 2.4e-15 ***
I(x^2) -0.05946 0.04238 -1.403 0.164
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.479 on 97 degrees of freedom
Multiple R-squared: 0.4779,Adjusted R-squared: 0.4672
F-statistic: 44.4 on 2 and 97 DF, p-value: 2.038e-14

P<-fitpoly1$coef
plot(x,y,main="Scatterplot of x vs y along with the fitted and true regression")
t<-seq(-3,3,0.1)
Py<-P[1]+P[2]*t+P[3]*t^2
points(t,Py,"l",col="blue")
abline(a=-1,b=0.5,col="red")
legend(x=c(-2,-1),y=c(0.0,0.5),c("Fitted curve","True line"), col=c("blue","red"),lty=c(1,1))

Scatterplot of x vs y along with the fitted and true regression


0.5

Fitted curve
True line
0.0
-0.5
-1.0
y

-1.5
-2.0
-2.5

-2 -1 0 1 2

The fitted quadratic function has the adjusted R-squared of 0.4672. The estimated coefficients are
-0.97164, 0.50858, -0.05946. Among the three coefficients, only the coefficient of the quadratic term is
not significantly different from 0, which is consistent with the true model. This tells that the linear
model fits better.

15
h) Repeat (a) to (f) after modifying the data generation process in such a way that there is less noise in
the data. The model (1) should remain the same. You can do this by decreasing the variance of the
normal distribution used to generate the error term  in (b). Describe your results. [5 mark]

# h)
x<-rnorm(n=100,mean=0,sd=1)
eps<-rnorm(n=100,mean=0,sd=sqrt(0.05))
y<--1+0.5*x+eps
length(y)
plot(x,y,main="Scatterplot of x vs y")
fitlin2<-lm(y~x)
summary(fitlin2)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.61308 -0.12553 -0.00391 0.15199 0.41332

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.98917 0.02216 -44.64 <2e-16 ***
x 0.52375 0.02152 24.33 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.2215 on 98 degrees of freedom
Multiple R-squared: 0.858,Adjusted R-squared: 0.8565
F-statistic: 592.1 on 1 and 98 DF, p-value: < 2.2e-16

Scatterplot of x vs y along with the fitted and true regression


0.5

Fitted curve
True line
0.0
-0.5
-1.0
y

-1.5
-2.0
-2.5

-3 -2 -1 0 1 2

plot(x,y,main="Scatterplot of x vs y along with the fitted and true regression lines")

16
abline(a=fitlin2$coefficient[1],b=fitlin2$coefficient[2],col="blue")
abline(a=-1,b=0.5,col="red")
legend(x=c(-2,-1),y=c(0.0,0.5),c("Fitted line","True line"), col=c("blue","red"),lty=c(1,1))

Scatterplot of x vs y along with the fitted and true regression lines


0.5

Fitted line
True line
0.0
-0.5
-1.0
y

-1.5
-2.0
-2.5

-3 -2 -1 0 1 2

fitpoly2<-lm(y~x+I(x^2))
summary(fitpoly2)
Call:
lm(formula = y ~ x + I(x^2))

Residuals:
Min 1Q Median 3Q Max
-0.61297 -0.12369 -0.00475 0.14707 0.43183

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.98386 0.02677 -36.754 <2e-16 ***
x 0.52279 0.02179 23.995 <2e-16 ***
I(x^2) -0.00498 0.01396 -0.357 0.722
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.2225 on 97 degrees of freedom


Multiple R-squared: 0.8582,Adjusted R-squared: 0.8553
F-statistic: 293.5 on 2 and 97 DF, p-value: < 2.2e-16

17
P<-fitpoly2$coef
plot(x,y,main="Scatterplot of x vs y along with the fitted and true regression")
t<-seq(-3,3,0.1)
Py<-P[1]+P[2]*t+P[3]*t^2
points(t,Py,"l",col="blue")
abline(a=-1,b=0.5,col="red")
legend(x=c(-2,-1),y=c(0.0,0.5),c("Fitted curve","True line"), col=c("blue","red"),lty=c(1,1))

Scatterplot of x vs y along with the fitted and true regression

Fitted curve
True line
0.0
-0.5
-1.0
y

-1.5
-2.0

-2 -1 0 1 2

By reducing the variance of the error distribution, all the results stayed the same except that the
quality of the fits have gotten better. For example the Adjusted R-squared of the fits have almost
doubled. This is also quite clear from the scatter plots and the fitted models.

i) Repeat (a) to (f) after modifying the data generation process in such a way that there is more noise
in the data. The model (1) should remain the same. You can do this by increasing the variance of the
normal distribution used to generate the error term  in (b). Describe your results. [5 mark]

# i)
x<-rnorm(n=100,mean=0,sd=1)
eps<-rnorm(n=100,mean=0,sd=sqrt(1))
y<--1+0.5*x+eps
length(y)
plot(x,y,main="Scatterplot of x vs y")

18
fitlin3<-lm(y~x)
summary(fitlin3)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.51014 -0.60549 0.02065 0.70483 2.08980

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.04745 0.09676 -10.825 < 2e-16 ***
x 0.42505 0.08310 5.115 1.56e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.9671 on 98 degrees of freedom
Multiple R-squared: 0.2107,Adjusted R-squared: 0.2027
F-statistic: 26.16 on 1 and 98 DF, p-value: 1.56e-06

Scatterplot of x vs y
1
0
-1
y

-2
-3

-3 -2 -1 0 1 2 3 4

plot(x,y,main="Scatterplot of x vs y along with the fitted and true regression lines")


abline(a=fitlin3$coefficient[1],b=fitlin3$coefficient[2],col="blue")
abline(a=-1,b=0.5,col="red")
legend(x=c(-2,-0.7),y=c(-0.5,0.5),c("Fitted line","True line"), col=c("blue","red"),lty=c(1,1))

19
Scatterplot of x vs y along with the fitted and true regression lines

1
Fitted line
True line

0
-1
y

-2
-3

-3 -2 -1 0 1 2 3 4

fitpoly3<-lm(y~x+I(x^2))
summary(fitpoly3)
Call:
lm(formula = y ~ x + I(x^2))
Residuals:
Min 1Q Median 3Q Max
-2.53612 -0.62004 0.00828 0.75138 2.05661
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.01481 0.11669 -8.697 8.68e-14 ***
x 0.43295 0.08487 5.101 1.68e-06 ***
I(x^2) -0.02385 0.04724 -0.505 0.615
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.9708 on 97 degrees of freedom
Multiple R-squared: 0.2128,Adjusted R-squared: 0.1965
F-statistic: 13.11 on 2 and 97 DF, p-value: 9.137e-06

P<-fitpoly3$coef
plot(x,y,main="Scatterplot of x vs y along with the fitted and true regression")
t<-seq(-3,4,0.1)
Py<-P[1]+P[2]*t+P[3]*t^2
points(t,Py,"l",col="blue")
abline(a=-1,b=0.5,col="red")
legend(x=c(-2,-0.7),y=c(-0.5,0.5),c("Fitted curve","True line"), col=c("blue","red"),lty=c(1,1))

20
Scatterplot of x vs y along with the fitted and true regression

1
Fitted curve
True line

0
-1
y

-2
-3

-3 -2 -1 0 1 2 3 4

By increasing the variance of the error distribution, all the results stayed the same except that the
quality of the fits have gotten worse. For example the Adjusted R-squared of the fits have dropped to
almost half of the original fits. This is also quite clear from the scatter plots and the fitted models.
j) What are the confidence intervals for 0 and 1 based on the original data set, the noisier data set,
and the less noisy data set? Comment on your results. [5 mark]

# First dataset, the 95% confidence intervals are


confint(fitlin1)
2.5 % 97.5 %
(Intercept) -1.1150804 -0.9226122 # |-1.1150804 -0.9226122|=0.1924682
x 0.3925794 0.6063602 # |0.3925794 - 0.6063602|=0.2137808

# Second dataset
confint(fitlin2)
2.5 % 97.5 %
(Intercept) -1.033141 -0.9451916 # |-1.033141 + 0.9451916|=0.0879494
x 0.481037 0.5664653 # | 0.481037 - 0.5664653|=0.0854283

# Third dataset
confint(fitlin3)
2.5 % 97.5 %
(Intercept) -1.2394772 -0.8554276 # |-1.2394772 +0.8554276|=0.3840496
x 0.2601391 0.5899632 # |0.2601391 - 0.5899632|=0.3298241

None of the intervals include 0, indicating that none of the coefficients 0 and 1 are zero at 95%.
In addition as the variance of the error term increases the confidence intervals become wider and
conversely as the variance decreases, the confidence intervals become narrower.

21

S-ar putea să vă placă și