Sunteți pe pagina 1din 19

Statistics ADE/ECO/IBE 2011 G. Garcı́a, J. Daoudi, F. Udina, L.

Splendore

Chapter 6:
Inference for Simple Linear Regression.

Reference
Newbold: Chapter 12.

Moore: Chapter 2 and 10.

Contents
1 Correlation 3
1.1 Testing for a zero correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Descriptive regression 5
2.1 Fitting a line to data: method of least squares. . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Descriptive regression with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Inference for regression 6


3.1 Standard assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 The Gauss-Markov Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Distribution of b: confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Hypothesis Tests for the slope b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 ANalysis Of VAriance for Regression 10


4.1 The coefficient of determination R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
11

5 Prediction 12

6 Exercises on correlation 13

7 Exercises on regression 16

All models are wrong, but some are useful.


George Box

To this point, we have dealt almost exclusively with problems of inference about single variable. In
business and economic applications we are interested in focused on the relationship between two or
more variables. We already learned the descriptive tools of the simple linear regression: scatterplots,
least-squares regression, and correlation. They are essential preliminaries to do inference with simple
linear regression. Regression analysis is widely used for prediction and forecasting
The earliest form of regression1 was the method of least squares which was published by Legendre
in 1805 and by Gauss in 1809. The term “regression” was coined by Francis Galton in the nineteenth
century to describe a biological phenomenon. It was the pioneering work of Sir Francis Galton in the
1880s that gave rise to the technique, the original idea being the direct result of an experiment on sweet
peas. He noticed that the seeds of the progeny of parents with seeds heavier than average were also
heavier than average, but the difference was not as pronounced; the same effect was true for the seeds
of the progeny of parents with light seeds, where again the differences from the average were not as
great. He called this phenomenon reversion and wrote that the mean weight ”reverted, or regressed,
toward mediocrity”.
The regression analysis was later extended by Udny Yule and Karl Pearson to a more general statistical
context. The assumption was weakened by R.A. Fisher in his works of 1922 and 1925. Regression
methods continue to be an area of active research: Clive W.J. Granger, (Nobel Prize Nobel in economics
2003)2 is famous for his study on some extension of the regression in particular: causality (Granger-
casuality) and cointegracion.
Here are the examples we will use to show the main ideas.
Example 1. Marriage and Divorce Statistics
Table 1 lists the number of divorces for each year from 1975 to 1980 (Marriage and Divorce Statistics,
Office of Population Censuses and Surveys, HMSO). Figure 1 is a time plot of the same data. There is

Marriage and Divorce Statistics


Years 1975 1976 1977 1978 1979 1980
Divorce (1000) 120.5 126.7 129.1 143.7 138.7 148.3

Table 1: Office of Population Censuses and Surveys, HMSO.

a positive linear association, the plot shows increasing trend.


145
140
divorcios

135
130
125
120

1975 1976 1977 1978 1979 1980

any

Figure 1: Office of Population Censuses and Surveys, HMSO.

Example 2. Figure 2 plots the data (from file http://pascal.upf.edu/estad/dades/thr.txt)


This are daily temperature and humidity from 1/5/ 2000 to 30/11/2000. We are interested in the re-
lationship between temperature and humidity. The scatterplot evidences a negative linear association
between the two variables.
1 See, for example, Stigler, Stephen M. (1999) Statistics on the Table: The History of Statistical Concepts and Methods. Harvard

University Press.
2 More details in http://nobelprize.org/.

2
90
80
h.rel

70
60
50

5 10 15 20 25

temp

Figure 2: http://pascal.upf.edu/estad/dades/thr.txt .

thr<-read.table("http://pascal.upf.edu/estad/dades/thr.txt")
thr
plot(thr)

1 Correlation
Example 3. Let us introduce the correlation following the example 2.

> round(cor(thr),4)
temp h.rel
temp 1.0000 -0.8248
h.rel -0.8248 1.0000

In this matrix −0.82 is the correlation between the two variables. The correlation measures the direc-
tion and strength of the linear association between two quantitative variables. The correlation coeffi-
cient, ρ, between X e Y is defined as

cov(X, Y)
ρ= � = cov(X∗ , Y ∗ ) (1)
V(X)V(Y)

where, as we know, cov(X, Y) = E ((X − EX)(Y − EY)) = E(XY) − E(X)E(Y) is the covariance (in R use:
cov(thr)) and X∗ , Y ∗ are the standardized versions of the variables. Remember that the correlation
−1 ≤ ρ ≤ 1. Because the correlation uses the standardized values of the observations (see the right-
side of formula 1), does not change when we change the units of measurement. The correlation is just
a number, itself has no unit of measurement. Please check the scatter plots figure 3.
In the inference to go from the sample (x1 , y1 ), (x2 , y2 ).., (xn , yn ) to the population we consider the
sample correlation coefficient:

3
r= 0.95 r= -0.65

y
x x

r= 0.23 r= -1
y

y
x x

r= 0 r= -0.97
y

x x

Figure 3: Samples of observations from joint distributions with different correlations.

� Y) �n
cov(X, (xi − x̄) (yi − ȳ)
ρ^ = r = = �� i=1 (2)
Sx Sy n 2 �n 2
i=1 (xi − x̄) i=1 (yi − ȳ)

where
��
(xi − x̄)2
S2x =
(n − 1)

1.1 Testing for a zero correlation

In this case, our null hypothesis is:

H0 : ρ = 0

to test the no (linear) relationship between a pair of variables. To test H0 against the alternative:

H1 : ρ �= 0

Assuming that X and Y are normal, and that H0 is true, then the sample correlation coefficient r satisfy:
r
� ∼ tn−2 (3)
1−r2
n−2


And so, the p−value for an observed tobs = √ r
1−r2
n − 2 is 2P(tobs < tn−2 )

4
and the decision rule is
� �
� �
� r �

RejectH0 if � � � > tn−2,α/2

� 1−r2 �
n−2

Example 4. Then using R on the data of example 2, we have:

> cor.test(thr$temp,thr$h.rel)

Pearson’s product-moment correlation

data: temp and h.rel


t = -20.2626, df = 193, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8650292 -0.7739445
sample estimates:
cor
-0.8247647

We have found strong evidence against H0 , we are sure that the true correlation is negative, not zero.

2 Descriptive regression
Now we recall the descriptive regression, first we focus in how we find the regression: least squares
estimation. And then we focus in some useful R commands.

2.1 Fitting a line to data: method of least squares.

Fitting a line to data means drawing a line that comes as close as possible to the points. There are
many ways to make “as close as possible”. The most common is the method of least squares3 . The
least-squares regression line is the line that makes the sum of the squares of the vertical distances of
the data points from the line as small as possible. Define error ei = yi − (a + bx), then:
n
� n

min e2i = min (yi − (a + bxi ))2
a,b a,b
i=1 i=1

From the first order conditions:

�n n
� n

∂ i=1 e2i
0 = = −2 yi + 2na + 2b xi
∂a
i=1 i=1
�n �n n
� n

∂ e2i
0 = i=1
= −2 xi yi + 2a xi + 2b x2i
∂b
i=1 i=1 i=1

we derive the solution


�n �n �n
n xi yi − i=1 xi i=1 yi sy
b= i=1
�n ��n �2 =r (4)
2
n i=1 xi − sx
i=1 xi
3 Developed by Laplace (1812) Théorie analytique des probabilités. Carl Friedrich Gauss is credited with developing the funda-

mentals of the basis for least-squares analysis in hisTheory of Celestial Movement.

5
where r is the correlation and sx , sy standard deviation de x e y respectively. Then, we have the
intercept:
a = y − bx.
s
Please note that byx = r syx �= bxy = r ssyx . To not abuse the notation we use just b.

The estimation of σ2 residual variance or Mean Square Error (MSE):


�n 2 �n
(yi − (a + bxi )) e2
^ 2 = MSE = i=1
s2e = σ = i=1 i (5)
n−2 n−2

In inference we call MSE also s2e or σ


^ 2 . We will be focused later on it (see section 3.2).

2.2 Descriptive regression with R

Example 5. We review the descriptive regression by R with example 1.

> divorces=scan() ### data entry


1: 120.5
2: 126.7
3: 129.1
4: 143.7
5: 138.7
6: 148.3
7:
Read 6 items
# alternatively, use
> divorces <- c(120.5, 126.7, 129.1, 143.7, 138.7, 148.3)
> year=1975:1980
> div.lm = lm(divorces ˜ year) ### lm is used to fit linear model
> names(div.lm) ### Our output
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
> div.lm$coefficients ### or just >divorces
(Intercept) any
-10577.900000 5.417143
> div.lm

Call:
lm(formula = divorcios ˜ any)

Coefficients:
(Intercept) any
-10577.900 5.417

Now we are ready for the inference for regression.

3 Inference for regression


In this section we describe methods for inference for the regression. We start with the assumption and
The Gauss-Markov theorem. Then we apply all inference that we already know (confidence intervals
and hypothesis tests) to the regression. We focus in the estimation of the slope, b, widely used by
economists. If you want deepen the estimation of the intercept, a, please check the books, for example
Moore or Newbold.

6
3.1 Standard assumptions.

Denote the population regression line by:

Yi = α + βxi + εi

note that, as usual, we use the Greek letter in the population. The following standard assumption are
often made:

1. The xi are fixed numbers or, if they are random, they are independent of the error terms εi
2. E(εi ) = 0

3. E(ε2i ) = σ2 (same variance for all, “homoscedasticity”)


4. E(εi εj ) = 0 for all i �= j

If the sample size is small, we require that the error are normally distributed.

3.1.1 The Gauss-Markov Theorem.

The Gauss-Markov theorem provides a powerful motivation for estimating the parameters of a regres-
sion model by least squares.
The Gauss-Markov Theorem: Denote the population regression line by:

Yi = α + βxi + εi

Suppose that the assumption 1-4 Section 3.1 hold. Then, of all possible estimators of α and β the least
squares estimators have the smallest variances.

3.2 Distribution of b: confidence intervals

Because we are interested in β, and use b as estimator of β, it is of our main interest to know the
distribution of b.
It is not difficult ot prove that

1. E(b) = β, so b is a good estimator of β.


�n 2
s2 i=1 ei
2. The standard error is sb with s2b = �n e
(x i −x̄)
2 = n−2
i=1

3. Under the assumptions previously stated (including normality when sample size is small) we
have
b−β
∼ tn−2
sb

Using this, confidence intervals could be built using the usual trick:

CI(β) = b ∓ t1−α/2 sb

In R we can use the function confint.


Example 6. By the data example 2:

7
thr=read.table("http://pascal.upf.edu/estad/dades/thr.txt")
attach(thr)
x=temp
y=h.rel
n=length(x)
######### Sum square
Sxy=sum(x*y)-sum(x)*sum(y)/n
Sxx=sum(xˆ2)-(sum(x))ˆ2/n
Syy=sum(yˆ2)-sum(y)ˆ2/n

####### Regression Coefficients (a and b)

b=Sxy/Sxx #b
a=sum(y)/n-b*sum(x)/n #a
plot(x,y,xlab="Temperature",ylab="Humidity") #plot
abline(a,b) # add the fitted line

########## fitted values (valores ajustados)


yhat=a+b*x
errores=y-yhat
sigma2=sum(erroresˆ2)/(length(errores)-2)
##################

############# Then lm (linear model)


d=lm(y˜x)

plot(x,y,xlab="Temperature",ylab="Humidity relative")
abline(d) # plot
d.summary=summary(d)
names(d)

############ Confidence interval for the Population Regression

coef =coef(d)
es.coef = d.summary$coefficients[,"Std. Error"]
cuant = qt(c(0.975,0.025),n-2)
int.a = coef[1] - cuant*es.coef[1] # a
int.a
[1] 91.38484 96.45357
int.b <- coef[2] - cuant*es.coef[2] # Intervalo para b
int.b
[1] -1.761657 -1.449123

############ Intervalos con la funcion confint


> confint(d)
2.5 \% 97.5 \%
(Intercept) 91.384845 96.453566
x -1.761657 -1.449123
##### same intervals, of course!

3.2.1 Hypothesis Tests for the slope b.

Example 7. Let continue with data of the previous section:

8
d.summary

Call:
lm(formula = y ˜ x)

Residuals:
Min 1Q Median 3Q Max
-12.6224 -4.9318 -0.7571 5.0788 13.5925

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.91921 1.28496 73.09 <2e-16 ***
x -1.60539 0.07923 -20.26 <2e-16 ***
---

Residual standard error: 5.916 on 193 degrees of freedom


Multiple R-squared: 0.6802,Adjusted R-squared: 0.6786
F-statistic: 410.6 on 1 and 193 DF, p-value: < 2.2e-16

In this case, the p-value for the null hypothesis H0 : b = 0 vs. H1 : b �= 0 is almost zero, we clearly
reject H0 , the line is not horizontal, and we have a lineal relationship. The slope is negative.

Now we introduce the theory. Following the same steps in the hypothesis test: write down the hy-
pothesis, find and calculate the sample test statistic and compare the result with the critical value. The
null and alternative hypothesis could be:

H0 : b = b0 and H1 : b �= b0

where b is constant, often is zero. Then note that if our null hypothesis is H0 : b = 0 and we do not
reject means that the line is horizontal.
The sample statistic is:
bi − b0
T= ∼ tn−2
sb

Reject H0 if
T > t(n−2, α2 ) or T < −t(n−2, α2 )

Example 8. If we consider the data from the example 1:

> summary(divorces)

Call:
lm(formula = divorcios ˜ any)

Residuals:
1 2 3 4 5 6
-0.4571 0.3257 -2.6914 6.4914 -3.9257 0.2571

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.058e+04 1.908e+03 -5.544 0.00518 **
any 5.417e+00 9.649e-01 5.614 0.00495 **
---

9
Residual standard error: 4.037 on 4 degrees of freedom
Multiple R-squared: 0.8874,Adjusted R-squared: 0.8592
F-statistic: 31.52 on 1 and 4 DF, p-value: 0.004947

145
140
divorcios

135
130
125
120

1975 1976 1977 1978 1979 1980

any

Figure 4: Scatterplot: Office of Population Censuses and Surveys, HMSO.

4 ANalysis Of VAriance for Regression


The basic regression line concept, DATA = FIT + RESIDUAL, is rewritten as follows:

yi − ȳ = y
^ i − ȳ + yi − y
^i

The first term is the Total variation in the response ,SST, the second term is the variation in mean
response (Regression), SSR, and the third term is the residual value (Error), SSE. Squaring each of
these terms and adding over all of the n observations gives the equation, SST=SSR+SSE:

n
� n
� n

SST = (yi − ȳ) = 2
(^
yi − ȳ) + 2
^ i )2
(yi − y
i=1 i=1 i=1

Again we have:
�n
• The total variation: SST= i=1 (yi − ȳ)2 with n − 1 degrees of freedom.
�n
• The variation due to the linear part of the model, the regression: SSR= i=1 (^yi − ȳ)2 with 1
degree of freedom.
�n
• Finally the variation due to desviations from the regression: SSE = i=1 (yi − y
^ i )2 with n − 2
degree of freedom.

Then consider the mean square, MST, MSR and MSE.


�n �n �n
− ȳ)2
i=1 (yi i=1 (^
yi − ȳ)2 i=1 (yi
−y^ i )2
MST = ; MSR = ; MSE =
n−1 1 n−2

As we did in the one way ANOVA for mean comparison, we consider the F-ratio MSR/MSE. Small
values of this ratio will mean that the linear part of the model do not account for the variation of the

10
response, while large values of this ratio will mean that most of the total variation is explained by the
linear part of the model. This is the ase for the
F-test for the regression model. Let H0 be that there is no linear dependence between y and x, or
equivalently, that b = 0. Then Ha will state that there is some linear depencence, b �= 0.
Under H0 , the statistic F = MSR/MSE has an F1,n−2 distribution (Fischer-Snedecor distribution, with
1 df in the numerator and n-2 df in the denominator). Since large values of this ratio will be in favour
of Ha , we will have
p − value = P(F1,n−2 > Fobs )
and so, the critical point will be F1,n−2,α .
Let us resume the ANOVA in a table:
Variability source Sum of Squares df Mean Squares F-ratio
REGRESSION SSR 1 MSR = SSR/1 MSR/MSE
ERROR SSE n−2 MSE = SSE/(n − 2)
TOTAL SST n−1

4.1 The coefficient of determination R2

Of particular importance is the quantity R2 = SSR SST that express the fraction of the total variation of y
that is explained or captured by the linear part of the model. It can be shown that (in this case of simple
regression, with a single predictor) R2 = r2 , where r is the Pearson coefficient of linear correlation (see
section 1
Example 9. Using the data from example 2:

thr=read.table("http://pascal.upf.edu/estad/dades/thr.txt")
lmt = lm(h.rel˜temp, data=thr)
summary(lmt)

Call:
lm(formula = h.rel ˜ temp, data = thr)

Residuals:
Min 1Q Median 3Q Max
-12.6224 -4.9318 -0.7571 5.0788 13.5925

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.91921 1.28496 73.09 <2e-16 ***
temp -1.60539 0.07923 -20.26 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.916 on 193 degrees of freedom


Multiple R-squared: 0.6802,Adjusted R-squared: 0.6786
F-statistic: 410.6 on 1 and 193 DF, p-value: < 2.2e-16

For the ANOVA:

anova(lmt)
Analysis of Variance Table

Response: h.rel
Df Sum Sq Mean Sq F value Pr(>F)

11
temp 1 14369.1 14369 410.57 < 2.2e-16 ***
Residuals 193 6754.6 35
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

5 Prediction
Given the model y = α+βx+ε and the fitted version y = a+bx+e there are two interesting questions
given a new value of x:

1. Given x0 what would be the mean response of y?


2. Given x0 , what could be the value of an individual having this x0 value?

For the point estimation, the answer to both question is a + bx0 . But the confidence interval is different
depending on the question (it is more difficult to estimate (2) than (1), so the confidence interval for (2)
is wider than for (1).
In R, we ask for “confidence” to answer the first question, and “prediction” to answer the second
question.
Figure 5 shows both confidence bands for prediction.

.
.... ...
.. .. . ........ ...... ...... ...... ....... ...... ...... ...... ...... ...... ......... ............
.. .. ............ .. .... .. ............ .... .
.. .. ......... ..... ....... ....... .. ....
...
..
..
µ̂ y/x = β̂ + β̂ x 0... . . .
1 .
.
.. ... ....
...
.
... Error extrapolación
...
... .
....... ...... ...... ...... ......... ...... ...... ...... ...... ...... .......................................... .
.. .
..................... .........
.
... . .. ....
.. .... .. . . .
.......... ........... ...
.
..
.. .
. ......... .
...... . . ......... ..
..
.
. . .
... ..... ..
..
...
.. ... . ..... ......... .. ... .......... ... ..
... ... . .... .... .... ..... ... .......... ..... ..
... ... ... .... ............. .... ... . .. . .
.. . ... .....
..... ....... . ..
. .... . . .. .. ....
...
..................... .... ...
... ...
. . ..... ...... ... ... . .
.. .... .......... ....... ..... .. . ..
........................
. ..
µ̂ y/x=x0...... ....... . . ... ....... .....
... ....
...
..
..
. ..... .. .
.
.............. ...... ...... ...... ................ .............. ......... ..
... . ....................... . ....

0 ...... . .......... .. .. ....... .......
....
....
..
..
...
.... . . ... .... ..
y ... .. .
. ........ ....
.
............ ..... .... ... ..
..
.
.
.... y ... . . . .
..........
.
.............. . .
..
... ... .. ... . .
........ ... ....
.. .... .. .. ... ... . .
. .
.. .... ........ ..... ... .. . ........ . ..
.. .............. ... ...
. . .
.
. .... ... .. .............. .. .
.. ............ ....... ... .. .
.... .. .......... .... ..... . . . .
.. ... .... ..... .. ... ...
... ............. .... . ..
.........
... ... .. .... ....
.
.
..... ..
.. ..................... ............................ ... ..
.. ..... . .
. . .... ........ ... ........ ...................... .............. .
.. ....... .... .. ..... ... .. ...... ...... ........ . .
.. ..... .. .......... ...... . ...... .. .. . .. ..
....... .. .
.. . ... .
... .. .
... ..... . ... .. .
..
.. ... . . ...
.. .... . . ...... ...
. ... ... . .... ..
.. ... ...
............................................................................................................................................................... . ..
..............................................................................................................................

x0 x0

x Rango ajustado para

Figure 5: Confidence interval for prediction.

For example 2, thr data, we want to ask for the confidence interval for the mean humidity given
temperatures of 5,10,15 and 20 degrees Celsius.

lmt = lm(h.rel ˜ temp, data=thr)

new.temp = data.frame(temp=c(5,10,15,20)) # list here the desired x values

# confidence interval for the mean!


predict(lmt, newdata=new.temp, interval="confidence", level=0.95)

12
fit lwr upr
1 85.89226 84.07717 87.70734
2 77.86531 76.68757 79.04304
3 69.83836 69.00137 70.67534
4 61.81141 60.70011 62.92271

We obtain, for example, that for temp=5, the interval is (84.08, 87.71). The “fit” is the central point, the
predicted value, the point estimation.
Now we want confidence intervals for an individual measure of humidity in a place where tempera-
ture is 5,10,15 or 20 degrees Celsius.

# confidence intervals for PREDICTION of individual values

predict(lmt, newdata=new.temp, interval="prediction", level=0.95)

fit lwr upr


1 85.89226 74.08382 97.70069
2 77.86531 66.13792 89.59269
3 69.83836 58.14028 81.53643
4 61.81141 50.09050 73.53231

As before, the first column lists the point estimation, y


^ , the second and third column list the lower and
upper limits of the intervals.

6 Exercises on correlation
1. Les dades següents relacionen la temperatura d’ebullició de l’aigua (en graus centı́grads), amb
la pressió baromètrica (en mm de mercuri), i van ser preses pel fı́sic escocès Forbes l’any 1857 als
Alps i a Escòcia.

Pressió (mm): 768 769 770 773 774 775


Temperatura (C): 93.8 94.1 95.3 98.1 99.3 99.9

(a) Feu una gràfica de les dades, posant “Pressió” a l’eix horitzontal i “Temperatura” a l’eix
vertical (l’experimentador tria diverses localitzacions geogràfiques amb diferents pressions
atmosfèriques i mesura, com a resposta, la temperatura d’ebullició de l’aigua).
Resposta. Vegeu més avall les instruccions R
(b) Calculeu la variància de cadascuna d’aquestes variables i la seva covariància. Calculeu a
partir d’aquestes quantitats la correlació entre elles. Comproveu si la funció cor de R us
dona el mateix.
Resposta. cor(pres,temp) dóna el mateix que cov(pres,temp)/sqrt(var(pres)*var(temp))
(c) Si heu anomenat pres a la serie de pressions i temp a la de temperatures, calculeu ara
les sèries estandarditzades pres.est i temp.est. Podeu fer-ho amb la funció scale o
també directament restant la mitjana i dividint per la desviació estàndard. Feu-ho de les
dues maneres i comproveu si us dona el mateix.
Resposta. Dóna el mateix scale(temp) que (temp-mean(temp))/sd(temp)

13
(d) Calculeu la covariància entre les series estandarditzades i la seva correlació. Comentaris?
Resposta. La covariància entre les variables estandarditzades és igual a la correlació de les variables,
originals o estandarditzades tant li fa.
(e) La pressió està expressada en milı́metres de mercuri. Actualment utilitzem més la unitat
d’hectopascal, (1 mmHg = 1.3332 hectopascals (hPa)). Si posem pres.hPa <- pres *
1.3332, quina serà la covariància entre la pressió i la temperatura ara? I la correlació?
Resposta. Si multipliquem la pressió per 1.3332, la seva variància queda multiplicada per 1.33322 i
la covariància entre pres i temp queda multiplicada per 1.3332. Llavors, la correlació no varia.
(f) La temperatura està expressada en graus centı́grads, però Forbes probablement les va pren-
dre en graus Fahrenheit. Recorda que la conversió de Celsius a Fahrenheit es pot fer amb:
F = 95 C + 32. Si expressem la temperatura en graus Fahrenheit, quina és la covariància entre
la tempreatura i la pressió? I la correlació?
Resposta. En multiplicar per 9/5 la covariància quedarà també multiplicada per 9/5 però la cor-
relació no canvia. En sumar 32, ni la covariància ni la correlació canvien.
(g) El coeficient de correlació obtingut, r, entre les dues variables, és significatiu? (nivell de
significació 0.05). Digues quina és la prova de significació que fas, quines són les hipòtesis
nul.la i alternativa, quins són els supòsits que assumim sobre les variables i quin resultat
obtens. Fes-ho primer fent tu els càcluls i després comprova si cor.test dóna el mateix.
Resposta. Plantejem
� H0 : ρ = 0 vs. Ha : ρ �= 0. Suposant que les variables inicials són normals,
l’estadı́stic r 1−r
n−2
2 segueix una tn−2 . Els càlculs estàn fets a sota. Obtenim un valor−p molt petit,

diem que hem trobat evidència estadı́sticament significativa contra la nul.la.

Instruccions R:

pres=c(768,769,770,773,774,775)
temp=c(93.8,94.1,95.3,98.1,99.3,99.9)
plot(presion,temp,main=‘‘Forbes Data’’,xlab=‘‘Pression’’,
ylab=‘Temperature’’)
cor(presion,temp)
[1] 0.9963161
> cor.test(pres,temp)

Pearson’s product-moment correlation

data: pres and temp


t = 23.236, df = 4, p-value = 2.033e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9651382 0.9996162
sample estimates:
cor
0.9963161
%## fem el calcul pas a pas
> r=cor(pres,temp)
> n=length(pres)
> ec <- r*sqrt((n-2)/(1-r*r)) %# l’estadistic de contrast
> ec
[1] 23.23597
> 2 * (1 - pt(ec, df=n-2)) %# el valor-p, dues cues
[1] 2.033128e-05

2. (optional exercise) From a sample of two normal variables, sample size n, we obtained a corre-
lation coefficient r = 0.5. How big should n be for this coefficient be significant? (Use α = 0.05,
state carefully the significance test you are using.) (Note: Since you don’t know n, you can’t
use the distribution tn−2 , You should use the normal as a first aproximation, then adjust your

14
answer using the t) �
Answer: We are testing H0 : ρ = 0 vs. Ha : ρ �= 0. The observed test statistic is tobs = r 1−r
n−2
2 =

0.5774 n − 2. This should be bigger than 1.96 (using the normal aproximation). So n should be
bigger than (2 + 1.96/0.5774)2 = 29.1. Now we check whether n = 30 gives the desired p-value
for the t distribution:

> r=0.5
> n=30
> 2*pnorm(1.96,lower.tail=FALSE)
[1] 0.04999579
> n=29
> 2* pt( r*sqrt((n-2)/(1-rˆ2)), df=n-2,lower.tail=FALSE)
[1] 0.005745713

And we see that n = 29 is not enough, but n = 30 it is enough.

3. Now we use the women data that are already included in R. Type women to see the dataset. Ask
for ?women to see information about the dataset and its variables.
(a) Using that 1 inch equals 2.54 cm, and that 1 lbs equals 0.4536 kg, create new variables with
the height and the weight in international units. Answer: weight.kg <- women$weight * 0.4536
and height.cm <- women$height * 2.54 do the job.
(b) Compare the scatterplot of the original variables to the scatterplot of the transformed vari-
ables. Answer: They are very similar, we just rescaled the axis but then R rescales them
again to draw the graph.
(c) Compare variances, covariances and correlations of the original variables to the ones of
the transformed variables. Answer: Variances get multiplied by the coefficient we used,
covariances get multiplied by both coefficients. Correlation do not change.
(d) Type the command lm(women$weight˜women$height)$coefficients and comment
the results you get. Note that the symbol ˜ is a tilde that can be typed with “alt-gr 4” in a
typical PC keyboard. Answer: We get the coefficient for a regression line W = a + bH.
(e) After making the scatterplot of the original variables, type
abline(lm(women$weight˜women$height)$coefficients) and comment what you
get in the graphics window. Answer: We get the line drawn toghether with the scatterplot.
4. In this exercise we want to explore simple random samples from the model Y = 0.5X + e where
X ∼ N(0, 1) and e ∼ N(0, 1), X and e being independent.
(a) Using the properties of variance and covariance, show that cor(X, Y) = 0.447. Answer:
V(X) = V(e) = 1,√V(Y) = 0.25V(X) + V(e) = 1.25, cov(X, Y) = cov(X, 0.5X) + cov(X, e) = 0.5
cor(X, Y) = 0.5/ 1.25 ≈ 0.447,
(b) Open a new script window in R and type in the following lines:
x <- rnorm(20)
y <- 0.5*x+rnorm(20)
plot(x,y)
abline(lm(y˜x)$coefficients)
cor(x,y)
Then choose “Run all” from the Edit menu and watch the graphics windows. Comment.
(c) Repeat many times the “Run all” command and describe what you see in the graphics win-
dow and also in the console window. Repeat many times “Run all” until you get a negative
correlation coefficient. Why do you think it happenned?
(d) Now we let R count how many times we get a negative correlation from these X, Y variables
that have “true correlation” equal to 0.45. Type the lines

15
cors<-c()
for (i in 1:10000)
{x <- rnorm(20); y <- 0.5*x+rnorm(20);cors[i]<-cor(x,y)}
summary(cors)
and explain what you get. To count how many cors where negative, type length(cors[cors<0]).

7 Exercises on regression

Publicat: Dijous, 25 de Febrer de 2010


Lı́mit lliurament: Dilluns, 1 de Març de 2010; 8:00 am

Exercici 1 A continuació tenim les puntuacions obtingudes per un grup d’estudiants en l’examen par-
cial i en l’examen final d’Estadı́stica.

Parcial 81 75 71 61 96 56 85 70 77 71 91 88 79 77
Final 80 82 83 57 100 30 68 40 87 65 86 82 57 75

1. Dibuixa el diagrama de dispersió de les dades i comenta’l.

2. Calcula el coeficient de correlació lineal entre unes i altres notes i interpreta el seu valor.

3. Si té sentit, localitza “a ull” la recta que millor ajusta els punts del gràfic.

4. Determina la recta d’ajust pel mètode dels mı́nims quadrats.

5. Un estudiant del mateix grup va obtenir una puntuació de 80 en el primer examen. No es va


poder presentar a l’examen final. A la vista del comportament del grup, quina nota creieu
que podia esperar en l’examen final? Com s’interpreta aquest valor?

6. Estudia els residus i comenta sobre la adecuació del model

Exercici 2 (Moore, exercici 10.9) (Datos: http://pascal.upf.edu/estad/dades/manatis.dat)


Los manatı́s son criaturas marinas grandes y apacibles que viven a lo largo de la costa de Florida.
Las lanchas motoras matan o lastiman muchos manatı́s. Tenemos los datos sobre las lanchas mo-
toras registradas (en miles) y el número de manatı́s muertos por las lanchas en Florida en los
años de 1997 a 1990.

1. Dibuja un diagrama de dispersión que muestre la relación entre el número de lanchas mo-
toras registradas y los manatı́s muertos (cuál es la variable explicativa?).

2. El aspecto general de la relación entre las variables es aproximadamente lineal? ¿Existen


observaciones atı́picas claras o observaciones influyentes fuertes?

3. Calcula el modelo de regresión con lm. ¿Qué indica R2 = 0.886 a propósito de la relación
entre lanchas y manatı́s muertos?

4. Explica lo que significa, en esta situación, la pendiente β de la verdadera recta de regresión.


Luego da un intervalo de confianza del 90% para β.

5. ¿Si Florida decidiera congelar el número de lanchas registradas en 700.000, cuántos manatı́s
predices que matarı́an las lanchas motoras cada año?

16
6. Pide a R la predicción para x = 700. ¿Coincide con la que habı́as obtenido tú?

7. Da un intervalo de predicción del 95% para la media de manatı́s que morirı́an cada año si
Florida congelara el número de licencias en 700.000.

8. Finalmente, estudia la adecuación del modelo ajustado: se cumplen las suposiciones ini-
ciales?

(Los apartados (f) i (g) no entran para el seminario 6)

Exercici 3 La siguiente tabla presenta algunos datos del número de lı́neas telefónicas por cada 1.000
individuos (Y) y el producto bruto interno per cápita (X) para Singapur en el perı́odo de 1966 a
1981 (16 años).

Año Y X
1966 48 1589
1967 54 1757
1968 59 1974
.. .. ..
. . .
1979 262 4628
1980 291 5038
1981 317 5472

Con estos datos tenemos las siguientes estimaciones:

media de X = 3334.6 media de Y = 145.7


varianza de X = 1.380 × 106 varianza de Y = 7697.4
covarianza entre Xe Y = 1.003 × 105

1. Si suponemos una relación lineal entre X e Y (Y = β0 +β1 X+ε), estimar β0 y β1 por mı́nimos
cuadrados y analizar la bondad del ajuste.

2. Si los errores ε son normales con media 0 y varianza σ2 , dar un intervalo de confianza 90%
para β1 .

3. Tenemos suficiente evidencia para rechazar la hipótesis de independencia lineal entre Y y


X?.

Problema 4 (De Moore, ex 11.21) El fitxer PNG.txt que pots descarregar del servidor del curs conté les
observacions del pes magre corporal (PesMagre), el nivell metabòlic (NivellMetabolic) i el gènere
(Genere) de 19 persones triades a l’atzar per participar en un estudi. El pes magre corporal (en
kg) correspon al pes total d’un individu descomptant el seu contingut en greix i es sospita que té
una forta influència sobre el nivell metabòlic, mesurat com la depesa de calories.

Respon a les següents qüestions i indica clarament quines instruccions de R utilitzes.

1. Amb la instrucció
dades <- read.table("http://pascal.upf.edu/estad/dades/PNG.txt", header
= TRUE)
carrega les dades en R.

17
2. Fes un diagrama de dispersió amb les observacions del pes magre i el nivell metabòlic,
estableix si pot existir relació entre les dues variables i de quina mena.

3. Determina la recta de regressió per mı́nims quadrats ordinaris per a explicar el nivell metabòlic
segons al pes magre.

4. Determina un interval de confiança pel pendent de la recta de regressió i explica de forma


clara el que indica el teu interval sobre la relació entre el pes corporal i el nivell metabòlic.

5. Quin percentatge de variabilitat del nivell metabòlic queda explicat per la seva relació lineal
amb la variable pes magre? Com valores l’ajust del model?

6. Quin és el signe del coeficient de correlació entre el pes magre i el nivell metabòlic? Pots
indicar com el calcularies en base a algun dels resultats que ja has obtingut? Fes ara els
càlculs amb R, és significatiu?

7. Troba els residus i examina’ls. Es compleixen els supòsits en els que es basa la inferència
per la regressió?

Exercici 5 Les següents observacions corresponen al consum de cerveses (1 consumició = 33 cl de


cervesa) i nivell d’alcohol en sang de diferents estudiants,

Estudiant 1 2 3 4 5 6 7 8 9 10
Nombre de consumicions 5 2 9 8 3 7 3 5 3 5
Alcohol en sang 0.10 0.03 0.19 0.12 0.04 0.095 0.07 0.06 0.02 0.05

Pots llegir les dades de l’arxiu cerveses-alcohol.txt del servidor habitual, o entrar-les des
del teclat.

Respon a les següents qüestions i indica clarament quines instruccions de R utilitzes.

1. Fes un gràfic de dispersió amb les observacions; detectes alguna mena d’associació entre les
dues variables?

2. Determina el coeficient de correlació lineal entre el nombre de cerveses ingerides i el nivell


d’alcohol en sang. Com l’interpretes? És significatiu? Què vol dir això?

3. Determina la recta de regressió per mı́nims quadrats ordinaris per a explicar el nivell d’alcohol
en sang segons el nombre de cerveses ingerides. Interpreta els coeficients obtinguts.

4. Quin percentatge de variabilitat del nivell d’alcohol en sang queda explicat per la seva
relació lineal amb la variable nombre de cerveses ingerides? Com valores l’ajust del model?

5. Fes un contrast sobre la significativitat del pendent de la recta de regressió? Explica com
l’interpretes. Podries haver-ho resolt amb els resultats d’algun dels apartats anteriors? Jus-
tifica clarament el perquè.

6. Fes una prova decidir si prendre una cervesa més apuja el nivell d’alcohol en sang en 0.02
contra la alternativa que és inferior.

18
7. Com podem estimar el nivell d’alcohol en sang d’un estudiant que prengués 6 cerveses? Ex-
plica quins intervals de confiança estan involucrats en aquesta estimació i com cal interpretar-
los.

8. Suposem que volem presentar amb més precisió la informació de la taula anterior i ex-
pressem la quantitat de cervesa ingerida no pas en nombre de consumicions sinó en cl. Quin
dels resultats anteriors canviarien i perquè? Intenta respondre sense refer tots els càlculs.

Exercise 6 Simulation of a regression model

Load the following code into an R script window and execute it repeatedly. Comment on what
you observe.

n=20
# valors aleatoris per la x
xx = rnorm(n)
# paràmetres de la recta real de regressió
a = 3
b = 2
# desv estand dels errors
sr = 2
# valors de y segons el model
yy = b * xx + a + rnorm(n, sd=sr)
# gràfic de dispersió
plot(xx,yy, xlim=c(-3,3), ylim=a+b*c(-3,3))
# ajustem el model sobre les dades
fit = lm(yy˜xx)
# pintem la recta ajustada
abline(fit$coefficients)
# pintem en vermell la recta real
abline(a,b,col="red")
# veiem el sumari de l’ajust
print(summary(fit))

19

S-ar putea să vă placă și