Linear Correlation and Regression

Statistics ADE/ECO/IBE 2011 G. Garcı́a, J. Daoudi, F. Udina, L.
Splendore
Chapter 6:
Inference for Simple Linear Regression.
Reference
Newbold: Chapter 12.
Moore: Chapter 2 and 10.
Contents
1 Correlation 3
1.1 Testing for a zero correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Descriptive regression 5
2.1 Fitting a line to data: method of least squares. . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Descriptive regression with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Inference for regression 6

3.1 Standard assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 The Gauss-Markov Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Distribution of b: confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Hypothesis Tests for the slope b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 ANalysis Of VAriance for Regression 10

4.1 The coefficient of determination R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
11
5 Prediction 12
6 Exercises on correlation 13
7 Exercises on regression 16
All models are wrong, but some are useful.

George Box
To this point, we have dealt almost exclusively with problems of inference about single variable. In
business and economic applications we are interested in focused on the relationship between two or
more variables. We already learned the descriptive tools of the simple linear regression: scatterplots,
least-squares regression, and correlation. They are essential preliminaries to do inference with simple
linear regression. Regression analysis is widely used for prediction and forecasting
The earliest form of regression1 was the method of least squares which was published by Legendre
in 1805 and by Gauss in 1809. The term “regression” was coined by Francis Galton in the nineteenth
century to describe a biological phenomenon. It was the pioneering work of Sir Francis Galton in the
1880s that gave rise to the technique, the original idea being the direct result of an experiment on sweet
peas. He noticed that the seeds of the progeny of parents with seeds heavier than average were also
heavier than average, but the difference was not as pronounced; the same effect was true for the seeds
of the progeny of parents with light seeds, where again the differences from the average were not as
great. He called this phenomenon reversion and wrote that the mean weight ”reverted, or regressed,
toward mediocrity”.
The regression analysis was later extended by Udny Yule and Karl Pearson to a more general statistical
context. The assumption was weakened by R.A. Fisher in his works of 1922 and 1925. Regression
methods continue to be an area of active research: Clive W.J. Granger, (Nobel Prize Nobel in economics
2003)2 is famous for his study on some extension of the regression in particular: causality (Granger-
casuality) and cointegracion.
Here are the examples we will use to show the main ideas.
Example 1. Marriage and Divorce Statistics
Table 1 lists the number of divorces for each year from 1975 to 1980 (Marriage and Divorce Statistics,
Office of Population Censuses and Surveys, HMSO). Figure 1 is a time plot of the same data. There is
Marriage and Divorce Statistics

Years 1975 1976 1977 1978 1979 1980
Divorce (1000) 120.5 126.7 129.1 143.7 138.7 148.3
Table 1: Office of Population Censuses and Surveys, HMSO.
a positive linear association, the plot shows increasing trend.

145
140
divorcios
135
130
125
120
1975 1976 1977 1978 1979 1980
any
Figure 1: Office of Population Censuses and Surveys, HMSO.
Example 2. Figure 2 plots the data (from file http://pascal.upf.edu/estad/dades/thr.txt)

This are daily temperature and humidity from 1/5/ 2000 to 30/11/2000. We are interested in the re-
lationship between temperature and humidity. The scatterplot evidences a negative linear association
between the two variables.
1 See, for example, Stigler, Stephen M. (1999) Statistics on the Table: The History of Statistical Concepts and Methods. Harvard
University Press.
2 More details in http://nobelprize.org/.
2
90
80
h.rel
70
60
50
5 10 15 20 25
temp
Figure 2: http://pascal.upf.edu/estad/dades/thr.txt .
thr<-read.table("http://pascal.upf.edu/estad/dades/thr.txt")
thr
plot(thr)
1 Correlation
Example 3. Let us introduce the correlation following the example 2.
> round(cor(thr),4)
temp h.rel
temp 1.0000 -0.8248
h.rel -0.8248 1.0000
In this matrix −0.82 is the correlation between the two variables. The correlation measures the direc-
tion and strength of the linear association between two quantitative variables. The correlation coeffi-
cient, ρ, between X e Y is defined as
cov(X, Y)
ρ= � = cov(X∗ , Y ∗ ) (1)
V(X)V(Y)
where, as we know, cov(X, Y) = E ((X − EX)(Y − EY)) = E(XY) − E(X)E(Y) is the covariance (in R use:
cov(thr)) and X∗ , Y ∗ are the standardized versions of the variables. Remember that the correlation
−1 ≤ ρ ≤ 1. Because the correlation uses the standardized values of the observations (see the right-
side of formula 1), does not change when we change the units of measurement. The correlation is just
a number, itself has no unit of measurement. Please check the scatter plots figure 3.
In the inference to go from the sample (x1 , y1 ), (x2 , y2 ).., (xn , yn ) to the population we consider the
sample correlation coefficient:
3
r= 0.95 r= -0.65
y
x x
r= 0.23 r= -1
y
y
x x
r= 0 r= -0.97
y
x x
Figure 3: Samples of observations from joint distributions with different correlations.
� Y) �n
cov(X, (xi − x̄) (yi − ȳ)
ρ^ = r = = �� i=1 (2)
Sx Sy n 2 �n 2
i=1 (xi − x̄) i=1 (yi − ȳ)
where
��
(xi − x̄)2
S2x =
(n − 1)
1.1 Testing for a zero correlation
In this case, our null hypothesis is:
H0 : ρ = 0
to test the no (linear) relationship between a pair of variables. To test H0 against the alternative:
H1 : ρ �= 0
Assuming that X and Y are normal, and that H0 is true, then the sample correlation coefficient r satisfy:
r
� ∼ tn−2 (3)
1−r2
n−2
√
And so, the p−value for an observed tobs = √ r
1−r2
n − 2 is 2P(tobs < tn−2 )
4
and the decision rule is
� �
� �
� r �
�
RejectH0 if � � � > tn−2,α/2
�
� 1−r2 �
n−2
Example 4. Then using R on the data of example 2, we have:
> cor.test(thr$temp,thr$h.rel)
Pearson’s product-moment correlation
data: temp and h.rel

t = -20.2626, df = 193, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8650292 -0.7739445
sample estimates:
cor
-0.8247647
We have found strong evidence against H0 , we are sure that the true correlation is negative, not zero.
2 Descriptive regression
Now we recall the descriptive regression, first we focus in how we find the regression: least squares
estimation. And then we focus in some useful R commands.
2.1 Fitting a line to data: method of least squares.
Fitting a line to data means drawing a line that comes as close as possible to the points. There are
many ways to make “as close as possible”. The most common is the method of least squares3 . The
least-squares regression line is the line that makes the sum of the squares of the vertical distances of
the data points from the line as small as possible. Define error ei = yi − (a + bx), then:
n
� n
�
min e2i = min (yi − (a + bxi ))2
a,b a,b
i=1 i=1
From the first order conditions:
�n n
� n
�
∂ i=1 e2i
0 = = −2 yi + 2na + 2b xi
∂a
i=1 i=1
�n �n n
� n
�
∂ e2i
0 = i=1
= −2 xi yi + 2a xi + 2b x2i
∂b
i=1 i=1 i=1
we derive the solution

�n �n �n
n xi yi − i=1 xi i=1 yi sy
b= i=1
�n ��n �2 =r (4)
2
n i=1 xi − sx
i=1 xi
3 Developed by Laplace (1812) Théorie analytique des probabilités. Carl Friedrich Gauss is credited with developing the funda-
mentals of the basis for least-squares analysis in hisTheory of Celestial Movement.
5
where r is the correlation and sx , sy standard deviation de x e y respectively. Then, we have the
intercept:
a = y − bx.
s
Please note that byx = r syx �= bxy = r ssyx . To not abuse the notation we use just b.
The estimation of σ2 residual variance or Mean Square Error (MSE):

�n 2 �n
(yi − (a + bxi )) e2
^ 2 = MSE = i=1
s2e = σ = i=1 i (5)
n−2 n−2
In inference we call MSE also s2e or σ

^ 2 . We will be focused later on it (see section 3.2).
2.2 Descriptive regression with R
Example 5. We review the descriptive regression by R with example 1.
> divorces=scan() ### data entry

1: 120.5
2: 126.7
3: 129.1
4: 143.7
5: 138.7
6: 148.3
7:
Read 6 items
# alternatively, use
> divorces <- c(120.5, 126.7, 129.1, 143.7, 138.7, 148.3)
> year=1975:1980
> div.lm = lm(divorces ˜ year) ### lm is used to fit linear model
> names(div.lm) ### Our output
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
> div.lm$coefficients ### or just >divorces
(Intercept) any
-10577.900000 5.417143
> div.lm
Call:
lm(formula = divorcios ˜ any)
Coefficients:
(Intercept) any
-10577.900 5.417
Now we are ready for the inference for regression.
3 Inference for regression

In this section we describe methods for inference for the regression. We start with the assumption and
The Gauss-Markov theorem. Then we apply all inference that we already know (confidence intervals
and hypothesis tests) to the regression. We focus in the estimation of the slope, b, widely used by
economists. If you want deepen the estimation of the intercept, a, please check the books, for example
Moore or Newbold.
6
3.1 Standard assumptions.
Denote the population regression line by:
Yi = α + βxi + εi
note that, as usual, we use the Greek letter in the population. The following standard assumption are
often made:
1. The xi are fixed numbers or, if they are random, they are independent of the error terms εi
2. E(εi ) = 0
3. E(ε2i ) = σ2 (same variance for all, “homoscedasticity”)

4. E(εi εj ) = 0 for all i �= j
If the sample size is small, we require that the error are normally distributed.
3.1.1 The Gauss-Markov Theorem.
The Gauss-Markov theorem provides a powerful motivation for estimating the parameters of a regres-
sion model by least squares.
The Gauss-Markov Theorem: Denote the population regression line by:
Yi = α + βxi + εi
Suppose that the assumption 1-4 Section 3.1 hold. Then, of all possible estimators of α and β the least
squares estimators have the smallest variances.
3.2 Distribution of b: confidence intervals
Because we are interested in β, and use b as estimator of β, it is of our main interest to know the
distribution of b.
It is not difficult ot prove that
1. E(b) = β, so b is a good estimator of β.

�n 2
s2 i=1 ei
2. The standard error is sb with s2b = �n e
(x i −x̄)
2 = n−2
i=1
3. Under the assumptions previously stated (including normality when sample size is small) we
have
b−β
∼ tn−2
sb
Using this, confidence intervals could be built using the usual trick:
CI(β) = b ∓ t1−α/2 sb
In R we can use the function confint.

Example 6. By the data example 2:
7
thr=read.table("http://pascal.upf.edu/estad/dades/thr.txt")
attach(thr)
x=temp
y=h.rel
n=length(x)
######### Sum square
Sxy=sum(x*y)-sum(x)*sum(y)/n
Sxx=sum(xˆ2)-(sum(x))ˆ2/n
Syy=sum(yˆ2)-sum(y)ˆ2/n
####### Regression Coefficients (a and b)
b=Sxy/Sxx #b
a=sum(y)/n-b*sum(x)/n #a
plot(x,y,xlab="Temperature",ylab="Humidity") #plot
abline(a,b) # add the fitted line
########## fitted values (valores ajustados)

yhat=a+b*x
errores=y-yhat
sigma2=sum(erroresˆ2)/(length(errores)-2)
##################
############# Then lm (linear model)

d=lm(y˜x)
plot(x,y,xlab="Temperature",ylab="Humidity relative")
abline(d) # plot
d.summary=summary(d)
names(d)
############ Confidence interval for the Population Regression
coef =coef(d)
es.coef = d.summary$coefficients[,"Std. Error"]
cuant = qt(c(0.975,0.025),n-2)
int.a = coef[1] - cuant*es.coef[1] # a
int.a
[1] 91.38484 96.45357
int.b <- coef[2] - cuant*es.coef[2] # Intervalo para b
int.b
[1] -1.761657 -1.449123
############ Intervalos con la funcion confint

> confint(d)
2.5 \% 97.5 \%
(Intercept) 91.384845 96.453566
x -1.761657 -1.449123
##### same intervals, of course!
3.2.1 Hypothesis Tests for the slope b.
Example 7. Let continue with data of the previous section:
8
d.summary
Call:
lm(formula = y ˜ x)
Residuals:
Min 1Q Median 3Q Max
-12.6224 -4.9318 -0.7571 5.0788 13.5925
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.91921 1.28496 73.09 <2e-16 ***
x -1.60539 0.07923 -20.26 <2e-16 ***
---
Residual standard error: 5.916 on 193 degrees of freedom

Multiple R-squared: 0.6802,Adjusted R-squared: 0.6786
F-statistic: 410.6 on 1 and 193 DF, p-value: < 2.2e-16
In this case, the p-value for the null hypothesis H0 : b = 0 vs. H1 : b �= 0 is almost zero, we clearly
reject H0 , the line is not horizontal, and we have a lineal relationship. The slope is negative.
Now we introduce the theory. Following the same steps in the hypothesis test: write down the hy-
pothesis, find and calculate the sample test statistic and compare the result with the critical value. The
null and alternative hypothesis could be:
H0 : b = b0 and H1 : b �= b0
where b is constant, often is zero. Then note that if our null hypothesis is H0 : b = 0 and we do not
reject means that the line is horizontal.
The sample statistic is:
bi − b0
T= ∼ tn−2
sb
Reject H0 if
T > t(n−2, α2 ) or T < −t(n−2, α2 )
Example 8. If we consider the data from the example 1:
> summary(divorces)
Call:
lm(formula = divorcios ˜ any)
Residuals:
1 2 3 4 5 6
-0.4571 0.3257 -2.6914 6.4914 -3.9257 0.2571
Coefficients:
(Intercept) -1.058e+04 1.908e+03 -5.544 0.00518 **
any 5.417e+00 9.649e-01 5.614 0.00495 **
---
9
F-statistic: 31.52 on 1 and 4 DF, p-value: 0.004947
145
140
divorcios
135
130
125
120
1975 1976 1977 1978 1979 1980
any
Figure 4: Scatterplot: Office of Population Censuses and Surveys, HMSO.
4 ANalysis Of VAriance for Regression

The basic regression line concept, DATA = FIT + RESIDUAL, is rewritten as follows:
yi − ȳ = y
^ i − ȳ + yi − y
^i
The first term is the Total variation in the response ,SST, the second term is the variation in mean
response (Regression), SSR, and the third term is the residual value (Error), SSE. Squaring each of
these terms and adding over all of the n observations gives the equation, SST=SSR+SSE:
n
� n
� n
�
SST = (yi − ȳ) = 2
(^
yi − ȳ) + 2
^ i )2
(yi − y
i=1 i=1 i=1
Again we have:
�n
• The total variation: SST= i=1 (yi − ȳ)2 with n − 1 degrees of freedom.
�n
• The variation due to the linear part of the model, the regression: SSR= i=1 (^yi − ȳ)2 with 1
degree of freedom.
�n
• Finally the variation due to desviations from the regression: SSE = i=1 (yi − y
^ i )2 with n − 2
degree of freedom.
Then consider the mean square, MST, MSR and MSE.

�n �n �n
− ȳ)2
i=1 (yi i=1 (^
yi − ȳ)2 i=1 (yi
−y^ i )2
MST = ; MSR = ; MSE =
n−1 1 n−2
As we did in the one way ANOVA for mean comparison, we consider the F-ratio MSR/MSE. Small
values of this ratio will mean that the linear part of the model do not account for the variation of the
10
response, while large values of this ratio will mean that most of the total variation is explained by the
linear part of the model. This is the ase for the
F-test for the regression model. Let H0 be that there is no linear dependence between y and x, or
equivalently, that b = 0. Then Ha will state that there is some linear depencence, b �= 0.
Under H0 , the statistic F = MSR/MSE has an F1,n−2 distribution (Fischer-Snedecor distribution, with
1 df in the numerator and n-2 df in the denominator). Since large values of this ratio will be in favour
of Ha , we will have
p − value = P(F1,n−2 > Fobs )
and so, the critical point will be F1,n−2,α .
Let us resume the ANOVA in a table:
Variability source Sum of Squares df Mean Squares F-ratio
REGRESSION SSR 1 MSR = SSR/1 MSR/MSE
ERROR SSE n−2 MSE = SSE/(n − 2)
TOTAL SST n−1
4.1 The coefficient of determination R2
Of particular importance is the quantity R2 = SSR SST that express the fraction of the total variation of y
that is explained or captured by the linear part of the model. It can be shown that (in this case of simple
regression, with a single predictor) R2 = r2 , where r is the Pearson coefficient of linear correlation (see
section 1
Example 9. Using the data from example 2:
thr=read.table("http://pascal.upf.edu/estad/dades/thr.txt")
lmt = lm(h.rel˜temp, data=thr)
summary(lmt)
Call:
lm(formula = h.rel ˜ temp, data = thr)
Residuals:
Min 1Q Median 3Q Max
-12.6224 -4.9318 -0.7571 5.0788 13.5925
Coefficients:
(Intercept) 93.91921 1.28496 73.09 <2e-16 ***
temp -1.60539 0.07923 -20.26 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 410.6 on 1 and 193 DF, p-value: < 2.2e-16
For the ANOVA:
anova(lmt)
Analysis of Variance Table
Response: h.rel
Df Sum Sq Mean Sq F value Pr(>F)
11
temp 1 14369.1 14369 410.57 < 2.2e-16 ***
Residuals 193 6754.6 35
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
5 Prediction
Given the model y = α+βx+ε and the fitted version y = a+bx+e there are two interesting questions
given a new value of x:
1. Given x0 what would be the mean response of y?

2. Given x0 , what could be the value of an individual having this x0 value?
For the point estimation, the answer to both question is a + bx0 . But the confidence interval is different
depending on the question (it is more difficult to estimate (2) than (1), so the confidence interval for (2)
is wider than for (1).
In R, we ask for “confidence” to answer the first question, and “prediction” to answer the second
question.
Figure 5 shows both confidence bands for prediction.
.
.... ...
.. .. . ........ ...... ...... ...... ....... ...... ...... ...... ...... ...... ......... ............
.. .. ............ .. .... .. ............ .... .
.. .. ......... ..... ....... ....... .. ....
...
..
..
µ̂ y/x = β̂ + β̂ x 0... . . .
1 .
.
.. ... ....
...
.
... Error extrapolación
...
... .
....... ...... ...... ...... ......... ...... ...... ...... ...... ...... .......................................... .
.. .
..................... .........
.
... . .. ....
.. .... .. . . .
.......... ........... ...
.
..
.. .
. ......... .
...... . . ......... ..
..
.
. . .
... ..... ..
..
...
.. ... . ..... ......... .. ... .......... ... ..
... ... . .... .... .... ..... ... .......... ..... ..
... ... ... .... ............. .... ... . .. . .
.. . ... .....
..... ....... . ..
. .... . . .. .. ....
...
..................... .... ...
... ...
. . ..... ...... ... ... . .
.. .... .......... ....... ..... .. . ..
........................
. ..
µ̂ y/x=x0...... ....... . . ... ....... .....
... ....
...
..
..
. ..... .. .
.
.............. ...... ...... ...... ................ .............. ......... ..
... . ....................... . ....
ŷ
0 ...... . .......... .. .. ....... .......
....
....
..
..
...
.... . . ... .... ..
y ... .. .
. ........ ....
.
............ ..... .... ... ..
..
.
.
.... y ... . . . .
..........
.
.............. . .
..
... ... .. ... . .
........ ... ....
.. .... .. .. ... ... . .
. .
.. .... ........ ..... ... .. . ........ . ..
.. .............. ... ...
. . .
.
. .... ... .. .............. .. .
.. ............ ....... ... .. .
.... .. .......... .... ..... . . . .
.. ... .... ..... .. ... ...
... ............. .... . ..
.........
... ... .. .... ....
.
.
..... ..
.. ..................... ............................ ... ..
.. ..... . .
. . .... ........ ... ........ ...................... .............. .
.. ....... .... .. ..... ... .. ...... ...... ........ . .
.. ..... .. .......... ...... . ...... .. .. . .. ..
....... .. .
.. . ... .
... .. .
... ..... . ... .. .
..
.. ... . . ...
.. .... . . ...... ...
. ... ... . .... ..
.. ... ...
............................................................................................................................................................... . ..
..............................................................................................................................
x0 x0
x Rango ajustado para
Figure 5: Confidence interval for prediction.
For example 2, thr data, we want to ask for the confidence interval for the mean humidity given
temperatures of 5,10,15 and 20 degrees Celsius.
lmt = lm(h.rel ˜ temp, data=thr)
new.temp = data.frame(temp=c(5,10,15,20)) # list here the desired x values
# confidence interval for the mean!

predict(lmt, newdata=new.temp, interval="confidence", level=0.95)
12
fit lwr upr
1 85.89226 84.07717 87.70734
2 77.86531 76.68757 79.04304
3 69.83836 69.00137 70.67534
4 61.81141 60.70011 62.92271
We obtain, for example, that for temp=5, the interval is (84.08, 87.71). The “fit” is the central point, the
predicted value, the point estimation.
Now we want confidence intervals for an individual measure of humidity in a place where tempera-
ture is 5,10,15 or 20 degrees Celsius.
# confidence intervals for PREDICTION of individual values
predict(lmt, newdata=new.temp, interval="prediction", level=0.95)
fit lwr upr

1 85.89226 74.08382 97.70069
2 77.86531 66.13792 89.59269
3 69.83836 58.14028 81.53643
4 61.81141 50.09050 73.53231
As before, the first column lists the point estimation, y

^ , the second and third column list the lower and
upper limits of the intervals.
6 Exercises on correlation
1. Les dades següents relacionen la temperatura d’ebullició de l’aigua (en graus centı́grads), amb
la pressió baromètrica (en mm de mercuri), i van ser preses pel fı́sic escocès Forbes l’any 1857 als
Alps i a Escòcia.
Pressió (mm): 768 769 770 773 774 775

Temperatura (C): 93.8 94.1 95.3 98.1 99.3 99.9
(a) Feu una gràfica de les dades, posant “Pressió” a l’eix horitzontal i “Temperatura” a l’eix
vertical (l’experimentador tria diverses localitzacions geogràfiques amb diferents pressions
atmosfèriques i mesura, com a resposta, la temperatura d’ebullició de l’aigua).
Resposta. Vegeu més avall les instruccions R
(b) Calculeu la variància de cadascuna d’aquestes variables i la seva covariància. Calculeu a
partir d’aquestes quantitats la correlació entre elles. Comproveu si la funció cor de R us
dona el mateix.
Resposta. cor(pres,temp) dóna el mateix que cov(pres,temp)/sqrt(var(pres)*var(temp))
(c) Si heu anomenat pres a la serie de pressions i temp a la de temperatures, calculeu ara
les sèries estandarditzades pres.est i temp.est. Podeu fer-ho amb la funció scale o
també directament restant la mitjana i dividint per la desviació estàndard. Feu-ho de les
dues maneres i comproveu si us dona el mateix.
Resposta. Dóna el mateix scale(temp) que (temp-mean(temp))/sd(temp)
13
(d) Calculeu la covariància entre les series estandarditzades i la seva correlació. Comentaris?
Resposta. La covariància entre les variables estandarditzades és igual a la correlació de les variables,
originals o estandarditzades tant li fa.
(e) La pressió està expressada en milı́metres de mercuri. Actualment utilitzem més la unitat
d’hectopascal, (1 mmHg = 1.3332 hectopascals (hPa)). Si posem pres.hPa <- pres *
1.3332, quina serà la covariància entre la pressió i la temperatura ara? I la correlació?
Resposta. Si multipliquem la pressió per 1.3332, la seva variància queda multiplicada per 1.33322 i
la covariància entre pres i temp queda multiplicada per 1.3332. Llavors, la correlació no varia.
(f) La temperatura està expressada en graus centı́grads, però Forbes probablement les va pren-
dre en graus Fahrenheit. Recorda que la conversió de Celsius a Fahrenheit es pot fer amb:
F = 95 C + 32. Si expressem la temperatura en graus Fahrenheit, quina és la covariància entre
la tempreatura i la pressió? I la correlació?
Resposta. En multiplicar per 9/5 la covariància quedarà també multiplicada per 9/5 però la cor-
relació no canvia. En sumar 32, ni la covariància ni la correlació canvien.
(g) El coeficient de correlació obtingut, r, entre les dues variables, és significatiu? (nivell de
significació 0.05). Digues quina és la prova de significació que fas, quines són les hipòtesis
nul.la i alternativa, quins són els supòsits que assumim sobre les variables i quin resultat
obtens. Fes-ho primer fent tu els càcluls i després comprova si cor.test dóna el mateix.
Resposta. Plantejem
� H0 : ρ = 0 vs. Ha : ρ �= 0. Suposant que les variables inicials són normals,
l’estadı́stic r 1−r
n−2
2 segueix una tn−2 . Els càlculs estàn fets a sota. Obtenim un valor−p molt petit,
diem que hem trobat evidència estadı́sticament significativa contra la nul.la.
Instruccions R:
pres=c(768,769,770,773,774,775)
temp=c(93.8,94.1,95.3,98.1,99.3,99.9)
plot(presion,temp,main=‘‘Forbes Data’’,xlab=‘‘Pression’’,
ylab=‘Temperature’’)
cor(presion,temp)
[1] 0.9963161
> cor.test(pres,temp)
Pearson’s product-moment correlation
data: pres and temp

t = 23.236, df = 4, p-value = 2.033e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9651382 0.9996162
sample estimates:
cor
0.9963161
%## fem el calcul pas a pas
> r=cor(pres,temp)
> n=length(pres)
> ec <- r*sqrt((n-2)/(1-r*r)) %# l’estadistic de contrast
> ec
[1] 23.23597
> 2 * (1 - pt(ec, df=n-2)) %# el valor-p, dues cues
[1] 2.033128e-05
2. (optional exercise) From a sample of two normal variables, sample size n, we obtained a corre-
lation coefficient r = 0.5. How big should n be for this coefficient be significant? (Use α = 0.05,
state carefully the significance test you are using.) (Note: Since you don’t know n, you can’t
use the distribution tn−2 , You should use the normal as a first aproximation, then adjust your
14
answer using the t) �
Answer: We are testing H0 : ρ = 0 vs. Ha : ρ �= 0. The observed test statistic is tobs = r 1−r
n−2
2 =
√
0.5774 n − 2. This should be bigger than 1.96 (using the normal aproximation). So n should be
bigger than (2 + 1.96/0.5774)2 = 29.1. Now we check whether n = 30 gives the desired p-value
for the t distribution:
> r=0.5
> n=30
> 2*pnorm(1.96,lower.tail=FALSE)
[1] 0.04999579
> n=29
> 2* pt( r*sqrt((n-2)/(1-rˆ2)), df=n-2,lower.tail=FALSE)
[1] 0.005745713
And we see that n = 29 is not enough, but n = 30 it is enough.
3. Now we use the women data that are already included in R. Type women to see the dataset. Ask
for ?women to see information about the dataset and its variables.
(a) Using that 1 inch equals 2.54 cm, and that 1 lbs equals 0.4536 kg, create new variables with
the height and the weight in international units. Answer: weight.kg <- women$weight * 0.4536
and height.cm <- women$height * 2.54 do the job.
(b) Compare the scatterplot of the original variables to the scatterplot of the transformed vari-
ables. Answer: They are very similar, we just rescaled the axis but then R rescales them
again to draw the graph.
(c) Compare variances, covariances and correlations of the original variables to the ones of
the transformed variables. Answer: Variances get multiplied by the coefficient we used,
covariances get multiplied by both coefficients. Correlation do not change.
(d) Type the command lm(women$weight˜women$height)$coefficients and comment
the results you get. Note that the symbol ˜ is a tilde that can be typed with “alt-gr 4” in a
typical PC keyboard. Answer: We get the coefficient for a regression line W = a + bH.
(e) After making the scatterplot of the original variables, type
abline(lm(women$weight˜women$height)$coefficients) and comment what you
get in the graphics window. Answer: We get the line drawn toghether with the scatterplot.
4. In this exercise we want to explore simple random samples from the model Y = 0.5X + e where
X ∼ N(0, 1) and e ∼ N(0, 1), X and e being independent.
(a) Using the properties of variance and covariance, show that cor(X, Y) = 0.447. Answer:
V(X) = V(e) = 1,√V(Y) = 0.25V(X) + V(e) = 1.25, cov(X, Y) = cov(X, 0.5X) + cov(X, e) = 0.5
cor(X, Y) = 0.5/ 1.25 ≈ 0.447,
(b) Open a new script window in R and type in the following lines:
x <- rnorm(20)
y <- 0.5*x+rnorm(20)
plot(x,y)
abline(lm(y˜x)$coefficients)
cor(x,y)
Then choose “Run all” from the Edit menu and watch the graphics windows. Comment.
(c) Repeat many times the “Run all” command and describe what you see in the graphics win-
dow and also in the console window. Repeat many times “Run all” until you get a negative
correlation coefficient. Why do you think it happenned?
(d) Now we let R count how many times we get a negative correlation from these X, Y variables
that have “true correlation” equal to 0.45. Type the lines
15
cors<-c()
for (i in 1:10000)
{x <- rnorm(20); y <- 0.5*x+rnorm(20);cors[i]<-cor(x,y)}
summary(cors)
and explain what you get. To count how many cors where negative, type length(cors[cors<0]).
7 Exercises on regression
Publicat: Dijous, 25 de Febrer de 2010

Lı́mit lliurament: Dilluns, 1 de Març de 2010; 8:00 am
Exercici 1 A continuació tenim les puntuacions obtingudes per un grup d’estudiants en l’examen par-
cial i en l’examen final d’Estadı́stica.
Parcial 81 75 71 61 96 56 85 70 77 71 91 88 79 77
Final 80 82 83 57 100 30 68 40 87 65 86 82 57 75
1. Dibuixa el diagrama de dispersió de les dades i comenta’l.
2. Calcula el coeficient de correlació lineal entre unes i altres notes i interpreta el seu valor.
3. Si té sentit, localitza “a ull” la recta que millor ajusta els punts del gràfic.
4. Determina la recta d’ajust pel mètode dels mı́nims quadrats.
5. Un estudiant del mateix grup va obtenir una puntuació de 80 en el primer examen. No es va

poder presentar a l’examen final. A la vista del comportament del grup, quina nota creieu
que podia esperar en l’examen final? Com s’interpreta aquest valor?
6. Estudia els residus i comenta sobre la adecuació del model
Exercici 2 (Moore, exercici 10.9) (Datos: http://pascal.upf.edu/estad/dades/manatis.dat)

Los manatı́s son criaturas marinas grandes y apacibles que viven a lo largo de la costa de Florida.
Las lanchas motoras matan o lastiman muchos manatı́s. Tenemos los datos sobre las lanchas mo-
toras registradas (en miles) y el número de manatı́s muertos por las lanchas en Florida en los
años de 1997 a 1990.
1. Dibuja un diagrama de dispersión que muestre la relación entre el número de lanchas mo-
toras registradas y los manatı́s muertos (cuál es la variable explicativa?).
2. El aspecto general de la relación entre las variables es aproximadamente lineal? ¿Existen

observaciones atı́picas claras o observaciones influyentes fuertes?
3. Calcula el modelo de regresión con lm. ¿Qué indica R2 = 0.886 a propósito de la relación
entre lanchas y manatı́s muertos?
4. Explica lo que significa, en esta situación, la pendiente β de la verdadera recta de regresión.

Luego da un intervalo de confianza del 90% para β.
5. ¿Si Florida decidiera congelar el número de lanchas registradas en 700.000, cuántos manatı́s
predices que matarı́an las lanchas motoras cada año?
16
6. Pide a R la predicción para x = 700. ¿Coincide con la que habı́as obtenido tú?
7. Da un intervalo de predicción del 95% para la media de manatı́s que morirı́an cada año si
Florida congelara el número de licencias en 700.000.
8. Finalmente, estudia la adecuación del modelo ajustado: se cumplen las suposiciones ini-
ciales?
(Los apartados (f) i (g) no entran para el seminario 6)
Exercici 3 La siguiente tabla presenta algunos datos del número de lı́neas telefónicas por cada 1.000
individuos (Y) y el producto bruto interno per cápita (X) para Singapur en el perı́odo de 1966 a
1981 (16 años).
Año Y X
1966 48 1589
1967 54 1757
1968 59 1974
.. .. ..
. . .
1979 262 4628
1980 291 5038
1981 317 5472
Con estos datos tenemos las siguientes estimaciones:
media de X = 3334.6 media de Y = 145.7

varianza de X = 1.380 × 106 varianza de Y = 7697.4
covarianza entre Xe Y = 1.003 × 105
1. Si suponemos una relación lineal entre X e Y (Y = β0 +β1 X+ε), estimar β0 y β1 por mı́nimos
cuadrados y analizar la bondad del ajuste.
2. Si los errores ε son normales con media 0 y varianza σ2 , dar un intervalo de confianza 90%
para β1 .
3. Tenemos suficiente evidencia para rechazar la hipótesis de independencia lineal entre Y y

X?.
Problema 4 (De Moore, ex 11.21) El fitxer PNG.txt que pots descarregar del servidor del curs conté les
observacions del pes magre corporal (PesMagre), el nivell metabòlic (NivellMetabolic) i el gènere
(Genere) de 19 persones triades a l’atzar per participar en un estudi. El pes magre corporal (en
kg) correspon al pes total d’un individu descomptant el seu contingut en greix i es sospita que té
una forta influència sobre el nivell metabòlic, mesurat com la depesa de calories.
Respon a les següents qüestions i indica clarament quines instruccions de R utilitzes.
1. Amb la instrucció
dades <- read.table("http://pascal.upf.edu/estad/dades/PNG.txt", header
= TRUE)
carrega les dades en R.
17
2. Fes un diagrama de dispersió amb les observacions del pes magre i el nivell metabòlic,
estableix si pot existir relació entre les dues variables i de quina mena.
3. Determina la recta de regressió per mı́nims quadrats ordinaris per a explicar el nivell metabòlic
segons al pes magre.
4. Determina un interval de confiança pel pendent de la recta de regressió i explica de forma

clara el que indica el teu interval sobre la relació entre el pes corporal i el nivell metabòlic.
5. Quin percentatge de variabilitat del nivell metabòlic queda explicat per la seva relació lineal
amb la variable pes magre? Com valores l’ajust del model?
6. Quin és el signe del coeficient de correlació entre el pes magre i el nivell metabòlic? Pots
indicar com el calcularies en base a algun dels resultats que ja has obtingut? Fes ara els
càlculs amb R, és significatiu?
7. Troba els residus i examina’ls. Es compleixen els supòsits en els que es basa la inferència
per la regressió?
Exercici 5 Les següents observacions corresponen al consum de cerveses (1 consumició = 33 cl de

cervesa) i nivell d’alcohol en sang de diferents estudiants,
Estudiant 1 2 3 4 5 6 7 8 9 10
Nombre de consumicions 5 2 9 8 3 7 3 5 3 5
Alcohol en sang 0.10 0.03 0.19 0.12 0.04 0.095 0.07 0.06 0.02 0.05
Pots llegir les dades de l’arxiu cerveses-alcohol.txt del servidor habitual, o entrar-les des
del teclat.
Respon a les següents qüestions i indica clarament quines instruccions de R utilitzes.
1. Fes un gràfic de dispersió amb les observacions; detectes alguna mena d’associació entre les
dues variables?
2. Determina el coeficient de correlació lineal entre el nombre de cerveses ingerides i el nivell

d’alcohol en sang. Com l’interpretes? És significatiu? Què vol dir això?
3. Determina la recta de regressió per mı́nims quadrats ordinaris per a explicar el nivell d’alcohol
en sang segons el nombre de cerveses ingerides. Interpreta els coeficients obtinguts.
4. Quin percentatge de variabilitat del nivell d’alcohol en sang queda explicat per la seva
relació lineal amb la variable nombre de cerveses ingerides? Com valores l’ajust del model?
5. Fes un contrast sobre la significativitat del pendent de la recta de regressió? Explica com
l’interpretes. Podries haver-ho resolt amb els resultats d’algun dels apartats anteriors? Jus-
tifica clarament el perquè.
6. Fes una prova decidir si prendre una cervesa més apuja el nivell d’alcohol en sang en 0.02
contra la alternativa que és inferior.
18
7. Com podem estimar el nivell d’alcohol en sang d’un estudiant que prengués 6 cerveses? Ex-
plica quins intervals de confiança estan involucrats en aquesta estimació i com cal interpretar-
los.
8. Suposem que volem presentar amb més precisió la informació de la taula anterior i ex-
pressem la quantitat de cervesa ingerida no pas en nombre de consumicions sinó en cl. Quin
dels resultats anteriors canviarien i perquè? Intenta respondre sense refer tots els càlculs.
Exercise 6 Simulation of a regression model
Load the following code into an R script window and execute it repeatedly. Comment on what
you observe.
n=20
# valors aleatoris per la x
xx = rnorm(n)
# paràmetres de la recta real de regressió
a = 3
b = 2
# desv estand dels errors
sr = 2
# valors de y segons el model
yy = b * xx + a + rnorm(n, sd=sr)
# gràfic de dispersió
plot(xx,yy, xlim=c(-3,3), ylim=a+b*c(-3,3))
# ajustem el model sobre les dades
fit = lm(yy˜xx)
# pintem la recta ajustada
abline(fit$coefficients)
# pintem en vermell la recta real
abline(a,b,col="red")
# veiem el sumari de l’ajust
print(summary(fit))
19

Linear Correlation and Regression

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Linear Correlation and Regression

Încărcat de

Drepturi de autor:

Formate disponibile

Statistics ADE/ECO/IBE 2011 G. Garcı́a, J. Daoudi, F. Udina, L.

Moore: Chapter 2 and 10.

3 Inference for regression 6

4 ANalysis Of VAriance for Regression 10

All models are wrong, but some are useful.

Marriage and Divorce Statistics

Table 1: Office of Population Censuses and Surveys, HMSO.

a positive linear association, the plot shows increasing trend.

1975 1976 1977 1978 1979 1980

Figure 1: Office of Population Censuses and Surveys, HMSO.

Example 2. Figure 2 plots the data (from file http://pascal.upf.edu/estad/dades/thr.txt)

Figure 3: Samples of observations from joint distributions with different correlations.

1.1 Testing for a zero correlation

In this case, our null hypothesis is:

Example 4. Then using R on the data of example 2, we have:

Pearson’s product-moment correlation

data: temp and h.rel

2.1 Fitting a line to data: method of least squares.

From the first order conditions:

we derive the solution

mentals of the basis for least-squares analysis in hisTheory of Celestial Movement.

The estimation of σ2 residual variance or Mean Square Error (MSE):

In inference we call MSE also s2e or σ

2.2 Descriptive regression with R

Example 5. We review the descriptive regression by R with example 1.

> divorces=scan() ### data entry

Now we are ready for the inference for regression.

3 Inference for regression

Denote the population regression line by:

3. E(ε2i ) = σ2 (same variance for all, “homoscedasticity”)

3.1.1 The Gauss-Markov Theorem.

3.2 Distribution of b: confidence intervals

1. E(b) = β, so b is a good estimator of β.

In R we can use the function confint.

####### Regression Coefficients (a and b)

########## fitted values (valores ajustados)

############# Then lm (linear model)

############ Confidence interval for the Population Regression

############ Intervalos con la funcion confint

3.2.1 Hypothesis Tests for the slope b.

Example 7. Let continue with data of the previous section:

Residual standard error: 5.916 on 193 degrees of freedom

Example 8. If we consider the data from the example 1:

1975 1976 1977 1978 1979 1980

Figure 4: Scatterplot: Office of Population Censuses and Surveys, HMSO.

4 ANalysis Of VAriance for Regression

Then consider the mean square, MST, MSR and MSE.

4.1 The coefficient of determination R2

Residual standard error: 5.916 on 193 degrees of freedom

For the ANOVA:

1. Given x0 what would be the mean response of y?

x Rango ajustado para

Figure 5: Confidence interval for prediction.

lmt = lm(h.rel ˜ temp, data=thr)

new.temp = data.frame(temp=c(5,10,15,20)) # list here the desired x values

# confidence interval for the mean!

# confidence intervals for PREDICTION of individual values

predict(lmt, newdata=new.temp, interval="prediction", level=0.95)

fit lwr upr

As before, the first column lists the point estimation, y

Pressió (mm): 768 769 770 773 774 775

diem que hem trobat evidència estadı́sticament significativa contra la nul.la.

Pearson’s product-moment correlation