Documente Academic
Documente Profesional
Documente Cultură
η(x) =isβ0a +
• Linear regression β1 x1 approach
simple + β2 x2 + .to. . supervised
β p xp
learning.
Almost always Itthought
assumes of that
as anthe dependence oftoYthe
approximation ontruth.
X
Functions
1 , Xin, . . .
2 nature X is linear.
p are rarely linear.
• True regression functions are never linear!
7
6
f(X)
5
4
3
2 4 6 8
2 / 48
Advertising data
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
3 / 48
Simple linear regression using a single predictor X.
• We assume a model
Y = β0 + β1 X + ,
ŷ = β̂0 + β̂1 x,
4 / 48
Estimation of the parameters by least squares
• Let ŷi = β̂0 + β̂1 xi be the prediction for Y based on the ith
value of X. Then ei = yi − ŷi represents the ith residual
• We define the residual sum of squares (RSS) as
RSS = e21 + e22 + · · · + e2n ,
or equivalently as
RSS = (y1 −β̂0 −β̂1 x1 )2 +(y2 −β̂0 −β̂1 x2 )2 +. . .+(yn −β̂0 −β̂1 xn )2 .
25
20
Sales
15
10
5
TV
FIGURE 3.1. For the Advertising data, the least squares fit for the regression
The least squares fit for the regression of sales onto TV.
of sales onto TV is shown. The fit is found by minimizing the sum of squared
In this case a linear fit captures the essence of the relationship,
errors. Each grey line segment represents an error, and the fit makes a compro-
mise by averaging their squares. In this case a linear fit captures the essence of
although it is somewhat deficient in the left of the plot.
the relationship, although it is somewhat deficient in the left of the plot.
Figure 3.1 displays the simple linear regression fit to the Advertising
data, where β̂0 = 7.03 and β̂1 = 0.0475. In other words, according to this
6 / 48
Assessing the Accuracy of the Coefficient Estimates
where σ 2 = Var()
• These standard errors can be used to compute confidence
intervals. A 95% confidence interval is defined as a range of
values such that with 95% probability, the range will
contain the true unknown value of the parameter. It has
the form
β̂1 ± 2 · SE(β̂1 ).
7 / 48
Confidence intervals — continued
8 / 48
Hypothesis testing
• Standard errors can also be used to perform hypothesis
tests on the coefficients. The most common hypothesis test
involves testing the null hypothesis of
H0 : There is no relationship between X and Y
versus the alternative hypothesis
HA : There is some relationship between X and Y .
H0 : β1 = 0
versus
HA : β1 6= 0,
since if β1 = 0 then the model reduces to Y = β0 + , and
X is not associated with Y .
9 / 48
Hypothesis testing — continued
10 / 48
Results for the advertising data
11 / 48
Assessing the Overall Accuracy of the Model
• We compute the Residual Standard Error
v
r u n
1 u 1 X
RSE = RSS = t (yi − ŷi )2 ,
n−2 n−2
i=1
P
where the residual sum-of-squares is RSS = ni=1 (yi − ŷi )2 .
• R-squared or fraction of variance explained is
TSS − RSS RSS
R2 = =1−
TSS TSS
P
where TSS = ni=1 (yi − ȳ)2 is the total sum of squares.
• It can be shown that in this simple linear regression setting
that R2 = r2 , where r is the correlation between X and Y :
Pn
(xi − x)(yi − y)
r = Pn i=1
p pPn .
− 2 2
i=1 (x i x) i=1 (yi − y)
12 / 48
Advertising data results
Quantity Value
Residual Standard Error 3.26
R2 0.612
F-statistic 312.1
13 / 48
Multiple Linear Regression
Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ,
14 / 48
Interpreting regression coefficients
15 / 48
The woes of (interpreting) regression coefficients
16 / 48
Two quotes by famous Statisticians
“The only way to find out what will happen when a complex
system is disturbed is to disturb the system, not merely to
observe it passively”
Fred Mosteller and John Tukey, paraphrasing George Box
17 / 48
Estimation and Prediction for Multiple Regression
• Given estimates β̂0 , β̂1 , . . . β̂p , we can make predictions
using the formula
X2
X1
19 / 48
Results for advertising data
Correlations:
TV radio newspaper sales
TV 1.0000 0.0548 0.0567 0.7822
radio 1.0000 0.3541 0.5762
newspaper 1.0000 0.2283
sales 1.0000
20 / 48
Some important questions
21 / 48
Is at least one predictor useful?
(TSS − RSS)/p
F = ∼ Fp,n−p−1
RSS/(n − p − 1)
Quantity Value
Residual Standard Error 1.69
R2 0.897
F-statistic 570
22 / 48
Deciding on the important variables
23 / 48
Forward selection
24 / 48
Backward selection
25 / 48
Model selection — continued
26 / 48
Other Considerations in the Regression Model
Qualitative Predictors
• Some predictors are not quantitative but are qualitative,
taking a discrete set of values.
• These are also called categorical predictors or factor
variables.
• See for example the scatterplot matrix of the credit card
data in the next slide.
In addition to the 7 quantitative variables shown, there are
four qualitative variables: gender, student (student
status), status (marital status), and ethnicity
(Caucasian, African American (AA) or Asian).
27 / 48
26 3. Linear Regression
Credit Card Data
20 40 60 80 100 5 10 15 20 2000 8000 14000
1500
Balance
500
0
100
80
Age
60
40
20
8
6
Cards
4
2
20
15
Education
10
5
150
100
Income
50
14000
8000
Limit
2000
1000
600
Rating
200
0 500 1500 2 4 6 8 50 100 150 200 600 1000
FIGURE 3.6. The Credit data set contains information about balance, age, 28 / 48
Qualitative Predictors — continued
Resulting model:
(
β0 + β1 + i if ith person is female
yi = β0 + β1 xi + i =
β0 + i if ith person is male.
Intrepretation?
29 / 48
Credit card data — continued
30 / 48
Qualitative predictors with more than two levels
31 / 48
Qualitative predictors with more than two levels —
continued.
32 / 48
Results for ethnicity
33 / 48
Extensions of the Linear Model
Removing the additive assumption: interactions and
nonlinearity
Interactions:
• In our previous analysis of the Advertising data, we
assumed that the effect on sales of increasing one
advertising medium is independent of the amount spent on
the other media.
• For example, the linear model
\ = β0 + β1 × TV + β2 × radio + β3 × newspaper
sales
34 / 48
Interactions — continued
35 / 48
Interaction in the Advertising data?
3.2 Multiple Linear Regression 81
Sales
TV
Radio
When levels
FIGURE of 3.5.
either
For the TV or radio
Advertising areregression
data, a linear low, fitthen the
to sales usingtrue sales
and radio as predictors. From the pattern of the residuals, we can see that
are lowerTVthan predicted by the linear model.
there is a pronounced non-linear relationship in the data.
But when advertising is split between the two media, then the
model tends to underestimate
which simplifies to (3.15) for a simplesales.
linear regression. Thus, models with
more variables can have higher RSE if the decrease in RSS is small relative
to the increase in p.
In addition to looking at the RSE and R2 statistics just discussed, it can
be useful to plot the data. Graphical summaries can reveal problems with 36 / 48
Modelling interactions — Advertising data
Results:
Coefficient Std. Error t-statistic p-value
Intercept 6.7502 0.248 27.23 < 0.0001
TV 0.0191 0.002 12.70 < 0.0001
radio 0.0289 0.009 3.24 0.0014
TV×radio 0.0011 0.000 20.73 < 0.0001
37 / 48
Interpretation
38 / 48
Interpretation — continued
39 / 48
Hierarchy
40 / 48
Hierarchy — continued
41 / 48
Interactions between qualitative and quantitative
variables
42 / 48
With interactions, it takes the form
(
β2 + β3 × incomei if student
balancei ≈ β0 + β1 × incomei +
0 if not student
(
(β0 + β2 ) + (β1 + β3 ) × incomei if student
=
β0 + β1 × incomei if not student
43 / 48
90 3. Linear Regression
1400
1400
student
non−student
1000
1000
Balance
Balance
600
600
200
200
0 50 100 150 0 50 100 150
Income Income
FIGURE 3.7. For the Credit data, the least squares lines are shown for pre-
Credit data; Left: no interaction between income and student.
diction of balance from income for students and non-students. Left: The model
Right: with
(3.34) was fit. an interaction
There termbetween
is no interaction between income
income and student.
and student. Right: The
model (3.35) was fit. There is an interaction term between income and student.
30
20
10
Horsepower
FIGURE 3.8. The Auto data set. For a number of cars, mpg and horsepower are45 / 48
The figure suggests that
46 / 48
What we did not cover
Outliers
Non-constant variance of error terms
High leverage points
Collinearity
See text Section 3.33
47 / 48
Generalizations of the Linear Model
48 / 48