Sunteți pe pagina 1din 54

Multiple

Regression
FittingModelsforMultipleIndependentVariables

ByEllenLudlow

If you wanted to predict someones


weight based on their height, you
would collect data by recording the
height and weight and fit a model.

If you wanted to predict someones


weight based on their height, you
would collect data by recording the
height and weight and fit a model.
Lets say our population are males
ages 16-25, and this is a table of
collected data...

If you wanted to predict someones


weight based on their height, you
would collect data by recording the
height and weight and fit a model.
Lets say our population are males
ages 16-25, and this is a table of
collected data...

height 60 63 65 66 67 68 68 69 70 70 71 72 72 73 75
weight 120 35 130 143 37 149 144 150 156 152 154 162 169 163 168

Next, we graph the data..

Next, we graph the data..


Height vs Weight
80

75

70

65

Weight (lbs)
60

55
115

125

135

145

155

165

Height (ins)

175

Next, we graph the data..


Height vs Weight
80

75

70

65

Weight (lbs)
60

55
115

125

135

145

155

165

175

Height (ins)

And because the data looks linear, fit an LSR line

Next, we graph the data..


Height vs Weight
80

75

70

65

Weight (lbs)
60

55
115

125

135

145

155

165

175

Height (ins)

And because the data looks linear, fit an LSR line

But weight isnt the only factor that


has an impact on someones height.
The height of someones parents may
be another predictor.
With multiple regression you may have
more then one independent variable,
so you could use someone's weight and
his parents height to predict his own
height.

Our new table, with the data, the


average height of a subjects parents,
looks like this
height
60 63 65 66 67 68 68 69 70 70 71 72 72 73 75
weight 120 35 130 143 37 149 144 150 156 152 154 162 169 163 168
parent's
height
59 67 62 59 71 66 71 67 69 73 69 75 72 69 73

This data cant be graphed like simple


linear regression, because there are two
independent variables.

This data cant be graphed like simple


linear regression, because there are two
independent variables.
There is software, however, such as
Minitab, that can analyze data with
multiple independent variable.
Lets take a look at a Minitab output for
our data

PredictorCoefStdevtratiop
Constant25.0284.3265.790.000
weight0.240200.031407.650.000
parenth0.114930.090351.270.227

s=1.165Rsq=92.6%Rsq(adj)=91.4%

AnalysisofVariance

SOURCEDFSSMSFp
Regression2205.31102.6575.620.000
Error1216.291.36
Total14221.60

What does all this mean?

First, Lets look at the multiple


regression model
The general model for multiple
regression is similar to the model for
simple linear regression.
Simple linear regression model:

y=0 +1x
Multiple regression model:

y=0 +1x1 +2x2 +...+k xk

Just like linear regression, when you fit a


multiple regression to data, the terms in
the model equation are statistics not
parameters.
A multiple regression model using
statistical notation looks like...

=B0 +B1x1 +B2 x2 +...+Bk xk


y
where k is the number of independent
variables.

The multiple regression model for our


data is

height 25.028 .24020weight .11483parenth


We get the coefficient values from the
Minitab output
PredictorCoefStdevtratiop
Constant25.0284.3265.790.000
weight0.240200.031407.650.000
parenth0.114930.090351.270.227

Once the regression is fitted, we need


to know how well the model fits the
data
First, we check and see if there is a
good overall fit.
Then, we test the significance of
each independent variable. You will
notice that this is the same way we
test for significance in a simple linear
regression.

TheOverallTest
Hypotheses:

TheOverallTest
Hypotheses:
HO :1 =2 =3 =...=k

All independent variables are unimportant for


predicting y

TheOverallTest
Hypotheses:
HO :1 =2 =3 =...=k

All independent variables are unimportant for


predicting y
: At least one 0

At least one independent variable is useful


for predicting y

What type of test should be used?


The distribution used is called the
Fischer distribution. The F-Statistic is
used with this distribution.
<-- Fischer
Distribution

How do you calculate the F-statistic?

How do you calculate the F-statistic?


It can easily be found in the Minitab output,
along with the p-value

How do you calculate the F-statistic?


It can easily be found in the Minitab output,
along with the p-value
SOURCEDFSSMSFp
Regress2205.31102.6575.620.000
Error1216.291.36
Total14221.60
Or you can calculate it by hand

But, before you can calculate the Fstatistic, you need to be introduced to
some other terms.

But, before you can calculate the Fstatistic, you need to be introduced to
some other terms.
Regression sum of squares
(regression SS) - the variation in Y
accounted for by the regression model
with respect to the mean model

But, before you can calculate the Fstatistic, you need to be introduced to
some other terms.
Regression sum of squares
(regression SS) - the variation in Y
accounted for by the regression model
with respect to the mean model
Error sum of squares (error SS) - the
variation in Y not accounted for by the
regression model.

But, before you can calculate the Fstatistic, you need to be introduced to
some other terms.
Regression sum of squares
(regression SS) - the variation in Y
accounted for by the regression model
with respect to the mean model
Error sum of squares (error SS) - the
variation in Y not accounted for by the
regression model.
Total sum of squares (total SS) - the
total variation in Y

Now that we understand these terms we


need to know how to calculate them

Now that we understand these terms we


need to know how to calculate them
Regression SS
Error SS

= (Yi Y )
i=1
n

= (Yi Y)
i=1

Total SS

= (Yi Y )
i=1

Total SS = Regression SS + Error SS

Y Y
n

i 1

i 1

Yi Y

i 1

Yi Y

There are also regression mean of


squares, error mean of squares, and total
mean of squares (abbreviated MS).

There are also regression mean of


squares, error mean of squares, and total
mean of squares (abbreviated MS).
To calculate these terms, you divide the
sum of squares by its respective degrees
of freedom

There are also regression mean of


squares, error mean of squares, and total
mean of squares (abbreviated MS).
To calculate these terms, you divide the
sum of squares by its respective degrees
of freedom
Regression d.f. = k
Error d.f. = n-k-1
Total d.f. = n-1

There are also regression mean of


squares, error mean of squares, and total
mean of squares (abbreviated MS).
To calculate these terms, you divide the
sum of squares by its respective degrees
of freedom
Regression d.f. = k
Error d.f. = n-k-1
Total d.f. = n-1
Where k is the number of independent variables
and n is the total number of observations used
to calculate the regression

So
Regression
MS

= (Yi Y )
i=1

k
n

Error MS

(Yi Y )
i=1

nk 1
n

(Y Y )
i

Total MS

i=1

n1

and Regression MS + Error MS = Total


MS

Both sum of squares and mean squares


values can be found in Minitab

Both sum of squares and mean squares


values can be found in Minitab
SOURCEDFSSMSFp
Regress2205.31102.6575.620.000
Error1216.291.36
Total14221.60

Both sum of squares and mean squares


values can be found in Minitab
SOURCEDFSSMSFp
Regress2205.31102.6575.620.000
Error1216.291.36
Total14221.60
Now we can calculate the F-statistic.

Test Statistic and Distribution


Teststatistic:
F=
F=
F=

model mean square


error mean square
102.65
1.36
75.48

Which is very close to F-statistic from


Minitab ( 75.62)

The p-value for the F-statistic is then


found in a F-Distribution Table. As you
saw before, it can also be easily
calculated by software.
A small p-value rejects the null
hypothesis that none of the independent
variables are significant. That is to say,
at least one of the independent
variables are significant.

The conclusion in the context of our


data is:
We have strong evidence (p is approx.
0) to reject the null hypothesis. That is
to say either someones weight or their
average parents height is significant in
predicting his height.
Once you know that at least one
independent variable is significant, you
can go on to test each independent
variable separately.

TestingIndividualTerms
Ifanindependentvariabledoesnotcontribute
significantlytopredictingthevalueofY,the
coefficientofthatvariablewillbe0.
Thetestofthethesehypothesesdetermines
whethertheestimatedcoefficientissignificantly
differentfrom0.
Fromthis,wecantellwhetheranindependent
variableisimportantforpredictingthedependent
variable.

TestforIndividualTerms:

TestforIndividualTerms:
HO: j =0

TestforIndividualTerms:
HO: j =0
Theindependentvariable,xj,isnotimportant
forpredictingy

TestforIndividualTerms:
HO: j =0
Theindependentvariable,xj,isnotimportant
forpredictingy
HA: 0 or 0 or 0

TestforIndividualTerms:
HO: j =0
Theindependentvariable,xj,isnotimportant
forpredictingy
HA: 0 or 0 or 0
Theindependentvariable,xj,isimportantfor
predictingy

TestforIndividualTerms:
HO: j =0
Theindependentvariable,xj,isnotimportant
forpredictingy
HA: 0 or 0 or 0
Theindependentvariable,xj,isimportantfor
predictingy
wherejrepresentsaspecifiedrandomvariable

Test Statistic:
t=

Test Statistic:
t=

d.f. = n-k-1

Test Statistic:
t=

d.f. = n-k-1
Remember, this test is only to be
performed, if the overall model of
the test is significant.

T-distribution
QuickTime and a
GIF decompressor
are needed to see this picture.

Tests of individual terms for


significance are the same as a test of
significance in simple linear regression

A small p-value means that the independent


variable is significant.
PredictorCoefStdevtratiop
Constant25.0284.3265.790.000
weight0.240200.031407.650.000
parenth0.114930.090351.270.227
This test of significance shows that
weight is a significant independent
variable for predicting height, but
average parent height is not.

Now that you know how to do tests of


significance for multiple regression,
there are many other things that you
can learn. Such as
How to create confidence intervals
How to use categorical variables in
multiple regression
How to test for significance in groups
of independent variables

S-ar putea să vă placă și