00 voturi pozitive00 voturi negative

3 vizualizări39 paginiBUS STAT CH 14

Nov 27, 2018

© © All Rights Reserved

PPT, PDF, TXT sau citiți online pe Scribd

BUS STAT CH 14

© All Rights Reserved

3 vizualizări

00 voturi pozitive00 voturi negative

BUS STAT CH 14

© All Rights Reserved

Sunteți pe pagina 1din 39

Building

Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Multiple Regression and Model

Building

14.1 The Multiple Regression Model and the

Least Squares Point Estimate

14.2 Model Assumptions and the Standard Error

14.3 R2 and Adjusted R2

14.4 The Overall F Test

14.5 Testing the Significance of an Independent

Variable

14.6 Confidence and Prediction Intervals

14-2

Multiple Regression and Model

Building Continued

14.7 The Sales Territory Performance Case:

Evaluating Employee Performance

14.8 Using Dummy Variables to Model

Qualitative Independent Variables

14.9 Using Squared and Interactive Terms

14.10 Model Building and the Effects of

Multicollinearity

14.11 Residual Analysis in Multiple Regression

14.12 Logistic Regression

14-3

LO 14-1: Explain the

multiple regression

model and the related

least squares point

estimates.

14.1 The Multiple Regression Model and

the Least Squares Point Estimate

Simple linear regression used one independent

variable to explain the dependent variable

Some relationships are too complex to be described

using a single independent variable

Multiple regression uses two or more independent

variables to describe the dependent variable

This allows multiple regression models to handle more

complex situations

There is no limit to the number of independent

variables a model can use

Multiple regression has only one dependent variable

14-4

LO14-1

The linear regression model relating y to x1, x2,…, xk

is y = β0 + β1x1 + β2x2 +…+ βkxk +

µy = β0 + β1x1 + β2x2 +…+ βkxk is the mean value of

the dependent variable y when the values of the

independent variables are x1, x2,…, xk

β0, β1, β2,… βk are the unknown regression

parameters relating the mean value of y to

x1, x2,…, xk

is an error term that describes the effects on y of all

factors other than the independent variables

x1, x2,…, xk

14-5

LO14-1

The Least Squares Estimates and Point

Estimation and Prediction

1. Estimation/prediction equation

ŷ = b0 + b1x1 + b2x2 + … + bkxk

is the point estimate of the mean value of the

dependent variable when the values of the

independent variables are x1, x2,…, xk

2. It is also the point prediction of an individual value of

the dependent variable when the values of the

independent variables are x1, x2,…, xk

3. b0, b1, b2,…, bk are the least squares point

estimates of the parameters β0, β1, β2,…, βk

4. x1, x2,…, xk are specified values of the independent

predictor variables x1, x2,…, xk

14-6

LO14-1

EXAMPLE 14.1 The Tasty Sub

Shop Case

LO 14-2: Explain the

assumptions behind

multiple regression and

calculate the standard

error. 14.2 Model Assumptions and the

Standard Error

The model is

y = β 0 + β 1 x 1 + β 2 x 2 + … + β kx k +

stated about the model error terms, ’s

14-8

LO14-2

Continued

The mean of the error terms is equal to 0

2. Constant Variance Assumption

The variance of the error terms σ2 is, the

same for every combination values of x1,

x2,…, xk

3. Normality Assumption

The error terms follow a normal distribution

for every combination values of x1, x2,…, xk

4. Independence Assumption

The values of the error terms are statistically

independent of each other

14-9

LO14-2

Sum of Squares

Sum of squared errors

SSE e i2 ( y i yˆ i ) 2

residual variance σ2

SSE

s MSE

2

n-k 1

Standard error: point estimate of the residual

standard deviation σ

SSE

s MSE

n- k 1

14-10

LO 14-3: Calculate and

interpret the multiple

and adjusted multiple

coefficients of

determination.

Σ(yi - ȳ)2

2. Explained variation is given by the formula

Σ(ŷi - ȳ)2

3. Unexplained variation is given by the

formula

Σ(yi - ŷi)2

4. Total variation is the sum of explained and

unexplained variation

This section can be read anytime

after reading Section 14.1

14-11

LO14-3

5. The multiple coefficient of determination is

the ratio of explained variation to total

variation

6. R2 is the proportion of the total variation that

is explained by the overall regression model

7. Multiple correlation coefficient R is the

square root of R2

14-12

LO14-3

The multiple correlation coefficient R is just

the square root of R2

With simple linear regression, r would take on

the sign of b1

There are multiple bi’s with multiple

regression

For this reason, R is always positive

To interpret the direction of the relationship

between the x’s and y, you must look to the

sign of the appropriate bi coefficient

14-13

LO14-3

The Adjusted R2

Adding an independent variable to multiple

regression will raise R2

R2 will rise slightly even if the new variable

has no relationship to y

The adjusted R2 corrects this tendency in R2

As a result, it gives a better estimate of the

importance of the independent variables

14-14

LO 14-4: Test the

significance of a

multiple regression

model by using an F

test.

14.4 The Overall F Test

To test

H0: β1= β2 = …= βk = 0 versus

Ha: At least one of β1, β2,…, βk ≠ 0

Test statistic

(Explained variation)/k

F(model)

(Unexplain ed variation)/[n - (k 1)]

p-value <

*F is based on k numerator and n-(k+1) denominator degrees of freedom

14-15

LO 14-5: Test the

significance of a

single independent

variable. 14.5 Testing the Significance of an

Independent Variable

A variable in a multiple regression model is

not likely to be useful unless there is a

significant relationship between it and y

To test significance, we use the null

hypothesis H0: βj = 0

Versus the alternative hypothesis

Ha: βj ≠ 0

14-16

LO14-5

Testing Significance of an Independent

Variable #2

14-17

LO14-5

Testing Significance of an

Independent Variable #3

Customary to test significance of every

independent variable

If we can reject H0: βj = 0 at =0.05, we have

strong evidence the independent variable xj is

significantly related to y

If we can reject H0: βj = 0 at =0.01, we have

very strong evidence the independent

variable xj is significantly related to y

The smaller the significance level at which

H0 can be rejected, the stronger the evidence

that xj is significantly related to y

14-18

LO14-5

A Confidence Interval for the

Regression Parameter βj

If the regression assumptions hold, 100(1-

)% confidence interval for βj

is [b1 ± t/2 Sbj]

t/2 is based on n – (k + 1) degrees of

freedom

14-19

LO 14-6: Find and

interpret a confidence

interval for a mean

value and a prediction

interval for an individual

value.

14.6 Confidence and Prediction

Intervals

The point on the regression line

corresponding to a particular value of x1,

x2,…, xk, of the independent variables is

ŷ = b0 + b1x1 + b2x2 + … + bkxk

It is unlikely that this value will equal the

mean value of y for these x values

Therefore, we need to place bounds on how

far away the predicted value might be

We can do this by calculating a confidence

interval for the mean value of y and a

prediction interval for an individual value of y

14-20

LO14-6

Distance Value

Both the confidence interval for the mean

value of y and the prediction interval for an

individual value of y employ a quantity called

the distance value

With simple regression, we were able to

calculate the distance value fairly easily

However, for multiple regression, calculating

the distance value requires matrix algebra

14-21

LO14-6

A Confidence Interval for a Mean

Value of y

Assume the regression assumptions hold

Confidence interval

[ŷ t /2 s( y yˆ ) ] s( y yˆ ) s Distance value

Prediction interval

[ŷ t /2 s( y yˆ ) ] s( y yˆ ) s 1 Distance value

14-22

14.7 The Sales Territory Performance

Case: Evaluating Employee Performance

yi Yearly sales of the company’s product

x1 Number of months the representative has

been employed

x2 Sales of products in the sales territory

x3 Dollar advertising expenditure in the territory

x4 Weighted average of the company’s market

share in territory for the previous four years

x5 Change in the company’s market share in

the territory over the previous four years

14-23

Partial Excel Output of a Regression Analysis

of the Sales Territory Performance Data

LO 14-7: Use dummy

variables to model

qualitative independent

variables. 14.8 Using Dummy Variables to Model

Qualitative Independent Variables

So far, we have only looked at including

quantitative data in a regression model

However, we may wish to include descriptive

qualitative data as well

For example, might want to include the gender

of respondents

We can model the effects of different levels of

a qualitative variable by using what are called

dummy variables

Also known as indicator variables

14-25

LO14-7

How to Construct Dummy

Variables

A dummy variable always has a value of

either 0 or 1

For example, to model sales at two locations,

would code the first location as a zero and

the second as a 1

Operationally, it does not matter which is

coded 0 and which is coded 1

14-26

LO14-7

What If We Have More Than Two

Categories?

Consider having three categories, say A, B

and C

Cannot code this using one dummy variable

A=0, B=1 and C=2 would be invalid

Assumes the difference between A and B is

the same as B and C

We must use multiple dummy variables

Specifically, k categories requires k-1 dummy

variables

14-27

LO14-7

For A, B, and C, would need two dummy

variables

x1 is 1 for A, zero otherwise

x2 is 1 for B, zero otherwise

If x1 and x2 are zero, must be C

This is why the third dummy variable is not

needed

14-28

LO14-7

Interaction Models

So far, have only considered dummy

variables as stand-alone variables

Model so far is y = β0 + β1x + β2D +

Where D is dummy variable

However, can also look at interaction

between dummy variable and other variables

That model would take the form

y = β0 + β1x + β2D + β3xD +

With an interaction term, both the intercept

and slope are shifted

14-29

LO 14-8: Use

squared and

interaction variables.

14.9 Using Squared and

Interaction Variables

Quadratic regression model is:

y = β0 + β 1 x + β2 x 2 ε

where

1. β0 + β1x + β2x2 is μy

2. Β, β, and β2 are the regression parameters

3. ε is an error term

14-30

LO14-8

Regression models often contain interaction

variables

Formed by multiplying two independent

variables together

Consider a model where x3 and x4 interact

and x3 is used as a quadratic

14-31

LO 14-9: Describe

multicollinearity and

build a multiple

regression model.

14.10 Model Building and the

Effects of Multicollinearity

Multicollinearity: when “independent”

variables are related to one another

Considered severe when the simple

correlation exceeds 0.9

Even moderate multicollinearity can be a

problem

Another measurement is variance inflation

factors

Multicollinearity a problem when VIF>10

Moderate problem for VIF>5 1

VIF

1 R 2j

j

14-32

LO14-9

Effect of Adding Independent

Variable

Adding any independent variable will increase

R²

Even adding an unimportant independent

variable

Thus, R² cannot tell us that adding an

independent variable is undesirable

14-33

LO14-9

A Better Criterion is the Standard

Error

A better criterion is the size of the standard

error s

If s increases when an independent variable

is added, we should not add that variable

However, decreasing s alone is not enough

An independent variable should only be

included if it reduces s enough to offset the

higher t value and reduces the length of the

desired prediction interval for y

SSE

s

n k 1

14-34

LO14-9

C Statistic

Another quantity for comparing regression

models is called the C (a.k.a. Cp) statistic

First, calculate mean square error for the

model containing all p potential independent

variables (s2p)

Next, calculate SSE for a reduced model with

k independent variables

C 2 n 2k 1

SSE

sp

14-35

LO14-9

C Statistic Continued

We want the value of C to be small

Adding unimportant independent variables

will raise the value of C

While we want C to be small, we also wish to

find a model for which C roughly equals k+1

A model with C substantially greater than k+1

has substantial bias and is undesirable

If a model has a small value of C and C for

this model is less than k+1, then it is not

biased and the model should be considered

desirable

14-36

LO14-9

The Partial F Test: An F Test a Portion

of a Regression Model

To test

H0: All of the βj coefficients corresponding to the

independent variables in the subset are zero

Ha: At least one of the βj coefficients is not equal to

zero

(SSE R - SSE C )/k *

F

SSE C /[n - (k 1)]

F(partial) > F or

p-value <

F is based on k-g numerator and n-(k+1) denominator degrees of freedom

14-37

LO 14-10: Use residual

analysis to check the

assumptions of multiple

regression.

14.11 Residual Analysis in

Multiple Regression

For an observed value of yi, the residual is

ei = yi - ŷ = yi – (b0 + b1xi1 + … + bkxik)

If the regression assumptions hold, the residuals

should look like a random sample from a normal

distribution with mean 0 and variance σ2

Residual plots

Residuals versus each independent variable

Residuals versus predicted y’s

Residuals in time order (if the response is a time

series)

14-38

LO14-10

Residual Plots for the Sales

Territory Performance Model

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.