Entering Multidimensional

Space: Multiple

Regression

Peter T. Donnan

Professor of Epidemiology and Biostatistics

Objectives of session

Understand methods of selecting variables

Understand strengths and weakness of

selection methods

Carry out Multiple

Regression in SPSS

and interpret the output

regression?

Research is not as simple as effect

of one variable on one outcome,

Especially with observational data

Need to assess many factors

simultaneously; more realistic

models

Dependent (y)

y = a + b1x1 + b2x2

x 2)

(

ry

o

t

na

a

l

p

Ex

Explanatory (x1)

SPSS of Min LDL in relation to

baseline LDL and age

regression modelling (1)

Assess relationship between two

variables while adjusting or allowing

for another variable

Sometimes the second variable is

considered a nuisance factor

Example: Physical Activity allowing

for age and medications

regression modelling (2)

In RCT whenever there is imbalance

between arms of the trial at baseline

in characteristics of subjects

e.g. survival in colorectal cancer on

two different randomised therapies

adjusted for age, gender, stage, and

co-morbidity

regression modelling (2)

A special case of this is when

adjusting for baseline level of the

primary outcome in an RCT

Baseline level added as a factor in

regression model

This will be covered in Trials part of

the course

regression modelling (3)

With observational data in order to

produce a prognostic equation for

future prediction of risk of mortality

e.g. Predicting future risk of CHD

used 10-year data from the

Framingham cohort

regression modelling (4)

With observational

adjust for possible

data in order to

confounders

those with hypertension adjusted for

age, gender, social deprivation and

co-morbidity

Definition of Confounding

A confounder is a factor which

is related to both the variable

of interest (explanatory) and

the outcome, but is not an

intermediary in a causal

pathway

Example of Confounding

Lung

Cancer

Deprivation

Smoking

factors only related to outcome

Lung

Cancer

Deprivation

Exercise

factor in a causal pathway

Exercise

Blood

viscosity

Stroke

merely a marker of the other

factors i.e correlated - collinearity

age in the independent box in linear

regression

regression on Age at

linear

baseline

Coefficientsa

Model

1

(Constant)

Age at baseline

Unstandardized

Standardized

Coefficients

Coefficients

B

Std. Error

Beta

2.024

.105

-.008

.002

-.121

t

19.340

-4.546

Sig.

Lower Bound Upper Bound Tolerance

VIF

.000

1.819

2.229

.000

-.011

-.004

1.000

1.000

Output from

regression on

SPSS linear

Baseline LDL

Coefficientsa

Model

1

(Constant)

Baseline LDL

Unstandardized

Coeff icients

B

Std. Error

.668

.066

.257

.018

Standardized

Coeff icients

Beta

.351

t

10.091

13.950

Sig.

Lower Bound Upper Bound

.000

.538

.798

.000

.221

.293

Model Summary

R2 now

improved

to 13%

Model

1

R

.360a

R Square

.130

Adjusted

R Square

.129

St d. Error of

the Estimate

.6753538

Coefficientsa

Model

1

(Constant)

Baseline LDL

Age at baseline

Unstandardized

Coeff icients

B

Std. Error

1.003

.124

.250

.019

-.005

.002

Standardized

Coeff icients

Beta

.342

-.081

t

8.086

13.516

-3.187

Sig.

.000

.000

.001

INDEPENDENTLY of each other

Lower Bound Upper Bound

.760

1.246

.214

.286

-.008

-.002

variables to enter the model?

Usually consider what hypotheses are you testing?

If main exposure variable, enter first and assess

confounders one at a time

For derivation of CPR you want powerful predictors

Also clinically important factors e.g. cholesterol in CHD

prediction

Significance is important but

It is acceptable to have an important variable without

statistical significance

enter in model?

Correlations? With great difficulty!

SPSS of Time from Surgery in

relation to Dukes staging and age

1. Let Scientific or Clinical factors

guide selection

2. Use automatic selection algorithms

3. A mixture of above

factors guide selection

Baseline LDL cholesterol is an

important factor determining LDL

outcome so enter first

Next allow for age and gender

Add adherence as important?

Add BMI and smoking?

factors guide selection

Results in model of:

1.Baseline LDL

2.age and gender

3.Adherence

4.BMI and smoking

Is this a good model?

guide selection: Final Model

Note three variables entered but not statistically significant

guide selection

Is this the best model?

Should I leave out the non-significant factors (Model 2)?

Model

Adj R2

F from

ANOVA

No. of

Paramete

rs p

0.137

37.48

0.134

72.021

parameters is less in 2nd model. Is this better?

Kullback-Leibler

Information

Kullback and Leibler (1951)

quantified the meaning of

information related to

Fishers sufficient statistics

Basically we have reality f

And a model g to approximate f

So K-L information is

I(f,g)

Kullback-Leibler

Information

to obtain the best model

over other models

I (f,g) is the information

lost or distance between

reality and a model so need

to minimise:

f ( x)

I ( f , g ) f ( x ) log(

) dx

g( x )

Akaikes Information

Criterion

It turns out that the

function I(f,g) is

related to a very simple

measure of goodnessof-fit:

Akaikes Information

Criterion or AIC

Selection Criteria

With a large number of factors type 1 error

large, likely to have model with many variables

Two standard criteria:

1) Akaikes Information Criterion (AIC)

2) Schwartzs Bayesian Information

Criterion (BIC)

Both penalise models with large number of

variables if sample size is large

Akaikes Information

Criterion

AIC 2 * loglikelihood 2 * p

Where p = number of parameters and

-2*log likelihood is in the output

Hence AIC penalises models with large

number of variables

Select model that minimises (-2LL+2p)

Unfortunately the standard

REGRESSION in SPSS does not give

these statistics

Need to use

Analyze

Generalized Linear Models..

Default is linear

Add Min LDL

achieved as

dependent as in

REGRESSION in

SPSS

Next go to

predictors..

Predictors

WARNING!

add the

predictors in

the correct box

Categorical in

FACTORS box

Continuous in

COVARIATES

box

Model

Add all

factors and

covariates in

the model as

main effects

Parameter Estimates

Note identical to REGRESSION output

Goodness-of-fit

Note output gives

log likelihood and

AIC = 2835

(AIC = -2x-1409.6

+2x7= 2835)

Footnote explains

smaller AIC is

better

guide selection: Optimal model

The log likelihood is a measure of

GOODNESS-OF-FIT

Seek optimal model that maximises the log

likelihood or minimises the AIC

Model

1 Full Model

2 Non-significant

variables removed

2LL

AIC

-1409.6

2835.6

-1413.6

2837.2

Chang

e is

1.6

factors guide selection

Key points:

1.Results demonstrate a significant association

with baseline LDL, Age and Adherence

2.Difficult choices with Gender, smoking and

BMI

3.AIC only changes by 1.6 when removed

4.Generally changes of 4 or more in AIC are

considered important

factors guide selection

Key points:

1.Conclude little to chose between models

2.AIC actually lower with larger model and

consider Gender, and BMI important factors so

keep larger model but have to justify

3.Model building manual, logical, transparent

and under your control

procedures

These are based on automatic

mechanical algorithms usually related

to statistical significance

Common ones are stepwise, forward

or backward elimination

Can be selected in SPSS using

Method in dialogue box

procedures (e.g Stepwise)

Select

Method =

Stepwise

procedures (e.g Stepwise)

1st step

2nd step

Final

Model

selection

Note: Only available from Generalized Linear Models

Step

Model

Log

Likelihoo

d

AIC

Chang

e in

AIC

No. of

Parameter

s p

Baseline LDL

-1423.1

2852.2

+Adherence

-1418.0

2844.1

8.1

+Age

-1413.6

2837.2

6.9

2) Advantages and

disadvantages of stepwise

Advantages

Simple to implement

Gives a parsimonious model

Selection is certainly objective

Disadvantages

Non stable selection stepwise considers many

models that are very similar

P-value on entry may be smaller once procedure is

finished so exaggeration of p-value

Predictions in external dataset usually worse for

stepwise procedures

2) Automatic procedures:

Backward elimination

Backward starts by eliminating the least

significant factor form the full model and has a

few advantages over forward:

Modeller has to consider the full model and

sees results for all factors simultaneously

Correlated factors can remain in the model (in

forward methods they may not even enter)

Criteria for removal tend to be more lax in

backward so end up with more parameters

procedures (e.g Backward)

Select

Method =

Backward

2) Backward elimination in

SPSS

1st step

Gender

removed

2nd step

BMI

removed

Final

Model

Summary of automatic

selection

model (may leave out important factors)

(forward vs. backward elimination)

stringent

3) A mixture of automatic

procedures and self selection

guide

Think about what factors are

important

Add important factors

Do not blindly follow statistical

significance

Consider AIC

Summary of Model

selection

Selection of factors for Multiple Linear

judgement

Check AIC or log likelihood for fit

Summary

Multiple regression models are the

quantitative research

Model assessment requires some

thought

Entia non sunt

multiplicanda

praeter

necessitatem

Entities must not be

multiplied beyond

necessity

William of Ockham

14th century Friar and

logician

1288-1347

Summary

After fitting any model check assumptions

Functional form linearity or not

Check Residuals for normality

Check Residuals for outliers

All accomplished within SPSS

See publications for further info

Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E

genotypes are associated with lipid lowering response to statin treatment in diabetes: A GoDARTS study. Pharmacogenetics and Genomics , 2008; 18: 279-87.

Practical on Multiple

Regression

Read in LDL Data.sav

LDL obtained using forward and backward

elimination. Are the results the same? Add

other factors than those considered in the

presentation such as BMI, smoking.

Remember the goal is to assess the

association of APOE with LDL response.

2)Try fitting multiple regression models for

Min Chol achieved. Is the model similar to

that found for Min Chol?

