Sunteți pe pagina 1din 55

# Statistics for Health Research

Entering Multidimensional
Space: Multiple
Regression
Peter T. Donnan
Professor of Epidemiology and Biostatistics

Objectives of session

## Recognise the need for multiple regression

Understand methods of selecting variables
Understand strengths and weakness of
selection methods
Carry out Multiple
Regression in SPSS
and interpret the output

## Why do we need multiple

regression?
Research is not as simple as effect
of one variable on one outcome,
Especially with observational data
Need to assess many factors
simultaneously; more realistic
models

Dependent (y)

## Consider Fitted line of

y = a + b1x1 + b2x2

x 2)
(
ry
o
t
na
a
l
p
Ex

Explanatory (x1)

## 3-dimensional scatterplot from

SPSS of Min LDL in relation to
baseline LDL and age

## When to use multiple

regression modelling (1)
Assess relationship between two
variables while adjusting or allowing
for another variable
Sometimes the second variable is
considered a nuisance factor
Example: Physical Activity allowing
for age and medications

## When to use multiple

regression modelling (2)
In RCT whenever there is imbalance
between arms of the trial at baseline
in characteristics of subjects
e.g. survival in colorectal cancer on
two different randomised therapies
adjusted for age, gender, stage, and
co-morbidity

## When to use multiple

regression modelling (2)
A special case of this is when
adjusting for baseline level of the
primary outcome in an RCT
Baseline level added as a factor in
regression model
This will be covered in Trials part of
the course

## When to use multiple

regression modelling (3)
With observational data in order to
produce a prognostic equation for
future prediction of risk of mortality
e.g. Predicting future risk of CHD
used 10-year data from the
Framingham cohort

## When to use multiple

regression modelling (4)
With observational
adjust for possible

data in order to
confounders

## e.g. survival in colorectal cancer in

those with hypertension adjusted for
age, gender, social deprivation and
co-morbidity

Definition of Confounding
A confounder is a factor which
is related to both the variable
of interest (explanatory) and
the outcome, but is not an
intermediary in a causal
pathway

Example of Confounding
Lung
Cancer

Deprivation

Smoking

## But, also worth adjusting for

factors only related to outcome
Lung
Cancer

Deprivation

Exercise

## Not worth adjusting for intermediate

factor in a causal pathway
Exercise

Blood
viscosity

Stroke

## In a causal pathway each factor is

merely a marker of the other
factors i.e correlated - collinearity

## SPSS: Add both baseline LDL and

age in the independent box in linear
regression

## Output from SPSS

regression on Age at

linear
baseline

Coefficientsa

Model
1
(Constant)
Age at baseline

Unstandardized
Standardized
Coefficients
Coefficients
B
Std. Error
Beta
2.024
.105
-.008
.002
-.121

t
19.340
-4.546

## 95% Confidence Interv al for B Collinearity Statistics

Sig.
Lower Bound Upper Bound Tolerance
VIF
.000
1.819
2.229
.000
-.011
-.004
1.000
1.000

Output from
regression on

SPSS linear
Baseline LDL

Coefficientsa

Model
1

(Constant)
Baseline LDL

Unstandardized
Coeff icients
B
Std. Error
.668
.066
.257
.018

Standardized
Coeff icients
Beta
.351

t
10.091
13.950

## 95% Confidence Interv al for B

Sig.
Lower Bound Upper Bound
.000
.538
.798
.000
.221
.293

Model Summary

R2 now
improved
to 13%

Model
1

R
.360a

R Square
.130

Adjusted
R Square
.129

St d. Error of
the Estimate
.6753538

Coefficientsa

Model
1

(Constant)
Baseline LDL
Age at baseline

Unstandardized
Coeff icients
B
Std. Error
1.003
.124
.250
.019
-.005
.002

Standardized
Coeff icients
Beta
.342
-.081

t
8.086
13.516
-3.187

Sig.
.000
.000
.001

## Both variables still significant

INDEPENDENTLY of each other

## 95% Confidence Interv al for B

Lower Bound Upper Bound
.760
1.246
.214
.286
-.008
-.002

## How do you select which

variables to enter the model?
Usually consider what hypotheses are you testing?
If main exposure variable, enter first and assess
confounders one at a time
For derivation of CPR you want powerful predictors
Also clinically important factors e.g. cholesterol in CHD
prediction
Significance is important but
It is acceptable to have an important variable without
statistical significance

## How do you decide what variables to

enter in model?
Correlations? With great difficulty!

## 3-dimensional scatterplot from

SPSS of Time from Surgery in
relation to Dukes staging and age

## Approaches to model building

1. Let Scientific or Clinical factors
guide selection
2. Use automatic selection algorithms
3. A mixture of above

## 1) Let Science or Clinical

factors guide selection
Baseline LDL cholesterol is an
important factor determining LDL
outcome so enter first
Next allow for age and gender
Add adherence as important?
Add BMI and smoking?

## 1) Let Science or Clinical

factors guide selection
Results in model of:
1.Baseline LDL
2.age and gender
3.Adherence
4.BMI and smoking
Is this a good model?

## 1) Let Science or Clinical factors

guide selection: Final Model
Note three variables entered but not statistically significant

## 1) Let Science or Clinical factors

guide selection
Is this the best model?
Should I leave out the non-significant factors (Model 2)?
Model

Adj R2

F from
ANOVA

No. of
Paramete
rs p

0.137

37.48

0.134

72.021

## Adj R2 lower, F has increased and number of

parameters is less in 2nd model. Is this better?

Kullback-Leibler
Information
Kullback and Leibler (1951)
quantified the meaning of
information related to
Fishers sufficient statistics
Basically we have reality f
And a model g to approximate f
So K-L information is
I(f,g)

Kullback-Leibler
Information

## We want to minimise I (f,g)

to obtain the best model
over other models
I (f,g) is the information
lost or distance between
reality and a model so need
to minimise:

f ( x)
I ( f , g ) f ( x ) log(
) dx
g( x )

Akaikes Information
Criterion
It turns out that the
function I(f,g) is
related to a very simple
measure of goodnessof-fit:
Akaikes Information
Criterion or AIC

Selection Criteria
With a large number of factors type 1 error
large, likely to have model with many variables
Two standard criteria:
1) Akaikes Information Criterion (AIC)
2) Schwartzs Bayesian Information
Criterion (BIC)
Both penalise models with large number of
variables if sample size is large

Akaikes Information
Criterion
AIC 2 * loglikelihood 2 * p
Where p = number of parameters and
-2*log likelihood is in the output
Hence AIC penalises models with large
number of variables
Select model that minimises (-2LL+2p)

## Generalized linear models

Unfortunately the standard
REGRESSION in SPSS does not give
these statistics
Need to use
Analyze
Generalized Linear Models..

## Generalized linear models.

Default is linear
Add Min LDL
achieved as
dependent as in
REGRESSION in
SPSS
Next go to
predictors..

Predictors

WARNING!

add the
predictors in
the correct box
Categorical in
FACTORS box
Continuous in
COVARIATES
box

Model
Add all
factors and
covariates in
the model as
main effects

## Generalized Linear Models

Parameter Estimates
Note identical to REGRESSION output

## Generalized Linear Models

Goodness-of-fit
Note output gives
log likelihood and
AIC = 2835
(AIC = -2x-1409.6
+2x7= 2835)

Footnote explains
smaller AIC is
better

## Let Science or Clinical factors

guide selection: Optimal model
The log likelihood is a measure of
GOODNESS-OF-FIT
Seek optimal model that maximises the log
likelihood or minimises the AIC
Model
1 Full Model

2 Non-significant
variables removed

2LL

AIC

-1409.6

2835.6

-1413.6

2837.2

Chang
e is
1.6

## 1) Let Science or Clinical

factors guide selection
Key points:
1.Results demonstrate a significant association
with baseline LDL, Age and Adherence
2.Difficult choices with Gender, smoking and
BMI
3.AIC only changes by 1.6 when removed
4.Generally changes of 4 or more in AIC are
considered important

## 1) Let Science or Clinical

factors guide selection
Key points:
1.Conclude little to chose between models
2.AIC actually lower with larger model and
consider Gender, and BMI important factors so
keep larger model but have to justify
3.Model building manual, logical, transparent
and under your control

## 2) Use automatic selection

procedures
These are based on automatic
mechanical algorithms usually related
to statistical significance
Common ones are stepwise, forward
or backward elimination
Can be selected in SPSS using
Method in dialogue box

## 2) Use automatic selection

procedures (e.g Stepwise)

Select
Method =
Stepwise

## 2) Use automatic selection

procedures (e.g Stepwise)

1st step
2nd step

Final
Model

## 2) Change in AIC with Stepwise

selection
Note: Only available from Generalized Linear Models
Step

Model

Log
Likelihoo
d

AIC

Chang
e in
AIC

No. of
Parameter
s p

Baseline LDL

-1423.1

2852.2

+Adherence

-1418.0

2844.1

8.1

+Age

-1413.6

2837.2

6.9

2) Advantages and
disadvantages of stepwise
Advantages
Simple to implement
Gives a parsimonious model
Selection is certainly objective

Disadvantages
Non stable selection stepwise considers many
models that are very similar
P-value on entry may be smaller once procedure is
finished so exaggeration of p-value
Predictions in external dataset usually worse for
stepwise procedures

2) Automatic procedures:
Backward elimination
Backward starts by eliminating the least
significant factor form the full model and has a
few advantages over forward:
Modeller has to consider the full model and
sees results for all factors simultaneously
Correlated factors can remain in the model (in
forward methods they may not even enter)
Criteria for removal tend to be more lax in
backward so end up with more parameters

## 2) Use automatic selection

procedures (e.g Backward)

Select
Method =
Backward

2) Backward elimination in
SPSS
1st step
Gender
removed
2nd step
BMI
removed

Final
Model

Summary of automatic
selection

## Automatic selection may not give optimal

model (may leave out important factors)

## Different methods may give different results

(forward vs. backward elimination)

stringent

## Model assessment still requires some thought

3) A mixture of automatic
procedures and self selection

## Use automatic procedures as a

guide
Think about what factors are
important
Add important factors
Do not blindly follow statistical
significance
Consider AIC

Summary of Model
selection
Selection of factors for Multiple Linear

judgement

## They are easily fitted in SPSS

Check AIC or log likelihood for fit

Summary
Multiple regression models are the

## most used analytical tool in

quantitative research

## They are easily fitted in SPSS

Model assessment requires some

thought

## Remember Occams Razor

Entia non sunt
multiplicanda
praeter
necessitatem
Entities must not be
multiplied beyond
necessity

William of Ockham
14th century Friar and
logician
1288-1347

Summary
After fitting any model check assumptions
Functional form linearity or not
Check Residuals for normality
Check Residuals for outliers
All accomplished within SPSS
See publications for further info
Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E
genotypes are associated with lipid lowering response to statin treatment in diabetes: A GoDARTS study. Pharmacogenetics and Genomics , 2008; 18: 279-87.

Practical on Multiple
Regression
Read in LDL Data.sav

## 1)Try fitting multiple regression model on Min

LDL obtained using forward and backward
elimination. Are the results the same? Add
other factors than those considered in the
presentation such as BMI, smoking.
Remember the goal is to assess the
association of APOE with LDL response.
2)Try fitting multiple regression models for
Min Chol achieved. Is the model similar to
that found for Min Chol?