Statistics 244 - Binary Response Regression, and Related Issues

Statistics 244 – Binary response regression, and related issues
Models for binary outcomes, and related issues
• GLMs for binary outcomes, link functions, logistic regression

• Latent variable interpretation
• Statistical inference
• Bernoulli versus binomial responses
• Example analyses, basic diagnostics, separation
Binary response models
For observation i, have set of p predictors xi1 , . . . , xip , and a binary response variable

0 if observation i is a “failure”
Yi =
1 if observation i is a “success”
Assume
Yi ∼ Bern(pi )
As in least-squares regression, want to model Yi as a function of the predictor variables.
Bernoulli model as an EDF The probability mass function is
pi (yi ) = pyi i (1 − pi )1−yi
which can be re-expressed as

yi θi − b(θi )
pi (yi ) = exp + c(yi , φ)
φ
with
φ = 1
b(θi ) = log(1 + exp(θi ))
c(yi , φ) = 0
exp(θi )
This implies pi = µi = b0 (θi ) = 1+exp(θi ) .
Example: Melanoma and mortality
Data were collected on patients with malignant melanoma who had their tumor removed by
surgery. The surgeries, which took place between 1962 and 1977, removed the tumors as well
as the surrounding skin. Patients were followed until the end of 1977.
We want to model the probability of death from melanoma by 1977, coded as status, as a func-
tion of
1
sex = Male or Female
age = Age in years at the time of the operation
year = Year of the tumor removal operation
thickness = Thickness of the tumor (mm)
ulcer = Absent or Present (ulceration of tumor)
Data summaries
> summary(melanoma)
status sex age year thickness
Alive:134 Female:119 Min. : 4.00 Min. :1962 Min. : 0.100
Dead : 57 Male : 72 1st Qu.:41.00 1st Qu.:1968 1st Qu.: 0.970
Median :54.00 Median :1971 Median : 1.940
Mean :51.52 Mean :1970 Mean : 2.861
3rd Qu.:63.50 3rd Qu.:1972 3rd Qu.: 3.540
Max. :95.00 Max. :1977 Max. :17.420
ulcer
Absent :108
Present: 83
Choice of link function: Identity link?
2
Is it reasonable to let ηi = g(µi ) = µi ?
In other words, should we let

pi = µi = xi β?
Problem with identity link
• If we assume pi = xi β, then it’s possible that large values of xij can result in pi > 1.
• Similarly, this parameterization does not prevent pi < 0.
• This argues to choose link function g() to ensure 0 < pi < 1.
Want to choose link function so that g maps from (0, 1) to (−∞, ∞).
Some common choices

µ
g(µ) = logit (µ) = log (logit link)
1−µ
g(µ) = Φ−1 (µ) (probit link)
g(µ) = log(− log(1 − µ)) (complementary log-log)
where Φ−1 (µ) is the inverse-standard normal cdf.
Inverse link functions (the function that gives the mean as a function of the linear predictor)
3
exp(xi β ) 1
• Logit link: pi = µi = =
1+exp(xi β ) 1+exp(−xi β )
• Probit link: pi = µi = Φ(xi β)
• C-log-log link: pi = µi = 1 − exp(− exp(xi β))
A few comments on these three link functions
• When −1 ≤ η ≤ 1, all three (inverse) link functions are nearly linearly related. This means
that they are equally good/bad to use when probabilities can be expected to be away from
0 or 1.
• Main difference between logit and probit (inverse) link functions is that logit tends to assign
conservative probabilities (closer to 0.5) for extreme values of the linear predictor. Logit is
more robust than probit.
• Notice that the complementary log-log inverse link function predicts extreme low probabil-
ities, but conservative high probabilities.
• Preference for one of these link functions over another can only be established with very
large samples of data. However, if working with very large sets of data, probably should
reconsider working with a linear function of the predictor variables.
4
5
Latent variable formulation
Suppose zi is an unobserved variable, with

zi = xi β + εi
where εi is an “error” random variable, centered at (or very near) 0. More details momentarily on
this distribution.
Given this definition of zi , let yi be defined as

0 if zi ≤ 0
yi =
1 if zi > 0
In this representation, zi can be thought of as an unobserved continuous score, and yi is partial
information about the latent score.
If εi has a logistic distribution centered at 0, with cumulative distribution function (cdf)

Fεi (u) = Pr(εi < u) = 1/(1 + exp(−u)),
then
Pr(yi = 1) = Pr(zi > 0) = Pr(xi β + εi > 0)
= Pr(εi > −xi β) = 1 − Pr(εi < −xi β)
exp(xi β)
= 1 − 1/(1 + exp(xi β)) = ,
1 + exp(xi β)
6
which is logistic regression.
This is one reason this model is called logistic regression.
Latent variable formulation of other models
In the equation
zi = xi β + εi ,
if εi ∼ N(0, 1), then this leads to probit regression (i.e., Bernoulli model with a probit link function).
If εi has cdf Fεi (x) = 1 − exp(−ex ), the “extreme value” distribution, then the complimentary
log-log link results.
In general, Fεi (x) being a cdf of a continuous distribution over R is a good candidate for an inverse
link function for a binary response model. The inverse cdf therefore can be used as a link function.
Reason for considering latent variable formulation
• In some situations, a binary outcome could be viewed as choice between two alternatives.
• It might be sensible to assume that, underlying the potential choice, there is a continuum of
values that expresses the merit or utility.
• The latent variable view of the model assumes that if a person’s (unobserved) utility is above
a certain threshold, one decision is made; if it is below the threshold, the other decision is
made.
• Note that the latent variable framework does not change the probability model; it just lends
an interpretation of the underpinnings of the model.
Logistic regression: Binary response regression with logit link
We will focus on logistic regression for most of the discussion.
• Hard to distinguish the fit of different link functions without a lot of data.
• Logit link has some nice mathematical advantages over other link functions (as we shall see).
• η = logit(µ) is the canonical link for the Bernoulli distribution.
Curious property of logistic regression
In many situations, the predictor variables are sampled (or determined by design) first, and then
the response is treated as a random variable. These are called prospective designs.
Example: Effect of smoking on lung cancer
Randomly select a sample of smokers and non-smokers, and wait to see whether study partici-
pants develop lung cancer by age 75.
7
Here, let xi be 0 or 1 if participant i is a non-smoker or a smoker, respectively, and let yi be 0 or 1
if the participant avoids or develops lung cancer. Possible logistic regression model:
logit pi = β1 + β2 xi .
A prospective design for studying the effects of smoking on lung cancer can be inefficient and
time-consuming.
Alternative: Retrospective sampling
Identify a random sample of elderly people with lung cancer, and a random sample without lung
cancer. Then identify whether they were smokers versus non-smokers
In this situation, the yi were sampled first, and the xi are observed subsequently.
Retrospective sampling could produce a very different type of sample than prospective sampling:
For example, despite lung cancer being relatively rare, we could ensure through retrospective
sampling just as many lung cancer cases as non-cases.
The curious result
If sampling were carried out retrospectively (instead of prospectively), and we fit a logistic regres-
sion model of the form
logit pi = β1 + β2 xi ,
the coefficient β2 to xi will be the same as if we collected the data prospectively!
The intercept, β1 , will not necessarily be the same.
You will get to show this on your homework.
Bottom line If we want to estimate the effect of a predictor variable in a logistic regression from
data collected in a prospective study, we can carry out a retrospective study instead and get ap-
proximately the same estimate.
Comments:
• No need to restrict xi to being a binary predictor, or even a single predictor. The result holds
for a vector of arbitrary predictor variables, xi .
• If we chose another link function besides logit, this would not work.
Statistical inference for logistic regression
Model can be fit using maximum likelihood estimation, implemented numerically using Fisher
scoring.
Result of model fit: β̂, the MLE estimate of β.
8
Can also show the estimated variance (lowercase “var”) of β̂ is
var(β̂) = (X T W X)−1
where W is the diagonal matrix with entries
1
wii = = µ̂i (1 − µ̂i )
(∂ηi /∂µi )2µi =µ̂i φV (µ̂i )
exp(xi β̂ )
with µ̂i = g −1 (xi β̂) = , and φ = 1 for the Bernoulli model.
1+exp(xi β̂ )
Simple logistic regression fit in R
> melanoma.year.glm =
glm(status ˜ year, family=binomial, data=melanoma)
> summary(melanoma.year.glm)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 351.33724 123.35457 2.848 0.00440 **
year -0.17880 0.06263 -2.855 0.00431 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpreting logistic regression coefficients
Consider the coefficient βj (or an estimate, β̂j ) to the predictor variable xj .
• Positive βj means that larger xj corresponds to greater probability of Y = 1, holding the

other predictors fixed.
• Similarly, a negative βj means that larger xj corresponds to lower probability of Y = 1,

holding the other predictors fixed.
• An increase of 1 in xj corresponds to an increase of βj on the scale of logit p.
Interpretation for melanoma example
The estimated coefficient to lwt, β̂1 = −0.17880, means that for every year later an operation was
performed, there is a 0.17880 drop in the logit of the probability that the patient dies by 1977.
Another way to interpret βj
Notice that for logistic regression
∂µ exp(xβ)
= βj = βj µ(1 − µ).
∂xj [1 + exp(xβ)]2
9
This means that for probabilities near (say) µ = 0.5, a unit increase in xj corresponds to an increase
of βj (0.5)(1 − 0.5) = βj /4 in probability. For probabilities closer to 0 or 1, the increase is less.
Of course, this interpretation depends on the approximate probability so should be used with
caution.
For example, in the range of probabilities of death of around 0.35, every additional year later an
operation is performed results in an drop of
0.17880(0.35)(1 − 0.35) = 0.0407
in probability of death.
Wald tests and 100(1 − α)% Confidence intervals for βj
q
β̂j −βj
• Test statistic is z = , where s.e.(β̂j ) = var(β̂)jj . Under an assumed βj , z ∼ N(0, 1).
s.e.(β̂j )
(t-distribution is a poor approximation for the distribution of β̂j in a logistic regression;
numerator and denominator are not independent)
• Conventional to use
∗
β̂j ± z1−α/2 s.e.(β̂j )
∗
where z1−α/2 is the (1 − α/2) quantile of N(0, 1).
• This is actually a crude calculation, and relies on the sample size being large enough to
assume β̂j is approximately normally distributed. Generally better to use profile likelihood
confidence intervals.
Logistic regression coefficients for binary predictors and odds ratios For event A, define
Pr(A)
odds(A) =
1 − Pr(A)
Basically the odds, a monotonically increasing function of probability, maps probabilities to a scale
that ranges from 0 to ∞.
Some odds
Pr(A) = 0 ←→ odds(A) = 0
Pr(A) = 1/2 ←→ odds(A) = 1
Pr(A) = 1 ←→ odds(A) = ∞
Notice that

Pr(Y = 1) Pr(Y = 1)
logit Pr(Y = 1) = log = log = log(odds(Y = 1)).
1 − Pr(Y = 1) Pr(Y = 0)
The “logit” function is sometimes called the “log-odds” function.
Odds ratios
10
For a chosen value a, consider the quantity
odds(Y = 1 | xj = a + 1)
ORj (a) =
odds(Y = 1 | xj = a)
This is an indication of the effect of xj (changing from a to a + 1).
For logistic regression,

ORj (a) = exp(βj )
which does not depend on a. (show this on homework)
For binary predictors (where xj = 0, 1), it is very common to summarize the effects as odds ratios.
Not as compelling to use odds ratios when xj is a quantitative predictor.
Confidence intervals for odds ratios
Simplest solution is to
1. Obtain a confidence interval for βj in the usual way, and then
2. Recognizing that ORj = exp(βj ), exponentiate the endpoints of the confidence interval from
the previous step.
Example: Melanoma
> melanoma1.glm = glm(status˜sex+age+year+thickness+ulcer,

family=binomial,data=melanoma)
> summary(melanoma1.glm)
Coefficients:
(Intercept) 460.62370 150.46918 3.061 0.002204 **
sexMale 0.50311 0.36880 1.364 0.172505
age 0.02214 0.01191 1.859 0.063020 .
year -0.23556 0.07652 -3.078 0.002083 **
thickness 0.11393 0.06669 1.708 0.087554 .
ulcerPresent 1.44607 0.39367 3.673 0.000239 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Application to melanoma example
Confidence interval for the odds ratio of ulceration of the tumor on probability of death.
β̂ulcer = 1.446
s.e.(β̂ulcer ) = 0.394
11
The endpoints of an approximate 95% confidence interval for β̂ulcer are
1.446 ± 1.96(0.394) = (0.674, 2.218)
Exponentiating results in an approximate 95% confidence interval of
(exp(0.674), exp(2.218)) = (1.962, 9.191)
Could apply the same procedure with the endpoints based on the profile likelihood confidence
interval.
Inference for µ
Based on the fit of the model, we have

1
µ̂ =
1 + exp(−xβ̂)
Can use the above expression for probability estimates.
Can obtain confidence intervals for µ by either
• obtaining the standard error of probability estimates using the Delta method estimates of
the variance, or
• obtaining the confidence interval of xβ, and finding the inverse logit of the endpoints.
The following produces the predicted probabilities and Delta method standard errors based on
the fitted model:
> melanoma1.predict = predict(melanoma1.glm,type="response",se.fit=T)

> melanoma$predict = melanoma1.predict$fit
> melanoma$pred.se = melanoma1.predict$se.fit
> head(melanoma) # first 6 observations
status sex age year thickness ulcer status.val predict pred.se

Alive Male 41 1977 1.34 Absent 0 0.02906452 0.02127766
Dead Male 52 1965 12.08 Present 1 0.90302062 0.06031324
Dead Female 68 1971 7.41 Present 1 0.53419860 0.10206943
Similarly, we can estimate the probability based on specified predictor values.
Example:
> melanoma.newdat = data.frame(sex="Female",age=50,year=1971,
12
thickness=2.5,ulcer="Absent")
> melanoma1.newpred =
predict(melanoma1.glm,newdata=melanoma.newdat,
type="response",se.fit=T)
> melanoma1.newpred
$fit
1
0.0938881
$se.fit
1
0.03119798
Confidence intervals for µ based on CI for xβ
1. Compute the logit of the estimated probability and its associated standard error.
2. Construct the confidence interval for η = logit µ.
3. Invert the endpoints of the confidence interval through the inverse logit function.
Example
> melanoma1a.newpred =
predict(melanoma1.glm,newdata=melanoma.newdat,type="link",
se.fit=T)
> print(melanoma1a.newpred$fit)
1
-2.267059
> print(melanoma1a.newpred$se.fit)
[1] 0.3667195
Example (cont’d)
Confidence interval:
> print(plogis(c(
melanoma1a.newpred$fit - 1.96*melanoma1a.newpred$se.fit,
melanoma1a.newpred$fit + 1.96*melanoma1a.newpred$se.fit))
)
1 1
0.04807017 0.17533355
Note that the symmetric interval based on the Delta method can extend beyond (0, 1).
13
Likelihood ratio tests for nested models
Want to test model M0 with p0 coefficients nested within model M1 with p1 coefficients.
Let D(µ̂0 | y) be the deviance for model M0 and let D(µ̂1 | y) be the deviance for the model M1 .
Then
D(µ̂0 | y) − D(µ̂1 | y)
has a χ2p1 −p0 distribution under model M0 . Determine p-value based on this statistic.
Likelihood ratio testing: Melanoma example
Are the predictors thickness and ulcer jointly significant?
> melanoma1.glm = glm(status˜sex+age+year+thickness+ulcer,

> melanoma2.glm = glm(status˜sex+age+year,
> anova(melanoma2.glm, melanoma1.glm, test="Chi")
Analysis of Deviance
Analysis of Deviance Table
Model 1: status ˜ sex + age + year

Model 2: status ˜ sex + age + year + thickness + ulcer
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 187 211.88
2 185 186.18 2 25.703 2.622e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The smaller model is rejected at the 0.05 significance level.
Working with deviances
Deviance for the smaller model: 186.18
Deviance for the full model: 211.88
Can also use the deviance function on the fitted models to obtain the deviances:
> deviance(melanoma1.glm)
[1] 186.1809
> deviance(melanoma2.glm)
[1] 211.8841
14
Extra linear parameters coming from thickness (1) and ulcer (1) for a total of 2.
Thus we compare (211.88 − 186.18) = 25.7 to a χ22 distribution to obtain a p-value. This can be
obtained from the anova output as before.
Profile likelihood for melanoma data
> melanoma.year.glm = glm(status ˜ year,

family=binomial, data=melanoma)
> confint(melanoma.year.glm)
2.5 % 97.5 %
(Intercept) 112.7589466 598.43310045
year -0.3042709 -0.05767612
Categorical predictors, grouped responses
When only categorical predictors are observed, Bernoulli responses can be grouped within cate-
gories to form Binomial responses.
Example: AIDS symptoms and AZT
A study in which 338 HIV-infected veterans whose immune systems were beginning to deteriorate
were either assigned to take AZT immediately or to wait until their cells showed severe immune
15
weakness. The response was whether a veteran developed AIDS symptoms during the 3-year
study.
Data
Has AIDS symptoms

Race AZT Use Yes No
White Yes 14 93
White No 32 81
Black Yes 11 52
Black No 12 43
Letting Yi be 1 if veteran i experienced AIDS symptoms, and 0 if not, we would like to model the
Bernoulli probability pi = Pr(Yi = 1) as a function of race and AZT use.
To implement the analysis, could list out all 338 cases. But this would be inefficient.
Instead, let c index a unique category formed from the factors, with c = 1, 2, . . . , C. In the HIV
example, C = 4.
Then within category c, assume

Yc ∼ Bin(nc , pc )
where Yc is the number of “successes” out of nc observations in category c, and pc is the probability
of “success” for a single observation within category c.
Assume logit pc = xc β.
AZT data: Results of grouped analysis
> azt1.glm = glm(cbind(Symptoms,SymptomFree) ˜ Race + AZT,

family=binomial, data=azt)
> summary(azt1.glm)
Coefficients:
(Intercept) -1.07357 0.26294 -4.083 4.45e-05 ***
RaceWhite 0.05548 0.28861 0.192 0.84755
AZTYes -0.71946 0.27898 -2.579 0.00991 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8.3499 on 3 degrees of freedom

Residual deviance: 1.3835 on 1 degrees of freedom
16
Comparing ungrouped and grouped analysis
Maximum likelihood produces the same overall inferences because the likelihoods are the same
up to a multiplicative constant.
• Contribution to likelihood for category c in ungrouped case:

nc
Y
pcyi (1 − pc )1−yi = pyc c (1 − pc )nc −yc

i=1
• Contribution to likelihood for category c in grouped case:

nc yc
p (1 − pc )nc −yc
yc c
Deviance for grouped data
When C categories of predictors are observed, the deviance for the logistic regression model is
C
X yc nc − yc
D(µ | y) = 2 yc log + (nc − yc ) log
µc nc − µ c
c=1
Note that this is not equal to the deviance viewing the data as ungrouped.
AZT data: Results of ungrouped analysis
Coefficients:
(Intercept) -1.07357 0.26294 -4.083 4.45e-05 ***
RaceWhite 0.05548 0.28861 0.192 0.84755
AZTYes -0.71946 0.27898 -2.579 0.00991 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Parameter estimates are identical to grouped analysis, but deviances are different.
Deviance for the saturated model for the grouped data representation is also different for the
ungrouped data representation (because for the grouped data, use µi = yc , and for the ungrouped
data use µi = yi ).
• Consequence – cannot evaluate fit of model by comparing scaled deviance to χ2n−p distribu-
tion.
17
• Resolution – much more justified to compare two models subtracting scaled deviances rather
than evaluating single models with a scaled deviance.
Example: Effect of Race, with AZT already in model
Is Race a significant predictor of having AIDS symptoms, accounting for AZT in the model?
Grouped analysis:
• Deviance with only AZT: 1.420
• Deviance with AZT and Race: 1.383
• Difference in deviance: 1.420 − 1.383 = 0.037
Ungrouped analysis
• Deviance with only AZT: 335.151
• Deviance with AZT and Race: 335.188
• Difference in deviance: 335.188 − 335.151 = 0.037
Same χ2 statistic. (same degrees of freedom = 1)
Goodness of fit
No commonly accepted analog to R2 in least-squares regression.
One measure that is sometimes used is due to McFadden (1974):

D∗ (µ̂ | y)
pseudo-R2 = 1 −
D∗ (µ̂(0) | y)
where µ̂(0) is the vector of mean estimates for the model with only an intercept term (no predic-
tors). This model is often generically called the “null” model.
Problem The null deviance is not well-defined. Could depend on whether the data were grouped.
Regression diagnostics: Application to binary responses
Pearson residuals for individual cases:

(p) yi − µ̂i yi − µ̂i
ri =p =p
V (µ̂i ) µ̂i (1 − µ̂i )
Pearson residuals for grouped data:
yc − µ̂c
rc(p) = p
µ̂c (nc − µ̂c )/nc
18
Deviance residuals
For individual data i = 1, . . . , n,

(d)
p
ri = sign(yi − µ̂i ) di
where
di = −2 (yi log µ̂i + (1 − yi ) log(1 − µ̂i )) .
For grouped data, with category c = 1, . . . , C,
p
rc(d) = sign(yc − µ̂c ) dc
with
yc n c − yc
dc = −2 yc log + (nc − yc ) log .
µ̂c nc − µ̂c
Letting
yi − µ̂i yc − µ̂c
ri∗ = p or rc∗ = p
(1 − hii )µ̂i (1 − µ̂i ) (1 − hcc )µ̂c (nc − µ̂c )/nc
be the standardized residuals where hii or hcc is the corresponding diagonal element of H w . To
first approximation,
var(ri∗ ) ≈ 1 var(rc∗ ) ≈ 1
Jackknifed residuals defined as before.
Basic diagnostics
• Plot deviance or jackknifed residuals against fitted values to detect non-linearities, outliers.
• Plot deviance or jackknifed residuals against values of a single predictor also to detect non-
linearities.
Problem With binary responses, residual plots almost always look striped.
19
An approach recommended by Gelman
• Sort the fitted values in ascending order
• Group observations according to the sorted fitted values into bins of size G (e.g., G at least
6-7 or more)
• Plot the average fitted value within each bin against the average residual within each bin.
Instead of fitted values, can use the same strategy for predictor variables when one wants to in-
vestigate non-linearities of the residuals by individual predictors.
20
21
Predictive performance of binary response regression
Typical scenario: Disease screening/testing
• Already analyzed a data set consisting of people with and without a disease, along with
predictors of having the disease. The analysis results in an estimated logistic regression.
• For each person, can plug in the predictor values into the logistic regression equation to
obtain an estimated probability of having the disease.
Do higher probabilities of having the disease correspond to actually having the disease?
Ingredients
• The collection of binary responses, yi .

• The collection of predicted probabilities, p̂i = µ̂i .
Simple prediction rule Let c be a chosen cutoff value. Declare that the person has the disease if
p̂ ≥ c, and not if p̂ < c.
Let
0 if p̂i < c
ỹi (c) =
1 if p̂i ≥ c
22
be a binary prediction as a function of the cutoff value c.
Let
F (c) = Pr(ỹ(c) = 0 | y = 1)
G(c) = Pr(ỹ(c) = 0 | y = 0)
be the cumulative distribution functions for 0 ≤ c ≤ 1
Now define
Sensitivity(c) = 1 − F (c) = Pr(ỹ(c) = 1 | y = 1)

Specificity(c) = G(c) = Pr(ỹ(c) = 0 | y = 0)
Would very much like (for an appropriately chosen c):
Sensitivity(c) = Pr(ỹ(c) = 1 | y = 1) ≈ 1
Specificity(c) = Pr(ỹ(c) = 0 | y = 0) ≈ 1
In real life, it is usually too optimistic to have these values close to 1.
Predictive summaries
Common practice to choose c = 0.5, and summarize
Sensitivity(0.5) = Pr(ỹ(0.5) = 1 | y = 1)
Specificity(0.5) = Pr(ỹ(0.5) = 0 | y = 0)
In other words, compute the fraction of observations with p̂i ≥ 0.5 among those with yi = 1 (the
sensitivity), and compute the fraction of observations with p̂i < 0.5 among those with yi = 0 (the
specificity).
Use these values as estimates of the true sensitivity and specificity.
Deeper look into predictability
Question What happens when c is chosen to be small (near 0)?
Answer p̂i will always be greater than c, so that ỹi (c) is always 1. This means
Great sensitivity, terrible specificity!
What happens when c is chosen to be large (near 1)?
23
Answer p̂i will always be less than c, so that ỹi (c) is always 0. This means
Terrible sensitivity, great specificity!
Moral of the story When constructing a prediction rule, there is a tradeoff between sensitivity and
specificity. Just selecting a cutoff for c may not be sufficient.
Summary for describing overall discrimination: ROC (“receiver operating characteristic”) curve
Steps to construct ROC curve After fitting a binary response regression,
1. Let c (cutoff value) vary from 0 to 1.
2. For each c, compute the empirical sensitivity Pr(ỹ(c) = 1 | y = 1), and “1 − specificity,”
Pr(ỹ(c) = 1 | y = 0).
3. Plot the “sensitivity” (y-axis) versus “1 − specificity” (x-axis) for this collection of points.
This approach only makes sense if the model produces many different predicted probabilities (a
model with only one or two factors would not be appropriate for an ROC analysis).
Interpretation
• Best possible situation: Plot jumps to a sensitivity of 1 and stays flat. This corresponds to
perfect accuracy.
• Worst possible situation: Diagonal line (y = x) from the origin to (1, 1). This corresponds to
random guessing.
Actual worst possible situation: curve is under the line y = x; predictions have poor sensi-
tivity and specificity.
• Most real-data situations are in between.
The more area under the ROC curve, the greater the accuracy.
Technical specification of ROC curve
For 0 ≤ t ≤ 1, define
ROC(t) = 1 − F (G−1 (1 − t)),
where
G−1 (1 − t) = inf{x ∈ R : G(x) ≥ 1 − t}.
(show this on homework!)
24
25
Summarizing an ROC curve: c-statistic (area under the ROC curve, “AUC”)
Z 1
c-statistic = AUC = ROC(u) du
0
The c-statistic, the “concordance index,” is an estimate of the following probability:
Pick at random an observation with yi = 1 and an observation with yi = 0. Then the concordance
index is the probability that the fitted probability for yi = 1 is larger than the fitted probability for
yi = 0.
This statistic is an accepted discrimination measure for binary response regressions. In fact, it is
equivalent to the Mann-Whitney U -statistic (also known as the Wilcoxon rank-sum statistic) for
testing whether one sample is stochastically larger than another. Will discuss this in section.
Rough guidelines for interpreting c-statistics
• 0.50 to 0.75 = fair

• 0.75 to 0.92 = good
• 0.92 to 0.97 = very good
• 0.97 to 1.00 = excellent
26
ROC analysis for melanoma models
Area under ROC curves for melanoma models:
• For full model, c-statistic is 0.793 (good discrimation).
• For reduced model, c-statistic is 0.696 (fair discrimination).
27
One criticism of ROC approach: Overfitting
• In situations where the regression model is overfitting (e.g., if too many predictor variables
are incorporated), then the ROC analysis will be misleading.
• Possible solution: For each observation i, determine the predicted probability p̂i based on
fitting a logistic regression without observation i. Use these jackknifed probability estimates
in place of the ordinary logistic regression probability estimates.
Further work on ROC
Smoothing the ROC curve – parametric smoothing, kernel smoothing
A problem that arises in binary response models: Separation
If binary responses can be separated (or nearly so) into all 0s and all 1s as a linear function of X,
then the MLE of β may not exist (or at least may not be finite).
Two types of separation:
• Complete separation
• Quasi-complete separation
28
Logistic regression for the first data set
Coefficients:
(Intercept) 165.32 407521.43 0 1
x -47.23 115264.41 0 1
Null deviance: 8.3178e+00 on 5 degrees of freedom

Residual deviance: 2.2152e-10 on 4 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
Logistic regression for the second data set
Coefficients:
(Intercept) 137.32 54599.64 0.003 0.998
x -39.23 15599.90 -0.003 0.998
29

AIC: 6.7726
Number of Fisher Scoring iterations: 21
Comments
• Instances of predictions p̂i = 0 or p̂i = 1 (to many decimal places) is an indicator of separa-
tion.
• Quasi-complete separation occurs more typically with categorical predictors.
• For complete separation, the maximized log-likelihood is 0 (to many decimal places); for
quasi-complete separation, it is strictly negative.
• The more covariates one has in a binary response model, the greater the potential for sepa-
ration.
Addressing separation
• Collect more data
• With quasi-complete separation, can still perform some inferential procedures:
– LRTs are still okay

– Inverting the LRTs to create confidence intervals/regions work
• Can use other approaches besides likelihood-based ones. For example, Bayesian approaches
with proper priors, regularization (penalized likelihood), and Firth’s “bias-reduced” ap-
proach all produce finite estimates.
30

Statistics 244 - Binary Response Regression, and Related Issues

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Statistics 244 - Binary Response Regression, and Related Issues

Încărcat de

Drepturi de autor:

Formate disponibile

Statistics 244 – Binary response regression, and related issues

Models for binary outcomes, and related issues

• GLMs for binary outcomes, link functions, logistic regression

Binary response models

Bernoulli model as an EDF The probability mass function is

pi (yi ) = pyi i (1 − pi )1−yi

which can be re-expressed as

Example: Melanoma and mortality

age = Age in years at the time of the operation

year = Year of the tumor removal operation

thickness = Thickness of the tumor (mm)

ulcer = Absent or Present (ulceration of tumor)

Choice of link function: Identity link?

In other words, should we let

Problem with identity link

Some common choices

• Probit link: pi = µi = Φ(xi β)

• C-log-log link: pi = µi = 1 − exp(− exp(xi β))

A few comments on these three link functions

Suppose zi is an unobserved variable, with

Given this definition of zi , let yi be defined as

If εi has a logistic distribution centered at 0, with cumulative distribution function (cdf)

This is one reason this model is called logistic regression.

Latent variable formulation of other models

Reason for considering latent variable formulation

Logistic regression: Binary response regression with logit link

We will focus on logistic regression for most of the discussion.

• η = logit(µ) is the canonical link for the Bernoulli distribution.

Curious property of logistic regression

Example: Effect of smoking on lung cancer

Alternative: Retrospective sampling

The curious result

The intercept, β1 , will not necessarily be the same.

You will get to show this on your homework.

Statistical inference for logistic regression

Result of model fit: β̂, the MLE estimate of β.

where W is the diagonal matrix with entries

Simple logistic regression fit in R

Interpreting logistic regression coefficients

Consider the coefficient βj (or an estimate, β̂j ) to the predictor variable xj .

• Positive βj means that larger xj corresponds to greater probability of Y = 1, holding the

• Similarly, a negative βj means that larger xj corresponds to lower probability of Y = 1,

• An increase of 1 in xj corresponds to an increase of βj on the scale of logit p.

Interpretation for melanoma example

Another way to interpret βj

Notice that for logistic regression

0.17880(0.35)(1 − 0.35) = 0.0407

Wald tests and 100(1 − α)% Confidence intervals for βj

This is an indication of the effect of xj (changing from a to a + 1).

For logistic regression,

Confidence intervals for odds ratios

1. Obtain a confidence interval for βj in the usual way, and then

> melanoma1.glm = glm(status˜sex+age+year+thickness+ulcer,

Application to melanoma example

1.446 ± 1.96(0.394) = (0.674, 2.218)

Exponentiating results in an approximate 95% confidence interval of

(exp(0.674), exp(2.218)) = (1.962, 9.191)

Based on the fit of the model, we have

Can use the above expression for probability estimates.

Can obtain confidence intervals for µ by either

> melanoma1.predict = predict(melanoma1.glm,type="response",se.fit=T)

status sex age year thickness ulcer status.val predict pred.se

Similarly, we can estimate the probability based on specified predictor values.