Documente Academic
Documente Profesional
Documente Cultură
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=astata.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal
of the American Statistical Association.
http://www.jstor.org
LogisticRegression,SurvivalAnalysis,and the
Kaplan-Meier Curve
BRADLEYEFRON*
* BradleyEfronis Professor,DepartmentofStatistics,
StanfordUni- ? 1988AmericanStatistical
Association
CA 94305.Thisarticle
Stanford,
versity, wasstimulatedbya talkofWei- Journal Association
oftheAmericanStatistical
YangTsai concerning estimation.
isotonichazard-rate and Methods
June1988,Vol.83, No.402,Theory
414
Efron:Survival Analysis Via Kaplan-Meier 415
1.0
Table 1. Data forArmA oftheHead-and-Neck-CancerStudy
ConductedbytheNorthern California
OncologyGroup,
Discretized
byMonths
0.8-
ll
Month n, s, s,' Month n, s, s
o.6 - 1 51 1 0 25 7 0 0
2 50 2 0 26 7 0 0
0 3 48 5 1 27 7 0 0
a.o * 4 42 2 0 28 7 0 0
2 5 40 8 0 29 7 0 0
6 32 7 0 30 7 0 0
7 25 0 1 31 7 0 0
0.2.B 8 24 3 0 32 7 0 0
9 21 2 0 33 7 0 0
10 19 2 1 34 7 0 0
0.0 A 11 16 0 1 35 7 0 0
0 20 40 60 80 12 15 0 0 36 7 0 0
Months 13 15 0 0 37 7 1 1
Figure1. Kaplan-MeierEstimated SurvivalCurves,ArmsA and B. 14 15 3 0 38 5 1 0
15 12 1 0 39 4 0 0
Theseestimatesare takenfroma studycomparing radiationtherapy
16 11 0 0 40 4 0 0
alone (A) versusradiation
plus chemotherapy (B) forthetreatment of 17 11 0 0 41 4 0 1
head and neck cancer. Treatment B is significantlybetteraccording 18 11 1 1 42 3 0 0
to theMantel-Haenszel test,significancelevel.01 (see Tables 1 and 19 9 0 0 43 3 0 0
2). "Death" actuallymeans "recurrence ofdisease." Theerrorbars 20 9 2 0 44 3 0 0
indicate? one standarderror. 21 7 0 0 45 3 0 1
22 7 0 0 46 2 0 0
23 7 0 0 47 2 1 1
24 7 0 0
For example,n3 = 48 patientswerealiveat thebegin-
ningofthethirdmonthofobservation, duringwhichS3 = Total 628 42 9
5 patientsdied and s3 = 1 patientwas lostto follow-up. NOTE: n,is the numberofpatientsat riskat thebeginningofmonthi, s, the numberofobserved
Thisleftn4= 42 patientsstillunderstudyatthebeginning 34,
deaths, s; the numberlost to follow-up.The survivaltimes in days forthe 51 patientswere 7,
42, 63, 64, 74+, 83, 84, 91, 108, 112, 129, 133, 133, 139, 140, 140, 146, 149, 154, 157,
ofmonth4. "Lost to follow-up," or "censored,"dataoc- 160, 160, 165, 173, 176, 185+, 218, 225, 241, 248, 273, 277, 279+, 297, 319+, 405, 417,
curredmainlybecausepatientsenteredthestudyat dif- 420, 440, 523, 523+, 583, 594, 1,101, 1,116+, 1,146, 1,226+, 1,349+, 1,412+, 1,417 ("+"
indicateslostto follow-up).The table was constructedfromthese data, takingone monthto be
ferentcalendartimes,and someof themwerestillalive 30.438 days.
whenthedatawerecollectedat theend ofthestudy.
Table 2 showsthe discretized data forarmB of the number ofdeathssiis binomially distributed, givenni,say
study.Here we haveusedN = 61 discreteintervals, not
all ofthesamelength.(The choiceofdiscretization made Si I ni - Bi(ni, hi) independently, i = 1, 2, ... , N.
littledifference
intheestimated hazardratesandsurvival (2.2)
curves;see RemarkE, Sec. 3, and RemarkI, Sec. 5.)
Our basicassumption is thatfordataoftype(2.1), the In otherwords,si has discretedensity
0.20
(n;)
hS'(l - hi)n,-S, si = 0, 1, 2, . . . , ni.
Here hiis thediscretehazardrate:
0.15 -
diesduringithinterval
hi = Pr{patient
1
patientsurvivesuntilbeginning ofithinterval}. (2.3)
The binomialassumption in (2.2) is basicto mostwork
0.10
in survivalanalysis.Nice discussions appearin chapter4
ofCox and Oakes (1984) and section(5.2) ofKalbfleisch
andPrentice(1980).In whatfollows,weconsider thenito
befixedat theirobservedvalues,and takeliterally thein-
dependence assumptionin(2.2). Although thisassumption
cannotbe exactlytrue(see Sec. 3, RemarkA), itleadsto
reasonableconclusionsunderthe usual assumptions for
censoreddata.
Months
The survival functionforourdiscretized is
situation
Figure2. Hazard-RateEstimatesforthe Head-and-NeckCancer
Study.Thereis an earlyhigh-risk
periodforbothtreatments.
Thehazard fi- (1--h), (2.4)
ratesstabilize
afteroneyear,withtreatment
A having a hazardrate l'j<i
roughly2.5 times
thatoftreatment
B. (Thebullets
areidentifying
sym-
bolsforcurveA,notdatapoints.) is basedona parametrictheprobability
Thisfigure thata patientdoes notdie duringthefirst
analysisdescribedinSection
2. i - 1 timeintervalsand thussurvivesat leastuntilthe
416 Journal of the American Statistical Association, June 1988
1.0
occasionally
write , and hia to emphasizethe depen-
denceon a.
0.8 - These assumptions describea standardlogisticregres-
sionmodel(e.g., see Cox 1970),so we willquotewithout
prooftheusualresults formaximum likelihoodestimation
0.6 - in such models. Let s = (Si, S2,* SN), nha = ,
n2h2
(nlh1l,, , . . . , nNhN,,)', andX equal theN x p ma-
trixhavingvectorxi of (2.8), as itsithrow.Thenthep-
0.4 -
dimensionalscore vectoria = ( (aIaaj) log fa(s) .).
is
0.2 -
ia = X'(s - nha). (3.1)
The MLE of a is thata^thatmakes(3.1) equal 0.
0.0
0 20 40
I
60 80 Thep x p secondderivative matrix- la, withjlthele-
Months ment - (a2laajaal) logfa(s), is givenby
Figure4. ParametricVersusNonparametric SurvivalEstimates,Arm
B. Ufe-tableestimateforarmB Oaggedline)is comparedwithpartial ia = X' diag(niVj,a)X. (3.2)
splinejoined at 11 months
logisticregressionbased on cubic-linear
[see(2.9)].
Here Via hi,a(1 - hi,a),and diag(niVi,a)is the N x N
diagonalmatrixwithdiagonalelementsniVi,a. The ex-
fora, 9a = E{ia(S)ia(S)'}
matrix
pectedFisherinformation
3. MAXIMUM AND
ESTIMATES
LIKELIHOOD = E-la}, also equals X' diag(niVi,a)X.The observed
STANDARD ERRORS matrix
Fisherinformation tobe 3 = &,orequiv-
is defined
This sectiondiscussescalculationof maximumlikeli- 3
alentlyfrom(3.2) = 1& = X' diag(niVi,&)X.
hood estimatesand theirstandarderrors,forpartiallo- Estimatedstandarderrors(SE's) forquantitiesof in-
gisticregression models.UsingthearmA data of Table terestsuchas a, hi, and Gi are obtainedfromfamiliar
1 as an example,we showhow the parametric survival maximum likelihoodcalculations:
estimatesapproachthelife-table curvesandhowtheires- COV ( AY
A
_
1
timatedstandarderrorsapproachthosegivenby Green-
wood'sformula, as theparametric modelsbecomemore SE(hi,&) = Vi,[xigx- l]1/2
complicated. The additionaltheoryrequiredformodels
involvingjoin-pointestimation, suchas the cubic-linear
SE(Gi,) = Gi, [(E hj,&x) (hj&Xj) (3.3)
spline(2.9), is discussedin Section4.
Suppose, then, that we have si I ni - Bi(ni, hi) as in
(2.2), wheretheniare considered fixedat theirobserved Here Gi, = H<V( - hj,&).We usually use the shorter
values,and thesi are takento be independently distrib- notationhi = hi,& Gi = Gi,
uted,giventheni. (The independence assumption is fur- Table 3 givesestimated hazardratesandtheirstandard
therdiscussed inRemarkA.) Also,assumethatthelogistic errorsforthreeconditional (2.8), fit
logisticregressions
parameterAi = log[hi/(1- hi)] followsthe linearlogistic to the Table 1 arm A data: a linearmodel xi = (1, ti),the
model Ai = xia as in (2.8), so hi = [1 + exp(-Ai)] -. We cubicmodel(2.7), and the cubic-linear spline(2.9). (A
Table 3. EstimatedHazard Rates and TheirStandard Errorsat Selected Time Points,forTable 1 ArmA Data
Month Linear Cubic Cubic-linear Life-table LTSM Linear Cubic Cubic-linear LTSM
Table5. Estimated
Survival
Functions
and StandardErrors
Estimated
survival Estimated
standarderrorsforlog survival
Month Linear Cubic Cubic-linear Life-table Linear Cubic Cubic-linear Life-table
1 .910 .947 .985 .980 .019 .019 .031 .033
3 .759 .819 .860 .843 .054 .055 .059 .065
5 .640 .677 .642 .642 .084 .086 .090 .095
7 .545 .543 .483 .501 .109 .112 .129 .132
9 .469 .433 .402 .397 .131 .143 .155 .168
11 .407 .350 .359 .355 .150 .177 .186 .202
15 .313 .250 .295 .261 .184 .236 .256 .256
20 .236 .197 .235 .184 .222 .284 .286 .299
25 .185 .176 .191 .184 .262 .312 .309 .325
30 .150 .166 .159 .184 .307 .325 .319 .338
35 .125 .158 .134 .184 .358 .341 .324 .348
40 .107 .147 .115 .126 .413 .360 .338 .370
45 .094 .108 .100 .126 .470 .445 .436 .493
47 .089 .065 .095 .063 .493 .718 .686 .726
NOTE: Leftpanel: estimated survivalG at the end of the indicatedmonths,forfourdifferent estimatorsapplied to the arm A data,
Table 1. Rightpanel: estimatedstandarderrorsforlog{G}. The standarderrorforthe cubic-linearspline includes a termforthe choice
ofjoin (see Sec. 4). Note thatthe life-tableestimateis onlyslightlymore variablethanthe cubic-linearspline.
Suppose we are interested in estimating the survival expectthatthevariability ofa survival curveGi,obtained
function Gi ratherthan the hazard rate hi. In thiscase, from a p-parameter conditional logistic regression, ap-
parametric methodsofferless impressive improvementsproachesthevariability of thelife-table estimateGi as p
overthenonparametric life-table
approach.Table5 com- -- N. What is surprisingin Table 5 is how quicklythe
paresthe estimatedsurvivalcurvesGi (fromthelinear, approachtakesplace. Even thecubicmodel,withonlyp
cubic,and cubic-linear splinemodels)withthenonpara- = 4 parameters, has barely1O%-15% smallerstandard
metriclife-tableestimate(2.5). The rightpanel shows errorsthanGi. On the otherhand,parametric models
estimatedstandarderrorsforlog{Gi}. The cubic-linear providemuchgreaterimprovements whenestimating the
spline,whichwasouronlyparametric modelgivinga sat- hazardrate,as thetheorem(3.7) shows.
isfactory fitto the data in Table 1, is onlyslightly less
variablethanthe life-table estimate.In the notationof RemarkA. The independence assumption (2.2) can-
notbe literallytrue.Forexample,ifthereis no censoring
(3.3), Si = ni - ni+1.In this case, the sequence s1, s2, . . ., is
completely determined by the sequencen1,n2, ... , in
SE(log Gi,a)= [( hi i,a, ] (3.8) contradiction to (2.2).
j<i j<i Nevertheless, calculations based on (2.2) givereason-
ableanswers underreasonableassumptions. Usingtheno-
It is easierto comparestandarderrorsforlog G thanfor tationof (2.1), letv' = (s1, s2,
s1, S2, . . * , s{1, Si)
G itself,becausethefactorGi&in thethirdequationof and = (s1, withn
vi Si, S2, S2' . . . , sii, Si'i). Starting
sharpenthe com- n1patientsat riskat thebeginning
-
formula(3.3) is removed.To further ofobservation (which
parison,all of thestandarderrorsin Table 5 werecalcu- we taketo be a constant, fixedat itsobservedvalue),vi
latedassuming thatthecubicmodelwas true.
is thehistory ofdeathsand lossesforthefirsti - 1 time
Formula(3.8) is closelyrelatedtoGreenwood's formula
intervals;v' is the same history extendedto includesi.
forthevarianceofthelife-table estimate. Supposein(2.8) Here we followtheusualconvention thatthes!' lossesin
we takep = N andxi = ei,theN-dimensional vector(0,
anyone timeinterval occurafterthesi deaths.Notethat
O,..., 1, 0,. . .,O) with1 in the ith place (i = 1,
n2 = n1 - s1 - s ,n3 = n2 - S2- s , and so forth, so
2, . . . , N). ThenX equals theN x N identity matrix, thereis no need to indicaten2,n3, . . , in or vi'.
ni vi
and (3.1) showsthatthe MLE hi equals si/ni= hi, the
We assumethatsi, givenvi,has a Bi(ni,hi,,)distribu-
nonparametric MLE. In thiscase,theMLE ofthesurvival
tion,wherehi,a = [1 + exp(-xia)]1, as in (2.8), and that
curve,Gi, equals (2.5), the life-tableestimateGi = on a nuisance
4i',givenvi, has a distribution depending
1l<j<i(1 - hj). The observedinformation matrix 9 = X'
parametervector(, but not on a:
diag(niVi,&)X -
equalsdiag(nihi(1 hi)),so (3.8) gives
fa,jS1Sli51S2, S2 ,*
I
SN, SN)
I
[(s:i) - f<(sj
r ~~~ ~~1/2
= [2 ' Si ) (3.9) X [(122) h2a(1 - h2)S2] f(S I V4) -
whichis Greenwood'sformula (see Miller1981,p. 45). >([() - hN,a) N NNI fg(SN (3.10)
hN,a (1 |k).
This calculation
(as wellas common sense)leads us to
420 Journalof the American StatisticalAssociation,June1988
= log(2), i = 37, . . . , 61. (3.17) 1 1.31 1.41 1.03 .011 .012 1.00
3 .43 .45 1.00 .022 .030 1.00
Thesea' compensate forthediffering lengthsofthetime 5 .21 .37 1.00 .017 .044 1.00
intervalsin Table 2. [See Eq. (5.3). The fittedhazards 7 .39 .36 1.02 .032 .033 1.00
9 .68 .45 1.04 .035 .024 1.00
hi, forarmB wereadjustedto one-month for
intervals, 11 .80 .54 1.00 .028 .020 1.01
hi, by 2 for i = 1, . .. , 18, in
example, by multiplying 15 .83 .48 1.00 .026 .016 1.02
20 .85 .42 1.01 .024 .013 1.01
orderto maketheplottedhazardratesin Fig. 2 compa- 25 .87 .44 1.01 .022 .013 1.00
rable.] 35 .90 .63 1.01 .019 .016 1.01
45 .95 .92 1.01 .016 .020 1.02
SPLINEMODEL
4. THECUBIC-LINEAR NOTE: Leftpanel: the logitscale. Rightpanel: the hazard scale. The penaltyratiois now very
small,so estimatingthe join pointfromthe data adds littleto the standarderror.
Thissectiondiscussesthemaximum likelihoodestima-
tionofthejoinpointinthecubic-linear splinemodel(2.9).
We are particularly interestedin assessingtheincreased in thepictureddifferences of thetwohazardsthantheir
standarderrorofquantities suchas thehazard-
ofinterest, individual standard errors would suggest.(It is important
betweenarmsA andB ofthecancerstudy,
ratedifferences to note that thisstatement depends on choosingthesame
due to theestimation of thejoin. The discussionhereis join forboth estimated hazard rates.)
verybrief.Efron(1986,sec. 4) gavemoredetails. The resultsin Table 6 are basedon a generalization of
Tables6 and 7 numerically summarize therathertech- model (2.8):
nicalresultsofEfron(1986).In Table6 we see theMLE's
= xi(o)a, i = 1, 2, ... , N, (4.1)
ofthehazardratesforarmsA andB, basedon thecubic- Aii-it,ao
oftheJoinPoint4
Table8. Deviancesas a Function It is easytosee theresults
ofdiscretizing
thiscontinuous
forArmsA and B situation.Supposethattheithdiscretetimeinterval has
centerpointtiand lengthAi. Then,the discretedensity
giJa -= i 2ga(t) dtis obtainedbya standardTaylorseries
Deviance 9.5 10 11 12 13 argument:
devA 50.928 48.016 47.519 47.370* 47.669
devB 34.011* 34.083 34.557 35.205 35.889 gi a = ga(ti)Ai + O(A ). (5.1)
Total 84.939 82.099 82.076* 82.575 83.558 thediscrete
Similarly, survival Gi,a=
function Ej2i gi and
NOTE: The totaldeviance is minimizedfor+ = 11.
discretehazard rate hia = gi, Gi,aare givenby
* Minima.
Gi,a = Ga(ti) + [ga(ti)I2] Ai + O(Aw)
suchas thosefortheGompertz
whereZ is a standardone-sidedexponential.This last culations (5.7)
distribution
resultassumesthata2 > 0, SO Ga(t) approaches0 as t give
00o
thatthelifetime
is positiveprobability T is infinite: t? 2 to> - (5.14)
Pa{T = oo} = e0, 0 [exp(a1)]/a2 < 0. (5.9) If a2 < 0, thenPa{T = oo}is positiveand can be found
by lettingt1 (5.14):-- ooin
Withthisunderstanding, (5.7) remainstrueas stated.
All of thediscretemodelsconsideredpreviously have Pa{T = ??} = Ga(to)[ha(to)I(-a2)I (5.15)
the potential for estimatingPa{T = oo} to be positive. For botharmsof the cancerstudy,the MLE a2 was
(Thishappenedinbotharmsofthecancerstudy;see Re- negative.Formula(5.15), withto= 47 months forarmA
markG.) Thisis an advantageofourapproach.In prac- and to = 77 monthsforarmB, givesthefollowing esti-
tice,it is oftendesirableto includethe possibilityof a matesforthesurvival fractions:
positivesurvivalfraction,butthiscanbe clumsy todo with
the usual parametric modelsfor lifetimedistributions. P&A{T = oo} = .025, P&B{T = oo} = .189. (5.16)
Miller(1981,sec. 2.4) gavea briefdiscussion. Of course,estimates suchas (5.16) shouldbe interpreted
Morecomplicated examplesofmodel(5.4), suchas log withcaution,sincetheyrepresent heroicextrapolations
ha(t) = a1 + a2t + a3t2,do notyieldsimpleexpressions beyondtheobserveddata.
forthecdfor density.Thisis unimportant, sincethees-
timationof parameters dependsonlyon the log hazard RemarkH. Thisarticleconcentrates on theone-sam-
rate,whichis particularly easyto use formodel(5.4). ple situation, where all patientshave the same survival
The parametervectora in model (5.4) is estimated curveGa(t). Model (5.4) and its discrete analog extend
as follows:Let n(t) be the numberof patientsat risk easilytotheregression situation, where patient j's survival
justbefore time t. We assume thattheoccurrence of ob- depends on a time-varying vectorzj(t) of observedco-
serveddeathsis a Poissonpointprocess,withintensity variates,say
n(t)ha(t) = n(t)ex(t)aat timet. This is thelimiting
process
hj(t) = exp[x(t)a + zj(t)f6]. (5.17)
obtainedfrom(2.2) by letting thediscretetimeintervals
decreaseto zero length(see Efron1977). Supposethat Model(5.17),andinparticular itsconnection withCox's
out of all n patientswe observed m deaths,at say,
times, likelihood,
partial orproportional hazards, model wasex-
T1,T2, . .. , Tm,withthe othern - m patientsbeinglost aminedin Efron(1977). It is showntherethatthefully
to follow-up at varioustimesduringthestudy.DefineS parametric model(5.17) willusuallynotimprove muchon
thepartiallikelihoodmodel hj(t) = ho(t) with
exp[zj(t)f3],
x(T1)'. The scorevectorla forthePoissonprocess
is ho(t)completely at
unspecified, least notfor the estimation
f
off,.On theotherhand,(5.17) canbe effective inactually
ia = S - n(t)x(t)'ex(t)adt, (5.10) estimating the hazardshj(t), rather than just comparing
0 themas thepartiallikelihoodmodeldoes.
so theMLE 'a is givenby RemarkI. Suppose thatin the continuousPoisson-
processsituation(5.4), we discretizeto situation(2.2).
S = n(t)x(t)'ex(06 dt. (5.11) How muchinformation is lost?For convenience assume
thatthecontinuous lifetime variateT takesitsvaluesin
The observedFisherinformation matrix fora is theunitinterval[0, 1], and thatthediscretization of the
data is into N equal subintervals, as in Table 1. Let
-a = fn(t)x(t)'x(t)ex(t)a dt. (5.12) ga(N) be theFisherinformation matrix fora basedon the
discretedata (2.2) (takingtheindependence assumption
and
literally), let ga(oo)be the Fisher information matrix
It is easy to see that(5.10) and (5.12) are simplythe
onecan based on the original continuous data. Then, as N -?
continuous analogsof(3.1) and(3.2). Conversely,
look at (3.1) and (3.2) as convenient summation approx- ga(N) - a(??) - c/N2, (5.18)
imations to theintegralsin (5.10)-(5.12).The connection
betweenthediscrete andcontinuous caseswasdrawnmore wherec = (1/12)fJx(t)'x(t)n(t)h(t) dt.Here thefunc-
carefully inEfron(1977),including a derivationof(5.10)- tionn(t) is consideredfixedat its observedvalue,even
(5.12). thoughit is random[liketheniin (2.2)].
Result(5.18) saysthattheinformation lossdue to dis-
RemarkG. A continuous cubic-linear splinemodellog cretization goes to0 veryquickly as N grows large.Various
ha(t) = a1l+ at2t+ ae3(t /ff+ a4(t - /)3 has
- alternative discretizations were triedon the cancer-study
data, suchas discretizing armA intothesame intervals
log ha(ti) = log ha(t0) + a2(t1- to) (5.13) used for arm B in Table 2, withalmost-imperceptible
forvaluesof t1and togreaterthanthejoin point4. Cal- changesin theresults.
424 Journalof the American StatisticalAssociation,June1988