Birth-Death MCMCS, Annals of Statistics, 2000, Stephens

Bayesian Analysis of Mixture Models with an Unknown Number of Components- An Alternative to Reversible Jump Methods Author(s): Matthew Stephens
Source: The Annals of Statistics, Vol. 28, No. 1 (Feb., 2000), pp. 40-74 Published by: Institute of Mathematical Statistics Stable URL: http://www.jstor.org/stable/2673981 Accessed: 23/06/2010 16:00
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://dv1litvip.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=ims. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The Annals of Statistics.
http://dv1litvip.jstor.org
The Annals of Statistics 2000, Vol. 28, No. 1, 40-74
BAYESLAN ANALYSIS OF MIXTURE MODELS WITH AN UNKNOWN NUMBER OF COMPONENTS-AN ALTERNATIVE TO REVERSIBLE JUMP METHODS'
BY MATTHEW STEPHENS
University ofOxford
of performing a Bayesian Richardson and Greenpresenta method withan unknown numa finite mixture distribution analysisofdata from is a Markov ChainMonteCarlo(MCMC) berofcomponents. Theirmethod described which makesuse ofthe"reversible approach, jump"methodology which viewsthepaan alternative MCMC method byGreen.We describe methods rameters of the modelas a (marked)pointprocess, extending birth-death withan approto createa Markov suggested byRipley process is easy to implement, even in priatestationary distribution. Our method thecase ofdata in morethanone dimension, and we illustrate it on both univariate and bivariate data. Thereappearsto be considerable potential as an alternative to moregenforapplying theseideas to othercontexts, and we conclude witha brief discussion of eral reversible jump methods, howthismight be achieved.
modelsare typically used to modeldata 1. Introduction. Finitemixture is assumedto have arisenfrom one ofk groups, each whereeach observation from someparametric The groupbeingsuitably modeled by a density family. of the mixture, to as a component and is densityof each groupis referred ofthegroupin the population. This model weighted bythe relative frequency a framework provides by whichobservations maybe clustered into together or classification groupsfordiscrimination [see, e.g.,McLachlanand Basford (1988)]. Fora comprehensive list ofsuchapplications, see Titterington, Smith and Makov(1985). Mixture modelsalso provide a convenient and flexible famfor ilyofdistributions estimating orapproximating distributions which are not well modeledby any standardparametric and providea parametric family, alternative to non-parametric methods ofdensity such as kernel estimation, estimation. density See, forexample,Roeder(1990), West(1993) and Priebe (1994). This paperis principally concerned withthe analysisofmixture modelsin whichthe numberof components k is unknown. In applications wherethe have a physical components interpretation, inference fork maybe ofinterest in itself. Wherethemixture modelis beingused purely as a parametric alternativeto non-parametric density estimation, thevalue ofk chosenaffects the ofthe modeland thusthe smoothness flexibility ofthe resulting density estimate. Inference fork maythenbe seen as analogousto bandwidth selection
Received December 1998. 1Supported byan EPSRC studentship and a grantfrom theUniversity ofOxford. AMS 1991subject classifications.Primary 62F15. and phrases.Bayesiananalysis, Keywords birth-death process, Markov process, MCMC,mixturemodel, modelchoice, reversible jump,spatialpointprocess. 40
BAYESIAN ANALYSIS OF MIXTURES
41
in kerneldensity estimation. Procedures whichallow k to varymaytherefore be ofinterest whether or not k has a physical interpretation. Inference for k may be seen as a specific example of the verycommon problem of choosing a modelfrom a givenset of competing models.Taking as we do here,has the advantagethat a Bayesian approachto thisproblem, it providesnot onlya way of selecting a single"best"model,but also a coherentway ofcombining model resultsoverdifferent models.In the mixture context thismight includeperforming estimation density bytakingan appropriateaverageofdensity estimates obtained values ofk. While usingdifferent modelchoice(and modelaveraging) within the Bayesianframework are both theoretically straightforward, theyoften providea computational challenge, when(as here)the competing modelsare ofdiffering dimension. particularly The use ofMarkovChain MonteCarlo (MCMC) methods [see foran introductionGilks,Richardson and Spiegelhalter (1996)] to perform Bayesiananalysis is now verycommon, but MCMC methods whichare able to jump between modelsofdiffering dimension have becomepopularonlyrecently, in particular through the use ofthe "reversible jump"methodology developed byGreen allowthe construction ofan ergodic (1995). Reversible jump methods Markov chain withthe joint posterior distribution ofthe parameters and the model as its stationary distribution. Movesbetween modelsare achievedbyperiodia moveto a different callyproposing it withappropriate model,and rejecting to ensurethatthe chain possessesthe requiredstationary probability distribution.Ideallytheseproposed movesare designedto have a highprobability ofacceptanceso thatthe algorithm the different modelsadequately, explores thoughthis is not always easy to achieve in practice. As usual in MCMC methods, quantitiesofinterest maybe estimated byforming samplepath avrealizations ofthisMarkovchain.The reversible eragesoversimulated jump has nowbeen appliedto a widerangeofmodelchoiceproblems, methodology including changepoint analysis[Green TraitLocusanal(1995)],Quantitative ysis [Stephensand Fisch (1998)] and mixture models[Richardson and Green (1997)]. In thispaper we presentan alternative method ofconstructing an ergodic Markovchain withappropriate stationary distribution, whenthe number of k is considered components unknown. The method is based on the constructionofa continuous timeMarkovbirth-death processas described byPreston (1976) withthe appropriate stationary distribution. MCMC methodsbased on these(and related)processes have been used extensively in the pointprocess literature to simulaterealizationsofpointprocesseswhichare difficult to simulatefromdirectly; an idea whichoriginated withKelly and Ripley (1976) and Ripley(1977) [see also Glotzl(1981), Stoyan,Kendall and Mecke (1987)]. These realizations can thenbe used for significance testing [as in Ripley (1977)], or likelihood inference forthe parameters ofthe model[see, e.g., Geyerand M0ller(1994) and references Morerecently therein]. such MCMC methods have been used to perform Bayesianinference fortheparameters of a pointprocessmodel,wherethe parameters themselves are (modeledby) a pointprocess[see, e.g.,Baddeleyand van Lieshout(1993), Lawson (1996)].
42
M. STEPHENS
In orderto applythese MCMC methods to the mixture modelcontext, we view the parametersof the model as a (marked)pointprocess,with each pointrepresenting a component of the mixture. The MCMC schemeallows the number ofcomponents to varyby allowingnew components to be "born" and existing to "die."These birthsand deaths occurin continucomponents ous time,and the relativerates at whichtheyoccurdetermine the stationary distribution ofthe process. The relationship between theserates and the stain Section3 (Theorem tionary distribution is formalized 3.1). We thenuse this to construct an easily simulated in whichbirthsoccurat a constant process, rate from the prior, and deaths occurat a rate whichis verylow forcompoin explaining nentswhichare critical the data, and veryhighforcomponents whichdo nothelp explainthe data. The accept-reject mechanism ofreversible whichallows both"good"and "bad" jump is thus replacedby a mechanism birthsto occur, but reversesbad birthsveryquicklythrough a veryquick death. Our method is illustrated in Section4, byfitting mixtures ofnormal (and t) distributions tounivariate and bivariate data. Wefound thattheposterior distribution ofthe number ofcomponents fora givendata set typically depends heavilyon modeling such as the form ofthe distribution assumptions forthe components (normalsor ts) and the priorsused forthe parameters ofthese In contrast, distributions. predictive estimates tendtobe relatively density insensitive to thesemodeling Our method assumptions. appearsto have similar computational expenseto thatofRichardson and Green(1997) in the context ofmixtures ofunivariate normaldistributions, direct though comparisons are difficult. Both methods certainly give computationally tractablesolutions to the problem, withroughresultsavailable in a matterofminutes. However, ourapproachappearsthemorenaturaland elegantin thiscontext, exploiting thenaturalnestedstructure ofthemodelsand exchangeability ofthemixture As a resultwe remove components. the need forcalculation ofa complicated the potential Jacobian, formakingalgebraicerrors. reducing In addition, the changesnecessary to explorealternative modelsforthe mixture components (replacing normals witht distributions, e.g.) are trivial. We conclude witha discussion ofthepotential for thebirth-death extending methodology (BDMCMC) to othercontexts, as an alternative to moregeneral reversible jump (RJMCMC)methods. One interpretation ofBDMCMC is as a continuous-time versionof RJMCMC,witha limiton the typesof moves whichare permitted in orderto simplify implementation. BDMCMC is easily applied to any context wherethe parameters of interest may be viewed as a point process,and where the likelihoodof these parametersmay be calculated(this latterrules out Hidden MarkovModels forexamexplicitly ple). We considerbriefly some examples(a multiplechange-point problem, and variable selection in regression models)wherethese conditions are fulfilled,and discuss the difficulties of designingsuitable birth-death moves. Wheresuch movesare sufficient to achieveadequate mixing BDMCMC providesan attractive easily-implemented alternative to moregeneralRJMCMC schemes.
43
2. Bayesian methods for mixtures. 2.1. Notation and missing data formulation. We consider a finitemixture model in which data x' = x1, ..., xn are assumed to be independent observa-
tionsfrom a mixture withk (k possibly unknown but finite) density components,

(1) p(X I 'T, ), ')
= 7Tf(x;
01, ij) +
+ lTkf(x; 4k, 'i),
where'T = *T1, ..., rvk)are the mixture whichare constrained proportions to be non-negative and sum to unity;4 = ( *1, ..., k) are the (possibly vector) component specific with fi beingspecific to component parameters, i; and 7q is a (possibly vector) common parameter whichis common to all components. Throughout thispaper p(* .) willbe used to denote bothconditional densities and distributions. It is convenient to introduce the missing data formulation ofthe model, in whicheach observation is assumed to arise from a but unknown specific xi The model(1) can be written component in termsofthe zj ofthe mixture. identically distributed discrete random variablesZ1,..., mass function
missing data, with z1, .. ., z, assumed to be realizations of independent and
Zn
withprobability
(2)
Pr(Z = i I s,
71)=
i (j = 1, ..., n; i = 1,..., k).
from the densities

(3) p(xjlZj
Conditional on the Zs, x1, ...,
xn are assumed to be independentobservations (j= 1,...,n).
= i'r,w,+,/) =f(xj;oi,r/)
Integrating out the missing data Z1, .., Zn thenyieldsthe model(1).

the parameters (k, n,
2.2. Hierarchical model. We assume a hierarchical modelforthe prior on

p, ii),
nentparameters 7j, has Radon-Nikodym derivative ("density") r(k,'a, + I , r) withrespect to an underlying symmetric measure X/# (defined below).For notationalconvenience we dropfor the restofthepaperthe explicit dependence of r(. w, 71)on w and rq.To ensureexchangeability we requirethat,forany givenk, r(.) is invariant underrelabeling ofthe components, in that (4) r(k, (71,
...
prior distributionfor(k, ,, +) given hyperparameters w, and commoncompo-
[For an alternative approachsee Escobar and West (1995) who use a prior structure based on the Dirichletprocess.]Specifically we assume that the
with (r1,
j),...1(7Tk,
k)
being exchangeable.
Irk), ( 1, ** ,k))
r(k, (7Te(j),
**
7TE(k)) (
(1),
e(k)
forall permutations E of 1,..., k. In orderto define the symmetric measure ,k we introduce somenotation. Let 4k-1 denotethe Uniform distribution on the simplex
44
M. STEPHENS
Let P denote the parameter space forthe /i (so Xi E 'P forall i), let v be some measure on PD, and let vk be the induced product measure on (jk. (For most of this paper F)will be Rm forsome m, and v can be assumed to be Lebesgue and yXk-1, X measure.) Now let /tkbe the productmeasure Pk x 4k-I on finallydefine .4' to be the induced measure on the disjoint union U, 1(pk X
Jtk-1 )
A special case. Given w and -q,let k have prior probabilitymass distribution p(k c, ii). Suppose + and aTare a priori independent given k, w and 71,with 01, * k,4k being independent and identically distributedfroma distributionwith density 5(O I w, 71)with respect to v, and a having a uniform distributionon the simplex y2k-1. Then (5) r(k, +,
i)
= p(k I I,
)...
PI ik
w, ,
Note that this special case includes the specificmodels used by Diebolt and Robert (1994) and Richardson and Green (1997) in the contextof mixtures of univariate normal distributions. 2.3. Bayesian inferencevia MCMC. Given data xn, Bayesian inference may be performedusing MCMC methods, which involve the constructionof a Markov chain {0(t)} with the posteriordistributionp(O I xn) of the parameters 0 = (k, i, +, r7)as its stationary distribution.Given suitable regularity conditions [see, e.g., Tierney (1996), page 65], quantities of interest may be consistentlyestimated by sample path averages. For example, if 0(0), 0(1),... is a sampled realization of such a Markov chain, then inferencefork may be based on an estimate of the marginal posteriordistribution
(6) (6)
~~Pr(k=
i IXn) =lim
N->o
N#{t: k() = i}
(N large),
#N#{t: k(t)-i}
and similarlythe predictivedensityfora futureobservationmay be estimated by (7) P(Xn+l x)

XnN
N
t=1
p(xn+l I 0(t)).
More details, including details of the constructionof a suitable Markov chain when k is fixed,can be foundin the paper by Diebolt and Robert(1994), chapters of the books by Robert (1994) and Gelman et al. (1995), and the article by Robert (1996). Richardson and Green (1997) describe the construction of a suitable Markov chain when k is allowed to vary using the reversible
45
an alternative jump methodology developed byGreen(1995). We nowdescribe approach. 3. Constructing a Markov chain via simulation of point processes. as a pointin parameter from ponentofthe mixture space, and adapt theory a Markovchain withthe ofpointprocessesto help construct the simulation as its stationary distribution. distribution ofthe parameters posterior Since, for(,r,4+)defined at (4) does notdependon forgivenk, the priordistribution and the likelihood the labelingofthe components,
L(k, ,, 4,
(8)
j) =
3.1. The parameters as a point process. Our strategyis to view each com-
p(xn
n
k, 'r, +, 71)
j=1
(Xj; 01 ,' + H [Tlf
+ gkf 701) (Xi; ?wk,
is also invariant underpermutations ofthe components labels,the posterior distribution

(9) p(k, ', xn,
,)
L(k,
n,
4,
-q)r(k, n,
in [0, 1]x?, withtheconstraint that 1 + .+k = d'k)I as a setofk points 1 (see, e.g.,Figurela.) The posterior distribution p(k, Tr, 1Xn, w, 7j) can then be seen as a (suitably distribution ofpoints in [0, 1]xCP,orin other constrained) wordsa pointprocesson [0, 1] x (. Equivalently theposterior distribution can be seenas a marked in 4),witheach pointkihaving an associated point process -E [0, 1], withthe marksbeingconstrained mark7i to sum to unity. This view ofthe parameters as a markedpointprocess[which is also outlined by Dawid (1997)] allows us to use methodssimilarto thosein Ripley (1977) to construct a continuous timeMarkovbirth-death processwithstaw distribution tionary p(k, a, + Ixn, w, 71), with and q keptfixed. Details of this construction are givenin the next section.In Section3.4 we combine this processwithstandard(fixed-dimension) MCMC update steps whichallow w and iq to vary, to create a Markovchain withstationary distribution
(1k,
the componentsand can consider any set of k parameter values I(91r'01)
willbe similarly invariant. thelabelingof Fixingw and q, we can thusignore

.
...
p(k, r, 4!, (, q I xn).
denotetheparameter modelwithk components, space ofthemixture ignoring thelabeling ofthecomponents, and letfl = Uk>1k . Wewilluse setnotation to refer to members ofQl,writing y = {(7T1, ) Elk to represent k)1 01)kD7 *Tk, theparameters ofthemodel(1) keeping and so we maywrite ( 7Ti, ki) E iq fixed,
* *
3.2. Birth-deathprocesses for the componentsof a mixturemodel. Let
Qk
y for i
1, ..., k. Note that (for given w and -q) the invariance of L(.) and
46
M. STEPHENS
cn C
zac~~~~~~~~cc co ~~0.6
0.2 -2 -1 0 1 mean
0.2 06
~s
C")
l5)
~~~~~C .c
0
05
C")
c) acO
04 0.
0.2 0.4
0.5 2 -2 -1 0 1 mean 2 -2
-1
0 1 mean
(a)
FIG. 1.
(b)
(c)
Illustration of births and deaths as defined by (10) and (11). (a) Representation of 0.2+IV(-1, 1)+0.6X(1, 2)+0.2.XV(1, 3) as a set ofpoints in parameter space. Y4 (, S2) denotesthe univariate normal distributionwith mean ,t and variance o2. (b) Resulting model afterdeath of component0.6X/(1, 2) in (a). (c) Resulting model afterbirthof componentat 0.2.4/(0.5, 2) in (b).
r(.) under permutationof the componentlabels allows us to defineL(y) and r(y) in an obvious way. We definebirths and deaths on fl as follows: Births: If at time t our process is at y = {(471, 01), , (7Tk, 4k)} IE ik and a birth is said to occur at (i-, 4) E [0, 1] x (D, then the process jumps to
(10)
y U (iT, 0) := j(T(1
7),
4)1)
***, (k(71
r)7, /k), (7m
4)) E fk+lflk
Deaths: If at time t our process is at y = ((71, 41), ..., (0k, 4)) E death is said to occur at (7Ti,Oi) E y, then the process jumps to
and a
{i
(1-T
)1
(1Ti
(11)
(1-Ti
7+)
(-i'k
_.
Thus a birth increases the number of componentsby one, while a death decreases the number of componentsby one. These definitions have been chosen so that births and deaths are inverse operations to each other,and the constraint 7T1 + + 7Fk = 1 remains satisfied after a birth or death; they are
.
47
in Figure1. Withbirthsand deathsthus defined, illustrated we consider the following continuous timeMarkovbirth-death process: Whenthe processis at y e Qk, let births and deathsoccuras independent Poissonprocessesas follows: Births:Birthsoccurat overallrate ,B(y),and when a birthoccursit occurs at a point(IT, 4) E [0, 1] x P, chosenaccording to density b(y; (IT, 4)) with respect to theproduct measure ' x v,where41 is theuniform (Lebesgue) measureon [0, 1]. Deaths:Whentheprocess is at y = {( T1, 01), , (Tk, 0)0} eachpoint (Tj, Oj) dies independently ofthe othersas a Poissonprocesswithrate
(12)
m 4j)) 8i(y) = d(y\(Tj, )j); (7j forsomed: fl x ([0, 1] x P) -* R+. The overalldeathrate is thengivenby 6(y) = Lj My).
The timeto the nextbirth/death eventis then exponentially distributed, withmean 1/(fl(y)?+(y)), and it willbe a birth withprobability /3(y)/(13(y)+ 8(y)), and a death of component j withprobability 5j(y)/(/3(y) + 5(y)). In order to ensurethatthebirth-death processdoesn't jump to an area withzero we imposethe following conditions on b and d: "density" (13)
(14) b(y; (T, 4))) = 0 whenever r(y u (ir,4))L(y u (7T,4)) = 0, d(y; (IT, 4)) = 0 whenever r(y)L(y) = 0.
The following theorem then gives sufficient conditions on b and d forthis processto have stationary distribution p(k, rr,j I Xn, w, q).
THEOREM3.1. Assuming thegeneral hierarchicalprior on (k, Tr, P) given in Section 2.2, and keepingw and 7j fixed,the birth-death process definedabove has stationarydistributionp(k, Tr, : I xn, w, ii), provided b and d satisfy (15) (k + 1)d(y; (IT, 4)))r(y U (TT, 4)))L(y U (T, 4)))k(l = /3(y)b(y;(ir, 0)) r(y)L(y)
lk
-
T)k-l
all y E for
and (T, 4) z [O, 1] x ?.
PROOF. The proof is deferred to theAppendix.
described at (5), where (16)
3.3. Naive algorithmfora special case. We now consider the special case
r(y) = p(k I o),71)P(1 I oj,rq)- A0k I &J, ')
( cw, Suppose that we can simulatefromj5 71),and considerthe process obtained by setting ,3(y) = Ab(a constant), with
b(y; (I, 4))
= k(l - IT)k-1 . 5(4
w ) (
{( T1, 4))
..,
Theorem3.1 we findthat the processhas the correct Applying stationary

distribution, providedthat when the process is at y =
(Tk, 4k)},
48
M. STEPHENS
ofthe others as a Poissonprocesswith each point(7rj, fj) dies independently rate (17) d(y\(7j,
Aj);9j,j)) b L(y\(7TJ,4j))p(k-I
w,,)
(j= 1,...,k).
is 3.1 below simulatesthis process.We note that the algorithm Algorithm tosimulate from onlytheability to implement, requiring verystraightforward (. I WI, 7), and to calculate the model likelihoodforany given model. The main and it is important that thelikelihood, burdenis in calculating computational ofdensitiesare storedand reusedwherepossible. calculations distristationary 3.1. To simulatea processwithappropriate ALGORITHM bution. iteratethe withinitialmodel y = (1, 01),.. Starting (7k, k k)} I zk, steps: following component the deathratefor 2. Calculatethe deathrateforeach component, j beinggivenby (17): (18)
6J(y)=Ab 5i(Y) = L(y\(7( ,J4j)) L(y) p(k -1 kp(k w,i)',) 1. Let the birth rate 3(y) = Ab.
(j = 1,-,
k).
. 5S(y). 3. Calculatethe totaldeathrate 8(y) = an exponential distribution with 4. Simulatethe timeto the nextjump from mean 17(3(y) + 8(y)). probabilities 5. Simulatethe typeofjump: birthor deathwithrespective
Pr(birth) =
/3(y)+ 6(y)'
13(y)
Pr(death) =
N3Y) ? 5(y),
8(y)
by (10) and (11)]: 6. Adjusty to reflect the birth or death [as defined
Birth: Simulate the point (r, k) at which a birth takes place fromthe densityb(y; (T, 0)) = k(l - q)k-l w( I C, r1) by simulating r and 0 indepenfrom densitiesk(l - ,)k1- and /(OI w,i) respectively. We notethat dently
(1, k), whichis easily the former is the Beta distribution withparameters Y1 - F(1, 1) and Y2 - F(k, 1) and setting simulatedfrom by simulating with X = Y1/(Y1 + Y2), whereF(n, A) denotesthe Gamma distribution
mean n/A.
5j(y)/6(y)
withprobability Death: Selecta component to die: (yj, j) C y beingselected

forj = 1, ..., k.
7. Return to step 2.
REMARK 3.2. Algorithm 3.1 seems rathernaive in that birthsoccur(in some sense) fromthe prior,which may lead to many births of components
whichdo not help to explain the data. Such components will have a high
49
in the same whichis inefficient death rate (17) and so will die veryquickly, if manysamples is inefficient simulation algorithm way as an accept-reject in the exampleswe consider in the next sectionthis are rejected.However, any well,and so we have notconsidered naive algorithm performs reasonably to occurin a less naive cleverer choicesofb(y;( iT, 1)) whichmayallowbirths discussion). way(see Section5.2 forfurther
then Algorithm3.1 distribution p(k, ', + I x', simulatesa birth-death processwith stationary withMCMC update steps whichallow woand co,7i). This can be combined 7i to vary to create a Markov chain with stationarydistributionp(k, T, ?, w, 3.4. Constructinga Markov Chain.
rj
If we fix wo and
the data xn by the missingdata zn = (zl, ..., zX) der1 I Xn). By augmenting and use ofthe necessary scribedin Section2.1, and assumingthe existence we can use Gibbssampling stepsto achievethisas in Algoconjugate priors, updates could also be used, removing rithm3.2 below;Metropolis-Hastings priors. the need to introduce the missingdata or use conjugate
ALGORITHM 3.2. To simulatea Markov chainwithappropriate stationary distribution. Giventhe state @(t) - 0(t) at timet, simulatea value for0(t+1) = 0(t+?) as follows:
thebirth-death for a fixed process Step 1. Sample (k(t)', r(t)',+(t)') byrunning timeto,starting from (w, ij) tobe (@(t), -q(t)). (k(t), 5T(t), +(t)) and fixing
(t) @(t), p(zI k(t?U, (t)/+(t)' Step 2. Sample (zn)(t+1) from ).(t)' p(, w k(t+l), +(t)' xnI Step 3. Sample q(t+l), w(t+1) from Step 4. Sample n(t+1), (t+1) fromp(Tr,pI k(t+l), ,(t+1), w(t+l),
Set k(t+1) = k(t)'.
n
zn
xn, zn).
give Provided thefullconditional posterior distributions for each parameter supportto all parts of the parameterspace, this will definean irreducible Markovchainwithstationary distribution p(k, Tr, +, , ,q,zn Ixn) suitablefor by forming sample path averagesas in (6) estimating quantitiesofinterest and (7). The proof (1997), is straightforward and is omitted here[see Stephens involves movements values between different page 84]. Step 1 ofthealgorithm ofk byallowing newcomponents to"die." tobe "born," and existing components to varywithk keptfixed. Step 4 is not Steps 2, 3 and 4 allowthe parameters ofthe Markovchain to the correct strictly necessaryto ensureconvergence butis included to improve mixing. Notethat(as usual stationary distribution, in Gibbssampling) and + are remains validifanyor all ofwl,r1 the algorithm into separate components whichare updatedone at a timeby a partitioned Gibbssampling step,as willbe the case in our examples. 4. Examples. Our examplesdemonstrate 3.2 to pertheuse ofAlgorithm form in the context inference ofbothunivariate and bivariatedata xn,which are assumed to be independent observations from a mixture ofan unknown
50
M. STEPHENS
ofnormaldistributions: number (finite) (19) p(x

I , ji,
Y) =
1T14(X;
17
Y1)
+ +TkIr(X;
/k'
v).
multifunction of the r-dimensional Here JV(x;,ui, 1i) denotesthe density matrix and variance-covariance withmean /xi variatenormaldistribution Yi.
In the univariate case (r = 1) we may write o-2for1.
Poissonprioron the number Priordistributions.We assume a truncated k: ofcomponents
(20)
- (k = 1, ...* p(k) A
kk
kmax =
100),
values we will perform analyseswithseveraldifferent whereAis a constant; on the on k we base our priorforthe modelparameters of A. Conditional of and Green(1997) in thecontext byRichardson suggested prior hierarchical oftheir A naturalgeneralization ofunivariate normaldistributions. mixtures distributions normal univariate byreplacing is obtained to r dimensions prior and replacinggamma distributions normaldistributions, withmultivariate to give distributions, withWishart (21)
(22)
Ali- 1VJ6,K
Y;- 1 -3
1)
(i = 1, ...,I
k),
Yr(2a, (2/8)-1)
(i = 1, ...,
(23) (24)
[3 Ylr(2g, (2h)-1), r - 9(y)
K, 3 and h are r x r matrices; where,3is a hyperparameter; 6 is an r x 1 vector; with distribution Dirichlet thesymmetric a, y and g are scalars; 9(y) denotes y and density parameter
F(k-y) 7i
F7(y)k
,1
_1Y**7Tk 1(1 -
-T1
*-k-O
withparamein r dimensions distribution and 4r(m, A) denotestheWishart ofthesample as thedistribution tersm and A. This last is usuallyintroduced distribunormal a multivariate covariance for a sampleofsize m from matrix, A. Because ofthisinterpretation tionin r dimensions matrix withcovariance m is usuallytakenas an integer, and form > r Yr(m,A) has density (25)
V Vm-r-l/2 m,A) = KIA I-m/2 Ylrt(V; x exp{- tr(A V)II(V positive definite)
Rr(r+l)/2),
on the space of all symmetric matrices(indicator function and

K-1
= 2mr/2,r(r-1)/4
whereI(.) denotesan
hIF(m
1-S)
51
m > r - 1. m provided fornon-integer a density (25) also defines However, from the Wishartdistribution (whichworkfornonMethodsof simulating integer m > r - 1) may be found in Ripley (1987). For m < r - 1 we will use to withdensity proportional the improper distribution Y/'(m,A) to represent (25). (This is not the usual definitionof Y/,(m,A) for m < r - 1, which is a Wherean matrices.) confined to a subspaceofsymmetric distribution singular to checkthe integrability is used, it is important priordistribution improper ofthe posterior. and Green(1997),whotake ((, K, Richardson Forunivariate data we follow values: constants withthe following a, g, h, y)to be (data-dependent)
=1,
K=
2,
1
a=2, = 1
g =0.2,
h = lOOg
aR1
ofthedata, and oftheobserved interval ofvariation where(j is the midpoint The value a = 2 was chosento expressthe R1 is the lengthofthis interval. restricting are similar, without beliefthat the variancesof the components them to be equal. For bivariate data (r = 2) we felt that a slightlystronger
constraint would be appropriate, and so increased a to 3, making a corresponding change in g and obvious generalizations forthe other constants to give
((1,
(2
=
O
1a
2
=3,
g=O.3, g =0.3,
h=
l 1 OOg O aRR2 Og
1,
where 61 and (2 are the midpoints of the observed intervals of variation of and R1 and R2 are the data in the firstand second dimension respectively, the respective lengths of these intervals. We note that the prior on ,Bin the bivariate case /3 - 02(O.6, (2h) )
is an improper distribution,but careful checking of the necessary integrals shows that the posteriordistributionsare proper. In our examples we consider the followingpriors: 1. The Fixed-K prior,which is the name we give to the priorgiven above. The full conditional posteriordistributionsrequired forthe Gibbs sampling up-
52
M. STEPHENS
dates (Steps 2-4 in Algorithm 3.2) are then(using ing on all othervariables) (26) p(z i = i | (27) (28)
(29)
)
...
to denotecondition-
oCir(xj;
i,Xj /
3
| -'.
I ..
ll (2g + 2ka, [2h + 2 E -1-) Xr

91 (y + nl, J4(niET'
**
, y + nk), xi+ Kc), (ni-'1 ?)-) -Tu)(x

-
+ K)1(nil-'
j:zj =i
(30)
for i
=
Y/}2a + ni, [2l3? E (x+

= .L. 1,
= ,
in the order81, Tr, wereperformed ,u, N. as hyperparameters in which( and K are also treated 2. The Variable-Kprior, This is an attempt to represent thebelief we place "vague"priors. on which whenviewedon somescale, without thatthe means willbe close together to adabout theiractual location.It is also an attempt beinginformative in Section 5.1. to the Fixed-K dress some ofthe objections priordiscussed uniform distribution on f and a "vague" Wechosetoplace an improper prior forK is proper, thisdistribution distribution is der to ensurethe posterior required to be proper,and so we require I > r - 1. We used I = r - 1 + 0.001 to be proper in as our default value for1. (In general, a distribution fixing in this case it can be shownthatif this way is not a goodidea. However, to E forsmall r, I = r - 1 + - theninference for,t,E and k is not sensitive numerical although problems mayoccurforverysmall E.) The fullconditional theFixed-K prior, withthe posteriors are thenas for of addition (31) (32)
1r(Y' (lIr)X1) distributionon
K
#Jj zj = i}) and xi is the mean ofthe observations allocated to class i (i= ZEj:zj=i xj/ni.) The Gibbs sampling updates
1, ..., k and j allocatedto class i (ni
n, where ni is the number of observations
where Ir is the r x r identitymatrix. In or-
1...
K I
4r(-,(kK)1),
+ k,(hr /41r(1
+ SS<1),
where,i = The Gibbs sampling _Ci- ()(i ()T. I i/k and SS = 3.2 wereperformed in the order/3,K, g, Ta, ,u, t. updatesin Algorithm
in Section These priorsare both examplesof the special case considered for 3.3, and so Algorithm 3.1 can be used. Theymaybe viewedas convenient themas "nonand we warnagainstconsidering the purposesofillustration, informative" or "weakly" In particular informative. we will see thatinference fork can be highly to the priors sensitive used. Further discussion is deferred to Section5.1.
53
Values for (to, Ab). Algorithm3.1 requires the specificationof a birth-rate Ab,and Algorithm3.2 requires the specificationof a (virtual) time to forwhich the birth-deathprocess is run. Doubling Abis mathematically equivalent to doubling to, and so we are free to fix to = 1, and specifya value for Ab. In all our examples we used Ab= A [the parameter of the Poisson prior in (20)], which gives a convenient form of the death rates (18) as a likelihood ratio which does not depend on A. Larger values of Abwill result in better mixing over k, at the cost of more computation time per iteration of Algorithm3.2, and it is not clear how an optimal balance between these factors should be achieved. 4.1. Example 1: Galaxy data. As our firstexample we consider the galaxy data firstpresented by Postman, Huchra and Geller (1986) consisting of the velocities (in 103 kmls) of distant galaxies divergingfromour own, fromsix well-separated conic sections of the Corona Borealis. The original data consists of83 observations,but one ofthese observations (a velocityof 5.607 x 103 km/s) does not appear in the version of the data given by Roeder (1990), which has since been analyzed under a varietyofmixturemodels by a number of authors, including Crawford (1994), Chib (1995), Carlin and Chib (1995), Escobar and West (1995), Phillips and Smith (1996) and Richardson and Green (1997). In order to make our analysis comparable with these we have chosen to ignorethe missing observation.A histogramofthe data overlaid with a Gaussian kernel density estimate is shown in Figure 2. The multimodality of the velocities may indicate the presence of super clusters of galaxies surrounded by large voids, each mode representinga cluster as it moves away at its own speed [Roeder (1990) gives more background].
LO c\!
LO
LO
10
20
30
40
velocity FIG. 2. Histogram of thegalaxy data, with bin-widthschosen byeye.Since histogramsare rather unreliable density estimation devices [see, e.g., Roeder (1990)] we have overlaid the histogram witha non-parametric densityestimate using Gaussian kerneldensityestimation,with bandwidth chosen automatically according to a rule given by Sheather and Jones (1991),calculated using the S functionwidth.SJ fromVenables and Ripley (1997).
54
M. STEPHENS
We use Algorithm 3.2 to fit thefollowing mixture modelstothegalaxydata: ofnormaldistributions (a) A mixture in usingthe Fixed-K priordescribed Section4. (b) A mixture ofnormaldistributions usingthe Variable-K priordescribed in Section4. on p = 4 degreesoffreedom: oft distributions (c) A mixture (33)
p(x
, i,
j2)
Tltp(x;bL1, -1 )
7Tktp(X; [Lk, %)'
wheretp(x;A, ofthe t-distribution withp degreesoffreeuio2) is the density dom,withmean [Liand variancepo-2/(p - 2) [see, e.g.,Gelmanet al. (1995), similarto the page 476]. The value p = 4 was chosento give a distribution normaldistribution withslightly "fatter tails,"sincetherewas someevidence whenfitting the normaldistributions thatextracomponents werebeingused to createlongertails. We used the Fixed-K c2). Adjusting the priorfor(w, ,u, birth-death to fitt distributions is simply a matter ofreplacing algorithm the normaldensity withthe t density whencalculating the likelihood. The Gibbs as explainedin Stephens(1997). sampling stepsare performed Wewillrefer tothesethreemodelsas "Normal, VariableFixed-K," "Normal, and "t4,Fixed-K" For each ofthe threemodelswe performed respectively. the analysiswithfourdifferent values ofthe parameter A (the parameter of the truncated Poisson prioron k): 1,3,6 and 25. The choiceof A = 25 was considered in orderto give some idea ofhow the method wouldbehave as A was allowedto get verylarge.
K"
we performed 20,000 iterations ofAlgorithm 3.2, withthe starting pointbeingchosenbysetting k = 1, setting (4, K) to thevalues chosenfortheFixed-K prior, and sampling the otherparameters from their joint priordistribution. In each case thesamplermovedquickly from thelowlikelihood ofthestarting pointto an area ofparameter space withhigher likelihood. The computational expensewas notgreat.For example, the runsforA= 3 took150-250 seconds (CPU timeson a Sun UltraSparc200 workstation, 1997),whichcorresponds to about80-130 iterations per second.Roughly the same amountoftimewas spentperforming theGibbssampling stepsas performing thebirth-death calculations. The main expenseofthe birth-death processcalculations is in calthemodellikelihood, culating and a significant savingcouldbe madebyusing a look-up table forthe normaldensity (thiswas notdone). In assessing the convergence and mixingproperties of our algorithm we follow Richardson and Green(1997) in examining firstly the mixing over k, and thenthemixing overtheother parameters within k. Figure3a showsthe sampledvalues of k forthe runs withA = 3. A roughidea ofhow well the is exploring thespace maybe obtained algorithm from thepercentages ofiterationswhich in thiscase were36%,52% and 38% for changedk,which models a)-c) respectively. Moreinformation can be obtained from the autocorrelation ofthe sampledvalues of k (Figure3b) whichshow that successivesamples
Startingpoints, computationalexpense and mixingbehavior For each prior
55
-4 uou
50Q0 l 000 15000 20000 Fitfing nornals, fixed kappa)
5000 10000 15000 20000 variable Fitting normals, kappa
0 S000 10000 15000 20000 4 degrees fixed oftreedom tswith Fitting kappa
(a) Sampledvalues of k
o Z
ilikll
t 0 00
0 20 d 00 eo 100 oixed Lag(Fiting normals, kappa)
60 20 40 eo 100 nonmals vanrable Lag(Fining kappa)
80 20 440 60 tixed Lag(Fiting t4s, kappa)
tOO
forsampledvalues of k (b) Autocorrelations

FIG. 3.
Results fromusing Algorithm3.2 to fitthe threedifferent models to thegalaxy data using A = 3. The columns show results for Left: Normals, Fixed-K; Middle: Normals, Variable-K; Right: t4s,Fixed-K.
have a high autocorrelation.This is due to the fact that k tends to change by at most one in each iteration, and so many iterations are required to move between small and large values of k. In orderto obtain a comparisonwith the performance ofthe reversiblejump runs withthe prior sampler ofRichardson and Green (1997) we also performed theyused forthis data; namely a uniform prioron k = 1, .. . , 30 and the FixedK prior on the parameters. For this prior our sampler took 170 seconds and changed k in 34% of iterations,which compares favorablywith the 11-18% of iterations obtained by Richardson and Green (1997) using the reversiblejump sampler (their Table 1). We also tried applying the convergence diagnostic suggested by Gelman and Rubin (1992) which requires more than one chain to be run fromover-dispersedstartingpoints (see the reviews by Cowles and Carlin (1996) or Brooks and Roberts (1998) foralternative diagnostics). Based on fourchains of length 20,000, with two started fromk = 1 and two started fromk = 30, convergencewas diagnosed forthe output ofAlgorithm 3.2 within 2500 iterations. Richardson and Green (1997) note that allowing k to vary can result in much improved mixing behavior of the sampler over the mixture model parameters withink. For example, if we fix k and use Gibbs sampling to fit k = 3 t4 distributionsto the galaxy data with the Fixed-K prior,there are two well-separated modes (a major mode with means near 10, 20 and 23 and a minor mode with means near 10, 21 and 34). Our Gibbs sampler with fixed
56
M. STEPHENS
k struggled to movebetweenthesemodes,moving from majormodeto minor modeand back onlyoncein 10,000iterations (resultsnotshown).We applied 3.2 to this problem, Algorithm using A = 1. Of the 10,000 pointssampled, therewere 1913 visitsto k = 3, during whichthe minor modewas visitedon at least 6 different occasions(Figure4). In thiscase the improved bemixing to movebetween haviorresultsfrom theability themodesfork = 3 via states withk = 4: thatis (roughly themajormodetotheminor from speaking), mode via a four modelwithmeansnear 10,20, 23 and 34. Ifwe are gencomponent in thecase k = 3 thentheimproved uinelyonlyinterested behavior of mixing the variablek samplermustbe balancedagainstits increasedcomputational cost,particularly as we generated k = 3 in 10,000 only1913 samples from iterations ofthe sampler. By truncating theprior on k to allowonlyk = 3 and k = 4, and using A = 0.1 to favor the 3 component modelstrongly, we were able to increasethis to 7371 sampleswithk = 3 in 10,000 iterations, with about6 separatevisitsto theminor mode.Alternative for strategies obtaining a sample from the birth-death processconditional on a fixedvalue of k are givenby Ripley(1977). Inference.The resultsin this sectionare based on runs oflength20, 000 - numbers withthefirst 10,000 iterations as burn-in beingdiscarded we believetobe largeenough to givemeaningful results based on ourinvestigations ofthe mixing ofour chain.Estimatesofthe posterior properties distribution ofk (Figure5) showthatit is highly to theprior sensitive used,bothin terms of choiceof A and the prior(Variable-K or Fixed-K) used on the parameters thatthis is less sensitive to choiceofmodel.Although the density estimates becomeless smooth as Aincreases, eventhe density estimates for(theunreasonablylargevalue of)A= 25 do notappear to be over-fitting badly. The large number ofnormalcomponents beingfitted to the data suggests thatthedata is notwellmodeled ofnormal bya mixture distributions. Further investigation showsthatmanyofthesecomponents have smallweight and are beingused to effectively "fatten the tails" ofthe normaldistributions, which explainswhyfewer t4 components are required to modelthe data. Parsimony suggeststhat we shouldprefer the t4 model,and we can formalize this as
(,U, Cr2).
estimates ofthe predictive Corresponding density (Figure 6) show
I E
500
1000
sample point
1500
2000
500
sample point
1000
1500
2000
500
sample point
1000
1500
2000
FIG. 4. Sampled values of means forthreecomponents, sampled using Algorithm3.2 whenfitting a variable numberof t4 componentsto thegalaxy data, withFixed-Kprior,A = 1, and conditioning the resultingoutput on k = 3. The output is essentially "unlabeled,"and so labeling of the points was achieved by applying Algorithm 3.3 of Stephens (1997). The variable k sampler visits the minor mode at least 6 separate times in 1913 iterations,compared with once in 10, 000 iterations fora fixed k sampler
57
0~~~~~~~~~~~~~~~~
no,~~~~~~~~~~~~~~~~~~c
0 2 4 6 8 14 10 12 k (fitting nomals, fixedkappa, lambda=l) 6 8 4 0 2 14 10 12 k (fi8nignomals, varabl. kappa. lamtbda1)
4 14 2 6 8 10 12 k (fitting t4a, fixedkappa. lambda-i)
(a) A = 1
10
-Ds
c,
,|
Si
On
64
0.
0 4 14 2 8 10 12 6 k (fitting nom kals, fixedkappa. lanbmda-3)
8 4 14 0 2 8 10 12 k (fitting normals,variable kappa, lanbda-3)
4 8 14 2 6 10 12 k (fitting tL4s, fixadkappa, lambda=3)
(b) A = 3
D
+6
Oi C. .
2 4 6 8 10 12 14 normal, kappa, lsmbda=6( k(fitting toaed
0 2 4 6 8 10 12 14 k(hfitig nonnala, nanalsa kappa, lsndrda=6(
2 4 6 8 10 12 14 k(ltfing fixad kappa, lamtida-O( t_,4a,
(C) A = 6 FIG. 5. Graphs showing estimates (6) of Pr(k = i) for i = 1,2, . .., for the galaxy data. These estimates are based on the values of k sampled using Algorithm3.2 when fitting the threedifferent models to thegalaxy data with A = 1,3, 6, with in each case the first10, 000 samples having been discarded as burn-in.The threecolumns show resultsforLeft:Normals, Fixed-K; Middle: Normals, Right: t4s,FixedK. The posterior distributionof k can be seen to depend on the type Variable-Ki; of mixtureused (normal or t4), theprior distributionfor k (value of A), and theprior distribution for(ji., 2 ) (Variable- K or Fixed -K).
58
M. STEPHENS
0 0 0
o N
0 10 20 30 Fitting nornals, fixedkappa 40
i
a
a
0~~~~~~~~
|
0 10 30 20 Fittingoromal.svenable kappa 40 0 10 30 20 40 Fiting ts it 4 degrees of freedom hixedkappa
(a) A = 1
O 0 0 10
a
0 0 10 4020 30 40
30 o
20
30
>
40 40
ci 10 20 30 Fitting nomrna, vanabia kappa
~~~~~~~~~~~~~~~~~~0
0 X 13 0 10 20 30 40 4 dgree" of treedom,fixedkappa Fitting ta with
W)~~~~~~~b
10 20 30 Fitting normals, fixedkappa
A=3
(b) A = 3
Ci
04 0
o 0 0 10 20 30 Fitting kappa nomials, fixoed 40
~~~~~~~~~~~~~~~0
0~~~~~~~c
0 10 20 30 Fifrfgnonnala, vaiiabls kappa 40
0 10 20 30 40 Fittinga wisth 4 degree. ol trsedom, fixedkappa
(c) A = 6
W~~~~~~~~~~~~~~~~~~
6
FIG
or L
10 30 20 Fitfing toxed kappa normials,

estoriae
40
6.Peitv
10 20 30 40 Fitin normals, kappa vadabla

aa hs
()frteglx
0 10 20 30 40 itdegreeat Fitngo wsvith hoxed kappa ofasrdomn.

arebsdo h upto
(d) A = 25 FIG. 6. Predictive density estimates (7) for the galaxy data. These are based on the output of models to the galaxy data with A = 1,3, 6, 25. Algorithm 3.2 when fittingthe three different The threecolumns show results for Left: Middle:Normals, Variable-K; Right: Normals, Fixed-K~; t4s, Fixed-K. The density estimates become less smooth as A increases, correspondingto a prior distributionwhichfavorsa larger numberofcomponents. However,the methodappears toperform acceptablyforeven unreasonably large values of A.
59
eithera mixture Suppose we assume that the data has arisen from follows. withp(t4) = p(normal)= 0.5. For the Fixedoft4S, ofnormalsor a mixture xn) using K priorwith A = 1 we can estimatep(k It4, Xn) and p(k Inormal, we have 3.2 (Table 1). By Bayes' theorem Algorithm
(34) and so (35) p(Xn k,t4)p(k, t4) p(t4 I X -) = p(k, t4 X) = ~~L40) -p(k t4, Xn)p(Xn) I t4, Xnl)-p(k p t4 [X) p(k I t4 X () = p(k, p(k t4,x0)=P(t4I Xn) I forall k
forall k,
and similarly
normal) forall k. p(normalIXn) -p(Xn k,normal)p(k, xn)p(xn) normal, p(k I for somek somek and p(Xn I k,normal) p(Xn I k,t4) for Thus ifwe can estimate (1997) describes thenwe can estimate p(t4 Ixn) and p(normalIxn).Mathieson Harhe refers to as Truncated which sampling a method [a typeofimportance described byDiCiccio, Mean (THM) and whichis similarto themethod monic estimatesforp(xn I k,t4) and Wasserman(1997)] ofobtaining Kass, Raftery to obtainthe estimates and p(Xn I k,normal),and uses thismethod (36) -log p(xn I k = 3, t4) - 227.64
-
and
logp(xn Ik = 3, normal) ` 229.08,

and p(normal I xn) - 0.084,
=
giving[usingequations(35) and (36)]

p(t4 I xn) , 0.916 from which we can estimate p(t4, k I Xn) p(t4 I xn)p(k I t4, Xn), and simi-
thatforthe larlyfornormals-theresultsare shownin Table 2. We conclude over are heavilyfavored oft4 distributions used, mixtures priordistributions havingthehighest withfourt4 components mixtures ofnormaldistributions,
TABLE 1 Estimates of the posterior probabilities p(k It4,x ) and p(k I normal, xn) for the galaxy data (Fixed-Kprior, A = 1). These are the means of the estimatesfromfiveseparate runs ofAlgorithm 3.2, each run consisting of 20, 000 iterations with the first10, 000 iterationsbeing discarded as burn-in;the standard errorsof these estimates are shown in brackets k= 2 0.056 (0.014) 0.000 3 0.214 (0.009) 0.554 (0.014) 4 0.601 (0.011) 0.338 (0.011) 5 0.115 (0.005) 0.093 (0.004) 6 0.012 (0.001) 0.013 (0.001) >6 0.001 (0.000) 0.001 (0.000)
pk I t4, x') I(kI normal,xn)
60
M. STEPHENS
TABLE2
k I x') forthegalaxy data Estimatesof theposterior probabilities p(t4, k I x') and p(normal, A= 1). See text wereobtained (Fixed-K prior, fordetailsofhowthese k= p(t4,kI xn) p^(normal,k I x') 2 0.051
0.000
3 0.196
0.047
4 0.551
0.028
5
0.008
6 0.011
0.001
>6 0.000
0.000
0.105
to modify our alIt wouldbe relatively posterior probability. straightforward to fitt distributions number ofdegreesoffreedom, withan unknown gorithm thusautomating the abovemodelchoiceprocedure. It wouldalso be straightforward to alloweach component ofthemixture to have a different number of degreesoffreedom. 4.2. Example 2: Old Faithful data. For our secondexample,we consider the Old Faithful data [theversion from Hardle (1991) also considered byVenables and Ripley(1994)] whichconsistsofdata on 272 eruptions ofthe Old Faithful geyserin the Yellowstone NationalPark. Each observation consists oftwoobservations: the duration(in minutes)ofthe eruption, and the waiting time(in minutes)before the nexteruption. A scatterplot ofthe data in twodimensions showstwomoderately separatedgroups(Figure7). We used 3.2 to fita mixture Algorithm of an unknown numberof bivariatenormal distributions to the data, using A = 1, 3 and boththe Fixed-K and Variable-K priorsdetailedin Section4. Each runconsisted of20, 000 iterations ofAlgorithm 3.2, withthe starting pointbeingchosenby settingk = 1, setting(6, K) to the values chosenfor the Fixed-K prior, and samplingthe otherparameters from theirjoint prior
8 0
* .
4 duration
FIG.7. Scatter plotoftheOld Faithful data [from Hardle(1991)].Thex axis showstheduration (in minutes) oftheeruption, and they axis showsthewaiting time(in minutes) before thenext eruption.
61
In each case the samplermovedquickly distribution. from the low likelihood of the starting pointto an area of parameterspace withhigherlikelihood. The runs forA = 3 tookabout 7-8 minutes.Figure 8a showsthe resulting sampledvalues ofthenumber ofcomponents can be seentovarymore k,which rapidlyforthe Variable-K model,due in part to its greaterpermissiveness of extra components. For the runs with A = 3 the proportion of iterations whichresulted in a changein k were9% (Fixed-K) and 39% (Variable-K). For A= 1 the corresponding were3% and 10% respectively. figures Graphsofthe autocorrelations (Figure8b) suggestthat the mixing is slightly poorerthan for thegalaxydata, presumably due to births ofreasonablecomponents being less likely in thetwo-dimensional case. This poorer meansthatlonger mixing runsmaybe necessary to obtainaccurateestimatesofp(k I xn). The method of Gelman and Rubin (1992) applied to two runs of length20,000 starting from k = 1 and k = 30 diagnosed convergence within 10,000iterations for the Estimatesoftheposterior distribution for k (Figure8c) showthatit depends on theprior heavily used,whileestimates ofthepredictive density (Figure8d)) are less sensitive to changesin the prior. Wheremorethan twocomponents are fitted to the data the extracomponents appear to be modeling deviations from in the two obviousgroups,ratherthan interpretable normality extra groups. Iris data, collected by Anderson (1935) whichconsistsoffourmeasurements (petal and sepal lengthand width)for50 specimens ofeach ofthreespecies ginicaand versicolor speciesmayeachbe splitintosubspecies, though analysis by McLachlan(1992) using maximum likelihood methods suggeststhat this is notjustified by the data. We investigated this questionforthe virginica speciesbyfitting a mixture ofan unknown number ofbivariate normal distributions to the50 observations ofsepal length and petallength for thisspecies, whichare shownin Figure9. Our analysis was performed with A = 1,3 and with both Fixed-Kand Variable-K We appliedAlgorithm priors. 3.2 to obtaina sampleofsize 20,000 from a randomstarting point, and discarded the first 10,000observations as burn-in. The mixing behaviorof the chain over k was reasonable,withthe ofsamplepointsforwhichk changedbeing6% (A = 1) and 21% percentages (A = 3) forthe Fixed-K and 5% (A = 1) and 36% (A = 3) for prior, theVariableK prior. The modeofthe resulting estimatesforthe posterior distribution of k is at k = 1 forat least threeofthe fourpriorsused (Figure 10a) and the resultsseemto support theconclusion ofMcLachlan(1992) thatthedata does not supporta divisioninto subspecies(thoughwe note that in our analysis we used onlytwoofthefour measurements availableforeach specimen). The fullpredictive in Figure10bindicatethatwheremorethan density estimates one component is fitted to the data theyare again beingused to modellack of in the data, ratherthan interpretable normality groupsin the data.
(setosa, versicolor,and virginica) of iris. Wilson (1982) suggests that the vir4.3. Example 3: Iris Virginica data. We now brieflyconsider the famous Fixed-K prior with A = 1, 3.
62
M. STEPHENS
~~~o
A
q IL-La
Fixed kappa,jambda-1
5000
15000
Vaniablh kappa,lambda=1
5000
15000
Fixed kappa,lambda,3
5000
15000
Vanable kappa.lambda.3
5000
15000
(a) Sampledvaluesof k
Lag(hxed kappa.iambda=1)
2040
6080
100
Lag(vanabil kappa lambda-I)
204056080
100
Lag(hxed kappa.Imbda=3)
2040
6080
100
Lag(vanable kappa,lambda=3)
2040860
80 100
(b) Autocorrelations of sampledvalues of k

o . . , , . o . . . . . o . . , . , . o
d|0
'
_ |i _
0 2 4 6 8 10 k (Fixed kappa, lambda=1)
4 8 10 0 2 6 k (Varable kappaj brmbda.1)
0 2 4 6 8 10 k (Fixed kappa, lambda-3)
8 10 0 2 4 6 k (Variabl. kappa, lambda=3)
8 14 C5
a
1
cL
14
6 6:
1 2 34
6
o8
6 10
(c) Estimates (6) of Pr(k = i) <

..
S
, ..............:O:
o 5 6 10 2
t
34
X
5
1 6
34
5 8
10
34
Fixed kappa
Vanable kappa
1F
ed kappa
Vanablekappa
density estimates (7), darkshadingcorresponding all to regionsof high density, (d) Predictive shadedon the same scale FIG. 8. Results forusing Algorithm3.2 to fita mixtureofnormal distributionsto the Old Faithful data. The columns show resultsfor Left:Fixed-K prior,A = 1; Left-middle:Variable-Kprior,A = 1; Right-middle:Fixed-K prior, A = 3; Right: Variable-Kprior, A = 3. The posteriordistributionof k can be seen to depend on both the prior distributionfor k (value of A), and theprior distribution for (tL, 1) (Variable-K or Fixed-K). The density estimates appear to be less sensitive to choice of
prior.
63
CDI.o.... *1 -
C) LI)
sepal length
FIG. 9.
Scatter plot ofpetal lengthagainst sepal lengthforthe Iris Virginica data.
_ _ _ .. ._ . _ 8 0 2 4 810 k (Fixed kappa. larrbda=1)
.1
0 2 4 810 6 k (Variabe kappa. lambda-1)
8 0 2 4 810 k (Fixed kappa, lambda=3)
8 0 2 4 810 k (Variable kappa, lanbda.3)
(a) Estimates (6) of Pr(k = i)
O,
O.
~~~ ~~~...... ~~ ~
4 7 8 9 5 6 Variabl kappa. urbdal1 4
O -. . .. 7Of;-.O h: 7 f it;0 . ...

...yi-; , .
t~~~~~~~~~~~~~~~~
7 8 5 6 9 Fixed kappa, bambda-1
5 6 7 8 Fixed kappa, In,bda.3
4 5 6 7 8 9 Vanable kappa, bknbda-3
estimates all to regionsof highdensity, (b) Predictive density (7), darkshadingcorresponding shadedon the same scale FIG. 10. Results for using Algorithm3.2 to fita mixtureof normal distributionsto the Iris Virginica data. The columns show resultsforLeft:Fixed-Kprior,A = 1; Left-middle:Variable-K prior, A = 1; Right-middle: Fixed-Kprior,A = 3; Right: Variable-K prior,A = 3. The mode of the estimates of Pr(k = i) is k = 1 forat least threeof the fourpriors used, and seems to indicate that the data does not support splittingthe species into sub-species.
64
M. STEPHENS
5. Discussion. 5.1. Density estimation,inference for k and priors. Our examples demonstratethata Bayesianapproachto density estimation usingmixtures of(uniwithan unknown ofcomvariateor bivariate)normaldistributions number is computationally and thatthe resulting estimates ponents feasible, density to modeling and priors used. Extension to are reasonably robust assumptions is to but be dimensions likely provide higher computational challenges, might on the covariance matrices them possiblewithsuitableconstraints (requiring all to be equal or all to be diagonalforexample). the factthat whileinference forthe number Our examplesalso highlight the posterior ofcomponents k in the mixture is also computationally feasible, on notjust the priorchosenfor distribution fork can be highlydependent ofthe mixture model. k, but also the priorchosenforthe otherparameters ofone-dimensional Richardson and Green(1997), in theirinvestigation data, note that when using the Fixed-K prior, the value chosenforK in the prior a strong belief thatthemeans ofk.A very bution largevalue ofK, representing ofthe rangeofthe data) willfavor models lie at g (chosento be the midpoint ofcomponents and largervariances. witha smallnumber DecreasingK to repthe about the means will initially resentvaguerpriorknowledge encourage ofmorecomponents withmeans spread across the rangeofthe data. fitting to decreaseK, to represent However, continuing vaguerand vaguerknowledge In favors fewer on the locationofthe means,eventually fitting components. distribution of k becomesindependent of the limit,as K -* 0, the posterior ofobservations, the data, and dependsonlyon the number a heavilyfavoring ofobservations one component modelfor reasonable number (1997), [Stephens forthe Jennison informative (1997)]. Priorswhichappear to be only"weakly" of the thus be for mixture informative the parameters components may highly in themixture. ofcomponents Since very number largeand verysmallvalues of K in the Fixed-K whichare highly bothlead to priors informative for prior k,it might to searchfor a value ofK (probably on the be interesting depending is "minimally observed leads to a Fixed-K which informative" data) which prior fork in somewell-defined way. Wherethe main aim of the analysis is to definegroupsfordiscrimination(as in taxonomic such as the iris data, e.g.) it seemsnatural applications that the priorsshould reflect our beliefthat this is a reasonableaim, and thusavoidfitting severalsimilarcomponents whereone willsuffice. This idea is certainly not capturedby the priorswe used here,whichRichardson and Green(1997) suggestare moreappropriate for"exploring Inheterogeneity." hibition priorsfrom spatial pointprocesses[as used by,e.g., Baddeleyand van Lieshout(1993)] provideone way of expressing a priorbeliefthat the we might diswillbe somewhat distinct. components present Alternatively try thenumber ofcomponents in themodel, and thenumber between tinguishing of"groups" in the data, byallowing each groupto be modeled byseveral"similar"components. For example, groupmeans might be a priori distributed on
J7((, K-1) forthe means ctl,. . .,
Ak
on the posteriordistrihas a subtle effect
65
the scale ofthe data, and each groupmightconsistof an unknown number ofnormalcomponents, withmeans distributed aroundthe groupmean on a smallerscale than the data. The discussion Richardson and Green following (1997) provides a number ofother avenuesfor ofsuitable further investigation in thispaperwill priors, and we hope thatthe computational toolsdescribed help make suchfurther investigation possible. 5.2. Choiceofbirth distribution.The choiceofbirth distribution we made in Algorithm 3.1 is rathernaive,and indeedwe were rathersurprised that we were able to make muchprogress withthis approach.Its successin the Fixed-K modelappears to stemfrom the factthatthe (data-dependent) independentpriorson the parameters a + are not so vague as to neverproduce reasonablebirth as to alwayspropose event,and yetnotso tight components whichare verysimilarto thosealreadypresent. In the Variable-K modelthe successofthenaive algorithm seemsto be due to thewayin which thehyperK and f "adapt"the birth parameters distribution to make thebirthofbetter morelikely. Here we mayhave been lucky, components sincethe priors were not chosenwiththese properties in mind.In generalthen it may be necessensiblebirth-death saryto spend moreeffort schemesto achieve designing Our resultssuggestthat a strategy adequate mixing. of allowingthe birth distribution ofy, but dependon the data, may b(y;(u, 4)) to be independent resultin a simplealgorithm withreasonablemixingproperties. An ad hoc approachto improving involvesimply mixing might beinvestigating mixing haviorfor moreorless "vague"choicesofb. A moreprincipled would approach be to choosea birth distribution which can be botheasilycalculatedand simulated from and whichroughly directly, the (marginal) approximates posterior distribution ofa randomly chosenelement ofP. Such an approximation might be obtainedfrom a preliminary analysis witha naive birthmechanism, or perhapsstandardfixed-dimension MCMC withlarge k. A more sophisticated approach mightallow the birthdistribution b(y; (T, 0)) to dependon y. Indeed,the oppositeextreme to our naive approach wouldbe toallowall points todie at a constant rate,and find thecorresponding birthdistribution using(15) [as in, e.g.,Ripley(1977)]. However, mucheffort maythenbe required to calculatethebirth rate ,3(.) (perhapsbyMonte-Carlo whichlimitsthe appeal ofthis approach.[This problem integration), did not arisein Ripley (1977) wheresimulations wereperformed conditional on a fixed value of k by alternating birthsand deaths.]For this reasonwe believethat it is easier to concentrate on designing efficient birth distributions whichcan be simulated from directly and whosedensities can be calculatedexplicitly so thatthe deathrates (15) are easilycomputed. 5.3. Extension toother contexts. It appearsfrom ourresultsthat,for finite mixture problems, our birth-death algorithm provides an attractive alternative to the algorithm used by Richardson and Green(1997). There seems to be considerable potentialforapplying similarbirth-death schemesin other contexts as an alternative to moregeneralreversible jump methods. We now
66
M. STEPHENS
attemptto give some insightinto forwhichproblems such an approachis likelyto be feasible. We beginour discussion byhighlighting the main differ3.1 and the algorithm used by Richardson and ences betweenour Algorithm Green(1997). A. Our algorithm operates in continuous time,replacingthe accept-reject rates. schemeby allowing eventsto occurat differing B. Our dimension-changing birthand death movesdo not make use of the out over themwhen calculating missingdata zn, effectively integrating the likelihood. C. Our birthand death movestake advantageof the naturalnested strucofa complicated the need forthe calculation tureofthe models, removing and making morestraightforward. Jacobian, implementation and deathmovestreattheparameters as a pointprocess, and do D. Our birth < ,Uk [used byRichardson such as IL, ' notmake use ofany constraint theirsplitand combine and Green(1997) in defining moves].
,
time A to be the least important distinction. We consider Indeed,a discrete an could be version ofourbirth-death designed process using accept-reject step along the lines of Geyerand M0ller(1994), or using the generalreversibleone can envisiona continuous ofGreen(1995). (Similarly jump formulation We have no good time versionof the generalreversible jump formulation.) timeor continuous timeversionsare likelyto forwhether discrete intuition that in general,although be moreefficient Geyerand M0ller(1994) suggests to mixing for thediscrete time it is easier to obtainanalytical resultsrelating
version.
ofour algorithm: forapplication PointB raises an important requirement forany givenparameters. This we must be able to calculatethe likelihood to applyto HiddenMarkovModels, makes the methoddifficult requirement wherecalculationofthe likelihood or othermissingdata problems requires to thisproblem wouldbe to introofthemissing data. One solution knowledge and perform births and deaths ducethemissing data intotheMCMC scheme, whilekeeping the missing data fixed and deaths [alongthe lines ofthebirths in Richardson and Green(1997)]. However, of"empty" wherethe components fork thisseemslikely data is highly informative to lead to poormixmissing whichpropose ing,and reversible jump methods jointupdatesto the missing ofthe modelappear moresensiblehere. data and the dimension In order ofthebirth-death to take advantageofthesimplicity methodology, we mustbe able toviewtheparameters ofourmodelas a pointprocess, and in we mustbe able to expressourprior in terms ofa Radon-Nikodym particular to a symmetric as in Section2.2. Thisis derivative, r(.), withrespect measure, nota particularly examples restrictive requirement, and we givetwoconcrete than the mixture below.These examplesare in manyways simpler problem sincethereare no mixture and themarked becomes proportions, pointprocess a pointprocesson a space (D. The analogue ofTheorem3.1 forthis simpler case [which from follows Preston (1977)] essentially directly (1976) and Ripley
67
maybe obtainedbyreplacing condition (15) with (37) (k + 1)d(y; 4))r(yu O)L(y U 4) = f(y)b(y; O)r(y)L(y). Provided we can calculatethe likelihood ofthebirth-death L(y), theviability willdependon beingable to find distribution a birth whichgives methodology in Section5.2 provide adequate mixing. someguidancehere.It The comments is clear thatin someapplications the use ofbirthand deathmovesalone will to achieveadequate mixing. make it difficult the ease withwhich However, different birth in distributions and the successofouralgorithm maybe tried, in designing the mixture context withminimaleffort efficient birthdistributions, thatthistypeofalgorithm is worth before morecomplex suggests trying reversible jump proposaldistributions are implemented. Example 1: Change point analysis. Considerthe change-point problem from Green(1995). The parameters ofthis modelare the number ofchange < Sk < L of the change points, and points k, the positions 0 < s <... the heights hi (i = 0,..., k) associated with the intervals [si, si+,], where In orderto treat the so and Sk+1 are definedto be 0 and L respectively. parametersof the model as a pointprocess,we dropthe requirement that statisticss(1) < < S(k),and the corresponding heightsh(j) (i = 0,..., k) associatedwiththe intervals[s(i), s(i+?)], wheres(O) and S(k+1) are defined to be 0 and L respectively. Consider in whichk has prior a prior initially mass distribution probability p(k), and conditional on k,the si and hi are assumedto be independent, with on [0,L],and hi F(a, ,3). In thenotation si uniformly distributed ofprevious on.(P=[O,L] x [O,oo), (38)
sections we take
Tj = h(o),
Sl <
...
< Sk, and define the likelihood of the model in terms of the order
Xi = (s(i), h(i)), w = (a, ,B),v to be Lebesgue measure

kl
r(k,s, h) = p(k)
i= l
l LI(si
[0, L])F(hi; a, 3),
and 1 is ignored. Withbirths and deathson iF defined in an obviousway, it is thenstraightforward to use condition (37) to createa birth-death processon () = [0, L] x [0, oc) withthe posterior distribution of+ givenr1 as its stationThis can thenbe alternated ary distribution. withstandardfixed-dimension MCMC steps (whichallow h(o), and perhapsa and /3 to vary)to create an ergodicMarkovchain withthe posterior distribution of the parametersas its stationary distribution. The analogueofournaive algorithm forthisprior wouldhave birthdistribution (39) b(y;(s, h)) = LjI(s
E
[0, L])F(h; a, /).
A more sophisticated approachwould be to allow the birthof new change pointsto be concentrated on areas which, based on thedata, seemgoodcandidates forchangepoints(e.g.,bylooking at themarginal posterior distribution ofthe distribution ofchangepointsin a preliminary analysesusingthe naive birth orfixed-dimension mechanism, MCMC), and allowthebirth distribution
68
M. STEPHENS
forthe new h to dependon the new s, again beingcentered on regions which appear to be goodcandidatesbased on the data. tributed as the even-numbered orderstatistics of2k + 1 pointsindependently and uniformly distributed on [0, L]:
(40Sp(s()
' * ' X(k))
...
Now suppose that [as in Green (1995)] S(l),
...
., S(k) are, given k, a prioridis-
L2k+ 1) (S(1) (S(k)

-
0)(S(2)
-
s(1)) < S()

<...
S(k1))(L
S(k))I(O
< S(k) <
L).
This corresponds to Si,
oftheseorderstatistics: distributed as a random permutation

(41) (S1,...,s=k!
...,
Sk (which must be exchangeable) being a priori
1 (2k?+ 1)?
...
(41)
L2 1 (s(l)
-
0)(s(2)
-
(S(k)
S(k-l))(L
S(k))
HI(Si
S(l)) k
[r,L])
to a priorwhichcorresponds giving
r'(k, s, h) = (!L2k+) (S(l) - 0)(s(2) - S(l)) kL2lk
...
(42)
(S(k)
S(kl1))(L
S(k))
i=l
HI(si
E [0, L])F(hi; a, 3).
schemeusingthe prior(39), it wouldbe straightforward Givena birth-death to modify this schemeto use this secondprior(42), forexample,by keeping ofthe death rates the calculation and modifying the birthdistribution fixed, by replacingr withr'. The way in whichpriorsare so easily experimented ofthebirth-death withis one majorattraction methodology. in a regression a subsetofa givencollection ofvariablesto be included lecting includedemodel[see,e.g.,Georgeand McCulloch (1996)]. (Similarproblems or whichlinksto include cidingwhichtermsto includein an autoregression, in a Bayesian BeliefNetwork.) Let therebe K possiblevariablesto include, modelwhichcontainsk ofthe variablescan thenbe represented by a set of are k points {(i1, P3i), , (ik, P'k)} in (D = {1, ..., K} x R, where il, i. Note thatthe pointsare exchangeable in addingvariablei to the regression. thatthe orderin whichtheyare listedis irrelevant. A suitablechoiceforv in the definition ofthe symmetric measure /W (Section2.2) wouldbe theproduct measure of countingmeasure on {1, ..., K} and Lebesgue measure on R. forall i, and conditional on variablei beingpresent, pendently f3ihas prior
Suppose our prior is that variable i is present with probabilitypi, indedistinctintegersin {1, . . ., K}. The birthof a point (i, i3i)then correspondsto and let variable i be associated with a parameter Pi E R (i = 1, ..., K). A Variable selectionfor regressionmodels. Consider now the problem of se-
69
forall i. Then we have p(l3i), again independent (43) r(k, (il, P3i,), ***X(ik, (0,
P3ik))
if ia
= ib
forsomea, b,
pil P(/i1) .Pik P(Pik.)
otherwise.
The choiceofbirthdistribution b(y;(i, P3i))mustin thiscase dependon y, in order to avoidaddingvariableswhichare alreadypresent. A naive suggestion wouldbe to set
(44) b(y; (i, Pi3)) = bip(8i3)
withbi ocpi for thevariablesi notalreadypresent in y. Again,moreefficient schemescould be devisedby lettingthe birthsbe data-dependent, possibly the marginalposterior distributions ofthe fPi in prelimithrough examining naryanalyses. APPENDIX: PROOF OF THEOREM 3.1
PROOF. Our proof drawsheavilyon the theory derived by Preston(1976), Section5, for on statespace Ql = Uk lk generalMarkovbirth-death processes wherethe Qk are disjoint. The processevolvesbyjumps,ofwhichonlya finite number can occurin a finite time.The jumps are oftwotypes:"births," whichare jumps from a pointin Qk to Qk+1' and "deaths," whichare jumps from a pointin flk to a pointin Qk-l When the processis at y E f1k the behavior ofthe processis defined rate13(y), the deathrate 5(y), by the birth
probability measureson Qk+l and Qk-1 respectively. Birthsand deathsoccur as independent Poissonprocesses, withrates ,B(y)and 6(y) respectively. If a birthoccursthenthe processjumps to a pointin fk+1' withthe probability thatthispointis in any particular set F C Q2k+1 beinggivenby K(k)(y; F). If a deathoccursthenthe process jumps to a pointin Qk-1l withtheprobability that this pointis in any particular set G C lk-1 beinggivenby K" (y; G). Preston(1976) showedthatforsuch a processto possess stationary distribution,i it is sufficient thatthe following detailedbalance conditions hold: 1 (Detailed balance conditions). ,i is said to satisfydetailed balance conditions if
DEFINITION
and the birthand death transitionkernels K(k)(y; ) and K(k)(y; ) which are
(45) and (46)
| ,/3(y) y) dj/8(
8(z)K kl) (z; F) dik+?l(z) fork > 0, F C lk
JG 6(z)
diik?i(z)
/3(y)K k(y; G) di8k(y)
fork > 0, G
C Qk+l-
These have the intuitive meaningthat the rate at whichthe processleaves any set through the occurrence of a birthis exactlymatchedby the rate at
70
M. STEPHENS
the occurrence whichtheprocessentersthatset through ofa death,and viceversa. O the detailedbalance We therefore checkthat p(k, a, Xn,w, ij) satisfies to the generalMarkovbirthconditions forour process,whichcorresponds and deathtrandeathrate 8(y), and birth rate /3(y), deathprocesswithbirth sitionkernelsK(k)(y; ) and K() (y; ) whichsatisfy
(47) K(k)(y; ,8 F)
=
):YU(7T, )eF 1T~~(,
b(y; (IT, 4p))dr v(d4)
and
(48) 5(y)K(k)(y; F)
=
(w7,O)ey:y\(r,O)eF
d (y\( (,
k); (I,
k)).
Let Ak represent the parameter We beginby introducing some notation. taken space forthe k-component model,withthe labelingofthe parameters the and let Qk be the corresponding intoaccount, space obtained by ignoring If (,a,4) E Ak, thenwe will write[,r,+] forthe labelingofthe components. memberof flk* WithA = Uk>j Ak, let P(.) and P(.) be the corresponding measureson A, and let Pk(.) and Pk(.) denote priorand posterior probability has Radon-Nikodym to Ak.The prior distribution restrictions theirrespective to 4k- x vk. Thus for(ir, +) e Ak we have derivative r(k, a, +) withrespect (49)
dPk{(r(,
4)}
r(k, n, )(k
1)! dr7T... diTkl v(d l)...
v(d k).
we have Also,by Bayes theorem

dP{(r(,
?)} ocL([r, ]) dP{ (,, +)} +)} = f([r, ,]) dP{(Tr, +)}
and so we will write

dP{(rr,
forsome fQ([, +]) o L([r, D]). measuresinducedon fl by P(.) and Now let ,u(.) and ,i(.) be theprobability and let IkQ) and /k(.) denotetheirrespective restrictions P(-) respectively, to Qk. Then forany function g Q -+ R we have
(50)
1
g(y) |Q
g(y)kd(Y) y)
g([r,a ,)
dPk I(, 0
)}
and
d/i(k
= 'AY
-
(51)
f g([r,
I
Qk
g([wr, 4])dPkl
QT,
4)I
4 ])dPk{(I, r)} Df([Tr,

dhk(y).
g(y)f(y)
71
We define births on A by
(52)
(Tr,4) U (IT, 4) := ((7TI(1
-
),
b1), * (*, ( k(1
r),
k), (n,
.))
and will requirethe following Lemma (whichis essentially a simplechange ofvariableformula).

LEMMA5.1.
If (Tr,4) E Ak and (T, 4))E [0, 1] x (F then 4)

U (ir,
r(k, Tr, +)dPk+lt(Tr,
)I)}
-
= r(k + 1, (yr, 4) U (ir, 0))k(l

PROOF.
<)k-1drv(d4))dPk{('r,
4)}
LHS = r(k, T, +) dPk+l{Q(, +) U (r,

= r(k, r,
4))}
4)1), *
(r(l T,
- I7), 4)k), (T,
1)dPk+l{((Orl(1
- I7),
))}
[equation (52)]
= r(k, I, +)r(k + 1, (Tr, u ('r, 0)))k!(1 41) *d7Tk d7
1
d7T1
rv(dol) ..v(dOk)v(do) [equation(49) and changeofvariable]

- v.)k 1
= r(k + 1, (7r,+) U (7r,0)))k(1

= RHS.
d 7T v(d4)) dPk {(I,
)}
[equation(49)]
Assume forthe moment that r(y)L(y) > 0 forall y. Let I(.) denotethe generic so I(x E A) = 1 ifx E A and 0 otherwise. indicator function, We check the first partofthe detailedbalance conditions (45) as follows: LHS =
IF (Y) dk(Y)
Qk
= |
I(y E F)fl(y)f(y)d,Ak(y) [equation(51)]

I(y E F)8(y)f(y)
Jb(y; (IT,'))
(z)
dw v(d4) d,LLk(Y)
[b mustintegrate to 1.]
RHS = |
-|
k+1
1
(z)K(k+l)(z; F) d,
3(z)K(k+l)(z;
fk+
F) f(z) dAk+1(z)
[equation (51)] [equation (48)]
-t ~ ~E k+&,,_1X_.
(0EZ\7,_
~~ 1EF
(z\(,7T +)(7T, 0)) f(z)dAk+1(Z)
72
k+1
M. STEPHENS
fA|+1
Ak+l i=1
4)]\(ij, 0i); (i, 0))) E I([Tr, 4)]\('i,4i) E F) d([Tr,

x f ([r, { (Tr 4)]) dPk+l ,
Ok+l)
)
E F)
[equation (50)]
fA(k
Ak+1
+ 1)I([r,
4)]\(Tk+l,
x dQT,
= fA(k
Ak+1
)]\(QTk+l, Ok+l); (7Tk+l, Ok+l))
x f ([,, 4])dPk+l t(r,
4)1
)}
ofPk+l(-)] [by symmetry

0))f ([r',
+ 1)I([n', 4)']E F) d([iT', 4)]; (iT, x dPk+l {r,
4)']U (, 4))
(r,
= L| /0|1]
4)')U (T
[(T,
4)')U (iT, f) =
4))]
I([T',
4']; (7T, 4))) f ([r', 4'] U (iT, 4)) 4)']E F)(k + 1) d([rr',
U (r,
r)kl_)k-1
x r(k + 1, (,' ,(F)
x dr v(d4) dPkI (,' 1, ')}

=
f|kk /
[Lemma5.1]
|1]
C F)(k + 1) d(y; (ir, 4))) f (y U (r, 4)) I(y xr (y U (r, 4)) k(l - T)k-l dT v(d4) dpk(y) r(y) [equation (50)]
and so LHS = RHS provided

(k + 1)d (y; (iTr,)) f (yU(, 0)) r(y u (ir,4)))k(l - r)k-l
=
3(y)b(y;(IT, 0)) f(y)
whichis equivalentto the conditions (15) statedin the Theoremas f(y) O( detailedbalance conditions L(y). The remaining (46) can be shownto holdin a similarway. The condition that r(y)L(y) = 0 forall y can now be relaxedby applying the spaces Ak and Qk to {y the conditions (13) and (14), and restricting
r(y)L(y)
>
0}. U
ProfesAcknowledgments. I wouldlike to thankmyD.Phil. supervisor, and forvaluforsuggesting this approachto the problem, sor Brian Ripley, I wouldalso liketo thankMarkMathieson on earlierversions. able comments Peter Green,two forhelpfuldiscussionson this work,and Peter Donnelly, advice an AssociateEditorand the Editorforhelpful anonymous reviewers, the manuscript. on improving
73
REFERENCES
ANDERSON,
E. (1935). The irisesofthe Gaspe Peninsula. Bulletin of the American Iris Society 59 2-5. BADDELEY,A. J. and VANLIESHOUT,M. N. M. (1993). Stochastic geometry modelsin high-level vision. InAdvances inApplied Statistics (K. V.Mardiaand G. K. Kanji,eds.) 1 231-256. UK. Carfax, Abingdon, forMarkov BROOKS, S. P. and ROBERTS, assessment chain G. 0. (1998). Convergence techniques MonteCarlo.Statist. Comput. 8 319-335. via Markov ChainMonte Carlomethods. CARLIN, B. P. and CHIB, S. (1995).Bayesianmodel choice J Roy. Statist. Soc. Ser. B 57 473-484. Assoc.90 1313-1321. from theGibbsoutput. J AmerStatist. CHIB, S. (1995).Marginal likelihood a B. P. (1996). Markov Chain MonteCarloconvergence COWLES,M. K. and CARLIN, diagnostics: J Amer.Statist. Assoc. 91 883-904. comparative review. J ofthe Laplace method to finite mixture distributions. S. L. (1994). An application CRAWFORD, Amer.Statist. Assoc. 89 259-267. ofpaperbyRichardson and Green(1997).J A. P. (1997). Contribution to the discussion DAWID, Roy. Statist. Soc. Ser B 59 772-773.
A. and WASSERMAN, L. (1997). Computing Bayes factors by DiCiccio, T., KASS, R., RAFTERY, Assoc.92 903J AmerStatist. simulation and asymptotic posterior approximations. 915.
distributions J.and ROBERT, C. P. (1994).Estimation offinite mixture DIEBOLT, through Bayesian J Roy. Statist. Soc. Ser B 56 363-375. sampling. M. D. and WEST, M. (1995). Bayesiandensity estimation and inference ESCOBAR, usingmixtures. J Amer Statist. Assoc. 90 577-588. A. and RUBIN, D. B. (1992).Inference from iterative simulation GELMAN, usingmultiple sequences (withdiscussion). Statist. Sci. 7 457-511.
A. G., CARLIN, GELMAN, J. B., STERN,H. S. and RUBIN,D. B. (1995). Bayesian Data Analysis.
In Markov searchvariableselection. GEORGE,E. I. and MCCULLOCH,R. E. (1996). Stochastic
Chapman& Hall, London.
and D. J. Spiegelhalter, Chain MonteCarlo in Practice(W R. Gilks,S. Richardson eds.) Chapman& Hall, London. C. J. and MOLLER,J. (1994). Simulation and likelihood inference forspatial GEYER, procedures Scand. J Statist.21 359-373. pointprocesses.
D. J., eds. (1996). Markov Chain Monte Carlo GILKS, W. R., RICHARDSON, S. and SPIEGELHALTER,
and Gibbsianpointprocesses I. Markovian GLOTZL,E. (1981). Time-reversible spatialbirth and
in Practice. Chapman& Hall, London.
JENNISON, C. (1997).Comment on "OnBayesiananalysisofmixtures with an unknown number of
HARDLE,W. (1991). Smoothing techniques with implementation in S. Springer,New York.
deathprocess on a generalphase space.Math.Nach. 102 217-222. P. J.(1995).Reversible chainMonte Carlocomputation and Bayesianmodel GREEN, jumpMarkov determination. Biometrika 82 711-732.
and P. J.Green. components," J Roy.Statist. Soc. Ser B 59 778-779. byS. Richardson KELLY, F. P. and RIPLEY, B. D. (1976) A noteon Strauss'smodelforclustering. Biometrika 63 357-360. A. B. (1996). Markovchain Monte Carlo methodsforspatial clusterprocesses.In LAWSON,
M. J. (1997). Ordinalmodelsand predictive MATHIESON, in pattern methods recognition. Ph.D. Univ.Oxford. thesis,
MCLACHLAN, G. J. (1992). Discriminant Analysis and Statistical PatternRecognition.Wiley,New
ComputerScience and Statistics: Proceedings of the 27th Symposium on the Interface 314-319.
G. J. and BASFORD,K. E. (1988). Mixture Models: Inferenceand Applications to MCLACHLAN, Clustering. Dekker, New York. PHILLIPS,D. B. and SMITH,A. F. M. (1996). Bayesianmodelcomparison via jump diffusions. In
York.
74
M. STEPHENS
Markov Chain MonteCarlo in Practice(W.R. Gilks,S. Richardson and D. J.Spiegelhalter, eds.) 215-239. Chapman& Hall, London. in the structure J. P. and GELLER, M. J. (1986). Probesof large-scale POSTMAN, M., HUCHRA, 92 1238-1247. TheAstronomical Journal CoronaBorealisregion. 46 371-391. Bull. Inst.Internat. Statist. PRESTON, C. J.(1976). Spatialbirth-and-death processes. mixtures. J Amer. Assoc.89 796-806. PRIEBE, C. E. (1994).Adaptive Statist.
RICHARDSON, S. and GREEN,P. J. (1997). On Bayesian analysis of mixtures with an unknown
number ofcomponents (withdiscussion). J Roy.Statist.Soc. Ser.B 59 731-792. J Roy.Statist.Soc. Ser.B 39 RIPLEY, B. D. (1977). Modelling spatialpatterns (withdiscussion). 172-212. New York. RIPLEY,B. D. (1987). Stochastic Simulation. Wiley, C. P. (1994). TheBayesianChoice:a Decision-Theoretic Motivation. New York. ROBERT, Springer, Inference and estimation. In Markov ChainMonte C. P. (1996).Mixtures ofdistributions: ROBERT, Carlo in Practice S. Richardson and D. J.Spiegelhalter, (W.R. Gilks, eds.) Chapman& Hall, London. K. (1990).Density estimation with confidence setsexemplified andvoids ROEDER, bysuperclusters in thegalaxies.J Amer. Statist. Assoc.85 617-624. S. J. and JONES, M. C. (1991). A reliabledata-based bandwidth selection SHEATHER, method for kernel density estimation. J Roy.Statist. Soc. Ser.B 53 683-690. D. A. and FISCH, R. D. (1998).Bayesiananalysisofquantitative traitlocusdata using STEPHENS, reversible chainMonteCarlo.Biometrics 54 1334-1367. jumpMarkov M. (1997).BayesianMethods Mixtures Ph.D. thesis, STEPHENS, for ofNormal Distributions. Univ. www. stats. ox. ac . uk/hstephens. Oxford. Availablefrom STOYAN, D., KENDALL, W. S. and MECKE,J. (1987). Stochastic Geometry and Its Applications, 1st ed. Wiley, New York. TIERNEY, L. (1996). Introduction to generalstate-space Markovchaintheory. In MarkovChain CarloinPractice Monte S. Richardson and D. J.Spiegelhalter, (W.R. Gilks, eds.)59-74. Chapman& Hall, London. TITTERINGTON, D. M., SMITH, A. F. M. and MAKOV, U. E. (1985). Statistical AnalysisofFinite Mixture Distributions. Wiley, New York. W.N. and RIPLEY,B. D. (1994).Modern VENABLES, AppliedStatistics with S-Plus. Springer, New York. W.N. andRIPLEY,B. D. (1997).Modern VENABLES, Applied Statistics with S-Plus,2nded. Springer,
New York. WEST, M. (1993). Approximating posterior distributions by mixtures. J:Roy.Statist.Soc. Ser.B 55 409-422. WILSON,S. R. (1982). Soundand exploratory data analysis.In COMPSTAT 1982.Proceedings in
Statistics and R. Tamassone, Computational (H. Caussinus,P. Ettinger eds.) 447-450. Physica, Vienna.
DEPARTMENT OF STATISTICS
1, SOUTH PARKS RoAD
OXFORD, OX1 3TG UNITED KINGDOM E-MAIL:stephens@stats.ox.ac.uk

Birth-Death MCMCS, Annals of Statistics, 2000, Stephens

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Birth-Death MCMCS, Annals of Statistics, 2000, Stephens

Încărcat de

Drepturi de autor:

Formate disponibile

Bayesian Analysis of Mixture Models with an Unknown Number of Components- An Alternative to Reversible Jump Methods Author(s): Matthew Stephens

The Annals of Statistics 2000, Vol. 28, No. 1, 40-74

BAYESIAN ANALYSIS OF MIXTURES

BAYESIAN ANALYSIS OF MIXTURES

tionsfrom a mixture withk (k possibly unknown but finite) density components,

+ lTkf(x; 4k, 'i),

i (j = 1, ..., n; i = 1,..., k).

from the densities

Conditional on the Zs, x1, ...,

xn are assumed to be independentobservations (j= 1,...,n).

Integrating out the missing data Z1, .., Zn thenyieldsthe model(1).

2.2. Hierarchical model. We assume a hierarchical modelforthe prior on

prior distributionfor(k, ,, +) given hyperparameters w, and commoncompo-

and similarlythe predictivedensityfora futureobservationmay be estimated by (7) P(Xn+l x)

BAYESIAN ANALYSIS OF MIXTURES

(Xj; 01 ,' + H [Tlf

+ gkf 701) (Xi; ?wk,

is also invariant underpermutations ofthe components labels,the posterior distribution

the componentsand can consider any set of k parameter values I(91r'01)

willbe similarly invariant. thelabelingof Fixingw and q, we can thusignore

p(k, r, 4!, (, q I xn).

3.2. Birth-deathprocesses for the componentsof a mixturemodel. Let

r)7, /k), (7m

BAYESIAN ANALYSIS OF MIXTURES

and (T, 4) z [O, 1] x ?.

PROOF. The proof is deferred to theAppendix.

described at (5), where (16)

r(y) = p(k I o),71)P(1 I oj,rq)- A0k I &J, ')

Theorem3.1 we findthat the processhas the correct Applying stationary

withprobability Death: Selecta component to die: (yj, j) C y beingselected

BAYESIAN ANALYSIS OF MIXTURES

ofnormaldistributions: number (finite) (19) p(x

Poissonprioron the number Priordistributions.We assume a truncated k: ofcomponents

[3 Ylr(2g, (2h)-1), r - 9(y)

on the space of all symmetric matrices(indicator function and

BAYESIAN ANALYSIS OF MIXTURES

ll (2g + 2ka, [2h + 2 E -1-) Xr

, y + nk), xi+ Kc), (ni-'1 ?)-) -Tu)(x

Y/}2a + ni, [2l3? E (x+

1, ..., k and j allocatedto class i (ni

n, where ni is the number of observations

where Ir is the r x r identitymatrix. In or-

BAYESIAN ANALYSIS OF MIXTURES

7Tktp(X; [Lk, %)'

Startingpoints, computationalexpense and mixingbehavior For each prior

BAYESIAN ANALYSIS OF MIXTURES

50Q0 l 000 15000 20000 Fitfing nornals, fixed kappa)

5000 10000 15000 20000 variable Fitting normals, kappa

0 20 d 00 eo 100 oixed Lag(Fiting normals, kappa)

60 20 40 eo 100 nonmals vanrable Lag(Fining kappa)

80 20 440 60 tixed Lag(Fiting t4s, kappa)

forsampledvalues of k (b) Autocorrelations

estimates ofthe predictive Corresponding density (Figure 6) show

BAYESIAN ANALYSIS OF MIXTURES

4 14 2 6 8 10 12 k (fitting t4a, fixedkappa. lambda-i)

0 4 14 2 8 10 12 6 k (fitting nom kals, fixedkappa. lanbmda-3)

8 4 14 0 2 8 10 12 k (fitting normals,variable kappa, lanbda-3)

4 8 14 2 6 10 12 k (fitting tL4s, fixadkappa, lambda=3)

2 4 6 8 10 12 14 normal, kappa, lsmbda=6( k(fitting toaed

0 2 4 6 8 10 12 14 k(hfitig nonnala, nanalsa kappa, lsndrda=6(

2 4 6 8 10 12 14 k(ltfing fixad kappa, lamtida-O( t_,4a,

ci 10 20 30 Fitting nomrna, vanabia kappa

10 20 30 Fitting normals, fixedkappa