Documente Academic
Documente Profesional
Documente Cultură
Article views: 5
a Department of Mathematics and Computer Science, Saint Louis University, Saint Louis, MO, USA; b Department of
Biostatistics, College for Public Health and Social Justice, Saint Louis University, Saint Louis, MO, USA
ABSTRACT
ARTICLE HISTORY
We propose a semiparametric approach to estimate the existence and location of a statistical change-point to a nonlinear multivariate time series
contaminated with an additive noise component. In particular, we consider
a p-dimensional stochastic process of independent multivariate normal
observations where the mean function varies smoothly except at a single
change-point. Our approach involves conducting a Bayesian analysis on
the empirical detail coefficients of the original time series after a wavelet
transform. If the mean function of our time series can be expressed as a
multivariate step function, we find our Bayesian-wavelet method performs
comparably with classical parametric methods such as maximum likelihood
estimation. The advantage of our multivariate change-point method is seen
in how it applies to a much larger class of mean functions that require only
general smoothness conditions.
Semiparametric; scaling
coecient; detail coecient;
discrete wavelet transform;
Haar wavelet
1. Introduction
The change-point problem has been studied in a variety of settings since at least the 1920s when in
an effort to improve quality control Walter Shewart developed his now ubiquitous statistical control charts to detect various statistical changes in industrial processes.[1] Although control chart
methods proved useful in practice more theoretically grounded approaches involving maximum
likelihood estimation (MLE) [2] and Bayesian techniques [3] later allowed the practitioner to rigorously associate confidence intervals to their conclusions. While initially the univariate case of a single
change-point in the mean function was the focus, efforts expanded to include various other related
problems such as a multiple statistical change-points,[46] change in variance,[7] and a simultaneous
change in mean and variance.[8] The case where the error component is not from a normal distribution has also been studied by various authors.[9,10] While many of these methods have proven to be
valuable diagnostic data analysis tools, they generally either apply only in a single dimension or after
making strict assumptions on the time series model.
There appears to be a gap in the change-point literature that addresses the change-point problem
for nonlinear multivariate time series. Classical parametric approaches such as MLE and Bayesian
methods exist to detect and estimate the location of one or more statistical change-points in multivariate time series.[8,11,12] Many variations of such parametric approaches exist for detecting
multivariate statistical change-points,[1316] but invariably these methods require strict assumptions on the time series mean function. Mller [17] developed an approach to detect discontinuities
in derivatives using left and right one-sided kernel smoothers for one-dimensional smooth functions.
CONTACT Steven E. Rigdon
2015 Taylor & Francis
srigdon@slu.edu
More recently, Ogden and Lynch,[18] Ciuperca,[19] and Battaglia and Protopapas [20] all have results
for estimating change-point locations in one-dimensional nonlinear time series. Matteson and James
[21] developed a fully nonparametric approach for estimating the location of multiple change-point in
a multivariate data. While their work is perhaps the method most relevant to the change-point problem in this article, their method still only applies to data sets where the mean function is piecewise
constant.
The multivariate change-point problem is an important problem that has direct applications in
a surprising number of otherwise seemingly unrelated fields. In statistical process control (SPC),
the multivariate change-point problem is important to quickly detect and estimate changes in
many industrial processes.[22] The US Department of Transportation has applied the multivariate
change-point problem to estimate statistical change-points around a speed limit increase from 55 to
65 mph.[13] Additional applications occur in such unrelated fields as biosurveillance, financial market analysis, and hydrology to name a few.[23,24] In practice, however, imposing strict assumptions
on the time series may be impractical when encountering the change-point problem for real world
multivariate data. Unfortunately, in the multivariate time series setting there have not been many
other good options. In this article we propose a method that attempts to bridge this gap by developing a generalization of the approach from Ogden and Lynch.[18] The method we propose detects
and estimates the location of a statistical change-point for multivariate data through a Bayesian analysis on empirical wavelet detail coefficients and applies even when strict assumptions about the true
underlying mean function cannot be made.
0.3
0.2
magnitude
0.1
0.0
signal
0
coefficient value
2
0
2
0.0
0.2
0.4
0.6
time
0.8
1.0
0.0
0.2
0.4
0.6
time
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
time
Figure 1. Original function (left), scaling coecients at level 5 smoothing the time series (middle), and detail coecients capturing
how the time series is changing at detail level 6 (right).
w0,k 0,k +
dj,k j,k
(1)
j=j0 k=
k=
where wj,k and dj,k are called the scaling and detail coefficients, respectively. Through an integration
which we define below, each wj,k and dj,k coefficient is associated with a particular scaling and wavelet
function j,k and j,k , respectively. Each j,k and j,k are in turn related to each other by the so-called
father and mother wavelet expressed as
{j,k (t) = 2j/2 (2j t k)}j,k ,
{j,k (t) = 2j/2 (2j t k)}j,k .
(2)
From Equation (2), we see wavelets by definition are simply systems of dilations and translations. The
simplest wavelet that we may explicitly express in closed form is the Haar wavelet given by
(t) =
1,
0
0t<1
otherwise
1,
1
0t<
2
(t) = 1, 1 t < 1
0
otherwise.
Besides being orthonormal bases for L2 (R), all wavelet systems in this article also possess other
special properties. By fixing a particular j in Equation (2), we denote the closed linear spans of the
scaling and wavelet functions as Vj and Oj , respectively, as k ranges through the integers. Each Vj is
an approximation space for the next finer approximation space of spanning scaling functions, Vj+1 ,
with the difference in information being precisely Oj . Figure 1 (middle) demonstrates this concept
of an approximation space by illustrating how a particular scaling level approximates the original
function by retaining the overall function shape while losing localized function characteristics. In
particular Vj is orthogonal to Oj with the direct sum of these orthogonal subspaces equal to Vj+1 ,
that is Vj+1 = Vj Oj . Such a construction leads to the so-called multiresolution analysis and allows
us to approximate (smooth) a signal at various approximation levels while precisely keeping track of
detail levels. The detail levels capture function change at these different resolutions and will play a
key role in our analysis of the change-point problem.
While the above formulation represents fundamental concepts of wavelet theory for general L2 (R)
functions and provides intuition for the approach, in practice we always apply the discrete wavelet
transform (DWT) with actual data. We start with a discrete time series x = {xi } of length 2J for some
natural number J. Next, we let j be an index across the DWT resolution levels ranging from J 1
down to 0. At each resolution level we produce 2j scaling and detail coefficients. For the Haar DWT,
the finest level of scaling (wjk ) and detail (djk ) coefficients are computed by the formulas
(3)
We then compute all subsequent levels of scaling and detail coefficients recursively by the formulas
djk
= djk + jk ,
is the empirical detail coefficient we actually observe. In the case of the Haar wavelet d
where djk
jk
would be the computation results after recursively applying Equations (3) and (4). Next, djk is the
true (but unknown) detail coefficient of the underlying smooth mean function we wish to estimate.
Finally, jk is the transformed additive noise component from the original time series that transforms
again to noise.[26]
If we assume i is generated from a Gaussian process, then i will also be Gaussian.[30] Wang [31]
connected these properties of the DWT to the change-point problem where he recognized that under
suitable conditions the largest detail coefficients are the result of those places where the time series
is changing most rapidly and probably not attributable to noise. Wang then hypothesized that the
places where the time series is most rapidly changing may be due to a statistical change-point. While
Wangs method works well for change-point problems with relatively high signal-to-noise ratios, it
becomes much less reliable as the additive noise is increased. Additionally, there is also the issue of
determining how best to combine the information from different detail levels to use in the analysis.
In the following section we develop a method that capitalizes on these statistical properties while
addressing Wangs shortcomings in a complete Bayesian model framework.
20
40
60
80
100
0.5
signal
0
0.5
0.0
signal
1.0
1.0
120
20
40
units of time
32
100
120
48
6 5 4 3 2 1 0
Resolution Level
16
80
Detail Coefficients
(sine function with a shift)
6 5 4 3 2 1 0
0
60
units of time
Detail Coefficients
(step function)
Resolution Level
64
Translate
Standard transform Daub cmpct on ext. phase N=10
16
32
48
64
Translate
Standard transform Daub cmpct on ext. phase N=10
Figure 2. Two example mean functions with a change-point at time point 81 (top) along with their respective detail coecients (bottom). Each detail level is normalized by its l -norm. Notice at the nest four resolution levels the detail coecients
are essentially identical to each other.
by proposing a method for estimating the change-point location of a one-dimensional time series by
applying Bayesian techniques in the wavelet domain. In this section, we generalize a similar methodology now to an arbitrary dimensional time series and extend the approach to answer the inference
question.
The DWT allows us to analyse a time series at varying resolution levels and stores the resulting
details of smooth functions in a similar way. Observe Figure 2 which displays two examples of a
smooth time series mean function except at a change-point at time point 81 (top) along with the
respective detail coefficient values (bottom). Observe that the detail coefficient values are essentially
identical for the finest three resolution levels (levels 4, 5, and 6), despite the fact that the mean functions are quite different. While some coefficient values at the lowest four resolution levels do begin
to diverge, at least 112 of the total 127 detail coefficients in this 128 element time series very closely
agree. This suggests that, from the wavelet perspective, the two change-points in Figure 2 are equally
difficult to detect when using just the highest detail coefficient levels.
The phenomena in Figure 2 illustrates the sparsity property of the DWT and holds in general for
any smoothly varying mean functions which share a common change-point. In particular, any otherwise smooth function with a change-point should have a similar detail coefficient representation
at the finest levels as a step function with a change-point at the same location. This observation provides the intuition behind why an analysis of wavelet detail coefficients may be an effective approach
in estimating the change-point location of time series with otherwise smooth functions such as those
shown in Figure 2.
Consider a multi-dimensional time series of independent observations {xi }N
i=1 for N N where xi
is a p-dimensional vector, such that
xi Np (i , ).
(5)
In the typical case where Bayesian or likelihood techniques are applied to the multivariate changepoint problem, i is assumed to be a p-dimensional step function. For our more general analysis,
however, i is assumed to be generated by a p-dimensional function, g(), smoothly changing except
at a single point in time where the shift occurs. Throughout this article, we denote the unknown time
series change-point location with the symbol . We also assume is an unknown but constant p p
covariance matrix throughout our time series. A particular observation of the time series takes the
form
xi = g(i) + i ,
where
i Np (0, ).
Next, we let the N p matrix X represent our time series where each row represents an observation
at a particular time. Additionally, we introduce the idealized N p matrix, H, which we compare
against X:
x11
x21
..
.
X=
x 1
x +1,1
.
..
xN1
x12
x22
..
.
x 2
x +1,2
..
.
xN2
...
...
..
.
x1p
x2p
..
.
...
x p
. . . x +1,p
..
..
.
.
...
xNp Np
0
0
..
.
and H =
0
1
.
..
0
0
..
.
0
1
..
.
0
0
..
.
. . . 1 Np
...
...
..
..
.
.
... ...
... ...
..
..
.
.
...
(6)
The zero rows in H represent those observations before the change-point and the one rows indicate
observations after the change-point. We assume for now that our time series is of dyadic length, that
is, of length N = 2J for some J N. While this appears to be a restrictive requirement, in practice
there are several padding techniques that remedy this apparent difficulty.[26] For example we might
simply concatenate low-level statistical noise to the front end of the time series to achieve the required
dyadic length if we have data available from the in-control time series state. Another method is to
reflect the time series elements of sufficient length to obtain the require dyadic length. For example,
a data set with six elements (x1 , x2 , x3 , x4 , x5 , x6 ) could be modified as (x3 , x2 , x1 , x2 , x3 , x4 , x5 , x6 ) to
achieve the required dyadic length. The latter approach is what we will apply for our practical example
in Section 6.2.
We now take a one-dimensional discrete wavelet transform (DWT) of both X and H column by
column which produces two (N 1) p matrices in the wavelet domain D and Q. We can normalize each detail level by its l norm which has the effect weighting coefficients from different resolution
levels equally. With l normalized detail levels our subsequent analysis becomes less sensitive to
change information contained by the lowest resolution levels. In Section 6 we apply our algorithm
both with and without normalized detail coefficients. When the rows of zeroes and ones of H exactly
correspond to the rows of X before and after change-point, the rows of D and Q will closely relate
to each other in a meaningful manner as we describe below. Since the statistical properties of the
additive noise component of the time series are retained after a one-dimensional DWT, it can easily
be shown using the linearity of the DWT that the expected covariance matrix after the transform
remains .
Notationally, we index our detail matrices to emphasize the detail levels of each row. More explicitly, supposing our time series is of length 2J , we denote a p-dimensional detail coefficient as djk =
(djk,1 , djk,2 , . . . , djk,p ) where j represents a particular detail level and k the translation index at the given
detail level. We then express the DWT of X and H as the matrices D and Q where,
d01,1
d01,2
...
d01,p
d11,2
...
d11,p
11,1
d
d12,2
...
d12,p
12,1
d21,2
...
d21,p
d21,1
.
.
.
.
D =
..
..
..
..
d
djk,2
...
djk,p
jk,1
..
..
..
..
.
.
.
.
d
... d
d
J
J
J
and
(J1) 2 ,1
(J1) 2 ,2
q01,1
q11,1
q12,1
q21,1
..
.
q01,2
q11,2
q12,2
q21,2
..
.
Q=
qjk,1
..
.
q(J1) J ,1
2
qjk,2
..
.
q(J1) J ,2
2
(J1) 2 ,p
...
...
...
...
..
.
q01,p
q11,p
q12,p
q21,p
..
.
...
qjk,p
..
..
.
.
. . . q(J1) J ,p
(N1)p
(7)
(N1)p.
Next, we define = [1 , 2 , . . . , p ] as the amount our mean function shifts at the unknown
change-point. It is important to note that here is not a vector, rather a set of coefficients. We use the
[ ] notation to distinguish this from say q11 = (q11,1 , q11,2 , . . . , q11,p ) which is a p-dimensional vector. So in particular, we define, q11 = (1 q11,1 , 2 q11,2 , . . . , p q11,p ) using element-by-element scalar
multiplication.
We know the additive noise component of the original time series is again transformed to an additive noise component after the dimension-by-dimension DWT is taken of the original time series.
Furthermore, as illustrated in Figure 2, we know, at least for the finest level detail vectors, that the
true detail vector values should very closely match the detail vectors of Q. In the case when the
mean function of our time series is a multivariate step function, all true detail vectors will match the
detail vectors of Q. In the more general case where the true underlying mean function is unknown,
we will ultimately retain only the finest level detail vectors in our final analysis. With these properties
in mind, those retained empirical detail coefficient vectors, djk , may therefore be modelled as
djk Np (djk , ) = Np (qjk , ),
where djk is the true detail vector while djk = (djk1 , djk2 , . . . djkp ) and qjk = (qjk,1 , qjk,2 , . . . qjk,p ) are
the jk rows of the matrices D and Q, respectively. Using Bayes theorem, our posterior distribution
of , , and takes the form of the product of our likelihood and prior distribution; that is,
p( , , |D )
f (djk | , , )p0 ( , , ).
(8)
j
information we have in our original time series directly applies after our transform. For example,
we could put a Wishart distribution as an informative prior on if we have sufficient prior knowledge of . For the most general case, however, we will apply Jeffreys noninformative prior given as
p0 (, , ) ||1/2 . We also note, that implicit in this prior is that we assign a uniform prior to
the change-point location throughout the time series. Our posterior distribution takes the form
1
p( , , |D ) ||m/2 exp
(djk qjk )T 1 (djk qjk ) ||1/2 ,
2 j
Downloaded by [Library Services City University London] at 09:47 19 December 2015
where m represents the actual number of detail coefficients used in the analysis. In the appendix,
we provide details of the calculations where we integrate out and to arrive at the marginalized
posterior distribution function that we apply in Sections 6 and 7,
(mp1)/2
1
1/2
T
T
p( |D ) C
djk djk BB
,
C
j k
(9)
where
A=
j
1
dT
jk djk ,
B=
j
qij djk ,
BT =
j
qij dT
jk ,
and C =
j
q2ij .
Formally we estimate the change-point of the time series as arg max p( |D ). In particular, there
are N 1 possible values of and with probability one a maximum value always exists. Notice that
Equation (9) is neither wavelet nor detail level specific. Depending on what we know (or do not
know) about the time series, different wavelet- and detail-level combinations may be more appropriate. Depending on the true underlying mean function of the time series, we found through simulation
studies that the choice of wavelet had a minor, but noticeable, effect on correctly estimating the
change-point location. In the simplest case, when the mean function is represented by a multivariate step function, studies show it is also the simplest wavelet (i.e. the Haar wavelet) that performs
marginally better. In the case of a smoothly varying mean function, the Daubechies 10-tap wavelet
became the best choice for correctly estimating the change-point location.
We also need to decide which detail levels to apply. This decision is fairly straightforward depending on what is known about the true mean function. In general the more applicable detail vectors that
we can use in Equation (9), the more confidence we will be able to attribute to our conclusions. So
long as the mean function is smooth except at the change-point location then our model assumptions
apply and at least the finest three or four detail levels should be applied. If more information about the
mean is available, it may be optimal to use more detail levels. For example in the case of a multivariate
step function, all detail levels should be applied.
likelihood for M1 :
P(D |M1 ) = K(2 )(pmp)/2 2mp/2 p
(mp1)/2
1
1/2
T
T
djk djk BB
,
C
2
C
j k
m
(10)
p(p1)/4
p
[x + (1 i)/2],
i=1
K is a constant common to both models, and all other terms are as previously defined.
In M2 , calculations are simplified since is assumed to be the p-dimensional zero vector. Once
again adopting a similar approach as before we obtain the likelihood of observing our data under M2 :
(mp)/2
(m
+
1)
mp/2 (m+1)p/2
T
P(D |M2 ) = K(2 )
2
p
djk djk
.
(11)
2
j k
We note the difference in the number of free parameters in M1 and M2 is k2 k1 = p, namely the
dimension of . This suggests a form of the SIC.
(SIC) = 2(log P(D |M2 ) log P(D |M1 )) + (k2 k1 ) log N.
For our multi-dimensional change-point problem, we maximize Equation (10) for to obtain our
final result.
(SIC) = 2(log(P(D |M2 )) log P(D |M1 )) + p log(N),
(12)
where Equation (12) implicitly assumes equal prior probability of realizing either M1 or M2 . In certain
instances the modeller may have reason to favour one model over the other and so the prior odds ratio
of the two models would not be 1. Recall, the posterior odds ratio may be expressed as
P(D |M1 ) P(M1 )
P(M1 |D )
=
P(M2 |D )
P(D |M2 ) P(M2 )
P(M1 )
= Bayes Factor
.
P(M2 )
(13)
We may modify Equation (12) to incorporate a prior belief to a priori favour one model over the other.
In our setting this may be accomplished by substituting the data-dependent terms in Equation (12)
with 2 times the log of Equation (13). For the later examples and simulations we provide, we note
that each model is given equal weight and Equation (12) is implemented in its present form.
Our selection process is now a straightforward calculation of (SIC). We select the no change
model when (SIC) < 0 and infer a change-point exists in the time series when (SIC) > 0. We note
slightly positive values (e.g. (SIC) 3) should be treated with caution. Although the change-point
model is favoured in such cases, the evidence is not particularly strong. Values computed farther from
zero (i.e. (SIC) > 3) denote strong evidence of the existence of a change-point with more assurance
obtained with larger computed values.
10
(1) estimate the number of change-points in a nonlinear multivariate time series and (2) estimate the
locations of these change-points. In Section 6 we also provide an illustrative example of how this may
be applied to a data set containing multiple change-points.
Assume we observe a p-dimensional time series, X = {xi }N
i , where N N such that
xi Np (i , ).
We assume is an unknown constant covariance matrix throughout the time series while i is determined by an unknown multivariate mean function g() smoothly varying except at the set of points
{i }M
i=1 . We focus our attention on determining M and each i . The binary segmentation algorithm
may now be applied as follows:
(1) Apply Equation (12) to the time series X.
If (SIC) < 3 terminate the algorithm and conclude time series has no change-points.
(2) Apply Equation (9) and record change-point location .
(3) Segment the original time series into two time series from elements 1 through and + 1
through N.
(4) Return to step 1 for each segment.
The algorithm runs until all segments terminate. This approach may be efficiently applied to time
series with an arbitrary number of change-points. Furthermore, no new theoretical machinery is
required thereby simplifying its implementation.
One technical issue the practitioner should be aware of concerns instances when the signal-tonoise ratio is not sufficiently high for the algorithm to pick out the exact change-point location. For
example, if in step 2 of the algorithm, should the change-point location estimate miss by even one time
point, then the subsequent segmented time series will contain a change-point already accounted for
in the previous step. If this possibility is not accounted for in advance, the algorithm could incorrectly
indicated the presence of false change-points. One possibility for addressing this issue is to not allow
for change-points within a fixed distance of each other. In practice when change-points are at least
five time units away from each other this problem is not encountered for time series whose dimensional components have a signal-to-noise ratio of at least one. Alternatively, if the algorithm returns
multiple change-points adjacent to each other, then the modeler may often safely interpret this return
as representing a single change-point.
6. Examples
6.1. Illustrative examples
We provide an illustrative example to demonstrate how our Bayesian-wavelet approach to the multivariate change-point problem easily adapts to various mean functions. For this example, we simulate
data from a three-dimensional normal distribution centred around 0 for the first 85 elements of the
time series and then introduce a shift of 1 unit in the first and third dimensions for the remaining
43 elements. The covariance matrix remains constant throughout the time series and has 0.25 on all
diagonal elements and 0 on all off-diagonal elements. Figure 3 depicts a plot of our time series where
the shift in the first and third dimensions is visually evident.
Applying a classical likelihood-based approach to this time series correctly returns time point 85
as the estimated change-point location. A purely Bayesian approach such as the one described by
Perreault et al. [15] also returns time point time point 85 as the estimated change-point location along
with a 95% credible interval of [84, 86]. Applying our Bayesian-wavelet approach we first calculate the
SIC using Equation (12) to determine the existence of a change-point. Equation (12) returns a value of
53.25 providing us with near certainty that a change-point exists in the data. Estimating the changepoint location with Equation (9) also correctly returns the change-point location at time point 85.
20
40
60
80
100
120
Signal
1.5
1.0
0.5 0.0
Signal
0.5
1.0
20
Units of Time
40
60
80
100
120
20
40
60
Units of Time
80
100
120
Units of Time
Figure 3. A three-dimensional time series where the mean function is a three-dimensional step function. In particular, a shift occurs
at time point 85 in the rst and third dimensions.
0.0
0
20
40
60
80
units of time
100
120
0.6
0.4
0.2
0.2
0.4
0.6
Probability of a changepoint
0.0
0.8
Dimension 3
1.5
Dimension 2
1.5
Signal
Dimension 1
11
20
40
60
80
100
120
units of time
Figure 4. Marginal posterior distribution of Equation (9) applied to the time series in Figure 3 with all details used (left) and only
the four highest detail levels used (right). Notice in each case the concentrated probability is correctly centred at time point 85, but
with a slightly wider credible interval for the case on the right when not all detail coecients are used.
In this case since the true underlying mean function is a multivariate step function all details levels
should be applied. In practice, we may not know the structure of the mean function and so would
only apply the four highest detail levels. In both cases the Bayesian-wavelet correctly estimates the
change-point location, but with different 95% credible intervals. With all detail levels used we obtain
a 95% credible interval of [84, 86] and in the second case with only the first levels used we obtain a
slightly less precise 95% credible interval of [82, 89] (see Figure 4).
To illustrate the power of the Bayesian-wavelet approach, suppose we now impose one period of
a sine wave on the same data set in each dimension. This new data set now represents the scenario
where the mean function of our time series is nonlinear. Figure 5 depicts this new time series where
we see the change-point at time point 85 is much more obscured. Applying both the likelihood and
pure Bayesian approaches to this time series with a nonlinear mean function both return meaningless
results as the assumptions upon which they are based are now violated. Directly imputing the new
time series in the MLE algorithm, for example, incorrectly estimates the change-point location at
time 63.
Our Bayesian-wavelet approach, however, easily adapts to this more complicated situation. Using
the four highest detail coefficient levels we calculate an SIC of 12.5 indicating the presence of a changepoint in the time series. Maximizing Equation (9) for correctly estimates the change-point location
once again at time point 85. Figure 6 displays the relative probabilities for the change-point location
with a slightly less concentrated 95% credible interval of [82, 88].
As a final illustrative example, we generate a five-dimensional time series now with multiple
change-points at time points 50, 100, 150, and 200. Figure 7 illustrates the first dimension of this
time series where segments 1, 2, 3, 4, and 5 are centred around mean vectors T1 = (0, 0, 0, 0, 0), T2 =
Dimension 2
Dimension 3
Signal
0.5
Signal
1.5
0
1
Signal
Dimension 1
20
40
60
80 100 120
20
Units of Time
40
60
80 100 120
Units of Time
20
40
60
80 100 120
Units of Time
Figure 5. This is the same data set at in Figure 3 only now with one period of the trigonometric function sin(2t/128) added to
the elements in each dimension.
12
20
40
60
80
100
120
units of time
Figure 6. Marginal posterior distribution from the time series in Figure 5 with concentrated probability at the correct change-point
at time point 85.
(1, 1, 1, 1, 1), T3 = (.5, .5, .5, .5, .5), T4 = (2, 2, 2, 2, 2), and T5 = (.5, .5, .5, .5, .5),
respectively. Applying Equation (12) to the original time series returns a value of 70.5 indicating with
near certainty the presence of a statistical change-point; we therefore apply Equation (9) to estimate
the location of the change-point. The first application of Equation (9) estimates the change-point
location at time point 200 corresponding the largest shift of the time series. Next we segment the
time series from time points 1200 and 201256 and repeat this process. Continuing in such a way
until all segments terminate, the algorithm correctly estimates the presence of four statistical changepoints at time points 51, 100, 151, and 200 each with associated 95% credible intervals of less than
5 time units of the actual change-point location.
6.2. Practical example
We present a practical example implementing the methods developed in this article involving six
hydrological sequences in the Northern Qubec Labrador region as represented in Figure 8. In particular, we analyse the streamflow in units of 1/(km2 s) measured in the springs from 1957 to 1995. It
has been noted that a perceptible general decrease in streamflow seemed to occur in the 1980s in this
region. The regional proximity of the rivers suggests a likely relationship between the rivers, but the
specific covariance structure is unclear a priori. Hence, a multivariate analysis certainly appears more
appropriate than six individual river univariate studies. The assertion is that due to causes attributed to
2
0
50
100
150
units of time
200
250
50
100
150
units of time
200
250
Figure 7. The left gure represents the rst dimension of a ve-dimensional time series with change-points at time points 50,
100, 150, and 200. The right gure delineates the time series into segments as estimated by the binary segmentation algorithm in
conjunction with Equations (9) and (12).
Churchill Falls
1960
1970
1980
30
20
1/km^2 s
30
20
1/km^2 s
40
Romain
1990
1960
1970
1980
1990
Manicouagan
1970
1980
30
1/km^2 s
1960
20
30
20
1/km^2 s
40
Outardes
1990
1960
1980
1990
1990
25
1/km^2 s
1970
1980
15
40
1960
1970
A la Baleine
30
1/km^2 s
SainteMarguerite
20
signal
signal
Dimension 1
13
1960
1970
1980
1990
Figure 8. Plots of riverows of six rivers in the Northern Qubec Labrador region. The dashed lines for la Baliene are years river
ows are estimated from a linear regression since the actual data are unavailable.
perhaps climate change or other regional factors, a change-point in streamflow has occurred. Applying our methods, we would like to determine whether or not our methods support this assertion and
if so estimate the change-point year.
Perreault et al. [15] originally applied a retrospective Bayesian change-point analysis to this data
set. The principal advantage of our Bayesian-wavelet method over Perreaults pure Bayesian approach
to this data set is that our method applies even if the true underlying mean function is not a step
function. Perreault spends considerable time justifying rather strict assumptions on the data and the
choice of hyperparameters used in the model. While Perreaults analysis appears largely valid in this
case, the strict assumptions required by such a purely Bayesian approach limit its applicability in more
general contexts and often make conclusions less compelling. With the Bayesian-wavelet approach,
however, we have no need to elicit informative priors for the mean vectors both before and after the
14
0.8
0.6
0.
0.2
0.0
Probability of a changepoint
1960
1970
1980
1990
Year
Figure 9. Posterior distribution of a change-point for six hydrological sequence in the Northern Qubec Labrador region.
unknown change-point nor for the covariance matrix to construct our model. As discussed above,
we require only that the true underlying mean function be smooth except at the single change-point
and that the random component be normally distributed.
To begin our analysis we note measurements for one river, la Baliene, are unavailable from the
years 19571962 inclusive. To handle this discrepancy we took two different approaches. In the first
case, we simply analysed the data for the common years from 1963 to 1995 inclusive. In the second
approach, we treat river flows for la Baliene as a dependent variable and perform a linear regression
for the years with complete data against the other five rivers. With the linear model in hand, we
estimate river flows for la Baliene for the years 19571962 from the linear model using the data
from the other rivers with complete data sets. The dashed line in Figure 8 for la Baliene represents
these estimated values. After a comparison of our analyses, we find very similar results are obtained
in both cases. As such we present results from only the latter case.
We implement the Daubechies 10-tap wavelet since it has known properties particularly well suited
to detect abrupt time series change.[32] Based on Perreaults analysis, the mean function is some
unknown multivariate step function. If this property actually holds, we should be able to apply all
detail levels with Bayesian-wavelet and arrive at the same answer. Standardizing detail coefficients
as described in Section 3, we thus apply all detail coefficients in our analysis. Finally, we note this
time series is not a power of two as required to apply any DWT. We remedy this situation by simply reflecting the beginning of the time series to achieve the required dyadic length as described in
Section 3.
With our wavelet parameters in hand, we next must determine whether or not a statistical changepoint in the mean vector even exists in our data set. A computation of the SIC returns a value 14.53
which represents strong evidence for the existence of a statistical change-point. Next, we estimate
the location of the change-point by maximizing the Bayesian-wavelet change-point equation for .
This returns the year 1984 as the change-point location estimate with posterior probability of nearly
0.85. Furthermore, we note a 90% credible interval ranges around this estimation of the change-point
location from [1983, 1986] (see Figure 9). We note these results are similar to Perreault, who also
estimated the change-point year as 1984, but with a 90% credible interval of [1983, 1985].[15]
15
7. Simulations
In order to compare the performance of the Bayesian-wavelet method with a likelihood-based
method, we ran simulations and compared how often the estimate of the change-point was within
two time units of the true change-point.
10
25
50
75
100
0.5
0.5
1.0
1.0
1.5
1.5
2.0
2.0
0.5
0.5
1.0
1.0
1.5
1.5
2.0
2.0
I
I
I
I
I
I
I
I
2
2
2
2
2
2
2
2
0.36
0.37
0.77
0.77
0.95
0.96
0.99
0.99
0.21
0.22
0.60
0.60
0.85
0.84
0.96
0.96
0.60
0.59
0.96
0.96
0.99
0.99
1.00
1.00
0.14
0.16
0.56
0.56
0.81
0.81
0.97
0.97
0.79
0.79
0.99
0.99
1.00
1.00
1.00
1.00
0.20
0.21
0.71
0.71
0.89
0.90
0.98
0.98
0.98
0.94
1.00
1.00
1.00
1.00
1.00
1.00
0.13
0.13
0.62
0.63
0.87
0.88
0.97
0.97
0.98
0.98
1.00
1.00
1.00
1.00
1.00
1.00
0.07
0.08
0.41
0.41
0.76
0.76
0.91
0.91
0.98
0.98
1.00
1.00
1.00
1.00
1.00
1.00
0.06
0.07
0.24
0.26
0.57
0.57
0.88
0.88
0.93
0.92
1.00
1.00
1.00
1.00
1.00
1.00
0.05
0.05
0.12
0.14
0.32
0.31
0.54
0.53
Notes: BW indicates the Bayesian-wavelet approach and MLE indicates the maximum likelihood estimation approach. Simulations
are conducted with two covariance matrices, the identity covariance matrix (I) and a covariance matrix with 1s along the diagonal
and .5s on all o-diagonal elements (1 ).
16
Table 2. Percentage each method estimates the change-point location within 2 time units of true change-point location where
each run represents 1000 simulations. In all cases the initial mean vector is = (sin(2t/128), sin(2t/128), . . . , sin(2t/128))
and then shifts to = (sin(2t/128) + 1, sin(2t/128) + 1, . . . , sin(2t/128) + 1).
10
25
50
BW
MLE
BW
MLE
BW
MLE
BW
MLE
BW
MLE
BW
MLE
BW
MLE
BW
MLE
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
1.2
1.2
1.4
1.4
1.6
1.6
0.98
0.00
0.88
0.00
0.71
0.00
0.60
0.01
0.53
0.02
0.44
0.01
0.39
0.01
0.31
0.01
0.99
0.00
0.99
0.00
0.92
0.00
0.87
0.01
0.78
0.01
0.70
0.00
0.62
0.01
0.54
0.00
1.00
0.00
0.99
0.00
0.99
0.00
0.94
0.00
0.89
0.00
0.89
0.00
0.76
0.00
0.69
0.00
1.00
0.00
1.00
0.00
0.99
0.00
0.97
0.00
0.96
0.00
0.90
0.00
0.83
0.00
0.79
0.00
1.00
0.00
1.00
0.00
0.99
0.00
0.99
0.00
0.97
0.00
0.95
0.00
0.89
0.00
0.87
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
0.99
0.00
0.99
0.00
0.98
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
0.99
0.00
0.99
0.00
Notes: BW indicates the Bayesian-wavelet approach and MLE indicates the maximum likelihood estimation approach. Throughout
the simulations the covariance matrix used is the identity multiplied by 2 .
7.2. Multivariate piecewise smooth function with a single mean function shift
We next investigate how these methods perform when the underlying mean function does not conform to a multivariate step function. In particular, since the Bayesian-wavelet method requires only
the underlying mean function to be smooth except at the change-point, we consider a multivariate
time series with a nonconstant smoothly varying mean function.
We generate time series with a smoothly varying mean function except at a single change-point.
Specifically, we set the initial mean to t = sin(2 t/128)1, t = 1, 2, . . . , and then after the changepoint the mean vector becomes t = sin(2 t/128)1 + 1, t = + 1, + 2, . . . , 128. That is, the shift
vector is = (1, 1, . . . , 1) for all simulations. We then incrementally adjust the variance of the additive noise by changing the diagonal terms of the covariance matrix. We set our covariance matrix
equal to the identity multiplied by the constant 2 as given in Table 2. The change-point is randomly
selected from the middle 90% of the time series and the Daubechies 10-tap wavelet is applied using
the four highest details coefficients.
Simulation results provide evidence that the Bayesian-wavelet method does well seeing through
additive noise component of the time series and estimating the true change-point location. Applying
Equation (9) exactly as we did in Section 7.1, only now with just the four highest detail levels and
the Daubechies 10-tap wavelet, we have a method that easily adapts to estimate change-points in
a very different time series. Methods such as MLE or a purely Bayesian approach that make strict
assumptions on the true underlying mean function do not share this same flexibility. We see the
underlying form of the oscillating mean function violates the likelihood assumptions in such a way
that this method has no ability to correctly estimate the change-point location. Only in the lower
dimensional cases with high variance when the time series more closely resembles pure noise, does
the MLE register a few correct estimates by chance alone. In the other cases the geometry of the time
series forces the MLE method away from the true change-point location.
8. Conclusion
In this article we presented a methodology for both inferring the existence of one or more statistical change-points in a multivariate time series and estimating their location. We see this general
17
approach is not limited to just changes in mean, but can also be adapted to estimate covariance structure change-point locations as well. Finally, it can be shown that Equation (5) is invariant to dimension
preserving linear transformations. This property suggests applications to the change-point problem
for high dimensional time series in conjunction with a dimension reduction through a random matrix
multiplication. All these topics are currently under investigation.
Another interesting aspect of this approach is how it may be used as an indirect tool to validate
certain data set assumptions. When parametric methods such as MLE or purely Bayesian models
are applied to infer and estimate the location of a single change-point in a multivariate time series,
the true underlying mean function is typically a multivariate step function. In principle using all
detail levels of our Bayesian-wavelet method should always return very nearly identical change-point
location estimates in such cases. If a discrepancy exists between the above parametric methods with
our Bayesian-wavelet method, then either the time series signal-to-noise ratio is not sufficiently high
or the model assumptions are simply not valid.
We found our multivariate Bayesian-wavelet approach for detecting statistical change-points performs comparably with the classical likelihood method when the true mean function of the time
series is a multivariate step function. The advantage to our approach is seen in how our method also
easily extends to more general situations. The simulations demonstrate how the likelihood method
fails when its model assumptions become invalid, but also show how the Bayesian-method still performs well. We chose a multivariate trigonometric function as an example in our simulations, but the
approach applies equally well to any other such piecewise smooth multivariate functions. We thus
conclude that the Bayesian-wavelet method affords the modeler greater flexibility in much more general situations and potentially serves as a valuable diagnostic tool in the setting of the multivariate
change-point problem.
Acknowledgments
We would like to thank both Professor Darrin Speegle and the anonymous referees for their careful consideration of
this paper. Their suggestions and helpful advice certainly improved the final form of this paper.
Disclosure statement
No potential conflict of interest was reported by the authors.
References
[1] Montgomery D. Introduction to statistical quality control. 6th ed. Hoboken, NJ: Wiley ; 2009.
[2] Worsley K. On the likelihood ratio test for a shift in location of normal populations. J Amer Statist Assoc.
1979;74:365367.
[3] Smith AFM. A Bayesian approach to inference about a change-point in a sequence of random variables.
Biometrika. 1975;62:407416.
[4] Barry D, Artigan J. A Bayesian analysis for change point problems. J Amer Statist Assoc. 1993;88(421):309319.
[5] Chib S. Estimation and comparison of multiple change-point models. J Econ. 1998;86(2):221241.
[6] Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.
Biometrika. 1995;82:711732.
[7] Chen J, Gupta AK. Testing and locating variance change-points with application to stock prices. J Amer Statist
Assoc. 1997;92:739747.
[8] Zamba K, Hawkins D. A multivariate change-point model for change in mean vector and/or covariance structure.
J Quality Technol. 2009;41(3).
[9] Carlin B, Gelfand A, Smith A. Hierarchical Bayesian analysis of changepoint problems. Appl Stat. 1992;389405.
[10] Pettitt A. A non-parametric approach to the change-point problem. Appl Stat. 1979;126135.
[11] Bai J. Estimation of a change point in multiple regression models. Rev Econ Stat. 1997;79(4):551563.
[12] Sullivan J, Woodall W. Change-point detection of mean vector or covariance matrix shifts using multivariate
indificual observations. IIE Trans. 2000;32(6):537549.
[13] Chen J, Gupta AK. Parametric statistical change point analysis. New York: Birkhauser; 2012.
[14] Horvth L, Kokoszka P. Testing for changes in multivariate dependent observations with an application to
temperature changes. J Multivariate Anal. 1999;68:96119.
18
[15] Perreault L, Parent E, Bernier J, Bobe B, Parent E. Retrospective multivariate Bayesian change-point analysis: A
simultaneous single change in the mean of several hydrological sequences. J Multivariate Anal. 2000;235:221241.
[16] Son YS, Kim SW. Bayesian single change point detection in a sequence of multivariate normal observations.
Statistics. 2005;39(5):373387.
[17] Mller HG. Change-points in nonparametric regression analysis. Ann Stat. 1992;20:737761.
[18] Ogden R, Lynch J. Bayesian analysis of change-point models. Lecture Notes Stat. 1999;141:6782.
[19] Ciuperca G. Estimating Nonlinear Regression with and without Change-points by the LAD Method. Ann Inst
Stat Math. 2011;63:717743.
[20] Battaglia F, Protopapas MK. Multi-regime models for nonlinear nonstationary time series. Comput Stat.
2012;27:319341.
[21] Matteson DS, James NA. A nonparametric approach for multiple change point analysis of multivariate data. J
Amer Statist Assoc. 2014;109:334345.
[22] Mason R, Young J. Multivariate statistical process control with industrial applications. Philadelphia, PA: Society
for Industrial and Applied Mathematics; 2002.
[23] Perreault L, Bernier J, Bobe B, Parent E. Change-point analysis in hydrometeorological time series. part 1. the
normal model revisted. J Multivariate Anal. 2000;235:221241.
[24] Wagner M. Handbook of bioserveillance. Burlington, MA: Elsevier Academic Press; 2006.
[25] Daubechies I. Ten lectures on wavelets. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1992.
[26] Nason G. Wavelet methods in statistics with R. New York: Springer Science+Business Media, LLC; 2008.
[27] Vidakovic B. Statistical modeling by wavelets. Canvers, MA: Wiley; 1999.
[28] Mallat SG. Theory for multiresolution signal decomposition: The wavelet representaion. IEEE Trans Pattern Anal
Mach Intell. 1989;11(7):674693.
[29] Donoho DL, Johnstone JM. Ideal Spatial Adaption by Wavelet Shrinkage. Biometrika. 1994;81:425455.
[30] Mardia K, Kent J, Bibby J. Multivariate analysis. New York: Academic Press; 1979.
[31] Wang Y. Jump and sharp cusp detection by wavelets. Biometrika. 1995;82:38597.
[32] Jensen A, la Cour-Harbo A. Ripples in mathematics: the discrete wavelet transform. Berlin: Springer; 2001.
Appendix
We derive Equation (9) beginning with the posterior distribution
1
m/2
T
1
exp
(djk qjk ) (djk qjk ) ||1/2 .
p( , , |D ) ||
2 j
k
Here, m represents the actual number of detail coefficients used in the analysis. We integrate out and to obtain the
marginal posterior distribution function
1
(m+1)/2
T
1
p( |D )
||
exp
(djk qjk ) (djk qjk ) dd.
(A1)
2 j
PD(p) Rp
k
1
p( |D )
||(m+1)/2 exp
(djk qjk )T 1 (djk qjk ) d d.
(A2)
2 j
PD(p) Rp
k
PD(p) Rp
1
exp
(dTjk 1 djk + qjk T 1 qjk qjk T 1 djk dTjk 1 qjk ) d d
2 j
PD(p)
1
||(m+1)/2 exp (A + CT 1 T 1 B BT 1 ) d d,
2
Rp
(A3)
where
A=
dTjk 1 djk ,
B=
j
BT =
qij djk ,
j
qij dTjk ,
and C =
j
19
q2ij .
1
1
m/2 C1/2 exp
dTjk 1 djk BT 1 B d
=
2
C
PD(p)
j
=
PD(p)
m/2 1/2
1
exp tr 1
2
(mp1)/2
1
djk dTjk BBT
,
C1/2
C
j k
1
djk dTjk BBT
C
d
(A4)
where in the last step Equation (9) follows by dropping multiplicative constants and applying the known form of the
Wishart distribution.