Optimal Sequential Vector Quantization

OPTIMAL SEQUENTIAL VECTOR QUANTIZATION OF MARKOV
SOURCES
VIVEK S. BORKAR
, SANJOY K. MITTER
, AND SEKHAR TATIKONDA
SIAM J. CONTROL OPTIM. c 2001 Society for Industrial and Applied Mathematics
Vol. 40, No. 1, pp. 135148
Abstract. The problem of sequential vector quantization of a stationary Markov source is cast
as an equivalent stochastic control problem with partial observations. This problem is analyzed using
the techniques of dynamic programming, leading to a characterization of optimal encoding schemes.
Key words. optimal vector quantization, sequential source coding, Markov sources, control
under partial observations, dynamic programming
AMS subject classications. 94A29, 90E20, 90C39
PII. S0363012999365261
1. Introduction. In this paper, we consider the problem of optimal sequential
vector quantization of stationary Markov sources. In the traditional rate distortion
framework, the well-known result of Shannon shows that one can achieve entropy rates
arbitrarily close to the rate distortion function for suitably long lossy block codes [9].
Unfortunately, long block codes imply long delays in communication systems. In par-
ticular, control applications require causal coding and decoding schemes.
These concerns are not new, and there is a sizable body of literature addressing
these issues. We shall briey mention a few key contributions. Witsenhausen [24]
looked at the optimal nite horizon sequential quantization problem for nite state
encoders and decoders. His encoder had a xed number of levels. He showed that
the optimal encoder for a kth order Markov source depends on at most the last k
symbols and the present state of the decoders memory. Walrand and Varaiya [23]
looked at the innite horizon sequential quantization problem for sources with nite
alphabets. Using Markov decision theory, they were able to show that the optimal
encoder for a Markov source depends only on the current input and the current state
of the decoder. Gaarder and Slepian [12] look at sequential quantization over classes
of nite state encoders and decoders. Though they lay down several useful denitions,
their results, by their own admission, are incomplete. Other related works include a
neural network based scheme [17] and a study of optimality properties of codes in
specic cases [3], [10]. Some abstract theoretical results are given in [19].
Received by the editors December 20, 1999; accepted for publication (in revised form) December
6, 2000; published electronically May 31, 2001. A preliminary version of this paper appeared in the
Proceedings of the 1998 IEEE International Symposium on Information Theory, IEEE Information
Theory Society, Piscataway, NJ, 1998, p. 71.
http://www.siam.org/journals/sicon/40-1/36526.html
School of Technology and Computer Science, Tata Institute of Fundamental Research, Homi
Bhabha Road, Mumbai 400005, India (borkar@tifr.res.in). This work was done while this author
was visiting the Laboratory for Information and Decision Systems, Massachusetts Institute of Tech-
nology. The research of this author was supported by a Homi Bhabha fellowship, NSF KDI: Learning,
Adaptation, and Layered Intelligent Systems grant 6756500, and grant III 5(12)/96-ET of Depart-
ment of Science and Technology, Government of India.
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Room
35-403, Cambridge, MA 02139 (mitter@lids.mit.edu). The research of this author was supported by
NSF KDI: Learning, Adaptation, and Layered Intelligent Systems grant ECS-9873451.
University of California at Berkeley, Soda Hall, Room 485, Berkeley, CA 94720

(tatikond@eecs.berkeley.edu). The research of this author was supported by U.S. Army grant
PAAL03-92-G-0115.
135
136 V. S. BORKAR, S. K. MITTER, AND S. TATIKONDA
A formulation similar in spirit to ours (insofar as it aims to minimize a La-
grangian distortion measure described below) is studied in [7], [8]. They show em-
pirically that one can make gains in performance by entropy coding the codewords.
In [7] the entropy constrained vector quantization problem for a block is formulated
and a MaxLloyd-type algorithm is introduced. In [8] they introduce the conditional
entropy constrained vector quantization problem and show that one should use con-
ditional entropy coders when the codewords are not independent from block to block.
In these papers there is more emphasis on synthesizing algorithms and less emphasis
on proving rigorously the optimality of the schemes proposed. Along with this work
there is a large literature on dierential predictive coding, where one encodes the
innovation. Other than the GaussMarkov case, though, it is not apparent how one
may prove the optimality of such innovation coding schemes. Herein we emphasize,
through the dynamic programming formulation, the optimality properties of the se-
quential quantization scheme. This leads the way for the application of many powerful
approximate dynamic programming tools.
In this paper we do not impose a xed number of levels on the quantizer. The
aim is to somehow jointly optimize the entropy rate of the quantized process (in order
to obtain a better compression rate) as well as a suitable distortion measure. The
traditional rate distortion framework [9] calls for the minimization of the former with
a hard constraint on the latter. We shall, however, consider the analytically more
tractable Lagrangian distortion measure of [7], [8], which is a weighted combination
of the two. We approach the problem from a stochastic control viewpoint, treating
the choice of the sequential quantizer as a control choice. The correct state space
then turns out to be the space of conditional laws of the underlying process given
the quantizer outputs, these conditional laws serving as the state or sucient
statistics. The state dynamics is then given by the appropriate nonlinear lter.
While this is very reminiscent of the nite state quantizers studied, e.g., in [16], the
state space here is not nite, and the state process has the familiar stochastic control
interpretation as the output of a nonlinear lter. We then consider the separated
or certainty equivalent control problem of controlling this nonlinear lter so as to
minimize an appropriately transformed Lagrangian distortion measure. This problem
can be analyzed in the traditional dynamic programming framework. This in turn
can be made a basis for computational schemes for near-optimal code design.
To summarize, the main contributions of this paper are as follows.
(i) We formulate a stochastic control problem equivalent to the optimal vector
quantization problem. In the process, we make precise the passage from the
source output to its encoded version in a manner that ensures the well-
posedness of the control problem.
(ii) We underscore the crucial role of the process of conditional laws of the source
given the quantized process as the correct sucient statistics for the prob-
lem.
(iii) We analyze the equivalent control problem by using the methodology of
Markov decision theory. This opens up the possibility of using the com-
putational machinery of Markov decision theory for code design.
Specically, we consider a pair of a state process {X
n
} and an associated ob-
servation process {Y
n
}, given by the dynamics
X
n+1
= g(X
n
,
n
), Y
n+1
= h(X
n
,
n
),
where {
n
}, {
n
} are independently and identically distributed (i.i.d.) driving noise
processes. We quantize Y
n+1
into its quantized version q
n+1
that has a nite range and
SEQUENTIAL VECTOR QUANTIZATION OF MARKOV SOURCES 137
is selected based on the history q
n

= [q
0
, q
1
, . . . , q
n
]. The aim then is to minimize
the long run average of the Lagrangian distortion measure R
n
= E[H(q
n+1
/q
n
) +
||Y
n
q
n
||
2
], where > 0 is a prescribed constant, H(/) is the conditional entropy,
and q
n
is the best estimate of Y
n
given q
n
.
Let
n
be the regular conditional law of X
n
given q
n
for n 0. From
n
,
one can easily derive the regular conditional law of Y
n+1
given q
n
. Using Bayess
rule, {
n
} can be evaluated recursively by a nonlinear lter. Furthermore, one can
express R
n
as the expected value of a function of
n
and a control process Q
n
alone. ({Q
n
} is, in fact, the nite set depicting the range of the vector quantization
of Y
n+1
prior to its encoding into a xed nite alphabet.) This allows us to consider
the equivalent problem of controlling {
n
} with the aim of minimizing the long run
average of the R
n
recast as above. This then ts the framework of traditional Markov
decision theory and can be approached by dynamic programming. As usual, one has
to derive the dynamic programming equations for the average cost control problem by
a vanishing discount argument applied to the associated innite horizon discounted
control problem for which the dynamic programming equation is easier to justify.
The structure of the paper is as follows. In section 2, we describe the sequential
quantization problem and introduce the formalism. Section 3 derives the equivalent
control problem. This is analyzed in section 4 using the formalism of Markov decision
theory.
2. Sequential quantization. This section formulates the sequential vector quan-
tization problem. In particular, it describes the passage from the observation process
to its quantized version, which in turn gets mapped into its encoding with respect to
a xed alphabet. We also lay down our key assumptions which, apart from making
the coding scheme robust, also make its subsequent control formulation well-posed.
The section concludes with a precise statement of this long run average cost control
problem with partial observations that is equivalent to our original vector quantization
problem.
Throughout, for a Polish (i.e., complete separable metric) space X, P(X) will
denote the Polish space of probability measures on X with Prohorov topology [6,
Chapter 2]. For a random process {Z
m
}, set Z
n
= {Z
m
, 0 m n}, its past up to
time n. Finally, K will denote a nite positive constant, depending on the context.
Let {X
n
} be an ergodic Markov process taking values in R
s
, s 1, with an
associated observation process {Y
n
} taking values in R
d
, d 1. ({Y
n
} thus is the
actual process being observed.) Their joint evolution is governed by a transition
kernel x R
s
p(x, dz, dy) P(R
s
R
d
), as described below. We assume this
map to be continuous and further, that p(x, dz, dy) = (y, z|x)dzdy for a density
(, |) : R
d
R
s
R
s
R
+
that is continuous and strictly positive, and furthermore,
(y, z|) is Lipschitz uniformly in y, z.
The evolution law is as follows. For A R
s
, B R
d
Borel,
P(X
n+1
A, Y
n+1
B/X
n
, Y
n
) =
_
AB
p(X
n
, dx, dy)
=
_
A
_
B
(y, z|X
n
)dydz.
Following [13], we call the pair ({X
n
}, {Y
n
}) a Markov source, though the terminology
hidden Markov model is more common nowadays. We impose on ({X
n
}, {Y
n
}) the
condition of asymptotic atness described next. We assume that these processes
are given recursively by the dynamics
X
n+1
= g(X
n
,
n
), (2.1)
Y
n+1
= h(X
n
,
n
), (2.2)
where {
n
}, {
n
} are i.i.d. R
m
-valued (say) random variables independent of each other
and of X
0
, and g : R
s
R
m
R
s
, h : R
s
R
m
R
d
are prescribed measurable
maps satisfying
||g(x, y)||, ||h(x, y)|| K(1 +||x||) y.
Equations (2.1) and (2.2) and the laws of {
n
}, {
n
} completely specify p(x, dz, dy),
and therefore the conditions we impose on the latter will implicitly restrict the choice
of the former.
Let ({X
n
(x)}, {Y
n
(x)}), ({X
n
(y)}, {Y
n
(y)}) denote the solutions to (2.1), (2.2)
for X
0
= x, respectively, y with the same driving noises {
n
}, {
n
}. The assumption
of asymptotic atness then is that there exist K > 0, 0 < < 1, such that
E[||X
n
(x) X
n
(y)||] K
n
||x y||, n 0.
A simple example would be the case when g(x, u) = g(x) + u, h(x, u) =

h(x) + u
for all x, u, where g : R
s
R
s
is a contraction with respect to some equivalent
norm on R
s
. This covers, e.g., the usual linear quadratic Gaussian (LQG) case when
the state process is stable. Another example would be a discretization of continuous
time asymptotically at processes considered in [1], where a Lyapunov-type sucient
condition for asymptotic atness is given. This assumption, one must add, is not
required for our formulation of the optimization problem per se but will play a key
role in our derivation of the dynamic programming equations in section 4.
Let
= {
1
,
2
, . . . ,
N
} be an ordered set that will serve as the alphabet for our
vector quantizer. Let {q
n
} denote the
-valued process that stands for the vector

quantized version of {Y
n
}. The passage from {Y
n
} to {q
n
} is described below.
Let D denote the set of nite nonempty subsets of R
d
with cardinality at most
N 1, satisfying the following.
() There exist M > 0 (large) and > 0 (small) such that
(i) x A D implies ||x|| M,
(ii) x = [x
1
, . . . , x
d
], y = [y
1
, . . . , y
d
] for x, y A D, x = y, implies |x
i
y
i
| >
for all i.
We endow D with the Hausdor metric which renders it a compact Polish space.
For A D, let l
A
: R
d
A denote the map that maps x R
d
to the element
of A nearest to it with reference to the Euclidean norm || ||, any tie being resolved
according to some xed priority rule. Let i
A
: A
denote the map that rst orders

the elements {a
1
, . . . , a
m
} of A lexicographically and then maps them to {
1
, . . . ,
m
}
preserving the order.
Let
(i.e., a one-sided countably innite product. Analogous

notation will be used elsewhere.) At each time n, a measurable map
n
:
n+1
D
is chosen. With Q
n
=
n
(q
n
), one sets
q
n+1
= i
Q
n
(l
Q
n
(Y
n+1
)).
This denes {q
n
} recursively as the quantized process that is to be encoded and
transmitted across a communication channel.
The explanation of this scheme is as follows. In case of a xed quantizer, the
nite subset of R
d
to which the signal gets mapped can itself be identied with the
alphabet
. In our case, however, this set will vary from one instant to another and
therefore must be mapped to a xed alphabet
in a uniquely invertible manner.

This is achieved through the map i
A
. Assuming that the receiver knows ahead of
time the deterministic maps {n
n
()} (later on we argue that a single xed () will
suce), she can reconstruct Q
n
as
n
(q
n
) on having received q
n
by time n. In turn,
she can reconstruct i
1
Q
n
(q
n+1
) = l
Q
n
(Y
n+1
) as the vector quantized version of Y
n+1
.
The main contribution of the condition () is to render the map A = {a
1
, . . . , a
m
}
D {i
A
(a
1
), . . . , i
A
(a
m
)}
continuous. Not only does this make sense from the

point of view of robust decoding, but it also makes the control problem we formulate
later well-posed.
As mentioned in the introduction, our aim will be to jointly optimize over the
choice of {
n
()} the average entropy rate of {q
n
} ( the average code length if
the encoding is done optimally) and the average distortion. The conventional rate
distortion theoretic formulation would be to minimize the average entropy rate
limsup
n
1
n
n1
m=0
E[H(q
m+1
/q
m
)],
H() being the (conditional) Shannon entropy, subject to a hard constraint on the
distortion
limsup
n
1
n
n1
m=0
E[||Y
m
q
m
||
2
] K,
where q
m
= i
1
Q
m1
(q
m
) = l
Q
m1
(Y
m
). We shall, however, consider the simpler problem
of minimizing the Lagrangian distortion measure
limsup
n
1
n
n1
m=0
E[H(q
m+1
/q
m
) +||Y
m
q
m
||
2
], (2.3)
where > 0 is a prescribed constant. One may think of as a Lagrange multiplier,
though, strictly speaking, such an interpretation is lacking given our arbitrary choice
thereof.
3. Reduction to the control problem. This section derives the completely
observed optimal stochastic control problem equivalent to the optimal vector quan-
tization problem described above. In this, we follow the usual separation idea of
stochastic control by identifying the regular conditional law of state given past ob-
servations (in our case, past encodings of the actual observations) as the new state
process for the completely observed control problem. The original cost function is
rewritten in an equivalent form that displays it as a function of the new state and
control processes alone. Under the assumptions of the previous section on the per-
missible vector quantization schemes (as reected in our denition of D), the above
controlled Markov process is shown to have a transition kernel continuous in the ini-
tial state and control. Finally, a relaxation of this control problem is outlined, which
allows for a larger class of controls. This is purely a technical convenience required for
the proofs of the next section and does not aect our control problem in any essential
manner.
Let
n
(dx) P(R
s
) denote the conditional law of X
n
given q
n
, n 0. A standard
application of the Bayes rule shows that {
n
} is given recursively by the nonlinear
lter
n+1
(dx
) =
_ _
I{i
Q
n
(l
Q
n
(y)) = q
n+1
}(y, x
|x)dydx
n
(dx)
_ _ _
I{i
Q
n
(l
Q
n
(y)) = q
n+1
}(y, z|x)dydz
n
(dx)
. (3.1)
By (), l
1
A
(i
1
A
(a)) contains an open subset of R
d
for any a, A. Given this fact and
the condition that (, |) > 0, it follows that the denominator above is strictly
positive, and hence the ratio is well dened. The initial condition for the recursion
(3.1) is
0
= the conditional law of X
0
given q
0
. We assume q
0
to be the trivial
quantizer, i.e., q
0
0, say, so that
0
= the law of X
0
. Thus dened, {
n
} can
be viewed as a P(R
s
)-valued controlled Markov process with a D-valued control
process {Q
n
}. To complete the description of the control problem, we need to dene
our cost (2.3) in terms of {
n
}, {Q
n
}. For this purpose, let (y|x)

=
_
(y, z|x)dz for
all (x, y) R
s
R
d
. Note that for a
,
P(q
n+1
= a/q
n
) = E[E[I{q
n+1
= a}/q
n
, X
n
]/q
n
]
= E
__
p(X
n
, R
s
, dy)I{q
n+1
= a}/q
n
_
=
_

n
(dx)
_
(y|x)I{i
n
(q
n
)
(l
n
(q
n
)
(y)) = a}dy
= h
a
(
n
, Q
n
),
where h
a
: P(R
s
) D R is dened by
h
a
(, A) =
_
(dx)f
a
(x, A)
with
f
a
(x, A) =
_
(y|x)I{i
A
(l
A
(y)) = a}dy.
Also dene
f(x, A) =
_
(y|x)||y l
A
(y)||
2
dy,
k(, A) =
a
h
a
(, A) log h
a
(, A),
r(, A) =
_
(dx)

f(x, A),
where the logarithm is to the base 2. We assume f
a
(, A),

f(, A) to be Lipschitz
uniformly in a, A. This would be implied in particular by the condition that (y/)
be Lipschitz uniformly in y. Now (2.3) can be rewritten as
limsup
n
1
n
n1
m=0
E[k(
m
, Q
m
) +r(
m
, Q
m
)]. (3.2)
Strictly speaking, we should consider the problem of controlling {
n
} given by
(3.1) so as to minimize the cost (3.2). We shall, however, introduce some further
simplications, thereby replacing (3.2) by an approximation of the same. Let
1
N
>
> 0 be a small positive constant. For n 1, let P
n
denote the simplex of probability
vectors in R
n
which have each component bounded from below by
. That is,
P
n
=
_
x = [x
1
, . . . , x
n
] R
n
: x
i
[
, 1] i,
i
x
i
= 1
_
.
Similarly, let
P
n
=
_
x = [x
1
, . . . , x
n
] R
n
: x
i
[0, 1] i,
i
x
i
= 1
_
denote the entire simplex of probability vectors in R
n
. Let
n
: P
n
P
n
denote the
projection map. Let h(, A) = [h
a
1
(, A), . . . , h
a
m
(, A)] for A = {a
1
, . . . , a
m
} and
h(, A) =
|A|
(h(, A))
= [
h
a
1
(, A), . . . ,
h
a
m
(, A)].
Note that
| log
h
a
(, A)| log
< a, , A. (3.3)
Finally, let
k(, A) =
h
a
(, A) log
h
a
(, A).
The control problem we consider is that of controlling {
n
} so as to minimize the cost
limsup
n
1
n
n1
m=0
E[
k(
n
, Q
n
) +r(
n
, Q
n
)]. (3.4)
Replacing k(, ) by

k(, ) is a purely technical convenience to suit the needs of the
developments to come in section 4. We believe that it should be possible to obtain
the same results directly for (3.2), though possibly at the expense of a considerable
additional technical overhead.
We shall analyze this problem using techniques of Markov decision processes.
With this in mind, call {Q
n
} a stationary control policy if Q
n
= v(
n
) for all n
for a measurable v : P(R
s
) D. The map v() itself may be referred to as the
stationary control policy by a standard abuse of notation. Let (, A) P(R
s
)
D (, A, d
) = P(P(R
s
)) denote the transition kernel of the controlled Markov
process {
n
}.
Lemma 3.1. The map (, , d
) is continuous.
Proof. It suces to check that for f C
b
(P(R
s
)), the map
_
f(y)(, , dy) is
continuous. Let (
n
, A
n
) (
, A
) in P(R
s
) D. Then {
n
} are tight, and
therefore, for any > 0, we can nd a compact S
R
s
such that
n
(S
) > 1
for n = 1, 2, . . . , . Fix > 0 and S
R
s
. By the StoneWeierstrass theorem, any
f C
b
(P(R
s
)) can be approximated uniformly on S
by

f C
b
(P(R
s
)) of the form
f() = F
__
f
1
d, . . . ,
_
f
l
d
_
for some l 1, f
1
, . . . , f
l
C
b
(R
s
) and F C
b
(R
l
). Then
_
f(y)(
n
, A
n
, dy)
_
f(y)(
, A
, dy)
4K + sup
S
|f()

f()| +
_

f(y)(
n
, A
n
, dy)
_

f(y)(
, A
, dy)
.
(3.5)
Let
ai
(, A) =
_ _
f
i
(y)I{i
A
(l
A
(y)) = a} (y|x)dy(dx)
for a
, 1 i l. Direct verication leads to

_

f(y)(, A, dy) =
a
h
a
(, A)F
_
a1
(, A)
h
a
(, A)
, . . . ,

al
(, A)
h
a
(, A)
_
. (3.6)
Note that for all a,
I{i
A
n
(l
A
n
(y)) = a} I{i
A
(l
A
(y)) = 0} almost everywhere (a.e.),

because this convergence fails only on the boundaries of the regions l
1
A
(b), b A
,
which have zero Lebesgue measure. (These are the so called Voronoi regions in vector
quantization literature, viz., sets in the partition generated by the quantizer l
A
().)
Therefore, for all a, j,
f
j
(y)I{i
A
n
(l
A
n
(y)) = a} f
j
(y)I{i
A
(l
A
(y)) = a} a.e.
If x
n
x
in R
s
, (y|x
n
) (y|x
) for all y. Then by Schees theorem [6, p. 26],

(y|x
n
)dy (y|x
)dy
in total variation. Hence for any a, j,
_
f
j
(y)I{i
A
n
(l
A
n
(y)) = a} (y|x
n
)dy
_
f
j
(y)I{i
A
(l
A
(y)) = a} (y|x
)dy.
That is, the map
(x, A)
_
f
j
(y)I{i
A
(l
A
(y)) = a} (y|x)dy
is continuous. It is clearly bounded. The continuity of
ia
(, ) follows. That of
h
a
(, ) follows similarly. The continuity of the sum in (3.6) then follows by one more
application of Schees theorem. Thus the last term on the right-hand side (RHS)
of (3.5) tends to zero as n . Since > 0 was arbitrary and the second term on
the RHS of (3.5) can be made arbitrarily small by a suitable choice of

f, the claim
follows.
We conclude this section with a description of a certain relaxation of this control
problem wherein we permit a larger class of control policies, the so-called wide sense
admissible controls used in [11]. Let (, F, P) denote the underlying probability
space, where, without loss of generality, we may suppose that F = V
n
F
n
for F
n
=
(X
i
, Y
i
,
i
,
i
, Q
i
, i n), n 0. Dene a new probability measure P
0
on (, F) as
follows. Let
n
:
n+1
R
m
P(
) denote the regular conditional law of q

n+1
given (q
n
, Y
n+1
) for n 0. (Thus we are now allowing for a randomized choice of
Q
n
, i.e., Q
n
is not necessarily a deterministic function of (q
n
, Y
n+1
).) Let P(
)
be any xed probability measure with full support. If, for n 0, P
n
, P
0n
, we denote
the restrictions of P, P
0
to (, F
n
), respectively, then P
n
<< P
0n
with
dP
n
dP
0n
=
n1
m=0
n
(q
m
, Y
m+1
)({q
m+1
})
({q
m+1
})
, n 1.
Then, under P
0
, {q
n
} are independent of {X
n
, Y
n
,
n
,
n
} and are i.i.d. with law .
We say that {Q
n
} is a wide sense admissible control if under P
0
, (q
n+1
, q
n+2
, . . .)
is independent of (q
n
, Q
n
) for n 0. Note that this includes {Q
n
} of the type
Q
n
=
n
(q
n
) for suitable maps {
n
()}.
It should be kept in mind that this allows explicit randomization in the choice
of {Q
n
}, whence the entropy rate expression in (3.2) or (3.4) is no longer valid.
Nevertheless, we continue with wide sense admissible controls in the context of (3.1)
(3.4) because, for us, this is strictly a temporary technical device to facilitate proofs.
The dynamic programming formulation that we shall nally arrive at in section 4 will
permit us to return without any loss of generality to the apparently more restrictive
class of {Q
n
} we started out with.
4. The vanishing discount limit. This section derives the dynamic program-
ming equations for the equivalent separated control problem by extending the tra-
ditional vanishing discount argument to the present setup. Deriving the dynamic
programming equations for the long run average cost control of the separated control
problem has been an outstanding open problem in the general case. We solve it here
by using in a crucial manner the asymptotic atness assumption introduced earlier. It
should be noted that this assumption was not required at all in the development thus
far and is included purely for facilitating the vanishing discount limit argument that
follows. In particular, it could be dispensed with altogether were we to consider the
nite horizon or innite horizon discounted cost. For an alternative set of conditions
(also strong) under which the dynamic programming equations for the average cost
control under partial observations have been derived, see [21].
Our rst step will be to modify the construction at the end of section 3 so as
to construct on a common probability space two controlled nonlinear lters with a
common control process but diering in their initial condition. This allows us to
compare discounted cost value functions for two dierent initial laws. In turn, this
allows us to show that their dierence, with one of the two initial laws xed arbitrarily,
remains bounded and equicontinuous with respect to a certain complete metric on the
space of probability measures, as the discount factor approaches unity. (This is where
one uses the condition of asymptotic atness.) The rest of the derivation mimics the
classical arguments in this eld.
For (0, 1), consider the discounted control problem of minimizing
J
(
0
, {Q
n
}) = E
_

n=0
n
(
k(
n
, Q
n
) +r(
n
, Q
n
))
_
(4.1)
over

= the set of all wide sense admissible controls, with the prescribed
0
. Dene
the associated value function V
: P(R
s
) R by
V
(
0
) = inf
J(
0
, {Q
n
}).
Standard dynamic programming arguments show that V
() satises
V
() = min
A
_
k(, A) +r(, A) +
_
(, A, d
)V
)
_
(4.2)
for P(R
s
). We shall arrive at the dynamic programming equation for our original
problem by taking a vanishing discount limit of a variant of (4.2). For this purpose,
we need to compare V
() for two distinct values of its argument. In order to do so, we

rst set up a framework for comparing (4.1) for two choices of
0
but with a common
wide sense admissible control {Q
n
}. This will be done by modifying the construction
at the end of the preceding section. Let (, F, P
0
) be a probability space on which
we have (i) R
s
-valued, possibly dependent random variables

X
0
,

X
0
, with laws
0
,
0
,
respectively; (ii) R
m
-valued i.i.d. random processes {
m
}, {
m
}, independent of each
other and of [

X
0
,

X
0
] with laws as in (2.1), (2.2); and (iii)
-valued i.i.d. random

sequences { q
m
}, { q
m
} with law . Also dened on (, F, P
0
) is a D-valued process
{Q
n
} independent of ([

X
0
,

X
0
], {
n
}, {
n
}, { q
n
}) and satisfying the following. For
n 0, ( q
n+1
, q
n+2
, . . .) is independent of Q
n
, q
n
. Let (

X
n
,

Y
n
), (

X
n
,

Y
n
) be solutions
to (2.1), (2.2) with

X
0
,

X
0
as above. Without loss of generality, we may suppose that
F = V
n
F
n
with F
n
= (

X
n
,

X
n
,

Y
n
,

Y
n
, q
n
, q
n
, Q
n
), n 0. Dene a new probability
measure P on (, F) as follows. If P
n
, P
0n
denote the restrictions of P, P
0
, respectively,
to (, F
n
), n 0, then P
n
<< P
0n
with
dP
n
dP
0n
=
n1
m=0
n
( q
n
,

Y
n+1
)({ q
n+1
, })
n
( q
n
,

Y
n+1
)({ q
n+1
})
({ q
n+1
})({ q
n+1
})
,
where the
n
(respectively,
n
) are the regular conditional laws of Q
n
(
Y
n+1
) given
( q
n
,

Y
n+1
) (respectively, of Q
n
(
Y
n+1
) given ( q
n
,

Y
n+1
)) for n 0.
What this construction achieves is the identication of each wide sense admissible
control {Q
n
} for initial law
0
with one wide sense admissible control for
0
. (This
identication can be many-one.) By a symmetric argument that interchanges the
roles of
0
and
0
, we can identify each wide sense admissible control for
0
with
one for
0
. Now suppose that V
(
0
) V
(
0
). Then for a wide sense admissible
control {Q
n
} that is optimal for
0
(existence of this follows by standard dynamic
programming arguments), we have
|V
(
0
) V
(
0
)| = V
(
0
) V
(
0
)
J
(
0
, {Q
n
}) J
(
0
, {Q
n
})
sup
|J
(
0
, {Q
n
}) J
(
0
, {Q
n
})|,
where we use the above identication. If V
(
0
) V
(
0
), a symmetric argument
applies. Thus we have proved the following lemma.
Lemma 4.1.
|V
(
0
) V
(
0
)| sup
|J
(
0
, {Q
n
}) J
(
0
, {Q
n
})|.
Next, let P
1
(R
s
) = { P(R
s
) :
_
||x||(dx) < }, topologized by the (com-
plete) Vasserstein metric [20]
(
1
,
2
) = inf E[||X Y ||],
where the inmum is over all joint laws of (X, Y ) such that the law of X (respectively,
Y ) is
1
(respectively,
2
). We shall assume from now on that
0
P
1
(R
s
). Given
the linear growth condition on g(, y), h(, y) of (2.1), (2.2), uniformly in y, it is then
easily deduced that E[||X
n
||] < for all n and therefore
n
P
1
(R
s
) almost surely
(a.s.) for all n. Thus we may and do view {
n
} as a P
1
(R
s
)-valued process. We then
have the following lemma.
Lemma 4.2. For
0
,
0
P
1
(R
s
) and > 0, |V
(
0
) V
(
0
)| K(
0
,
0
).
Proof. Let {
n
}, {
n
} be solutions to (3.1) with initial conditions
0
,
0
, re-
spectively, and a common wide sense admissible control {Q
n
} . Then for
{
X
n
}, {
X
n
} as above (with K denoting a generic positive constant that may change
from step to step)
|E[r(
n
, Q
n
)] E[r(
n
, Q
n
]|
= |E[

f(

X
n
, Q
n
)] E[

f(

X
n
, Q
n
)]|
E[|
f(

X
n
, Q
n
)

f(

X
n
, Q
n
)|]
KE[||
X
n

X
n
||]
(by the Lipschitz condition on

f)
K
n
E[||
X
0

X
0
||]
(by asymptotic atness).
Now consider
|E[
k(
n
, Q
n
)] E[
k(
n
, Q
n
]|.
Suppose that E[
k(
n
, Q
n
)] E[
k(
n
, Q
n
)]. Then
|E[
k(
n
, Q
n
)] E[
k(
n
, Q
n
)]|
= E[
k(
n
, Q
n
)] E[
k(
n
, Q
n
)]
= E
_
h
a
(
n
, Q
n
) log
h
a
(
n
, Q
n
)
_
E
_
h
a
(
n
, Q
n
) log
h
a
(
n
, Q
n
)
_
= E
_
a
_
h
a
(
n
, Q
n
) log
h
a
(
n
, Q
n
)
h
a
(
n
, Q
n
) log
h
a
(
n
, Q
n
)
+

h
a
(
n
, Q
n
) log
h
a
(
n
, Q
n
)
h
a
(
n
, Q
n
)
__
E
_
a
(
h
a
(
n
, Q
n
)
h
a
(
n
, Q
n
)) log
h
a
(
n
, Q
n
))
_
(by Jensens inequality)
E
_
a
(f
a
(

X
n
, Q
n
) f
a
(

X
n
, Q
n
)) log
h
a
(
n
, Q
n
))
_
KE[||
X
n

X
n
||]
K
n
E[||
X
0

X
0
||],
where we use (3.3) to arrive at the second to last inequality. A symmetric argument
works if E[
k(
n
, Q
n
)] E[
k(
n
, Q
n
)], leading to the same conclusion. Combining
everything, we have
|E[
k(
n
, Q
n
) +r(
n
, Q
n
)] E[
k(
n
, Q
n
) +r(
n
, Q
n
)]|
K
n
E[||
X
0

X
0
||].
Therefore, by Lemma 4.1,
|V
(
0
) V
(
0
)| K
n
E[||
X
0

X
0
||]
K
1
E[||
X
0

X
0
||].
For any > 0, we can render
E[||
X
0

X
0
||] (
0
,
0
) +
by suitably choosing the joint law of (

X
0
,

X
0
). Since > 0 is arbitrary, the claim
follows.
Fix
P(R
s
) and dene

V
() = V
() V
) for P(R
s
), (0, 1). By
the above lemma,

V
() is bounded equicontinuous. Letting 1, we use the Arzela

Ascoli theorem to conclude that

V
() converges in C(P
1
(R
s
)) to some V () along a
subsequence {(n)}, (n) 1. By dropping to a further subsequence if necessary,
we may also suppose that {(1 (n))V
(n)
(
)}, which is clearly bounded, converges

to some R as n . These V (), will turn out to be, respectively, the value
function and optimal cost for our original control problem.
Our main result is the following theorem.
Theorem 4.3.
(i) (V (), ) solve the dynamic programming equation
V () = min
u
_
k(, u) +r(, u) +
_
(, u, d
)V (
)
_
. (4.3)
(ii) is the optimal cost, independent of the initial condition. Furthermore, a
stationary policy v() is optimal for any initial condition if
v() Argmin
_
k(, ) +r(, ) +
_
(, , d
)V (
)
_
.
In particular, an optimal stationary policy exists.
(iii) If v() is an optimal stationary policy and is a corresponding ergodic prob-
ability measure for {
n
}, then
V () =

k(, v()) +r(, v()) +
_
(, v(), d
)V (
) , -a.s.
Proof. For (i) rewrite (4.2) as
() = min
u
_
k(, u) +r(, u) +
_
(, u, d
) (1 )V
)
_
.
Let 1 along {(n)} to obtain (4.3).
For (ii) note that the rst two statements follow by a standard argument which
may be found, e.g., in [15, Theorem 5.2.4, pp. 8081]. The last claim follows from a
standard measurable selection theoremsee, e.g., [22].
For (iii) note that the claim holds if = is replaced by . If the claim is false,
we can integrate both sides with respect to to obtain
<
_
(
k(, v()) +r(, v()))(d).

The RHS is the cost under v(), whereby this inequality contradicts the optimality of
v(). The claim follows.
This result opens up the possibility of exploiting the computational machinery of
Markov decision theory (see, e.g., [2], [18], [21]) for code design.
Finally, we briey consider the decoders problem. If transmission is error free,
the decoder can construct {
n
} recursively given {q
n
} and the stationary policy v().
Then {X
n
}, {Y
n
} may be estimated by the maximum a posteriori (MAP) estimates:
X
n
= argmax
n
(),
Y
n
= argmax
__ _
I{i
Q
n1
(l
Q
n1
()) = q
n+1
}(, z|x)dz
n1
(dx)
_
.
Suppose the decoder receives {q
n
} through a noisy but memoryless channel with input
alphabet
and output alphabet another nite set O, with transition probabilities

p(i, j), i D, j O. Thus p(i, j) 0,
l
p(i, l) = 1 for all i, j. Let d
n
be the channel
output at time n.
The decoder can estimate (X
n
, Y
n
) given d
n
, n 0, but this is no longer easy
because we cannot reconstruct {Q
n
} exactly in absence of his knowledge of {
n
}, {q
n
}.
Thus he should estimate {q
n
} by { q
n
}, say (e.g., by maximum likelihood), given {d
n
}
and use these estimates in place of {q
n
} in the nonlinear lter for {
n
}, giving an
approximation {
n
} to {
n
}. The guess for Q
n
then is v(
n
), n 0.
5. Conclusions and extensions. In this paper we have considered the prob-
lem of optimal sequential vector quantization of a stationary Markov source. We have
formulated the problem as a stochastic control problem. We have used the method-
ology of Markov decision theory. Further, we have shown that the conditional law
of the source given the quantized past is a sucient statistic for the problem. Thus
the optimal encoding scheme has a separated structure. The conditional laws are
given recursively by the nonlinear lter described in (3.1). The optimal policy is
characterized by Theorem 4.3.
The next step is to apply traditional Markov decision problem approximation
techniques to compute approximate schemes. If we have access to training data,
then we can use the tools of reinforcement learning. Here the idea is to parametrize
the value function space or the control law itself and apply stochastic approximation
techniques to optimize those parameters.
In general, the nonlinear lter recursion is very complicated. In the literature
people have approximated this by a linear prediction of the mean. These linear
predictive methods can be considered an approximation to the general nonlinear lter.
REFERENCES
[1] G. Basak and R. N. Bhattacharya, Stability in distribution for a class of singular diusions,
Ann. Probab., 20 (1992), pp. 312321.
[2] D. Bertsekas and J. Tsitsiklis, Neurodynamic Programming, Athena Scientic, Belmont,
MA, 1996.
[3] A. Bist, Dierential state quantization of high order Gauss-Markov processes, in Proceedings
of the IEEE Data Compression Conference, Snowbird, UT, 1994, pp. 6271.
[4] V. S. Borkar, Optimal Control of Diusion Processes, Pitman Lecture Notes in Math. 203,
Longman Scientic and Technical, Harlow, UK, 1989.
[5] V. S. Borkar, Topics in Controlled Markov Chains, Pitman Lecture Notes in Math. 240,
Longman Scientic and Technical, Harlow, UK, 1991.
[6] V. S. Borkar, Probability Theory: An Advanced Course, Springer-Verlag, New York, 1995.
[7] P. Chou, T. Lookabaugh, and R. Gray, Entropy-constrained vector quantization, IEEE
Trans. Acoust. Speech Signal Process., 37 (1989), pp. 3142.
[8] P. Chou and T. Lookabaugh, Conditional entropy-constrained vector quantization of linear
predictive coecients, in Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing, Vol. 1, Albuquerque, NM, 1990, pp. 197200.
[9] T. Cover and J. Thomas, Elements of Information Theory, John Wiley, New York, 1991.
[10] J. G. Dunham, An iterative theory for code design, in Proceedings of the IEEE International
Symposium on Information Theory, St. Jovite, QC, Canada, 1983, pp. 8890.
[11] W. Fleming and E. Pardoux, Optimal control for partially observed diusions, SIAM J.
Control Optim., 20 (1982), pp. 261285.
[12] N. T. Gaarder and D. Slepian, On optimal nite-state digital transmission systems, IEEE
Trans. Inform. Theory, 28 (1982), pp. 167186.
[13] R. E. Gallager, Information Theory and Reliable Communication, John Wiley, New York,
1968.
[14] P. Hall and C. C. Heyde, Martingale Limit Theory and Its Applications, Academic Press,
New York, London, 1980.
[15] O. Hernandez-Lerma and J. B. Lasserre, Discrete-Time Markov Control Processes,
Springer-Verlag, New York, 1996.
[16] J. C. Kieffer, Stochastic stability of feedback quantization schemes, IEEE Trans. Inform.
Theory, 28 (1982), pp. 248254.
[17] E. Levine, Stochastic vector quantization and stochastic V Q with state feedback using neural
networks, in Proceedings of the IEEE Data Compression Conference, Snowbird, UT, 1996,
pp. 330339.
[18] S. P. Meyn, Algorithms for optimization and stabilization of controlled Markov chains, in
SADHANA: Indian Academy of Sciences Proceedings in Engineering Sciences 24, Banga-
lore, 1999, pp. 339368.
[19] D. Neuhoff and R. K. Gilbert, Causal source codes, IEEE Trans. Inform. Theory, 28 (1982),
pp. 701713.
[20] S. T. Rachev, Probability Metrics and the Stability of Stochastic Models, John Wiley, Chich-
ester, UK, 1991.
[21] W. J. Runggaldier and L. Stettner, Approximations of Discrete Time Partially Observed
Control Problems, Applied Maths. Monographs 6, Giardini Editori e Stampatori, Pisa,
Italy, 1994.
[22] D. H. Wagner, Survey of measurable selection theorems, SIAM J. Control Optim., 15 (1977),
pp. 859903.
[23] J. Walrand and P. P. Varaiya, Optimal causal coding-decoding problems, IEEE Trans. In-
form. Theory, 29 (1983), pp. 814820.
[24] H. Witsenhausen, On the structure of real-time source coders, The Bell System Technical
Journal, 58 (1979), pp. 14371451.

Optimal Sequential Vector Quantization

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Optimal Sequential Vector Quantization

Încărcat de

Drepturi de autor:

Formate disponibile

OPTIMAL SEQUENTIAL VECTOR QUANTIZATION OF MARKOV

, AND SEKHAR TATIKONDA

University of California at Berkeley, Soda Hall, Room 485, Berkeley, CA 94720

-valued process that stands for the vector

denote the map that rst orders

(i.e., a one-sided countably innite product. Analogous

in a uniquely invertible manner.

continuous. Not only does this make sense from the

> 0 be a small positive constant. For n 1, let P

, 1 i l. Direct verication leads to

(y)) = 0} almost everywhere (a.e.),

) for all y. Then by Schees theorem [6, p. 26],

) denote the regular conditional law of q

() for two distinct values of its argument. In order to do so, we

-valued i.i.d. random

() is bounded equicontinuous. Letting 1, we use the Arzela

)}, which is clearly bounded, converges

k(, v()) +r(, v()))(d).

and output alphabet another nite set O, with transition probabilities

S-ar putea să vă placă și