Sunteți pe pagina 1din 23

5258

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

Rate Distortion Theory for Causal Video Coding:


Characterization, Computation Algorithm,
and Comparison
En-Hui Yang, Fellow, IEEE, Lin Zheng, Da-Ke He, Member, IEEE, and Zhen Zhang, Fellow, IEEE

AbstractCausal video coding is considered from an information theoretic point of view, where video source frames
1
2 ...
N are encoded in a frame by frame manner, the
encoder for each frame k can use all previous frames and all
previous encoded frames while the corresponding decoder can
use only all previous encoded frames, and each frame k itself
is modeled as a source k = f k ( )gi=1 . A novel computation
approach is proposed to analytically characterize, numerically
compute, and compare the minimum total rate of causal video
coding c ( 1 . . . N ) required to achieve a given distortion
0. Among many other things,
(quality) level 1 . . . N
the computation approach includes an iterative algorithm with
global convergence for computing c ( 1 . . . N ). The global
convergence of the algorithm further enables us to demonstrate
a somewhat surprising result (dubbed the more and less coding
theorem)under some conditions on source frames and distortion, the more frames need to be encoded and transmitted,
the less amount of data after encoding has to be actually sent.
With the help of the algorithm, it is also shown by example that
c ( 1 . . . N ) is in general much smaller than the total rate
offered by the traditional greedy coding method. As a by-product,
an extended Markov lemma is established for correlated ergodic
sources.

X ;X ; ;X

R3 D ; ; D
D ; ;D >

X i 1

R3 D ; ; D

R3 D ; ; D

Index TermsCausal video coding, extended Markov lemma, iterative algorithm, multi-user information theory, predictive video
coding, rate distortion characterization and computation, rate distortion theory, stationary ergodic sources.

I. INTRODUCTION
ONSIDER a causal video coding model shown in Fig. 1,
, represents a video frame,
where
and
represent respectively its encoded frame and recon, are encoded in
structed frame, all frames
can use all
a frame by frame manner, and the encoder for

Manuscript received March 31, 2010; revised December 23, 2010; accepted
March 04, 2011. Date of current version July 29, 2011. This work was supported
in part by the Natural Sciences and Engineering Research Council of Canada
under Grant RGPIN203035-06 and Strategic Grant STPGP397345, and by the
Canada Research Chairs Program.
E. Yang and L. Zheng are with the Department of Electrical and Computer
Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail:
ehyang@uwaterloo.ca; l9zheng@uwaterloo.ca).
D.-K. He is with Research in Motion/SlipStream, Waterloo, ON N2L 5Z5,
Canada (e-mail: dhe@rim.com).
Z. Zhang is with the Department of Electrical Engineering-Systems, University of Southern California, Los Angeles, CA 90095-1594 USA (e-mail:
zhzhang@usc.edu).
Communicated by E. Ordentlich, Associate Editor for Source Coding.
Digital Object Identier 10.1109/TIT.2011.2159043

Fig. 1. Causal video coding model.

previous frames
, and all previous en, while the corresponding
coded frames
decoder can use only all previous encoded frames. The model
is not allowed to access to
is causal because the encoder for
future frames in the encoding order. In the special case where
is further restricted to enlist help only
the encoder for each
, causal
from all previous encoded frames
video coding reduces to predictive video coding.
All MPEG-series and H-series video coding standards [13],
[19] proposed so far fall into the above causal video coding
model (strictly speaking, into the predictive video coding
model); the differences among these different video coding
standards lie in how information available to the encoder of
is used to generate . The causal coding model
each frame
is the same as the sequential coding model of correlated source
, and also called the C-C model
proposed in [15] when
in [10], [11], and [12]. However, when
, which is a
typical case in MPEG-series and H-series video coding, the
causal coding model considered here is quite different from
sequential coding1. In a special case where all frames are identical, which rarely happens in practical video coding, the causal
video coding model is reduced to the successive renement
setting considered in [8]. Notwithstanding, when frames are
not identical, causal video coding is drastically different from
successive renement even though the decoding structure looks
similar in both cases. Partial results of this paper were presented
without proof in [23] and [22].
It is expected that a future video coding standard will continue
to fall into the causal video coding model shown in Fig. 1. To
1The name of sequential coding was used in [15] to refer to a special video
1, can only use the
coding paradigm where the encoder for frame
previous frame
as a helper and the corresponding decoder uses only the
and reconstructed frame ^
as a helper.
previous encoded frame

0018-9448/$26.00 2011 IEEE

X ;k >
X

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

provide some design guidance for a future video coding standard, in this paper, we aim at investigating from an information theoretic point of view how each frame in the causal model
should be encoded so that collectively the total rate is minimized
.
subject to a given distortion (quality) level
itself as a source
We model each frame
taking values in a nite alphabet
. Together, the
frames then form a vector source
taking values in the product
. The sources
alphabet
are said to be (rst-order) Markov if for any
is the output of a memoryless channel in response to input
; in this case, we say
forms a
Markov chain. Let
denote the reconstrucdrawn from a nite reproduction
tion of
alphabet
. The distortion between
and
is measured
by a single-letter distortion measure
.
Without loss of generality, we shall assume that

5259

satisfying the property that the range of given any


binary sequences is a prex set, and a decoder of order
is dened by a function

The encoded and reconstructed sequences of


given respectively by
and
.
, the distortion between
For
is given by

are

and

the corresponding average distortion per symbol is then equal to

and the average rate in bits per symbol of the th encoder is


for any
. For convenience, we write
simply as
for any and
. For any
dimen, denote
by
sional vector
, and
by
. As such, by
we shall
mean that
. A similar convention will apply to reconstruction sequences and other
vectors.
for
Formally, we dene an order- causal video code
by using encoder and decoder pairs as follows2:
, an encoder of order is dened by a function
1) For
from
to
, the set of all binary sequences of
nite length, satisfying the property that the range of is a
prex set, and a decoder of order is dened by a function

where
denotes the length of the binary sequence . The
is then meaperformance of the order- causal video code
sured by the rate distortion pairs
.
Denition 1: Let
be a rate vector and
a distortion vector. The rate distortion pair
is said to be achievable by
vector
, there exists an order- causal
causal video coding if
video code
for all sufciently large such that
(1.1)
for
Let

The encoded and reconstructed sequences of


are
given respectively by
and
.
, an encoder of order is dened by
2) For
a function

2It is worthwhile to point out that as far as causal video coding alone is concerned, there is no need to explicitly list previous encoded frames S as inputs
to the encoder for the current frame X in both the causal video coding diagram
shown in Fig. 1 and the formal denition of causal video code given here, and all
results and their respective derivations presented in the paper remain the same.
The reason for us to explicitly list S as inputs to the encoder for the current
frame X is two-fold: (1) it makes the subsequent information quantities more
transparent and intuitiveconnecting those information quantities to the diagram with S linked to the respective encoder is easier than to that without S
linked to the respective encoderand (2) more importantly it gives us a simple,
unied way to describe predictive video coding in the context of causal video
coding and contrast the two coding paradigms in our forthcoming work on the
information theoretic performance comparison of predictive video coding and
causal video coding.

.
denote the set of all rate distortion pair vectors
achievable by causal video coding.
is a closed set in
From the above denition, it follows that
-dimensional Euclidean space. As in the usual video
the
compression applications, we are interested in the minimum
required to achieve the distortion
total rate
, which is dened by
level

One of our purposes in this paper is to numerically compute,


so that
analytically characterize, and compare
deep insights can be gained regarding how each frame should be
encoded in order to have a minimum total rate.
Our approach is computation oriented. Starting with
a jointly stationary and totally ergodic vector source3
, we rst show in Section II that
3A vector source (X ; X ; . . . ; X ) = f(X (i); X (i); . . . ; X (i))g
is said to be jointly stationary and totally ergodic if as a single process over the
is stationary
alphabet X 2X 21 1 12X ; f(X (i); X (i); . . . ; X (i))g
and totally ergodic.

5260

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

is equal to the inmum of the th order


over all ,
total rate distortion function
itself is given by the minimum of
where
an information quantity over a set of auxiliary random variables. Then we develop an iterative algorithm in Section III
to calculate
, and further show that this
algorithm converges to an optimal solution that achieves
. The global convergence of the algorithm
enables us to establish a single-letter characterization of
in Section IV in the case where the vector
is independent and identically
source
distributed (IID)4, by comparing
with
through a novel application of the algorithm. With the help of the algorithm, we further demonstrate
in Section V a somewhat surprising result dubbed the more
and less coding theoremunder some conditions on source
frames and distortion, the more frames need to be encoded and
transmitted, the less amount of data after encoding has to be
actually sent. The algorithm also gives an optimal solution for
allocating bits to different frames. It is shown in Section VI
that
is in general much smaller than the total
offered by the traditional greedy coding
rate
method by which each frame is encoded in a local optimum
manner based on all information available to the encoder of the
frame.
II. ACHIEVABLE REGION AND MINIMUM TOTAL RATE:
TOTALLY ERGODIC CASE
is jointly stationary
Suppose now that
to
and totally ergodic across samples (pixels). Dene
be the region consisting of all rate distortion pair vectors
for which there exist auxiliary
, and
such
random variables
that

, and
are met.
In (2.1) and throughout the rest of the paper, the notation
stands for mutual information or conditional mutual information (as the case may be) measured in bits, and the notation
stands for entropy or conditional entropy (as the case may be)
measured in bits. Although there is no restriction on the size of
in (2.1), one can show, by using the stanthe alphabet of each
dard cardinality bound argument based on the Caratheodory theorem (see, for example, Appendix A of [15]), that the alphabet
in (2.1) can be bounded. Let
.
size of each
. Then we have the folDenote its convex hull closure by
lowing result.
Theorem 1: For jointly stationary and totally ergodic sources
.
) will be
The positive part of Theorem 1 (i.e.,
proved in Appendix B by adopting a random coding argument
similar to that for IID vector sources. Here we present the proof
).
of the converse part (i.e.,
Proof of the converse part of Theorem 1: Pick any achievable
.
rate distortion pair vector
It follows from Denition 1 that for any
, there exfor
ists an order- causal video code
such that (1.1) holds. Let
and
all sufciently large
be the respective encoded frame of and reconstructed frame for
given by
. Let
. It is easy to see that the Markov
conditions
, are satised. However, since
depends
in general on
in addition to
and
, the
random variables
, and
do
not necessarily form a Markov chain in the indicted order. To
overcome this problem, let denote the conditional probability
given
. Dene a new
distribution of
which is the output of the channel
random variable
in response to the input
. Then it is easy to see
and
that
have the same distribution, and
,
and
form a Markov chain. This, together with (1.1),
implies the following distortion upper bounds:

(2.1)
(2.2)
and the following requirements5 are satised:
for some deterministic function ;
(R1)
for some deterministic func(R2)
;
tion
(R3) for any
;
(R4)
the Markov chain conditions

X ;X ; ; X
X i ;X i ; ;X i
X i ;X i ; ;X i
X i ; X i ; ;X i ; i
X ; X ; ;X
X n ; k ; ; ;N
n
U ; k ; ; ;N

4A vector source (
...
) = f(
( )
( ) ...
( ))g
is said to be IID if as a single process over the alphabet X 2 X 2 1 1 1 2 X
( ) ...
( ))g
is IID. Note that the common joint distribuf( ( )
tion of each sample ( ( )
( ) ...
( ))  1, can be arbitrary even
...
) is IID.
when the vector source (
5Throughout the paper, ^ (1; )
= 1 2 ...
, represents a random
^
variable taking values over X , the -fold product of the reproduction alphabet
^
= 1 2 ...
0 1, represents a random variable
X ; on the other hand,
taking values over an arbitrary nite alphabet.

for any

, and
(2.3)

Let us now verify rate lower bounds. In view of (1.1), we have

(2.4)
and for

(2.5)

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

where equality

5261

is due to the fact that


is a function of
. For the last frame, we have

for any and


pressed as

. As such,

can also be ex(2.9)

Next

we

derive an
. Dene

equivalent

expression

for

(2.6)
,
With auxiliary random variables
dened above, it now follows from (2.2) to (2.6)
and
and the desired Markov conditions that
. Letting
yields
, which in turn implies
. This completes the proof of the converse part.
in terms of information quanTo determine
tities, we dene for each

That is

(2.10)
where the inmum is taken over all auxiliary random variables
and
satisfying the requirements (R1)
to (R4). By comparing (2.10) with (2.7), it is easy to see that
(2.11)

(2.7)
where the minimum is taken over all auxiliary random vec, satisfying the following two
tors
requirements:
;
(R5) for any
(R6) the
Markov
chains

On the other hand, pick any auxiliary random variables


and
satisfying the requirements (R1)
be dened as
to (R4). Let
in the requirements (R1) and (R2). Then in view of the Markov
conditions in the requirement (R4), we have

, and
hold.
We further dene
(2.12)
(2.8)
Then we have the following result.
Theorem 2: For jointly stationary and totally ergodic sources
,

for any distortion level


.
To prove Theorem 2, we need the following lemma, which is
also interesting on its own right.
is convex and
Lemma 1: The function
.
hence continuous over the open region
. In view of the
Proof of Lemma 1: Fix
denition given in (2.7), it is not hard to show that the sequence
is subadditive, that is

is a
where the last inequality is due to the fact that
for any
. To continue,
function of
. It is not
we now verify Markov conditions involving
hard to see that the rst
Markov conditions in the requirement (R4),
, are equivalent to the following conditions:
and
(R7) for any
are conditionally independent given
and
.
and
From this, it follows that for any
are conditionally independent given
and
. Applying the equivalence again, we see
Markov conditions involving
in the
that the rst
requirement (R6) are satised. Therefore, we have

5262

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

Proof of Theorem 2: In view of the positive part of Theorem 1, it is not hard to see that

for any
sufces to show

. Therefore, in what follows, it

(2.15)
(2.13)
where the equality 1) follows from the
Markov conditions
involving
. Note that the last Markov condition in the
. To overcome
requirement (R6) may not be valid for
this problem, we use the same technique as in the proof of the
converse part of Theorem 1 to construct a new random vector
such that the following hold:
and

have
the
same
distribution;
the
Markov
condition
is met.
and
satTherefore, the random variables
isfy the requirements (R5) and (R6). This, together with (2.13),
(2.12), and (2.7), implies

for any
Now x

. Pick any rate vector


such that
.
From the proof of the converse part of Theorem 1, it follows
and sufciently large , there exist auxiliary
that for any
random variables
, and
satisreplaced by
fying the requirements (R1) to (R4) with each
such that

which, coupled with the equivalent expression (2.10) for


, further implies

(2.16)
In view of Lemma 1, dividing both sides of (2.16) by
letting
yield

(2.14)
Note that (2.14) is valid for any auxiliary random variables
and
satisfying the requirements (R1)
to (R4). It then follows from (2.14) and (2.10) that

which, together with (2.11), implies that

and (2.10) is an equivalent expression for


.
In comparison with (2.7), the equivalent expression (2.10)
makes it easier to apply the well-known time-sharing argument. By applying the time sharing argument to (2.10), it
is a convex
is now not hard to see that
function of
for each
. The convexity of
as a function of
then follows
from its equivalent expression (2.9) and the convexity of each
. Since a convex function is continuous over
an open region [14], this completes the proof of Lemma 1.

and then

from which (2.15) follows. This completes the proof of Theorem


2.
Remark 1: Theorems 1 and 2 remain valid for generally stationary ergodic sources
. However, the technique
adopted in the proof of the classic source coding theorem for a
single ergodic source [9], [2] can not be applied here. As such, a
new proof technique has to be developed; this will be addressed
in our forthcoming paper [25] in order not to deviate our computation approach.
, TheFor general stationary ergodic sources
orem 2 is probably the best result one could hope for in terms
. However,
of analytically characterizing
its impact on practical video coding will be limited if the
optimization problem involved can not be solved by an effective algorithm. To a large extent, this is also true even if
admits a single-letter characterization, and
true for many other multi-user information theoretic problems.
In the following section, we will develop an iterative algorithm
dened in (2.7), and establish
to compute
its convergence to the global minimum.

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

5263

III. AN ITERATIVE ALGORITHM


In this section, an iterative algorithm is proposed to calcudened in (2.7), which serves three purlate
poses in this paper: rst, it allows us to do numerical calculations; second, the global convergence of this algorithm provides
a completely different approach to establish a single-letter charwhen the
sources are IID;
acterization of
and third, it allows us to do comparisons and gain deep insights
.
into
Without loss of generality, we consider the case of
and
denote three sources by
, and
,
, and
respectively to
which in turn will be written as
simplify our notation for describing the iterative algorithm.
and
denote joint distributions of
Let
random vectors
and
, respectively;
and let
denote the marginal distribution of
. If there
is no ambiguity, subscripts in distributions will be omitted.
instead of
. In order
For example, we may write
and
that achieve
to nd the random variables
, we try to nd transition probability and
,
probability functions
that minimize
and

nested structure in (3.1), we solve the problem in (3.2) in


. From (3.1)
three stages. First let us nd

(3.3)
. In the
where
above, the last inequality follows from the log-sum inequality, and becomes an equality if and only if

(3.4)
for any
We next nd
have

(3.1)
, denotes the stanwhere
dard Lagrange multiplier, and the base of the logarithm is . For
brevity, we shall denote
by
, and
by
. Write
accord. When there is no ambiguity, the superingly as
script or subscript will be dropped. The iterative algorithm
works as follows.
Step 1: Initialize
and set
as a joint distribution function over
for any
Step
2:
Fix

, where
.
.
Find
such that

.
. In view of (3.1) and (3.3), we

(3.5)
where
. In the above, the last inequality
again follows from the log-sum inequality, and becomes
an equality if and only if

, and

(3.6)
for any
Finally, let us nd
(3.5), we have
(3.2)

where the minimum is taken over all transition probability


functions
. In view of the

.
. Continuing from (3.1) and

5264

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

For any

, let

Similarly, for any

(3.7)
where

, let

The above iterative algorithm can also be described succinctly


and
. The
by
following theorem shows that the sequence
converges to a quadruple of distributions that achieves
(3.12)

An argument similar to that leading to (3.3) and (3.5) can


be used to show that (3.7) becomes an equality if and only
if

(3.8)
for any
Step 3: Fix

.
. Find

where the inmum is taken over all possible


, and
.
Theorem 3: For any initial
for any
, and

satisfying
, there exists

such that

as

.
Proof of Theorem 3: From the description of the iterative
algorithm, it follows that

such that
(3.13)

(3.9)
where the minimum is taken over all joint distribution functions over
and . In view of (3.1), we see that

To show the desired convergence, let us rst verify that the algorithm has the so-called ve-point property (as dened in [7]),
, and the correthat is for any
sponding

(3.14)
To this end, let us calculate both sides of (3.14). In view of Steps
2 and 3, we have

(3.10)
is the output of the channel
in response to the input
, and
is
, i.e.,
the distribution of
where

(3.11)
for any
. The inequality (3.10) becomes an equality
if and only if
for any
.
Step 4: Repeat Steps 2 and 3 for
until
is smaller than a
prescribed threshold.

(3.15)
where the equality

follows from the following derivation:


(3.16)

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

5265

and

Combining (3.15) with (3.18) yields the desired ve-point


property in (3.14).
The rest of the proof is similar to that adopted in [5] to show
the convergence of the Blahut-Arimoto algorithm [3]. Suppose

(3.19)
for some
for any

and

. From (3.14), it then follows that

(3.20)
, implies

which, together with

(3.21)
and hence
(3.22)
Note that (3.22) is valid for any
(3.19). From this, we have

and

satisfying
(3.23)

To prove the convergence of


vergent subsequence of
, say

, pick a con. Then


and
(3.24)

; thus
In view of (3.23), we have
, and hence (3.20) applies to
and
. In particular,
is a nonincreasing sequence. Since
implies
, this means
.
Hence
and
as
. This completes
the proof of Theorem 3.
(3.17)
Combining (3.16) and (3.17), we immediately have the equality
in (3.15).
On the other hand
(3.18)

Remark 2: The above iterative algorithm can be easily extended to the case of
, and Theorem 3 remains valid. By
setting
, it also reduces to the case of
.
Remark 3: The iterative algorithm can be further extended to
work for coupled distortion measures (as dened in [15])
, where the
depends not only on
but
distortion
. The global convergence as expressed
also on
in Theorem 3 is still guaranteed.

5266

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

Remark 4: Although
as a function of
is convex as shown in the proof of Lemma 1,
both the optimization problems (2.7) and (3.12) are actually a
non-convex optimization problem. It is therefore kind of surprising to see the global convergence of our proposed iterative
algorithm. As shown in the proof of Theorem 3, the key for the
global convergence is the ve-point property (3.14).
Remark 5: There are many other ways (including, for example, the greedy alternative algorithm [24]) to derive iterative procedures. However, it is not clear whether their global
convergence can be guaranteed. Having algorithms with global
convergence is important to not only numerical computation itself, but also single-letter characterization of performance. One
of the purposes of this paper is indeed to demonstrate for the
rst time that single-letter characterization of performance can
also be established in a computational way via algorithms with
global convergence, as shown in the next section.
We conclude this section by presenting an alternative expres. Once again, we illustrate this by
sion for
. In view of the denitions (2.7)
considering the case of
and (3.12), it is not hard to show (for example, by using the technique demonstrated in the proof of Property 1 in [21]) that for
any

(3.25)
In other words,
as a function of is the conjugate
. Since
is convex and
of
lower semi-continuous over the whole region
, it follows from [14, Theorem 12.2, pp. 104] that for
,
any

. Since the vector source


is IID, we have
. In view of (3.26), we have

(4.2)
, where
is dened in (3.12). Here and
for any
throughout the rest of this proof, the subscript or superscript
dropped for convenience for notation in Section III is brought
and
.
back to distinguish between the cases of
Therefore, it sufces to show that
(4.3)
. To this end, we will
for any
and
to
run the iterative algorithm in both cases of
and
. Pick any initial positive distribution
calculate
, and run the iterative algorithm in the case of
. We
then get a sequence
which, according to
Theorem 3, satises
(4.4)
Now let
be the -fold product distribution of
. Clearly,
is also positive. Use
as an initial distribution and run
the iterative algorithm in the case of
. Then we get a
sequence
which, according to Theorem
3 again, satises
(4.5)
Since

(3.26)
In the next section, (3.26) will be used in the process of establishing a single-letter characterization for
when the vector source
is IID.
IV. SINGLE-LETTER CHARACTERIZATION: IID CAUSAL CASE

is the -fold product of


and
is the -fold product of
, careful examination on
(3.4), (3.6), (3.8), and (3.11) reveals that for any
is the -fold product of
, and
is the -fold product of
. (To see this is the case, let us look at (3.4) for example.
Let us temporarily drop the subscripts indicating random variand
ables in all notation. When
, it can be veried that in
(3.4)

is IID.
Suppose now that the vector source
In this section, we will use our iterative algorithm proposed in
Section III and its global convergence to establish a single-letter
.
characterization for
Theorem 4: If

is IID, then

Since

for any
.
Proof: We rst show that for any
,
(4.1)
for any
. Without loss of generality, we demonby using our iterstrate (4.1) in the case of
ative algorithm in Section III. Denote three sources by

it follows from (3.4) that

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

5267

Similar argument can be applied to (3.6), (3.8), and (3.11).)


Therefore, for any

which, coupled with (4.4) and (4.5), implies (4.3) and hence
(4.1).
Combining (4.1) with (2.8) yields

for any
implies

. This, together with Theorem 2,


(4.6)

for any
. Since by their denitions, both
and
are right
functions
continuous in the sense that for any

different, but equivalent form were also reported in [10], [11],


and [12] by following the classic approach. The difference lies in
the extra Markov chain condition for the reconstruction
shown as Condition (R4). For example, in the specic formulas
, the Markov
shown in [10, Theorem 1] in the case of
chain condition
is not required.
V. MORE AND LESS CODING THEOREM
To gain deep insights into causal video coding, in this section, we use our iterative algorithm proposed in Section III
among different values of
to compare
. To be specic, whenever we need to bring out the deand
on
pendence of
, we will write
the sources
as
, and
as
. In particular, we will compare
with
.
Without loss of generality again, we will consider the case of
. All results and discussions in this section can be easily
. We rst have the following
extended to the case of
result.
Theorem 6: Suppose that
, and
and totally ergodic, and
in the indicated order. Then for any

is jointly stationary
form a Markov chain
,

it follows that (4.6) remains valid for boundary points where


may be . This completes the proof of Theorem 4.
some
Theorem 4 can also be proved by using the classical auxiliary
random variable converse and positive proof (hereafter referred
to as the classic approach). Indeed, one can establish the following single-letter characterization for the achievable region
, the proof of which is given in Appendix A.

Proof: We distinguish between two cases: (1)


, and (2)
. In Case (1), it follows from Theorem
2 and (2.8) that it sufces to show

is an IID vector source, then6

(5.2)

Theorem 5: If
.

Remark 6: It is instructive to compare the computational


approach to single-letter characterization (as illustrated in the
proofs of Theorems 2, 3, and 4) with the classic approach. In
the computational approach, the converse is rst established for
multiple letters (blocks); its proof is often straightforward and
the required Markov chain conditions are satised automatically
as shown in the proof of Theorem 2. The key is then to have
an algorithm with global convergence for computing all block
terms and later show that all these block terms are the same.
On the other hand, in the classic approach, the converse proof
is quite involved; coming up with auxiliary random variables
with right Markov chain conditions is always challenging and
sometimes seems impossible. Since single-letter characterization has to be computed any way, the computational approach
is preferred whenever it is possible.
, Theorems 5 and 4 reduce to TheoRemark 7: When
rems 1 and 3 in [15], respectively. However, the proofs in [15]
are incomplete due to the invalid claim of the Markov condition
made in the proofs therein; as such formulas therein can not be
. Theorems 5 and 4 in a slightly
extended to the case of


R

6Since the alphabet size of each U in (2.1) can be bounded,


; n 1,
is actually convex and closed. As such, co(
)=
. We leave co(
)
in the statement of Theorem 5 just for the sake of consistency with the norm in
the literature [4].

(5.1)

for any
and
any auxiliary random variables
the requirements (R5) and (R6) with
verify that

. To this end, pick


, satisfying
. It is not hard to

(5.3)
where the equality 1) follows from the fact that the requirement
implies
(R6) plus the Markov condition
that the Markov condition
is satised. In (5.3), the Markov condition
may not be valid. However, to

5268

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

overcome this problem, we can use the same technique as in


the proof of the converse part of Theorem 1 and also in the
proof of Lemma 1 to construct a new random vector
such that the following hold:
and

have
the
same
distribution;
the
Markov
condition
is met.
and
satisfy
Therefore, the random variables
with respect to
the requirements (R5) and (R6) with
and
. This, together with (5.3) and (2.7),
implies

Fig. 2. One special case of two-layer causal coding.

source, in view of Theorem 4, we will drop the subscript or superscript for all notation in Section III with understanding of
throughout the rest of this section. Once again, to bring
on the source
, we
out the dependence of
for
as
for
will write
as
the notation
means
that
is regarded as a super source (see Fig. 2)and
for
as
. This convention will apply to
other notation in Section III as well. In particular

(5.4)

(5.6)

Since (5.4) is valid for any auxiliary random variables


, satisfying the requirements (R5) and (R6) with
, (5.2) then follows from the denition (2.7). This completes
the proof of (5.1) in Case (1).
To prove (5.1) in Case (2), note that both
and
are right
,
continuous in the sense that for any
the two equations shown at the bottom of the page hold. The
validity of (5.1) in Case (2) then follows from its validity in
Case (1). This completes the proof of Theorem 6.
Theorem 6 is what one would expect and consistent with our
, and
intuition. Let us now look at the case where
do not form a Markov chain, and
is an IID vector
source. Dene for any

for any
.
Condition A: A point
with
and
is said to satisfy Condition A if
as
and
has a negative subgradient
a function of
, at
such that there is a distribution
satisfying the following requirements:
.
(R8)
(R9) Dene (as in Step 2 of the iterative algorithm)
(5.7)

(5.5)
, for any source , is the classical rate distortion
where
. In view of Theorem
function of . Assume that
4 and the proof of Lemma 1, both
and
are convex as functions of
and
over
. As such, they
the region
with
and
are subdifferentiable at any point
. (See [14, Chapter 23] for discussions on the subdifferential and subgradients of a convex function.) From Section III,
they can also be computed via our iterative algorithm through
is an IID vector
their respective conjugates. Since

(5.8)
where
(5.9)

(5.10)

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

Denote

the

two

conditional

and
. Then either

or
, and

5269

distributions
by

depends on , i.e., there exist


with
such that

and
in Fig. 2 with distortions
the total rate. Thus

for any

without changing

. This, coupled with (5.14), implies

(5.15)
We are now ready to state a somewhat surprising result
dubbed the more and less coding theorem.
Theorem 7 (More and Less Coding Theorem): Suppose that
is an IID vector source with
, and
, and
do not form a Markov chain. Then for any point
, satisfying Condition A, there is a
such that for any
,
critical value

and
.
for any
To continue, we are now led to show
(5.16)
, satisfying Condition
for any point
A. First note that from the denition of causal video codes
(5.17)

(5.11)
and
. Fix now any point
, satisfying Condition A. We prove (5.16) by
contradiction. Suppose that
for any

and for any


(5.12)

(5.18)
Remark 8: In Theorem 7, if

Proof of Theorem 7: Since


is continuous over
function of
it sufces to show that

, then at

as a
and non-increasing,

. Let
be the negative subat the point
at the point
gradient of
in Condition A. From (5.15),
is also a negative subgradient
of
at the point
. This implies that
and
,
for any

(5.13)
for any point
, satisfying Condition
A. To this end, we consider a new two-layer causal coding model
and
together are regarded as one
shown in Fig. 2, where
super source. Let
denote its minimum total
, a random
rate function. Since at
variable
independent of
, and
can be
.
constructed in such a way that
Therefore, it is easy to see that

which, coupled with (5.18) and (5.17), in turn implies that the
equation shown at the bottom of the page holds for any
and
. In other words, under the assumption (5.18),
is also a negative subgradient of
at the point
. In view of (3.25), (3.26), and (5.6), it then follows
that

(5.19)
(5.14)
for any
and
. On the other hand, in view of
the denition of causal vide codes, it is not hard to see that any
, and
with respective discausal code for encoding
tortions
can also be used for encoding

(5.20)
In view of the requirement (R8) in Condition A, we have

(5.21)

5270

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

From Step 2 of the iterative algorithm, it follows that

(5.22)
dewhere the inequality in (5.22) is strict when
. Therefore, according to the requirement (R9) in
pends on
Condition A, no matter which choice in the requirement (R9) is
valid, we always have

which, together with (5.19) to (5.21), implies that

This contradicts the assumption (5.18), hence completing the


proof of (5.16) and (5.13).
Dene

Then from (5.13), it is easy to see that


is the desired critical
value. This completes the proof of Theorem 7.
Remark 9: Theorem 7, in particular, (5.11) is really counter
intuitive. It says that whenever the conditions specied in
Theorem 7 are met, the more source frames need to be encoded
and transmitted, the less amount of data after encoding has to
be actually sent! If the cost of data transmission is proportional
to the transmitted data volume, this translates literally into a
scenario where the more frames you download, the less you
would pay. To help the reader better understand this phenomenon, let us examine where the gain of
over
comes from whenever the conditions
to the
specied in Theorem 7 are met. The availability of
encoder of
does not really help the encoder of
and its
corresponding decoder achieve a better rate distortion tradeoff
. Likewise, the availability of
and
to the
encoder of
does not really help the encoder of
and its
corresponding decoder achieve a better rate distortion tradeoff
either. What really matters is that the availability of
to the encoder of
will help the encoder of
choose
better side information
for the encoder and decoder of
. If the rate reduction of the encoder of
arising from
along with
is more than the overhead asthis better
and the selection of this better
,
sociated with the rate
then the total rate
is smaller. (Here
and the selection of
the overhead associated with the rate
is meant to be the difference between the sum
this better
and
in
and the rate
in
of
. Depending on how helpful
is, the rate
in
can be more or less than the rate
in
.) This is further conrmed in Examples 1
and 2 at the end of this section.
Condition A is generally met at points
, for which positive bit rates are needed
and the decoder for
in order
at both the decoder for
for them to produce the respective reproductions with the
and
. Such distortion points will
desired distortions
be called points with positive rates. By using the technique

demonstrated in the proof of [21, Property 1], it can be shown


has a negative subgradient at
that
, with positive rates. In
any point
addition, the distribution
, if optimal, generally
(except for some corner cases) when
,
depends on
do not form a Markov chain. We illustrate this in the
and
following theorem in the binary case.
, and
Theorem 8: Assume that
the Hamming distortion measure is used. Let
be an IID vector source with
. Sup, and
do not form a Markov chain.
pose that
with
and
, if
Then for
(
and
) achieves
, i.e.,
(5.23)
then
depends on , i.e., there exists such
that the condition distributions
and
are different.
Proof of Theorem 8: Fix
with
and
.
. It is not hard to
We rst derive some bounds on
verify that

(5.24)
is the unique value of
at which
where
is equal to , and
the derivative of
is the unique value of
at which the derivative of
is equal to . In the above, the inequality 1) is due to
the fact that
(5.25)
for any

. Under the condition that


, the inequality (5.25) is strict at
. Therefore

(5.26)
In view of (5.23), it follows from the iterative algorithm that
(5.27)
(5.28)
where
operations
sources

appears as subscripts to indicate that the


and
dened in Section III are for the
and
. Let
be the output of

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

the channel in response to the input


Then the joint distribution of
implies

is

5271

.
, and (5.23)
(5.35)
where

, and
. Further simplifying (5.35) yields

(5.29)
Putting (5.29) and (5.26) together, we can conclude that
and hence
for any . Otherwise,
from (5.29) we would have that

(5.36)
Since

(5.30)
which contradicts (5.26).
We now prove Theorem 8 by contradiction. Suppose that
does not depend on . Then for any
and

(5.31)

, it can be veried that

is equal to if and only if


.
Next we show that
. To this end, rst note that
is
is a product distribution,
equivalent to saying that
i.e.,
(5.37)
By plugging (5.37) into (5.27), it follows from the Step 2 of the
iterative algorithm that in
does not depend on
and
does not depend on , i.e.,

which, together with (5.27), (5.7) to (5.10), and the fact that
, implies

(5.38)
(5.32)
Simplifying (5.32) yields

where
and
are the normalization factors so that
the respective terms are indeed distributions. It is easy to see
that (5.37) and (5.38) imply
(5.39)
(5.40)

(5.33)
.
where
To continue, we now consider specic values of
and .
and
. It follows
Let us rst look at the case of
from (5.33) that

(5.41)
Combining (5.39) to (5.41) with (5.29) yields

(5.34)
which implies

which contradicts (5.26). Therefore,


.
, (5.36) is equivalent to
Go back to (5.36). Since

(5.42)

5272

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

Fig. 3. Comparison of R

D ;D ;D

Fig. 4. Comparison of R

D ;D

D ; D ; D ) and R

) and

Repeat the above argument for the case of


We then have accordingly

) versus

D ; D ) versus D
and

(5.43)
Putting (5.42) and (5.43) together, we have shown that (5.31)
implies that
, and
form a Markov chain, which contradicts our assumption. This completes the proof of Theorem
8.
Remark 10: From Theorem 8, it follows that for any sources
, and
satisfying the conditions of Theorem 8, Condi, at which
tion A is met at any point
has a negative subgradient.
We conclude this section with examples illustrating Theorem
7.
,
Example 1: Suppose that
and that the Hamming distortion measure is used. Let
, and

for xed D = 0:31 and D = 0:15.

for xed D = 0:20 and D = 0:15 in Example 1.

It is easy to see that


and
do not form a Markov chain.
We consider the following three cases:
, and
;
Case 1:
, and
; and
Case 2:
, and
.
Case 3:
For Case 1, Fig. 3 shows the rate-distortion curves of
and
versus
.
shown in Fig. 3, it is clear
Over the interval of
is always strictly less than
that
.
and
For Case 2, Fig. 4 shows
versus
with xed
and
. It is observed that the critical point at which
meets
is the in.
tersection of the two curves. Denote this critical point by
,
Then it is clear that when
is indeed strictly less than
. Table I shows
the rate allocation across different encoders in both cases
and
for several
of
, where
, represents
sample values of
in both cases, and
the rate allocated to the encoder of
and
are denoted as
and
, respectively to save space. It is clear
from Table I that the allocated rates conrm the explanation
mentioned in Remark 9.

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

Fig. 5. Comparison of R

D ;D ;D

) and

5273

D ; D ) versus D

TABLE I
(D ; D ; D ) AND R
(D ; D )
RATE ALLOCATION OF R
VERSUS D FOR FIXED D = 0:20 AND D = 0:15 IN EXAMPLE 1

and
the two rate distortion curves
versus
, and Table III lists their respective
. The same
rate allocations for several sample values of
phenomenon is revealed as in Example 1.
For all cases shown in Examples 1 and 2, in comparison
, when we include
in the encoding and
with
transmission, we not only get the reconstruction of
(with dis) free at the receiver end, but are also able to reduce
tortion
the total number of bits to be transmitted. In other words, we can
achieve a double gain.

TABLE II
RATE ALLOCATION OF R
(D ; D ; D ) AND R
(D ; D )
VERSUS D FOR FIXED D = 0:22 AND D = 0:23 IN EXAMPLE 1

VI. COMPARISON WITH GREEDY CODING

When we assign different values to


and
,
we observe the same phenomenon, as shown again in Fig. 5 and
Table II for Case 3.
Let us now look at another example with a different joint
distribution.
Example 2: Suppose that
and that the Hamming distortion measure is used. Let
, and

Once again,

and

for xed D = 0:22 and D = 0:23 in Example 1.

do not form a Markov chain. Fix


and
. Fig. 6 shows

All MPEG-series and H-series video coding standards [13],


[19] proposed so far fall into predictive video coding, where at
, only previous encoded frames
the encoder for each frame
are used as a helper. By using a technique called soft decision
quantization [19], [17], [18], it has been demonstrated in a series of papers [19], [20], [16] that the greedy coding method7
offers signicant gains (ranging from 10% to 30% rate reduction at the same quality) over the respective reference codecs8 of
these standards. As such, it is instructive to compare the perforwith
mance of causal coding characterized by
the performance of greedy coding characterized by the total rate
offered by the greedy coding method. In this
section, we present specic examples to numerically compare
with
. Analytic comparison
between causal coding and predictive coding will be treated separately in our forthcoming paper due to its complexity.
,
Example 3: Suppose that
and the Hamming distortion measure is used. In this example,
7The greedy coding method is a special form of predictive video coding; based
on all previous encoded frames, it encodes each current frame in a local optimum
manner so as to achieve the best rate distortion tradeoff for the current frame
only.
8Both the greedy coding method and reference codecs are special forms of
predictive video coding. At this point, the best rate distortion performance of
predictive video coding is still unknown in general.

5274

Fig. 6. Comparison of R

Fig. 7. Comparison of R (D

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

D ;D ;D

) and

D ;D

; D ; D ) and R (D ; D ; D

) versus

) versus

for xed D = 0:0988 and D = 0:0911 in Example 2.

for xed D = 0:5488 and D = 0:3927 in Example 3.

TABLE III
RATE ALLOCATION OF R
(D ; D ; D ) AND R
(D ; D )
VERSUS D FOR FIXED D = 0:0988 AND D = 0:0911 IN EXAMPLE 2

we consider a Markov chain:


is given by
probability

. The transition

and the other transition probability

is given by

Fig. 7 shows the rate-distortion curves of


and
versus
when
is uniformly distributed,
, and
. As shown in Fig. 7, when
, which is more than
31 percent less than
.
, and
Let us now look at another example in which
do not form a Markov chain.
Example 4: Suppose that
,
and the Hamming distortion measure is used. In this example,
, and
do not form a Markov chain, but

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

Fig. 8. Comparison of R (D

; D ; D ) and R (D ; D ; D

) versus

5275

for xed D = 0:5488 and D = 0:3927 in Example 4.

does form a Markov chain in the indicated order. The tranis given by
sition probability

and the other transition probability

is given by

Fig. 8 shows the rate-distortion curves of


and
versus
when
is uniformly distributed,
, and
. As shown in Fig. 8, when
, which is 34.8 per.
cent less than
The above two examples are of course toy examples. However, if the performance improvement is indicative of the performance of causal video coding for real video data, it is denitely
worthwhile to make the causal video coding idea materialize in
video codecs.
VII. CONCLUSION
In this paper, we have investigated the causal coding of source
frames
from an information theoretic point of
view. An iterative algorithm has been proposed to numerically
achievable
compute the minimum total rate
asymptotically by causal video coding for jointly stationary and
,
totally ergodic sources at distortion levels
for IID sources
and analytically characterize
. The algorithm has been shown to converge
globally. With the help of the algorithm, we have further

established a somewhat surprising more and less coding theoremunder some conditions on source frames and distortion,
the more frames need to be coded and transmitted, the less
amount of data after encoding has to be sent! If the cost of data
transmission is proportional to the transmitted data volume,
this translates literally into a scenario where the more frames
you download, the less you would pay. Numerical comparisons
between causal video coding and greedy coding have shown
that causal video coding offers signicant performance gains
over greedy coding. Along the way, we have advocated that
whenever possible, the computational approach as illustrated
in the paper is a preferred approach to multi-user problems in
information theory. In addition, we have also established an
extended Markov lemma for correlated ergodic sources, which
will be useful to other multi-user problems in information
theory as well.
If the information theoretic analysis as demonstrated in this
paper is indicative of the real performance of causal video
coding for real video data, then the more and less coding
theorem plus the signicant performance gain of causal video
coding over greedy coding really points out a bright future for
causal video coding. To make the idea of causal video coding
materialize in real video codecs, future research efforts should
be towards designing effective causal video coding algorithms,
in addition to addressing many information theoretic problems
such as universal causal video coding.
APPENDIX A
In this Appendix, we prove Theorem 5. As usual, we divide
the proof of Theorem 5 into its converse part and its positive
part.
Proof of the converse part: Pick any achievable rate distortion
pair vector

For any

, there exists an order- causal video code


for all sufciently large such that (1.1)
and
be the respective encoded frame of
holds. Let
given by
. It follows
and reconstructed frame for
from the denition of causal video codes that the Markov

5276

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

conditions
, are satised, and

(A.1)
for
.
Dene auxiliary random variables
(A.4)
for any

and
, where
. Since
is an
IID vector source, it is not hard to verify that the Markov chain
is valid for any
and
. In view of (1.1), and the
is an IID vector source, we
assumption that
have

where the equality

is due to the Markov chain


.
To continue, we introduce a timesharing random variable
that is uniformly distributed over
, and indepen, and hence of all random varident of
,
ables appearing in (A.1) to (A.4). Dene
. Then it is not hard to verify that the
for
is valid
Markov chain
for
, and (A.2), (A.3), (A.4), and (A.1) can
be rewritten, respectively, as
(A.5)
(A.6)
(A.7)

(A.2)
and for

(A.8)

(A.3)

Note that
and
have
, is a functhe same distribution, and
tion of
. Therefore, in comparison with the requirements (R1) to (R4) in the denition (2.1), the only thing missing
is that the Markov chain
may not be valid. To overcome this problem, we can use the
same technique as in the proof of the converse part of Theorem
1 and also in the proof of Lemma 1 to construct a new random
such that the following hold:
vector
and
have

the same distribution;


the Markov condition
is met.
This, together with (A.5) to (A.8) and the denition (2.1), implies that
(A.9)

where the equality

is due to the Markov chain


.
For the last frame, we have

Letting

yields

and hence
. This completes the proof of the
converse part of Theorem 5.
, can be
The positive part of Theorem 5,
proved by using the standard random coding argument in multiuser information theory [4], [1]. For the sake of completeness,
we present a sketch of proof below.

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

Proof sketch of the positive part: For convenience, we shall


use bold letters to denote vectors throughout the rest of this sec. Since
is convex and
tion. For example,
is closed, it sufces to show that
.
Pick any rate distortion pair vector

We shall show that it is achievable. Let


, and
be the auxiliary random variables in (2.1) (for the
) satisfying the requirements (R1) to (R4) with
denition of
. Denote the alphabets of
functions
, by , respectively. For any
, dene

Let
be the set of -strongly jointly typical sewith respect to the joint distribution
quences of length
. Similarly, for any
, let
of
be the set of -strongly jointly typical
sequences of length with respect to the joint distribution of
, and let
be the set
of -strongly jointly typical sequences of length with respect
. Similar notation
to the joint distribution of
will be used for other sets of strongly typical sequences with
respect to other joint distributions. (For the denition of strong
typicality, please refer to, for example, [4, p. 326].) In what
follows, the values of in different strongly typical sets should
multiplied by different constants for
be understood as
different . We are now ready to describe random codebooks
and how encoders/decoders work.
Generation of codebooks:
codewords
1) Generate independently
(the set of which is denoted by ), where each codeword
is drawn according to the -fold
product distribution of
.
2) For
, for every combination
, where
for
, generate independently
codewords
(the set of which is denoted by
), where each
is
drawn according to the -fold product conditional distribution
conditionally given
.
of
3) For every combination
for
where
independently
codewords
(the set of which is denoted by

,
, generate

), where each
is drawn according to the
-fold product conditional distribution of
condition-

.
ally given
Encoding:
1) Given a sequence
, encode
into the index, say ,
of the rst codeword in
such that
if such a codeword exists. Otherwise, set
. Denote the
resulting codeword
by
.
2) For
, with the knowledge of all
historical
codewords
,
denoted by
, the encoder for
nds the index,
say
, of the rst codeword in
such that

5277

if
otherwise. Denote the
such a codeword exist, and set
by
.
resulting codeword
3) With the knowledge of all historical codewords
, denoted by
, the ennds the index, say
, of the rst codeword
coder for
such that
in
if such a codeword exist, and set
otherwise. Denote the resulting codeword
by
.
Decoding:
1) The decoder for
rst reproduces the codeword
from
, and then calculates
by applying the function to each
.
component of
2) Upon receiving
, the decoder for
reproduces the codeword
from
, and then calculates
by applying the function
to each component of
.
3) Upon receiving
, the decoder for
reproduces the
from
, and then outputs
codeword
.
Analysis of bit rates, typicality, and distortions:
1) From the construction of encoders, the bit rate in bits per
symbol for each
is upper bounded by
.
2) In view of the law of large numbers, standard probability
bounds associated with typicality (see, for example, [4, Lemma
10.6.2, Chapter 10]), and the Markov lemma [4, Lemma 15.8.1,
Chapter 15], [1], it follows that with probability approaching
as
are strongly typical,
and
and
are strongly typical.
3) In view of Requirements (R1) to (R3) in the denition
(2.1) and of the above two paragraphs, it follows that the
and
,
distortion per symbol between each
is upper bounded by
with probability approaching
as
.
Existence of a deterministic causal video code with desired
performance:
In the above analysis, all probabilities are with respect to both
the random sources
, and the random codebooks.
By the well-known Markov inequality, it follows that there exists a deterministic causal video code (i.e., a deterministic codeand
book) for which the distortion per symbol between each
, is upper bounded by
with prob9. Therefore, for this deterability approaching as
ministic causal video code, the average distortion per symbol
between each
and
, is upper bounded by
. Note that all rates are xed.
Putting all pieces together, we have shown that

Letting

yields

This completes the proof of the positive part of Theorem 5.


9This step is necessary since we have multiple distortion inequalities to satisfy, in which case declaring the existence of a deterministic code immediately
from several inequalities with average performance over the codebook ensemble
would fail.

5278

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

APPENDIX B
In this Appendix, we prove the positive part (i.e.,
) of Theorem 1. Since
, each
is
is closed, it sufces to show that for each
convex, and
.
Proof of
: Unless otherwise specied, notation below is the same as in the proof of the positive part in
Appendix A. Indeed, our proof is similar to the random coding
argument made for the IID case in Appendix A. However, since
now is not IID, but stationary
the vector source
and totally ergodic, the Markov lemma in its simple form as
expressed in [4, Lemma 15.8.1, Chapter 15] is not valid any
more. To overcome this difculty, we will modify the concept of
typical sequences and make it even stronger. With
and
, dened as
and
in Appendix A, we dene for each sequence
, where for any alphabet
denotes the set of all
sequences of length from

, let
be the output process of the
For any
in response
memoryless channel given by
and
.
to the inputs
be the output process of the memoryless
Let
in response to the inputs
channel given by
and
. Then the following
properties hold.
, where
(P1) The probability
and
, goes to as
.
and sufciently large
(P2) For any

(B.5)
for any
(P3) For sufciently large

(B.1)
(B.6)

and similarly, for each

(B.2)
We then dene our modied joint typical sets as follows:

(B.3)
and for

(B.4)
To get our random causal video coding scheme in this case,
we simply modify the encoding procedure of the random
coding scheme constructed in Appendix A by replacing
and
with
and
, respectively; the rest of the random
coding scheme remains the same. Since the rate of the encoder
is xed, the bit rate in bits per symbol for each
for each
is upper bounded by
. To get the desired upper bounds on
distortions, we need to analyze the joint typicality of the source
sequences and the respective transmitted codeword sequences.
At this point, we invoke the following result, which will be
proved at the end of this Appendix.
Lemma 2 (Extended Markov Lemma): Suppose that
are jointly stationary and ergodic. Let
, and
be the auxiliary random variables
) satisfying the requirements
in (2.1) (for the denition of
(R1) to (R4). Let
be the output process of the memin response to the input
.
oryless channel given by

.
for any
Lemma 2 can be regarded as an extended Markov lemma
in the ergodic case. In view of Lemma 2, it is not hard to
see that with high probability, which approaches 1 as
are strongly typical, and
and
are strongly typical. The rest of the proof is identical to
the case considered in Appendix A. This completes the proof
.
of
: We consider a block of
symbols
Proof of
as a super symbol and regard
as a vector source
. Since
is totally ergodic,
over
it is also ergodic when regarded as a vector source over
. Repeating the above argument for super symbols, i.e.,
, we then have
for alphabets
for any
. This completes the proof of the positive part
of Theorem 1.
We now prove Lemma 2.
Proof of Lemma 2: By construction, it is easy to see
and
are
that
the output of a memoryless channel in response to the input
. Since
are joint stationary and
ergodic, it follows from [2, Theorem 7.2.1, Page 272] that the
processes
and
are jointly stationary and ergodic as
well. By the ergodic theorem, we then have
(B.7)
Let

Rewrite

as
(B.8)

YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING

5279

Applying the Markov inequality to (B.8), we get

(B.9)
as
, combining (B.9) with (B.7) yields
Since
Property P1 in Lemma 2.
To prove Property P2 in Lemma 2, note that given any
is a conditionally independent
sequence. It is not hard to see that

(B.10)
. Furas long as
thermore, the convergence in (B.10) is uniform. This, coupled
, implies that for sufciently
with the denition of
,
large and for any

(B.11)
Applying the Markov inequality to (B.11), we get

(B.12)
which in turn implies

(B.13)
whenever
. Combining (B.13) with (B.11) yields (B.5).
A similar argument can be used to prove Property (P3). The
completes the proof of Lemma 2.

[7] I. Csiszar and G. Tusnady, Information geometry and alternating minimization procedures, Statistics and Decisions, pp. 205237, 1984,
Supplement Issue 1.
[8] W. H. R. Equitz and T. Cover, Successive renement of information,
IEEE Trans. Inf. Theory, vol. 37, no. 2, pp. 269275, Mar. 1991.
[9] R. G. Gallager, Information Theory and Reliable Communication.
New York: Wiley, 1968.
[10] N. Ma and P. Ishwar, On Delayed Sequential Coding of Correlated
Sources Sep. 30, 2008, arXiv: cs/0701197v2 [CS.IT].
[11] N. Ma and P. Ishwar, The value of frame-delays in the sequential
coding of correlated sources, in Proc. 2007 IEEE Int. Symp. Inf.
Theory, Nice, France, Jun. 2007, pp. 14961500.
[12] N. Ma, Y. Wang, and P. Ishwar, Delayed sequential coding of correlated sources, in Proc. 2007 Information Theory and Applications
Workshop, San Diego, CA, U.S.A., Jan. 2007, pp. 214222.
[13] I. E. G. Richardson, H.264 and MPEG-4 Video Compression. New
York: Wiley, 2003.
[14] R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton University Press, 1970.
[15] H. Viswanathan and T. Berger, Sequential coding of correlated
sources, IEEE Trans. Inf. Theory, vol. 46, no. 1, pp. 236246, Jan.
2000.
[16] E.-H. Yang and L. Wang, Full rate distortion optimization of MPEG 2
video coding, in Proc. 2009 IEEE Intern. Conf. Image Process., Cairo,
Egypt, Nov. 7-11, 2009, pp. 605608.
[17] E.-H. Yang and L. Wang, Joint optimization of run-length coding,
Huffman coding and quantization table with complete baseline JPEG
decoder compatibility, IEEE Trans. Image Process., vol. 18, no. 1, pp.
6374, Jan. 2009.
[18] E.-H. Yang and L. Wang, Method, System, and Computer Program
Product for Optimization of Data Compression with Cost Function,
U.S. Patent No. 7 570 827, Aug. 4, 2009.
[19] E.-H. Yang and X. Yu, Rate distortion optimization for H.264
inter-frame video coding: A general framework and algorithms, IEEE
Trans. Image Process., vol. 16, no. 7, pp. 17741784, Jul. 2007.
[20] E.-H. Yang and X. Yu, Soft decision quantization for H.264 with main
prole compatibility, IEEE Trans. Circuits Syst. Video Technol., vol.
19, no. 1, pp. 122127, Jan. 2009.
[21] E.-H. Yang and Z. Zhang, On the redundancy of lossy source
coding with abstract alphabets, IEEE Trans. Inf. Theory, vol. 44, pp.
10921110, May 1999.
[22] E.-H. Yang, L. Zheng, D.-K. He, and Z. Zhang, On the rate distortion
theory for causal video coding, in Proc. 2009 Information Theory and
Applications Workshop, San Diego, CA, Feb. 813, 2009, pp. 385391.
[23] E.-H. Yang, L. Zheng, Z. Zhang, and D.-K. He, A computation approach to the minimum total rate problem of causal video coding, in
Proc. 2009 IEEE Int. Symp. Inf. Theory, Seoul, Korea, Jun./Jul. 2009,
pp. 21412145.
[24] R. W. Yeung and T. Berger, Multi-way alternating minimization, in
Proc. 1995 IEEE Int. Symp. Inf. Theory, Whistler, Canada, Sep. 1722,
1995.
[25] L. Zheng and E.-H. Yang, Causal Video Coding Theorem for Ergodic
Sources in preparation.

ACKNOWLEDGMENT
The authors would like to acknowledge the associate editor,
Dr. Ordentlich, and anonymous reviewers for their detailed
comments. In particular, we are deeply grateful to the associate
editor for bringing the references [11] and [12] to our attention.
REFERENCES
[1] T. Berger, Multiterminal source coding, in Information Theory
Approach to Communications, G. Longo, Ed. New York:
Springer-Verlag, 1977.
[2] T. Berger, Rate Distortion Theory. Englewood Cliffs, NJ: PrenticeHall, 1971.
[3] R. E. Blahut, Computation of channel capacity and rate-distortion
function, IEEE Trans. Inf. Theory, vol. IT-18, pp. 460473, 1972.
[4] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd
ed. Hoboken, NJ: Wiley, 2006.
[5] I. Csiszar, On the computation of rate distortion functions, IEEE
Trans. Inf. Theory, vol. IT-20, pp. 122124, 1974.
[6] I. Csiszar and J. Korner, Information Theory Coding Theorems for Discrete Memoryless Systems. Budapest, Hungary: Akademiai Kiado,
1986.

En-Hui Yang (M97SM00F08) received the B.S. degree in applied mathematics from HuaQiao University, Qianzhou, China, and Ph.D. degree in mathematics from Nankai University, Tianjin, China, in 1986 and 1991, respectively.
Since June 1997, he has been with the Department of Electrical and Computer Engineering, University of Waterloo, ON, Canada, where he is currently
a Professor and Canada Research Chair in information theory and multimedia
compression. He held a Visiting Professor position at the Chinese University
of Hong Kong, Hong Kong, from September 2003 to June 2004; positions of
Research Associate and Visiting Scientist at the University of Minnesota, Minneapolis-St. Paul, the University of Bielefeld, Bielefeld, Germany, and the University of Southern California, Los Angeles, from January 1993 to May 1997;
and a faculty position (rst as an Assistant Professor and then an Associate
Professor) at Nankai University, Tianjin, China, from 1991 to 1992. He is the
founding Director of the Leitch-University of Waterloo multimedia communications lab, and a Co-Founder of SlipStream Data Inc. (now a subsidiary
of Research In Motion). His current research interests are: multimedia compression, multimedia watermarking, multimedia transmission, digital communications, information theory, source and channel coding including distributed
source coding, and image and video coding.
Dr. Yang is a recipient of several research awards, including the 1992 Tianjin
Science and Technology Promotion Award for Young Investigators; the 1992

5280

third Science and Technology Promotion Award of Chinese Ministry of Education; the 2000 Ontario Premiers Research Excellence Award, Canada; the
2000 Marsland Award for Research Excellence, University of Waterloo; the
2002 Ontario Distinguished Researcher Award; the prestigious Inaugural (2007)
Premiers Catalyst Award for the Innovator of the Year; and the 2007 Ernest C.
Manning Award of Distinction, one of the Canadas most prestigious innovation
prizes. Products based on his inventions and commercialized by SlipStream received the 2006 Ontario Global Traders Provincial Award. With over 170 papers
and many patents/patent applications, products with his inventions inside are
used daily by tens of millions people worldwide. He is a Fellow of the Canadian Academy of Engineering and a Fellow of the Royal Society of Canada:
the Academies of Arts, Humanities and Sciences of Canada. He served, among
many other roles, as a General Co-Chair of the 2008 IEEE International Symposium on Information Theory, an Associate Editor for IEEE TRANSACTIONS ON
INFORMATION THEORY, a Technical Program Vice-Chair of the 2006 IEEE International Conference on Multimedia & Expo (ICME), the Chair of the award
committee for the 2004 Canadian Award in Telecommunications, a Co-Editor of
the 2004 Special Issue of the IEEE TRANSACTIONS ON INFORMATION THEORY,
a Co-Chair of the 2003 U.S. National Science Foundation (NSF) workshop on
the interface of Information Theory and Computer Science, and a Co-Chair of
the 2003 Canadian Workshop on Information Theory.

Lin Zheng received the B.Eng. degree in electronics and information engineering from Huazhong University of Science and Technology, Wuhan, Hubei,
China, in 2004, and M.S. degree in electrical and computer engineering from the
University of Waterloo, Waterloo, ON, Canada, in 2007. She is currently pursuing the Ph.D. degree in electrical and computer engineering at the University
of Waterloo.
Her research interests include information theory, data compression,
multi-terminal source coding theory and algorithm design, and multimedia
communications.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

Da-Ke He (S01M06) received the B.S. and M.S. degrees, both in electrical
engineering, from Huazhong University of Science and Technology, Wuhan,
Hubei, China, in 1993 and 1996, respectively, and his Ph.D. degree in electrical
engineering from the University of Waterloo, Waterloo, ON, Canada, in 2003.
From 1996 to 1998, he was with Apple Technology China (Zhuhai) as a software engineer. From 2003 to 2004, he worked in the Department of Electrical
and Computer Engineering at the University of Waterloo as a postdoctoral research fellow in the Leitch-University of Waterloo Multimedia Communications
Lab. From 2005 to 2008, he was a research staff member in the Department of
Multimedia Technologies at IBM T. J. Watson Research Center in Yorktown
Heights, New York, U.S.A. Since 2008, he has been a technical manager in Slipstream Data, a subsidiary of Research In Motion, in Waterloo, Ontario, Canada.
His research interests are in source coding theory and algorithm design, multimedia data compression and transmission, multi-terminal source coding theory
and algorithms, and digital communications.

Zhen Zhang (F03) received the M.S. degree in mathematics from Nankai
University, Tianjin, China in 1980, Ph.D. degree in applied mathematics from
Cornell University, Ithaca, NY, in 1984, and Habilitation in mathematics from
Bielefeld University, Bielefeld, Germany, in 1988.
He served as a lecturer in mathematics at Nankai during 1981-1982. He was
a post- doctoral research associate with the School of Electrical Engineering,
Cornell University, from 1984 to 1985 and with the Information Systems Laboratory, Stanford University, in the Fall of 1985. From 1986 to 1988, he was
with the Mathematics Department, Bielefeld University, Bielefeld, Germany.
He joined the faculty of University of Southern California in 1988, where he
is currently a Professor in Electrical Engineering, the Ming Hsieh Department
of Electrical Engineering-systems. He is a fellow of IEEE. His research interest
includes information theory, coding theory, data compression, network coding
theory, combinatorics and various mathematical problems related to communication sciences.

S-ar putea să vă placă și