Sunteți pe pagina 1din 74

Chapter 2

Receiver Design for Discrete-Time


Observations
2.1 Introduction
The focus of this ad the next chapter is the receiver design. The task of the receiver can
be appreciated by considering a very noisy channel. Roughly speaking, this is a channel
for which the signal applied at the channel input has little inuence on the output. The
GPS channel is a good example. Let the channel input be the electrical signal applied
to the antenna of a GPS satellite orbiting at an altitude of 20,200 km, and the channel
output be the signal at the antenna output of a GPS receiver at sea level. The signal
of interest at the output of the receiver antenna is roughly a factor 10 weaker than the
ambient noise produced by various sources, including other GPS satellites, interference
from nearby electroniic equipment, and the thermal noise present in all conductors. If
we were to observe the receiver antenna output signal with a general-purpose instrument,
such as an oscilloscope or a spectrum analyzer, we would not be able to distinguish the
signal from the noise. Yet, most of the time the receiver manages to reproduce the bit
sequence transmitted by a satellite of interest. This is the result of a clever operation that
takes into account the source statistic, the signals structure, and the channel statistic.
It is instructive to start with the family of channel models that produce n-tuple outputs.
This chapter is devoted to decisions based on the output of such channels. Although
we develop a general theory, the discrete-time additive white Gaussian noise (AWGN)
channel will receive special attention. Understanding how to deal with such cases is our
goal in this chapter. As a prominent special case, we will consider the discrete-time
additive white Gaussian noise (AWGN) channel. In so doing, by the end of the chapter
we will have derived the receiver for the rst layer of Figure 1.3.
Figure 2.1 depicts the communication system considered in this chapter. Its components
are:
17
18 Chapter 2.
A Source: The source (not represented in the gure) produces the message to be
transmitted. In a typical application, the message consists of a sequence of bits
but this detail is not fundamental for the theory developed in this chapter. It is
fundamental that the source choses one message from a set of possible messages.
We are free to choose the label we assign to the various messages and our choice
is based on mathematical convenience. For now the mathematical model of a source
is as follows. If there are m possible choices, we model the source as a random
variable H that takes values in the message set H = {0, 1, . . . , (m1)}. More often
than not all messages are assumed to have the same probability but for generality we
allow message i to be chased with probability P
H
(i) . We visualize the mechanism
that produces the source output as a random experiment that takes place inside the
source and yields the outcome H = i H with probability P
H
(i) . The message set
H and the probability distribution P
H
are assumed to be know by the communication
system designer.
A Channel: The channel is specied by the input alphabet X , the output alphabet
Y , and by the output distribution conditioned on the input. In other words, for each
possible channel input x X , we assume that we know the distribution of the channel
output. If the channel output alphabet Y is discrete, then the output distribution
conditioned on the input is the probability distribution p
Y |X
(|x) , x X ; if Y is
continuous, then it is the probability density function f
Y |X
(|x) , x X . In most
examples, X is either the binary alphabet {0, 1} or the set R of real numbers. It
can also be the set C of complex numbers, but we have to wait until Chapter 7 to
understand why.
A Transmitter: The transmitter is a mapping from the message set H = {0, 1, . . . , m
1} to the signal set C = {c
0
, c
1
, . . . , c
m1
} where c
i
X
n
for some n. We need the
transmitter to connect two alphabets that typically are incompatible, namely H and
X
n
. From this point of view, the transmitter is simply a sort of connector. There is
another more subtle task accomplished by the transmitter. A well-designed transmit-
ter makes it possible for the receiver to meet the desired error probability. Towards
this goal, the elements of the signal set C are such that a well-designed receiver ob-
serving the channel outputs reaction can tell (with high probability) which signal
from C has excited the channel input.
A Receiver: The receivers task is to guess H from the channel output Y Y
n
.
Since the transmitter map is always one-to-one, it is more realistic to picture the
receiver as guessing the channel input signal and declaring the signals index as the
message guess. We use

i to represent the guess made by the receiver. Like the
message, the guess of the message is the outcome of a random experiment. The
corresponding random variable is denoted by

H H. Unless specied otherwise, the
receiver will always be designed to minimize the probability of error, denoted P
e
and
dened as the probability that

H diers from H. Guessing the value of a discrete
random variable H from the value of a related random variable Y is a so-called
hypothesis testing problem that comes up in various contexts. We are interested in
2.1. Introduction 19
hypothesis testing to design communications systems, but it can also be used in other
applications, for instance to develop a re detector.
First we give a few examples.
Example 2. A common source model consist of H = {0, 1} and P
H
(0) = P
H
(1) = 1/2.
This models individual bits of, say, a le. Alternatively, one could model an entire le of,
say, 1 Mbit by saying that H = {0, 1, . . . , (2
10
6
1)} and P
H
(i) =
1
2
10
6
, i H.
Example 3. A transmitter for a binary source could be a map from H = {0, 1} to
C = {a, a} for some real-valued constant a. This is a valid choice if the channel input
alphabet X is R. Alternatively, a transmitter for a 4-ary source could be a map from
H = {0, 1, 2, 3} to C = {a, ja, a, ja}, where j =

1. This is a valid choice if X is


C
Example 4. The channel model that we will use frequently in this chapter is the one
that maps a signal c R
n
into Y = c + Z, where Z is a Gaussian random vector
of independent and uniformly distributed components. As we will see later, this is the
discrete-time equivalent of the baseband continuos-time channel called additive white
Gaussian noise (AWGN) channel. For this reason, following common practice, we will
refer to both as additive white Gaussian noise channels (AWGNs).
The chapter is organized as follows. We rst learn the basic ideas behind hypothesis
testing, the eld that deals with the problem of guessing the outcome of a random variable
based on the observation of another random variable. Then we study the Q function as
it is a very valuable tool in dealing with communication problems that involve Gaussian
noise. At that point, we will be ready to consider the problem of communicating across
the additive white Gaussian noise channel. We will rst consider the case that involves
two messages and scalar signals, then the case of two messages and n-tuple signals, and
nally the case of an arbitrary number m of messages and n-tuple signals. The last
part of the chapter deals with techniques to bound the error probability when an exact
expression is unknown or too complex.
A point about terminology needs to be claried. It might seem awkward that we use
the notation c
i
C to denote the transmitter output signal for message i . We do so
because the transmitter of this chapter will become the encoder in the next and subsequent
chapters. When the input is i , the output of the encoder is the codeword c
i
and the set
of codewords is the codebook C . This is the reasoning behind our notation.
-
Transmitter
i H
-
c
i
C X
n
Y Y
n
Channel
-
Receiver
-

H H
Figure 2.1: General setup for Chapter 2.
20 Chapter 2.
2.2 Hypothesis Testing
Hypothesis testing refers to the problem of deciding which hypotheses (read event) has
occurred based on an observable (read side information). Expressed in mathematical
terms, the problem to decide the outcome of a random variable H that takes values in a
nite alphabet H = {0, 1, . . . , m 1}, based on the outcome of the random variable Y
called observable.
This problem comes up in various applications under dierent names. Hypothesis testing
is the terminology used in statistics where the problem is studied from a fundamental point
of view. A receiver does hypothesis testing, but communication people call it decoding. An
alarm system such as a re detector also does hypothesis testing, but people would call it
detection. A more appealing name for hypothesis testing is decision making. Hypothesis
testing, decoding, detection, and decision making are all synonyms.
In communications, the hypothesis H is the message to be transmitted and the observable
Y is the channel output. The receiver guesses H based on Y , assuming that both
the distribution of H and the conditional distribution of Y given H are known. They
represent the information we have about the source and the statistical dependence between
the source and the observable, respectively.
The receivers decision will be denoted by

H. If we could, we would ensure that

H = H,
but this is generally not possible. The goal is to devise a decision strategy that maximizes
the probability P
c
= Pr{

H = H} that the decision is correct.
1
We will always assume that we know the a priori probability P
H
and that for each i H
we know the conditional probability density function
2
(pdf) of Y given H = i , denoted
by f
Y |H
(|i) .
Hypothesis testing is at the heart of the communication problem. As described by Claude
Shannon in the introduction to what is arguably the most inuential paper ever written
on the subject [4], The fundamental problem of communication is that of reproducing
at one point either exactly or approximately a message selected at another point.
Example 5. As a typical example of a hypothesis testing problem, consider the problem
of communicating one bit of information across an optical ber. The bit being transmitted
is modeled by the random variable H {0, 1}, P
H
(0) = 1/2. If H = 1, we switch on
an LED and its light is carried across the optical ber to a photodetector at the n-
tuple former. The photodetector outputs the number of photons Y N it detects. The
problem is to decide whether H = 0 (the LED is o) or H = 1 (the LED is on). Our
decision can only be based on whatever prior information we have about the model and
1
Pr{} is a short-hand for probability of the enclosed event.
2
In most cases of interest in communication, the random variable Y is a continuous one. That is why
in the above discussion we have implicitly assumed that, given H = i , Y has a pdf f
Y |H
(|i) . If Y
is a discrete random variable, then we assume that we know the conditional probability mass function
p
Y |H
(|i) .
2.2. Hypothesis Testing 21
on the actual observation y . What makes the problem interesting is that it is impossible
to determine H from Y with certainty. Even if the LED is o, the detector is likely
to detect some photons (e.g. due to ambient light). A good assumption is that Y is
Poisson distributed with intensity , which depends on whether the LED is on or o.
Mathematically, the situation is as follows:
H = 0, Y P
Y |H
(y|0) =

y
0
y!
e

0
.
H = 1, Y P
Y |H
(y|1) =

y
1
y!
e

1
,
where 0
0
<
1
. We read the above as follows: When H = 0, the observable Y
is Poisson distributed with intensity
0
. When H = 1, Y is Poisson distributed with
intensity
1
.
Once again, the problem of deciding the value of H from the observable Y is a standard
hypothesis testing problem. It will always be assumed that the distribution of H and
that of Y for each value of H are known to the decision maker.
From P
H
and f
Y |H
, via Bayes rule, we obtain
P
H|Y
(i|y) =
P
H
(i)f
Y |H
(y|i)
f
Y
(y)
where f
Y
(y) =

i
P
H
(i)f
Y |H
(y|i) . In the above expression P
H|Y
(i|y) is the posterior
(also called a posteriori probability of H given Y ). By observing Y = y , the probability
that H = i goes from p
H
(i) to P
H|Y
(i|y) .
If we choose

H = i , then the probability that we made the correct decision is the prob-
ability that H = i , i.e., P
H|Y
(i|y) . As our goal is to maximize the probability of being
correct, the optimum decision rule is

H(y) = arg max


i
P
H|Y
(i|y) (MAP decision rule), (2.1)
where arg max
i
g(i) stands for one of the arguments i for which the function g(i)
achieves its maximum. The above is called maximum a posteriori (MAP) decision rule.
In case of ties, i.e. if P
H|Y
(j|y) equals P
H|Y
(k|y) equals max
i
P
H|Y
(i|y) , then it does not
matter if we decide for

H = k or for

H = j . In either case, the probability that we have
decided correctly is the same.
Because the MAP rule maximizes the probability of being correct for each observation
y , it also maximizes the unconditional probability P
c
of being correct. The former is
P
H|Y
(

H(y)|y) . If we plug in the random variable Y instead of y , then we obtain a
random variable. (A real-valued function of a random variable is a random variable.)
The expected valued of this random variable is the (unconditional) probability of being
correct, i.e.,
P
c
= E[P
H|Y
(

H(Y )|Y )] =
_
y
P
H|Y
(

H(y)|y)f
Y
(y)dy. (2.2)
22 Chapter 2.
There is an important special case, namely when H is uniformly distributed. In this case
P
H|Y
(i|y) , as a function of i , is proportional to f
Y |H
(y|i)/m. Therefore, the argument
that maximizes P
H|Y
(i|y) also maximizes f
Y |H
(y|i) . Then the MAP decision rule is
equivalent to the maximum likelihood (ML) decision rule:

H(y) = arg max


i
f
Y |H
(y|i) (ML decision rule). (2.3)
Notice that the ML decision rule does not require the prior P
H
. For this reason it is the
solution of choice when the prior is not known.
2.2.1 Binary Hypothesis Testing
The special case in which we have to make a binary decision, i.e., H = {0, 1}, is both
instructive and of practical relevance. We begin with it and generalize in the next section.
As there are only two alternatives to be tested, the MAP test may now be written as
f
Y |H
(y|1)P
H
(1)
f
Y
(y)

H = 1

<

H = 0
f
Y |H
(y|0)P
H
(0)
f
Y
(y)
.
Observe that the denominator is irrelevant because f(y) is a positive constant hence
will not aect the decision. Thus an equivalent decision rule is
f
Y |H
(y|1)P
H
(1)

H = 1

<

H = 0
f
Y |H
(y|0)P
H
(0).
The above test is depicted in Figure 2.2 assuming y R. This is a very important gure
that helps us visualize what goes on and, as we will see, will be helpful to compute the
probability of error.
y
f
Y |H
(y|1)P
H
(1)
f
Y |H
(y|0)P
H
(0)
Figure 2.2: Binary MAP Decision. The decision regions R
0
and R
1
are the
values of y (abscissa) on the left and right of the dashed line (threshold),
respectively.
2.2. Hypothesis Testing 23
The above test is insightful as it shows that we are comparing posteriors after rescaling
them by canceling the positive number f
Y
(y) from the denominator. However, from an
algorithmic point of view, the test it is not the most ecient one. An equivalent test
is obtained by dividing both sides with the non-negative quantity f
Y |H
(y|0)P
H
(1) . This
results in the following binary MAP test:
(y) =
f
Y |H
(y|1)
f
Y |H
(y|0)

H = 1

<

H = 0
P
H
(0)
P
H
(1)
= . (2.4)
The left side of the above test is called the likelihood ratio, denoted by (y) , whereas the
right side is the threshold . Notice that if P
H
(0) increases, so does the threshold. In
turn, as we would expect, the region {y :

H(y) = 0} becomes larger.
When P
H
(0) = P
H
(1) = 1/2 the threshold becomes unity and the MAP test becomes
a binary ML test:
f
Y |H
(y|1)

H = 1

<

H = 0
f
Y |H
(y|0).
A function

H : Y H is called a decision function (also called decoding function). One
way to describe a decision function is by means of the decision regions R
i
= {y Y :

H(y) = i}, i H. Hence R


i
is the set of y Y for which

H(y) = i .
To compute the probability of error, it is often convenient to compute the error probability
for each hypothesis and then take the average. When H = 0, we make an incorrect
decision if Y R
1
or, equivalently, if (y) . Hence, denoting by P
e
(i) the probability
of making an error when H = i ,
P
e
(0) = Pr{Y R
1
|H = 0} =
_
R
1
f
Y |H
(y|0)dy (2.5)
= Pr{(Y ) |H = 0}. (2.6)
Whether it is easier to work with the right side of (2.5) or that of (2.6) depends on whether
it is easier to work with the conditional density of Y or of (Y ) . We will see examples
of both cases.
Similar expressions hold for the probability of error conditioned on H = 1, denoted by
P
e
(1) . The unconditional error probability is then
P
e
= P
e
(1)p
H
(1) +P
e
(0)p
H
(0).
In deriving the probability of error we have tacitly used an important technique that we
use all the time in probability: conditioning as an intermediate step. Conditioning as an
24 Chapter 2.
intermediate step may be seen as a divide-and-conquer strategy. The idea is to solve a
problem that seems hard by breaking it up into subproblems that (i) we know how to solve
and (ii) once we have the solution to the subproblems we also have the solution to the
original problem. Here is how it works in probability. We want to compute the expected
value of a random variable Z. Assume that it is not immediately clear how to compute
the expected value of Z, but we know that Z is related to another random variable W
that tells us something useful about Z: useful in the sense that for every value w we
are able to compute the expected value of Z given W = w. Then, via the theorem of
total expectation, we compute: E[Z] =

w
E[Z|W = w] P
W
(w) . The same principle
applies for probabilities. (This is not a coincidence: The probability of an event is the
expected value of the indicator
3
function of that event). For probabilities the expression
is Pr(Z A) =

w
Pr(Z A|W = w)P
W
(w) .
Let us revisit what we have done in light of the above comments and what else we could
have done. The computation of the probability of error involves two random variables, H
and Y , as well as an event {H =

H}. To compute the probability of error (2.5) we have
rst conditioned on all possible values of H. Alternatively, we could have conditioned
on all possible values of Y . This is indeed a viable alternative. In fact we have already
done so (without saying it) in (2.2). Between the two, we use the one that seems more
promising for the problem at hand. We will see examples of both.
2.2.2 m-ary Hypothesis Testing
Now we go back to the m-ary hypothesis testing problem. This means that H =
{0, 1, , m1}.
Recall that the MAP decision rule, which minimizes the probability of making an error,
is

H
MAP
(y) = arg max
i
P
H|Y
(i|y)
= arg max
i
f
Y |H
(y|i)P
H
(i)
f
Y
(y)
= arg max
i
f
Y |H
(y|i)P
H
(i),
where f
Y |H
(|i) is the probability density function of the observable Y when the hypoth-
esis is i and P
H
(i) is the probability of the i th hypothesis. This rule is well dened
up to ties. If there is more than one i that achieves the maximum on the right side of
one (and thus all) of the above expressions, then we may decide for any such i without
aecting the probability of error. If we want the decision rule to be unambiguous, we can
for instance agree that in case of ties we choose the largest i that achieves the maximum.
3
See Section 2.6.2 for a formal denition of indicator function.
2.3. The Q Function 25
When all hypotheses have the same probability, then the MAP rule specializes to the ML
rule, i.e.,

H
ML
(y) = arg max
i
f
Y |H
(y|i).
We will always assume that f
Y |H
is either given as part of the problem formulation or
that it can be gured out from the setup. In communications, we typically know the
transmitter and the channel. In this chapter, the transmitter is the map from H to
C X
n
and the channel is described by the pdf f
Y |X
(y|x) known for all x X
n
and all
y Y
n
. From these two, we immediately obtain f
Y |H
(y|i) = f
Y |X
(y|c
i
) , where c
i
is the
signal assigned to i .
Note that the decoding (or decision) function

H assigns an i H to each y R
n
. As
already mentioned, it can be described by the decoding (or decision) regions R
i
, i H,
where R
i
consists of those y for which

H(y) = i . It is convenient to think of R
n
as
being partitioned by decoding regions as depicted in the following gure.
R
m1
R
i
R
0
R
1
We use the decoding regions to express the error probability P
e
or, equivalently, the
probability P
c
= 1 P
e
of deciding correctly. Conditioned on H = i we have
P
e
(i) = 1 P
c
(i)
= 1
_
R
i
f
Y |H
(y|i)dy.
2.3 The Q Function
The Q function plays a very important role in communications. It will come up frequently
throughout this text. It is dened as:
Q(x) :=
1

2
_

x
e

2
2
d.
Hence if Z is a normally distributed zero-mean random variable of unit variance, denoted
by Z N(0, 1) , then Pr{Z x} = Q(x) . (The Q function has been dened specically
to make this true.)
26 Chapter 2.
If Z is normally distributed with mean m and variance
2
, denoted by Z N(m,
2
) ,
the probability Pr{Z x} can also be written using the Q function. In fact the event
{Z x} is equivalent to {
Zm


xm

}. But
Zm

N(0, 1) . Hence Pr{Z x} =


Q(
xm

) . This result will be used frequently. It should be memorized.


We now describe some of the key properties of the Q function.
(a) If Z N(0, 1) , F
Z
(z) := Pr{Z z} = 1 Q(z) . (The reader is advised to draw a
picture that expresses this relationship in terms of areas under the probability density
function of Z.)
(b) Q(0) = 1/2, Q() = 1, Q() = 0.
(c) Q(x) +Q(x) = 1. (Again, it is advisable to draw a picture.)
(d)
1

2
e

2
2
(1
1

2
) < Q() <
1

2
e

2
2
, > 0.
(e) An alternative expression for the Q function with xed integration limits is Q(x) =
1

_
2
0
e

x
2
2 sin
2

d . It holds for x 0.
(f) Q()
1
2
e

2
2
, 0.
Proofs: The proofs or (a), (b), and (c) are immediate (a picture suces). The proof
of part (d) is left as an exercise (see Problem 26). To prove (e), let X N(0, 1) and
Y N(0, 1) be independent. Hence Pr{X 0, Y } = Q(0)Q() =
Q()
2
. Using Polar
coordinates
Q()
2
=
_
2
0
_

sin
e

r
2
2
2
rdrd =
1
2
_
2
0
_

2
2 sin
2

e
t
dtd =
1
2
_
2
0
e

2
2 sin
2

d.
To prove (f) we use (e) and the fact that e

2
2 sin
2

2
2
for [0,

2
] . Hence
Q()
1

_
2
0
e

2
2
d =
1
2
e

2
2
.
2.4 Receiver Design for Discrete-Time AWGN Channels
The hypothesis testing problem discussed in this section is key in digital communication.
The setup is depicted in Figure 2.3. The hypothesis H {0, . . . , m 1} represents
a randomly selected message. The transmitter maps H = i to a signal n-tuple c
i

R
n
. The channel adds a random (noise) vector Z that has independent and identically
distributed Gaussian components. The observable available to the receiver is Y = c
i
+Z.
2.4. Receiver Design for Discrete-Time AWGN Channels 27
We begin with the simplest possible situation, specically when there are only two
equiprobable messages and the signals are scalar ( n = 1). Then we generalize to ar-
bitrary values for n and nally we consider arbitrary values also for the cardinality m of
the message set.
2.4.1 Binary Decision for Scalar Observations
Let the message H {0, 1} be equiprobable and assume that the transmitter maps H = 0
into a R and H = 1 into b R. The output statistic for the various hypotheses is as
follows:
H = 0 : Y N(a,
2
)
H = 1 : Y N(b,
2
).
An equivalent way to express the output statistic for each hypothesis is
f
Y |H
(y|0) =
1

2
2
exp
_

(y a)
2
2
2
_
f
Y |H
(y|1) =
1

2
2
exp
_

(y b)
2
2
2
_
.
We compute the likelihood ratio
(y) =
f
Y |H
(y|1)
f
Y |H
(y|0)
= exp
_

(y b)
2
(y a)
2
2
2
_
= exp
_
b a

2
(y
a +b
2
)
_
. (2.7)
The threshold is =
P
0
P
1
. Now we have all the ingredients for the MAP rule. Instead
of comparing (y) to the threshold , we can compare ln (y) to ln . The function
-
i {0, . . . , m1}
Transmitter
c
i
R
n
Y = c
i
+ Z
-

6
Z N(0,
2
I
n
)
-
Receiver
-

H
Figure 2.3: Communication over the discrete-time additive white Gaussian
noise channel.
28 Chapter 2.
ln (y) is called log likelihood ratio. Hence the MAP decision rule can be expressed as
b a

2
_
y
a +b
2
_

H = 1

<

H = 0
ln .
The progress consists in the fact that the receiver no longer computes an exponential
function of the observation. It has to compute the ln of but this is a negligible eort
as it is done once and for all.
Without loss of essential generality (w.l.o.g.), assume b > a. Then we can divide both
sides by
ba

2
without changing the outcome of the above comparison. We can further
simplify by moving
a+b
2
on the right. The result is the simple test.

H
MAP
(y) =
_
1, y >
0, otherwise,
where =

2
ba
ln +
a+b
2
. Notice that if P
H
(0) = P
H
(1) , then ln = 0 and the threshold
becomes the midpoint
a+b
2
.
a
a
a+b
2
f
Y |H
(y|1) f
Y |H
(y|0)
y
Figure 2.4: The shaded area represents the probability of error P
e
= Q(
d
2
)
when H = 0 and P
H
(0) = P
H
(1) .
We now determine the probability of error.
P
e
(0) = Pr{Y > |H = 0} =
_

f
Y |H
(y|0)dy.
This is the probability that a Gaussian random variable with mean a and variance
2
exceeds the threshold . From our review on the Q function we know immediately
that P
e
(0) = Q
_
a

_
. Similarly, P
e
(1) = Q
_
b

_
. Finally, P
e
= P
H
(0)Q
_
a

_
+
P
H
(1)Q
_
b

_
.
The most common case is when P
H
(0) = P
H
(1) = 1/2. Then
a

=
b

=
ba
2
=
d
2
,
where d is the distance between a and b . In this case
P
e
= Q
_
d
2
_
.
2.4. Receiver Design for Discrete-Time AWGN Channels 29
This result can be obtained straightforwardly without side calculations. As shown in
Figure 2.4, the threshold is the middle point between a and b and P
e
= P
e
(0) = Q(
d
2
) ,
where d is the distance between a and b .
2.4.2 Binary Decision for n-Tuple Observations
As in the previous subsection, we assume that H is equiprobable in {0, 1}. What is new
is that the signals are now n-tuples for n 1. So when H = 0, the transmitter sends
some a R
n
and when H = 1, it sends b R
n
. The noise added by the channel is
Z N(0,
2
I
n
) and independent of H.
From now on we assume that the reader is familiar with the notions of inner product,
norm, and ane plane. (See Appendix 2.E for a review.) The inner product between the
vectors u and v will be denoted by u, v , whereas u is the norm of u. We also assume
familiarity with the denition and basic results related to Gaussian random vectors. (See
Appendix 2.C for a review.)
As we did earlier, to derive a ML decision rule, we start by writing down the output
statistic for each hypothesis
H = 0 : Y = a +Z N(a,
2
I
n
)
H = 1 : Y = b +Z N(b,
2
I
n
),
or, equivalently,
H = 0 : Y f
Y |H
(y|0) =
1
(2
2
)
n/2
exp
_

y a
2
2
2
_
H = 1 : Y f
Y |H
(y|1) =
1
(2
2
)
n/2
exp
_

y b
2
2
2
_
.
Like in the scalar case, we compute the likelihood ratio
(y) =
f
Y |H
(y|1)
f
Y |H
(y|0)
= exp
_
y a
2
y b
2
2
2
_
.
Taking the logarithm on both sides and using the relationship u+v, uv = u
2
v
2
,
which holds for real-valued vectors u and v , we obtain
LLR(y) =
y a
2
y b
2
2
2
(2.8)
=
2y (a +b), b a
2
2
=
_
y
a +b
2
,
b a

2
_
(2.9)
=
_
y,
b a

2
_
+
a
2
b
2
2
2
. (2.10)
30 Chapter 2.
From (2.10), the MAP rule is
y, b a

H = 1

<

H = 0
T, (2.11)
where T =
2
ln +
b
2
a
2
2
is a threshold and =
P
H
(0)
P
H
(1)
. This says that the decision
regions R
0
and R
1
are separated by the ane plane
_
y R
n
: y, b a = T
_
.
We obtain additional insight by analyzing (2.8) and (2.9). To nd the shape of the
boundary between R
0
and R
1
, we inspect the values of y for which (2.8) is constant.
As shown by the gure below on the left, the set of points y for which (2.8) is constant
is an ane plane. Indeed, by Pythagoras theorem, y a
2
y b
2
equals p
2
q
2
(which is constant) for all y on an ane plane perpendicular to the liner space spanned
by b a. Alternatively, we can check the values of y for which (2.9) is constant. As
shown in the gure on the right, rule (2.9) performs the projection of y
a+b
2
onto the
linear space spanned by b a. Once again, the set of points for which this projection is
constant is an ane plane perpendicular to the liner space spanned by b a.

H
H
H
H
H
H
H

a
b

p
q
A
A
A
A
A
A
A
A
A
A
A
A
r
r
r
y
ane plane

a
b
a+b
2
r
?y
A
A
A
A
A
A
A
A
A
A
A
A
r
r
ane plane
To nd the distance p of the ane plane from a as a function of the threshold T , consider
the point y = a +
ba
ba
p. This is the point y on the segment from a to b at distance p
from a. If we insert into 2.11 and solve for p we obtain
p =
d
2
+

2
ln
d
q =
d
2


2
ln
d
where we have dened d = b a and have used the fact that q = d p.
As for the scalar case, we are particularly interested in P
H
(0) = P
H
(1) =
1
2
. In this case
the ane plane is the set of points for which (2.8) is 0. These are the points y that are
at the same distance from a and from b . Hence R
0
contains all the points y R
n
that
are closer to a than to b .
2.4. Receiver Design for Discrete-Time AWGN Channels 31
A few additional observations are in order.
The separating ane plane moves towards b when the threshold T increases, which
is the case when
P
H
(0)
P
H
(1)
increases. This makes sense. It corresponds to our intuition
that the decoding region R
0
should become larger if the prior probability becomes
more in favor of H = 0.
If
P
H
(0)
P
H
(1)
exceeds 1, then ln is positive and T increases with
2
. This also makes
sense. If the noise increases, we trust less what we observe and give more weight to
the prior, which in this case favors H = 0.
Notice the similarity of (2.8) and (2.9) with the corresponding expressions for the
scalar case, i.e., the expressions in the exponent of (2.7). This suggest a tight rela-
tionship between the scalar and the vector case. We can gain additional insight by
placing the origin of a new coordinate system at
a+b
2
and by choosing the rst coor-
dinate in the direction of b a. In this new coordinate system, H = 0 is mapped
into the vector a = (
d
2
, 0, . . . , 0)
T
where d = b a, H = 1 is mapped into

b = (
d
2
, 0, . . . , 0)
T
, and the projection of the observation onto the subspace spanned
by b a = (d, 0, . . . , 0)
T
is just the rst component y
1
of y = (y
1
, y
2
, . . . , y
n
)
T
. This
shows that for two hypotheses the vector case is really a scalar case embedded in an
n dimensional space.
As for the scalar case, we compute the probability of error by conditioning on H = 0 and
H = 1 and then remove the conditioning by averaging: P
e
= P
e
(0)P
H
(0) + P
e
(1)P
H
(1) .
When H = 0, Y = a +Z and the MAP decoder makes the wrong decision if
Y, b a T.
Inserting Y = a+Z, dening the unit norm vector

=
ba
ba
that points in the direction
b a and rearranging terms yields the equivalent condition
Z,


d
2
+

2
ln
d
,
where again d = b a. The left-hand side is a zero-mean Gaussian random variable of
variance
2
. (See Appendix 2.C for a review.) Hence
P
e
(0) = Q
_
d
2
+
ln
d
_
.
Proceeding similarly we nd
P
e
(1) = Q
_
d
2

ln
d
_
.
In particular, when P
H
(0) = 0.5 we obtain
P
e
= P
e
(0) = P
e
(1) = Q
_
d
2
_
.
32 Chapter 2.
The gure below helps us visualize the situation. When H = 0, a MAP decoder makes
the wrong decision if the projection of Z onto the subspace spanned by b a lands on
the other side of the separating ane plane. The projection has the form Z

, where
Z

= Z,

N(0,
2
) . The projection lands on the other side of the separating ane
plane if Z

p. This happens with probability Q(


p

) , which corresponds to the result


obtained earlier.

a
b

A
A
A
A
A
A
AK

*
y
Z
Z

p
Z

A
A
A
A
A
A
A
A
A
A
A
A
A
A
r
r
ane plane
2.4.3 m-ary Decision for n-Tuple Observations
When H = i , i H = {0, 1, . . . , m 1}, the channel input is c
i
R
n
. For now we
make the simplifying assumption that P
H
(i) =
1
m
, which is a common assumption in
communications. Later on we will see that generalizing is straightforward.
When Y = y , the ML decision rule is

H
ML
(y) = arg max
i
f
Y |H
(y|i)
= arg max
i
1
(2
2
)
n/2
exp
_

y c
i

2
2
2
_
= arg min
i
y c
i
.
Hence a ML decision rule for the AWGN channel is a minimum-distance decision rule as
shown in Figure 2.5. Up to ties, R
i
corresponds to the Voronoi region of c
i
, dened as
the set of points in R
n
that are at least as close to c
i
as to any other c
j
.
Example 6. ( m-PAM) Figure 2.6 shows the signal points and the decoding regions of a
ML decoder for 6-ary Pulse Amplitude Modulation (the meaning of the name will become
clear in the next chapter), assuming that the channel is AWGN. The signal points are
elements of R, and the ML decoder chooses according to the minimum-distance rule.
When the hypothesis is H = 0, the receiver makes the wrong decision if the observation
2.4. Receiver Design for Discrete-Time AWGN Channels 33
y R falls outside the decoding region R
0
. This is the case if the noise Z R is larger
than d/2, where d = c
i
c
i1
, i = 1, . . . , 5. Thus
P
e
(0) = Pr
_
Z >
d
2
_
= Q
_
d
2
_
.
By symmetry, P
e
(5) = P
e
(0) . For i {1, 2, 3, 4}, the probability of error when H = i
is the probability that the event {Z
d
2
} {Z <
d
2
} occurs. This event is the union
of disjoint events. Its probability is the sum of the probabilities of the individual events.
Hence
P
e
(i) = Pr
_
_
Z
d
2
_

_
Z <
d
2
_
_
= 2Pr
_
Z
d
2
_
= 2Q
_
d
2
_
, i {1, 2, 3, 4}.
Finally, P
e
=
2
6
Q
_
d
2
_
+
4
6
2Q
_
d
2
_
=
5
3
Q
_
d
2
_
. We see immediately how to generalize.
For a constellation of m points, the error probability is
P
e
=
_
2
2
m
_
Q
_
d
2
_
.
Example 7. (4-ary QAM) Figure 2.7 shows the signal set {c
0
, c
1
, c
2
, c
3
} R
2
for 4-ary
Quadrature Amplitude Modulation (QAM). We consider signals as points in R
2
or in C.
We choose the former because we do not know how to deal with complex valued noise
yet. The noise is Z N(0,
2
I
2
) and the observable, when H = i , is Y = c
i
+ Z. We
assume that the receiver implements a ML decision rule, which for the AWGN channel
means minimum-distance decoding. The decoding region for c
0
is the rst quadrant, for
c
1
the second quadrant, etc. When H = 0, the decoder makes the correct decision if
{Z
1
>
d
2
} {Z
2

d
2
}, where d is the minimum distance among signal points. This
is the intersection of independent events. Hence the probability of the intersection is the
product of the probability of each event, i.e.
P
c
(0) =
_
Pr
_
Z
i

d
2
_
_
2
= Q
2
_

d
2
_
=
_
1 Q
_
d
2
_
_
2
.
R
1
s
1
R
0
s
2
R
2
s
0
Figure 2.5: Example of Voronoi regions in R
2
.
This gure needs to be re-done. It is in xg and the signals need to be relabeled to
c
i
34 Chapter 2.
-
y
s s s s s s
c
0
c
1
c
2
c
3
c
4
c
5
R
0
R
1
R
2
R
3
R
4
R
5
-
d
Figure 2.6: PAM signal constellation.
-
y
1
6
y
2
s s
s s
c
1
c
0
c
2
c
3
d
2
d
2
-
y
1
6
y
2
s -
z
1
6
z
2
s
y
-
d
2
d
2
?
6
Figure 2.7: QAM signal constellation in R
2
.
By symmetry, for all i , P
c
(i) = P
c
(0) . Hence,
P
e
= P
e
(0) = 1 P
c
(0) = 2Q
_
d
2
_
Q
2
_
d
2
_
.
Exercise 8. In Example 7 we have computed P
c
(0) but we could have opted for com-
puting P
e
(0) instead. Compute P
e
(0) directly from the problem statement and verify
that it equals 1 P
c
(0) .
Exercise 9. Rotate and translate the signal constellation of Example 7. How would you
then determine the error probability?
2.5 Irrelevance and Sucient Statistic
Have you ever tried to drink from a re hydrant? There are situations in which the
observable Y contains excessive data, but how do we reduce it and be sure that we are
not throwing away anything useful for a MAP decision? In this section we derive a test
to check if the proposed reduced data is all a MAP receiver needs. We begin by recalling
the notion of Markov chain.
2.5. Irrelevance and Sucient Statistic 35
Definition 10. Three random variables U, V, W are said to form a Markov chain in that
order, symbolized by U V W , if the distribution of W given both U and V is
independent of U , i.e., P
W|V,U
(w|v, u) = P
W|V
(w|v) .
Exercise 11. Verify the correctness of the following statements, which are straightfor-
ward consequences of the above Markov chain denition.
(i) U V W if and only if P
U,W|V
(u, w|v) = P
U|V
(u|v)P
W|V
(w|v) . In words, U, V, W
form a Markov chain (in that order) if and only if U and W are independent when
conditioned on V .
(ii) U V W if and only if W V U , i.e., Markovity in one direction implies
Markovity in the other direction.
Let Y be the observable and T(Y ) be a function (either stochastic or deterministic) of
Y . Observe that H Y T(Y ) is always true, but in general it is not true that
H T(Y ) Y .
Definition 12. Let T(Y ) be a random variable obtained from processing an observable
Y . If H T(Y ) Y , then we say that T(Y ) is a sucient statistic (for the hypothesis
H).
If T(Y ) is a sucient statistic, then the error probability of a MAP decoder that observes
T(Y ) is identical to that of a MAP decoder that observes Y . Indeed for all i H and
all y Y , P
H|Y
(i|y) = P
H|Y,T
(i|y, t) = P
H|T
(i|t) , where t = T(y) . Hence if i maximizes
P
H|Y
(|y) then it also maximizes P
H|T
(|t) . We state this important result as a theorem.
Theorem 13. If T(Y ) is a sucient statistic for H, then a MAP decoder that estimates
H from T(Y ) achieves the exact same error probability as one that estimates H from
Y .
In some situations we make multiple measurements and want to prove that some of the
measurements are relevant for the detection problem and some are not. Specically, the
observable Y may consist of two components Y = (Y
1
, Y
2
) where Y
1
and Y
2
may be m
and n tuples, respectively. If T(Y ) = Y
1
is a sucient statistic, then we say that Y
2
is
irrelevant. We use the two concepts interchangeably when we have two sets of observables:
if one set is a sucient statistic, the other is irrelevant and vice-versa.
Exercise 14. Consider the situation of the previous paragraph, i.e., Y = (Y
1
, Y
2
) . Show
that Y
1
is a sucient statistic (or equivalently Y
2
is irrelevant) if and only if H
Y
1
Y
2
. This result is sometimes called Theorem of Irrelevance [1]. (Hint: Show that
H Y
1
Y
2
is equivalent to H Y
1
Y ).
Example 15. Consider the communication system depicted in the gure where H, Z
1
, Z
2
are independent random variables. Then H Y
1
Y
2
. Hence Y
2
is irrelevant for the
purpose of making a MAP decision of H based on the (vector-valued) observable (Y
1
, Y
2
) .
36 Chapter 2.
Note that the independence assumption is essential here. For instance, if Z
2
= Z
1
, we
can obtain Z
2
(and Z
1
) from the dierence Y
2
Y
1
. We can then remove Z
1
from Y
1
and obtain H. In this case from (Y
1
, Y
2
) we can make an error-free decision about H.
Clearly Y
2
is not irrelevant in this case.
Source
-

? ?
Receiver

H -
? ?
H
Z
1
Z
2
Y
2
Y
1
We have seen that H T(Y ) Y implies that Y is irrelevant to a MAP decoder that
observes T(Y ) . Is the contrary also true? Specically, assume that a MAP decoder that
observes Y always makes the same decision as one that observes only T(Y ) . Does this
imply H T(Y ) Y ?
The answer is yes and no. We may expect the answer to be no because for H
U V to hold, it has to be true that P
H|U,V
(i|u, v) equals P
H|U
(i|u) for all values of
i, u, v whereas for v to have no eect on a MAP decision it is sucient that for all u, v
the maximum of P
H|U
(

|u) and that of P


H|U,V
(

|u, v) be achieved for the same i . It seems


clear that the former requirement is stronger. Indeed in Problem 20 we give an example
to show that the answer to the above question is no in general.
The answer becomes yes if the two MAP decoders make the same decision regardless
of the distribution on H. We prove this in Problem 22.
Example 16. Regardless of the distribution on H, the binary test (2.4) depends on Y
only through the likelihood ratio (Y ) . Hence the likelihood ratio is a sucient statistic.
Notice that (y) is a scalar even when y is an n-tuple.
The following result is a useful tool in proving that a function T(y) is a sucient statistic.
It is proved in Problem 21.
Theorem 17. (Fisher-Neyman Factorization Theorem) Suppose that there are functions
g
1
, g
2
, . . . , g
m1
and a function h so that for each i H one can write
f
Y |H
(y|i) = g
i
(T(y))h(y) (2.12)
Then T is a sucient statistic.
In the following example, we use the notion of indicator function. Given a set A, the
indicator function
A
is dened as
A
(x) =
_
1, x A
0, otherwise.
2.6. Error Probability 37
Example 18. Let H H = {0, 1, . . . , m 1} be the hypothesis and when H = i
let the components of Y = (Y
1
, . . . , Y
n
)
T
be iid uniformly distributed in [0, i] . We use
the Fisher-Neyman Factorization Theorem to show that T(Y ) = max{Y
1
, . . . , Y
n
} is a
sucient statistic. In fact
f
Y |H
(y|i) =
1
i
[0,i]
(y
1
)
1
i
[0,i]
(y
2
) . . .
1
i
[0,i]
(y
n
)
=
_
1
i
n
, if max{y
1
, . . . , y
n
} i and min{y
1
, . . . , y
n
} 0
0, otherwise
=
1
i
n
[0,i]
(max{y
1
, . . . , y
n
})
[0,)
(min{y
1
, . . . , y
n
}).
In this case, the Fisher-Neyman Factorization Theorem applies with g
i
(T) =
1
i
n [0,i]
(T) ,
where T(y) = max{y
1
, . . . , y
n
}, and h(y) =
[0,)
(min{y
1
, . . . , y
n
}) .
2.6 Error Probability
2.6.1 Union Bound
Here is a simple and extremely useful bound. Recall that for general events A, B
P (A B) = P (A) +P (B) P (A B)
P (A) +P (B) .
More generally, using induction, we obtain the Union Bound
P
_
M
_
i=1
A
i
_

i=1
P(A
i
), (UB)
that applies to any collection of sets A
i
, i = 1, . . . , M . We now apply the union bound
to approximate the probability of error in multi-hypothesis testing. Recall that
P
e
(i) = Pr{Y R
c
i
|H = i} =
_
R
c
i
f
Y |H
(y|i)dy,
where R
c
i
denotes the complement of R
i
. If we are able to evaluate the above integral
for every i , then we are able to determine the probability of error exactly. The bound
that we derive is useful if we are unable to evaluate the above integral.
For i = j dene
B
i,j
=
_
y : P
H
(j)f
Y |H
(y|j) P
H
(i)f
Y |H
(y|i)
_
.
B
i,j
is the set of y for which the a posteriori probability of H given Y = y is at least as
high for j as it is for i .
38 Chapter 2.
B
i,j
c
i
c
j
Figure 2.8: The shape of B
i,j
for AWGN channels and ML decision.
The following fact is very useful:
R
c
i

_
j:j=i
B
i,j
. (2.13)
To see that the above inclusion holds, consider an arbitrary y R
c
i
. By denition, there
is at least one k H such that P
H
(k)f
Y |H
(y|k) P
H
(i)f
Y |H
(y|i) . Hence y B
i,k
.
The reader may wonder why we do not have equality in (2.13). To see that equality may or
may not apply, consider a y is in B
i,l
for some l . It could be so because P
H
(l)f
Y |H
(y|l) =
P
H
(i)f
Y |H
(y|i) (notice the equality sign). To simplify the argument, let us assume that
for the chosen y there is only one such l . The MAP decoding rule does not prescribe
whether y should be in the decoding region of i or l . If it is in that of i , then equality
in (2.13) does not hold. If none of the y for which P
H
(l)f
Y |H
(y|l) = P
H
(i)f
Y |H
(y|i) for
some l has been assigned to R
i
then we have equality in (2.13). In one sentence, we have
equality if all the ties have been resolved against i .
We are now in the position to upperbound P
e
(i) . Using (2.13) and the union bound we
obtain
P
e
(i) = Pr {Y R
c
i
|H = i} Pr
_
Y
_
j:j=i
B
i,j
|H = i
_

j:j=i
Pr
_
Y B
i,j
|H = i
_
(2.14)
=

j:j=i
_
B
i,j
f
Y |H
(y|i)dy.
The gain is that it is typically easier to integrate over B
i,j
than over R
c
i
.
For instance, when the channel is the AWGN and the decision rule is ML, B
i,j
is the set
of points in R
n
that are at least as close to c
j
as they are to c
i
. Figure 2.8 depicts this
situation. In this case,
_
B
i,j
f
Y |H
(y|i)dy = Q
_
c
j
c
i

2
_
,
and the union bound yields the simple expression
P
e
(i)

j:j=i
Q
_
c
j
c
i

2
_
.
2.6. Error Probability 39
c
4
c
3
c
5
c
0
c
1
c
7
R
4
R
6
R
5
R
1
R
0
R
7
R
2
R
3
Figure 2.9: 8-ary PSK constellation in R
2
and decoding regions.
In the next section we derive an easy-to-compute tight upperbound on
_
B
i,j
f
Y |H
(y|i)dy
for a general f
Y |H
. Notice that the above integral is the probability of error under H = i
when there are only two hypotheses, the other hypothesis is H = j , and the priors are
proportional to P
H
(i) and P
H
(j) .
Example 19. ( m-PSK) Figure 2.9 shows a signal set for m-ary PSK (phase-shift keying)
when m = 8. Formally, the signal transmitted when H = i , i H = {0, 1, . . . , m1},
is
c
i
=
_
E
s
_
_
cos
_
2i
m
_
sin
_
2i
m
_
_
_
,
where E
s
= c
i

2
, i H. Assuming the AWGN channel, the hypothesis testing problem
is specied by
H = i : Y N(c
i
,
2
I
2
)
and the prior P
H
(i) is assumed to be uniform. Because we have a uniform prior, the
MAP and the ML decision rule are identical. Furthermore, since the channel is the
AWGN channel, the ML decoder is a minimum-distance decoder. The decoding regions
are also shown in the gure.
By symmetry P
e
= P
e
(i) and one can show that
P
e
(i) =
1

m
0
exp
_

sin
2
m
sin
2
( +

m
)
E
s
2
2
_
d.
The above expression is rather complicated. Let us see what we obtain through the union
bound.
With reference to Figure 2.10 we have:
40 Chapter 2.
P
e
(i) = Pr{Y B
i,i1
B
i,i+1
|H = i}
Pr{Y B
i,i1
|H = i} +Pr{Y B
i,i+1
|H = i}
= 2Pr{Y B
i,i1
|H = i}
= 2Q
_
c
i
c
i1

2
_
= 2Q
_
E
s

sin

m
_
.
Notice that we have been using a version of the union bound adapted to the problem: we
get a tighter bound by using the fact that R
c
i
B
i,i1
B
i,i+1
rather than R
c
i

j=i
B
i,j
.
How good is the upper bound? From the gure below drawn with i = 4 in mind we see
that
P
e
(i) = Pr{Y B
i,i1
|H = i} +Pr{Y B
i,i+1
|H = i} Pr{Y B
i,i1
B
i,i+1
|H = i}.
The above expression can be used to upper and lower bound P
e
(i) . We upper bound it
if we lower bound the last term. In fact if we set it to zero, we obtain the upper bound
that we have just derived. We lower bound P
e
(i) if we upper bound the last term of
the above expression. To do so, observe that R
c
i
is the union of (m1) disjoint cones,
one of which is B
i,i1
B
i,i+1
(see again the gure below). Furthermore, the integral of
f
Y |H
(|i) over B
i,i1
B
i,i+1
is smaller than that over the other cones. Hence the integral
over B
i,i1
B
i,i+1
must be less than
P
e
(i)
m1
. Mathematically,
Pr{Y (B
i,i1
B
i,i+1
)|H = i}
P
e
(i)
m1
.
Inserting in the previous expression, solving for P
e
(i) and using the fact that P
e
(i) = P
e
yields the desired lower bound
P
e
2Q
_
_
E
s

2
sin

m
_
m1
m
.
The ratio between the upper and the lower bound is the constant
m
m1
. For m large, the
bounds become very tight.
c
4
c
3
c
5
B
4,3
B
4,5 R
4
B
4,3
B
4,5
Figure 2.10: Bounding the error probability of PSK by means of the union bound.
2.6. Error Probability 41
The way we upper-bounded Pr{Y B
i,i1
B
i,i+1
|H = i} is not the only way to proceed.
Alternatively, we could use the fact that B
i,i1
B
i,i+1
is included in B
i,k
where k is the
index of the codeword opposed to c
i
. (In the above gure B
4,3
B
4,5
B
4,0
.) Hence
Pr{Y B
i,i1
B
i,i+1
|H = i} Pr{Y B
i,k
|H = i} = Q
_
E
s
/
_
. This goes to zero as
E
s
/
2
. It implies that the lower bound obtained this way becomes tight as E
s
/
2
becomes large.
2.6.2 Union Bhattacharyya Bound
Let us summarize. From the union bound applied to R
c
i

j:j=i
B
i,j
we have obtained
the upper bound
P
e
(i) = Pr{Y R
c
i
|H = i}

j:j=i
Pr{Y B
i,j
|H = i}
and we have used this bound for the AWGN channel. With the bound, instead of having
to compute
Pr{Y R
c
i
|H = i} =
_
R
c
i
f
Y |H
(y|i)dy,
which requires integrating over a possibly complicated region R
c
i
, we have only to compute
Pr{Y B
i,j
|H = i} =
_
B
i,j
f
Y |H
(y|i)dy.
The latter integral is simply Q(
a

) , where a is the distance between c


i
and the ane
plane bounding B
i,j
. For a ML decision rule, a =
c
i
c
j

2
.
What if the channel is not AWGN? Is there a relatively simple expression for Pr{Y
B
i,j
|H = i} that applies for general channels? Such an expression does exist. It is the
Bhattacharyya bound that we now derive.
4
We will need it only for those i for which
P
H
(i) > 0. Hence, for the derivation that follows, we assume that it is the case.
The denition of B
i,j
may be rewritten in either of the following two forms
_
y :
P
H
(j)f
Y |H
(y|j)
P
H
(i)f
Y |H
(y|i)
1
_
=
_
y :

P
H
(j)f
Y |H
(y|j)
P
H
(i)f
Y |H
(y|i)
1
_
except that the above fraction is not dened when f
Y |H
(y|i) vanishes. This exception
apart, we see that
B
i,j
(y)

P
H
(j)f
Y |H
(y|j)
P
H
(i)f
Y |H
(y|i)
,
4
There are two versions of the Bhattacharyya bound. Here we derive the one that has the simpler
derivation. The other version, which is tighter by a factor 2, is derived in Problems 32 and 33.
42 Chapter 2.
is true when y is inside B
i,j
; it is also true when outside because the left side vanishes
and the right is never negative. We do not have to worry about the exception because we
will use
f
Y |H
(y|i)
B
i,j
(y) f
Y |H
(y|i)

P
H
(j)f
Y |H
(y|j)
P
H
(i)f
Y |H
(y|i)
=

P
H
(j)
P
H
(i)
_
f
Y |H
(y|i)f
Y |H
(y|j),
which is obviously true when f
Y |H
(y|i) vanishes.
We are now ready to derive the Bhattacharyya bound:
Pr{Y B
i,j
|H = i} =
_
yB
i,j
f
Y |H
(y|i)dy
=
_
yR
n
f
Y |H
(y|i)
B
i,j
(y)dy

P
H
(j)
P
H
(i)
_
yR
n
_
f
Y |H
(y|i)f
Y |H
(y|j) dy. (2.15)
What makes the last integral appealing is that we integrate over the entire R
n
. As shown
in Problem 35, for discrete memoryless channels the bound further simplies.
As the name indicates, the Union Bhattacharyya bound combines (2.14) and (2.15),
namely
P
e
(i)

j:j=i
Pr{Y B
i,j
|H = i}

j:j=i

P
H
(j)
P
H
(i)
_
yR
n
_
f
Y |H
(y|i)f
Y |H
(y|j) dy.
We can now remove the conditioning on H = i and obtain
P
e

j:j=i
_
P
H
(i)P
H
(j)
_
yR
n
_
f
Y |H
(y|i)f
Y |H
(y|j) dy.
Example 20. (Tightness of the Bhattacharyya Bound) Let the message H {0, 1} be
equiprobable, let the channel be the binary erasure channel described in Figure 2.11, and
let c
i
= (i, i, . . . , i)
T
be the signal used when H = i .
1
1 p
1

0
1 p
0
Y X
Figure 2.11: Binary erasure channel.
2.7. Summary 43
The Bhattacharyya bound for this case yields
Pr{Y B
0,1
|H = 0}

y{0,1,}
n
_
P
Y |H
(y|1)P
Y |H
(y|0)
=

y{0,1,}
n
_
P
Y |X
(y|c
1
)P
Y |X
(y|c
0
)
(a)
= p
n
,
where in (a) we used the fact that the rst factor under the square root vanishes if y
contains 0s and the second vanishes if y contains 1s. Hence the only non-vanishing term
in the sum is the one for which y
i
= for all i . The same bound applies for H = 1.
Hence P
e

1
2
p
n
+
1
2
p
n
= p
n
.
If we use the tighter version of the union Bhattacharyya bound, which as mentioned earlier
is tighter by a factor of 2, then we obtain
P
e

1
2
p
n
.
For the Binary Erasure Channel and the two codewords c
0
and c
1
we can actually com-
pute the exact probability of error:
P
e
=
1
2
Pr{Y = (, , . . . , )
T
} =
1
2
p
n
.
The Bhattacharyya bound is tight for the scenario considered in this example!
2.7 Summary
The maximum a posteriori probability (MAP) rule is a decision rule that does exactly
what the name implies it maximizes the a posteriori probability and in so doing it
maximizes the probability that the decision is correct. With hindsight, the key idea is
quite simple and it applies even when there is no observable. Let us review it.
Assume that a coin is ipped and we have to guess the outcome. We model the coin
by the random variable H {0, 1}. All we know is P
H
(0) and P
H
(1) . Suppose that
P
H
(0) P
H
(1) . Clearly we have the highest chance of being correct if we guess H = 1.
We will be correct if indeed H = 1, and this has probability P
H
(1) . More generally, we
should choose the i that maximizes P
H
() and the probability of being correct is P
H
(i) .
It is more interesting when there is some side-information. The side information is
obtained when we observe the outcome of a related random variable Y . Once we have
made the observation Y = y , our knowledge about the distribution of H gets updated
from the prior distribution P
H
() to the posterior distribution P
H|Y
(|y) . What we have
said in the previous paragraphs applies with the posterior instead of the prior.
44 Chapter 2.
In a typical example P
H
() is constant whereas for the observed y , P
H|Y
(|y) may be
strongly biased in favor of one hypothesis. If it is strongly biased, the observable has been
very informative, which is what we hope of course.
Often P
H|Y
is not given to us, but we can nd it from P
H
and f
Y |H
via Bayes rule.
Althoug P
H|Y
is the most fundamental quantity associated to a MAP test and therefore
it would make sense to write the test in terms of P
H|Y
, the test is typically written in
terms of P
H
and f
Y |H
because these are the quantities that are specied as part of the
model.
Ideally a receiver performs a MAP decision. We have emphasised the case in which all
hypotheses have the same probability as this is a common assumption in digital communi-
cation. Then the MAP and the ML rule are identical. We have paid particular attention
to communication across the discrete-time AWGN channel as it will play an important role
in subsequent chapters. The ML receiver for the AWGN channel is a minimum-distance
decision rule and in the simplest cases the error probability can be computed exactly by
means of the Q-function and can be upper bounded by means of the union bound and
the Q-function otherwise.
A quite general and useful technique to upper bound the probability of error is the union
Bhattacharyya bound. Notice that it applies to MAP decisions associated to general
hypothesis testing problems, not only to communication problems. All we need to evaluate
the union Bhattacharyya bound are f
Y |H
and P
H
.
We end this summary with an example that shows how the posterior becomes more and
more selective as the number of observations increases. The example also shows that the
posterior becomes less selective if the observations are less reliable.
Example 21. Assume H {0, 1} and P
H
(0) = P
H
(1) = 1/2. The outcome of H is
communicated across a Binary Symmetric Channel (BSC) of crossover probability p <
1
2
via a transmitter that sends n 0s when H = 0 and n 1s when H = 1. The BSC
has input alphabet X = {0, 1}, output alphabet Y = X , and transition probability
p
Y |X
(y|x) =

n
i=1
p
Y |X
(y
i
|x
i
) where p
Y |X
(y
i
|x
i
) equals 1 p if y
i
= x
i
and p otherwise.
Letting k be the number of 1s in the observed channel output y we have
P
Y |H
(y|i) =
_
p
k
(1 p)
nk
, H = 0
p
nk
(1 p)
k
, H = 1.
Using Bayes rule,
P
H|Y
(i|y) =
P
H,Y
(i, y)
P
Y
(y)
=
P
H
(i)P
Y |H
(y|i)
P
Y
(y)
,
where P
Y
(y) =

i
P
Y |H
(y|i)P
H
(i) is the normalization that ensures

i
P
H|Y
(i|y) = 1.
2.7. Summary 45
Hence
P
H|Y
(0|y) =
p
k
(1 p)
nk
2P
Y
(y)
=
_
p
1 p
_
k
(1 p)
n
2P
Y
(y)
P
H|Y
(1|y) =
p
nk
(1 p)
k
2P
Y
(y)
=
_
1 p
p
_
k
p
n
2P
Y
(y)
.
Figure 2.12 depicts the behavior of P
H|Y
(0|y) as a function of the number k of 1s in y .
For the top two gures, p = 0.25. We see that when n = 50 (right gure), the prior is
very biased in favor of one or the other hypothesis, unless the number k of obverted 1s is
nearly n/2 = 25. Comparing to n = 1 (left gure), we see that many observations allow
the receiver to make a more condent decision. This is true also for p = .47 (bottom
row), but we see that with the crossover probability p close to 1/2, there is a smoother
transition between the region in favor of one hypothesis and the region in favor of the
other. If we make only one observation (bottom left gure), then there is only a slight
dierence between the posterior for H = 0 and that for H = 1.
k
0 1
P
H|Y
(0|y)
0
.5
1
(a) p = .25 , n = 1
k
0 10 20 30 40 50
0
.5
1
P
H|Y
(0|y)
(b) p = .25 , n = 50
k
0 1
P
H|Y
(0|y)
0
.5
1
(c) p = .47 , n = 1
k
0 10 20 30 40 50
P
H|Y
(0|y)
0
.5
1
(d) p = .47 , n = 50
Figure 2.12: Posterior as a function of the number k of 1s observed at the
output of a BSC of crossover probability p. The channel input consists of n
0s when H = 0 and of n 1s when H = 1.
46 Chapter 2.
Appendix 2.A Facts About Matrices
In this appendix we provide a summary of useful denitions and facts about matrices.
An excellent text about matrices is [5]. Hereafter H

is the conjugate transpose of the


matrix H. It is also called the Hermitian adjoint of H.
Definition 22. A matrix U C
nn
is said to be unitary if U

U = I . If U is unitary
and has real-valued entries, then it is orthogonal.
The following theorem lists a number of handy facts about unitary matrices. Most of
them are straightforward. Proofs can be found in [5, page 67].
Theorem 23. If U C
nn
, the following are equivalent:
(a) U is unitary;
(b) U is nonsingular and U

= U
1
;
(c) UU

= I ;
(d) U

is unitary
(e) The columns of U form an orthonormal set;
(f) The rows of U form an orthonormal set; and
(g) For all x C
n
the Euclidean length of y = Ux is the same as that of x; that is,
y

y = x

x.
Theorem 24. (Schur) Any square matrix A can be written as A = URU

where U is
unitary and R is an upper triangular matrix whose diagonal entries are the eigenvalues
of A.
Proof. Let us use induction on the size n of the matrix. The theorem is clearly true for
n = 1. Let us now show that if it is true for n 1, it follows that it is true for n. Given
A of size n, let v be an eigenvector of unit norm, and the corresponding eigenvalue.
Let V be a unitary matrix whose rst column is v . Consider the matrix V

AV . The
rst column of this matrix is given by V

Av = V

v = e
1
, where e
1
is the unit vector
along the rst coordinate. Thus
V

AV =
_

0 B
_
,
where B is square and of dimension n 1. By the induction hypothesis B = WSW

,
where W is unitary and S is upper triangular. Thus,
V

AV =
_

0 WSW

_
=
_
1 0
0 W
__

0 S
__
1 0
0 W

_
(2.16)
2.A. Facts About Matrices 47
and putting
U = V
_
1 0
0 W
_
and R =
_

0 S
_
,
we see that U is unitary, R is upper triangular and A = URU

, completing the induction


step. The eigenvalues of a matrix are the roots of the characteristic polynomial. To see
that the diagonal entries of R are indeed the eigenvalues of A it suces to bring the
characteristic polynomial of A in the following form: det(I A) = det
_
U

(I R)U

=
det(I R) =

i
( r
ii
) .
Definition 25. A matrix H C
nn
is said to be Hermitian if H = H

. It is said to be
Skew-Hermitian if H = H

. If H is Hermitian and has real-valued entries, then it is


symmetric.
Recall that a polynomial of degree n has exactly n roots over C. Hence an nn matrix
has exactly n eigenvalues in C, specically the n roots of the characteristic polynomial
det(I A) .
Lemma 26. A Hermitian matrix H C
nn
can be written as
H = UU

i
U
i
U

i
where U is unitary and = diag(
1
, . . . ,
n
) is a diagonal that consists of the eigenvalues
of H. Moreover, the eigenvalues are real and the i th column of U is an eigenvector
associated to
i
.
Proof. By Theorem 24 (Schur) we can write H = URU

where U is unitary and R is


upper triangular with the diagonal elements consisting of the eigenvalues of A. From
R = U

HU we immediately see that R is Hermitian. Since it is also diagonal, the


diagonal elements must be real. If u
i
is the i th column of U , then
Hu
i
= UU

u
i
= Ue
i
= U
i
e
i
=
i
u
i
showing that it is indeed an eigenvector associated to the i th eigenvalue
i
.
Exercise 27. Show that if H C
nn
is Hermitian, then U

HU is real for all U


C
n
.
A class of Hermitian matrices with a special positivity property arises naturally in many
applications, including communication theory. They can be thought of as a matrix equiv-
alent of the notion of positive numbers.
Definition 28. An Hermitian matrix H C
nn
is said to be positive denite if
U

HU > 0 for all non-zero U C


n
.
If the above strict inequality is weakened to U

HU 0, then A is said to be positive


semidenite.
48 Chapter 2.
Exercise 29. Show that a non-singular covariance matrix is always positive denite.
Theorem 30. (SVD) Any matrix A C
mn
can be written as a product
A = UDV

,
where U and V are unitary (of dimension mm and nn, respectively) and D R
mn
is non-negative and diagonal. This is called the singular value decomposition (SVD) of
A. Moreover, by letting k be the rank of A, the following statements are true:
(a) The columns of V are the eigenvectors of A

A. The last n k columns span the


null space of A.
(b) The columns of U are eigenvectors of AA

. The rst k columns span the range of


A.
(c) If m n then
D =
_
_
diag(

1
, . . . ,

n
)
. . . . . . . . . . . . . . . . . . .
0
_
_
,
where
1

2
. . .
k
>
k+1
= . . . =
n
= 0 are the eigenvalues of A

A C
nn
which are non-negative because A

A is Hermitian.
(d) If m n then
D = (diag(
_

1
, . . . ,
_

m
) : 0),
where
1

2
. . .
k
>
k+1
= . . . =
m
= 0 are the eigenvalues of AA

.
Note 1: Recall that the non-zero eigenvalues of AB equals the non-zero eigenvalues of
BA, see e.g. [5, Theorem 1.3.29]. Hence the non-zero eigenvalues in (iii) are the same for
both cases.
Note 2: To remember that V is associated to H

H (as opposed to being associated to


HH

) it suces to look at the dimensions: V R


n
and H

H R
nn
.
Proof. It is sucient to consider the case with m n since if m < n, we can apply the
result to A

= UDV

and obtain A = V D

. Hence let m n, and consider the matrix


A

A C
nn
. This matrix is Hermitian. Hence its eigenvalues
1

2
. . .
n
0
are real and non-negative and we can choose the eigenvectors v
1
, v
2
, . . . , v
n
to form an
orthonormal basis for C
n
. Let V = (v
1
, . . . , v
n
) . Let k be the number of positive
eigenvectors and choose.
U
i
=
1

i
Av
i
, i = 1, 2, . . . , k. (2.17)
Observe that
U

i
U
j
=
1
_

j
v

i
A

Av
j
=
_

i
v

i
v
j
=
ij
, 0 i, j k.
2.A. Facts About Matrices 49
Hence {U
i
: i = 1, . . . , k} form an orthonormal set in C
m
. Complete this set to an or-
thonormal basis for C
m
by choosing {U
i
: i = k+1, . . . , m} and let U = (U
1
, U
2
, . . . , U
m
).
Note that (2.17) implies
U
i
_

i
= Av
i
, i = 1, 2, . . . , k, k + 1, . . . , n,
where for i = k + 1, . . . , n the above relationship holds since
i
= 0 and v
i
is a corre-
sponding eigenvector. Using matrix notation we obtain
U
_
_
_
_
_
_
_

1
0
.
.
.
0

n
. . . . . . . . . . . . . . .
0
_
_
_
_
_
_
_
= AV, (2.18)
i.e., A = UDV

. For i = 1, 2, . . . , m,
AA

U
i
= UDV

U
i
= UDD

U
i
= U
i

i
,
where the last m the fact that U

u
i
has a 1 at position i and is zero otherwise and
DD

= diag(
1
,
2
, . . . ,
k
, 0, . . . , 0) . This shows that
i
is also an eigenvalue of AA

.
We have also shown that {v
i
: i = k+1, . . . , n} spans the null space of A and from (2.18)
we see that {u
i
: i = 1, . . . , k} spans the range of A.
The following key result is a simple application of the SVD.
Lemma 31. The linear transformation described by a matrix A R
nn
maps the unit
cube into a parallelepiped of volume | det A| .
Proof. We want to know the volume of the region we obtain when we apply to all the
points of the unit cube the linear transformation described by A. From the singular
value decomposition, we can write A = UDV

, where D is diagonal and U and V


are orthogonal matrices. A transformation described by an orthogonal matrix is volume
preserving. (In fact if we apply an orthogonal matrix to an object, we obtain the same
object described in a new coordinate system.) Hence we can focus our attention on
the eect of D. But D maps the unit vectors e
1
, e
2
, . . . , e
n
into
1
e
1
,
2
e
2
, . . . ,
n
e
n
,
respectively. Hence it maps the unit square into a rectangular parallelepiped of sides

1
,
2
, . . . ,
n
and of volume |

i
| = | det D| = | det A| , where the last equality holds
because the determinant of a product (of matrices) is the product of the determinants
and the determinant of an orthogonal matrix equals 1.
50 Chapter 2.
Appendix 2.B Densities after One-To-One Dierentiable
Transformations
In this Appendix we outline how to determine the density of a random vector Y when
we know the density of a random vector X and Y = g(X) for some dierentiable and
one-to-one function g .
We begin with the scalar case. Generalizing to the vector case is conceptually straight-
forward. Let X X be a random variable of density f
X
and dene Y = g(X) for a
given one-to-one dierentiable function g : X Y . The density becomes useful when we
integrate it over some set A to obtain the probability that X A. (A probability density
function relates to probability like pressure relates to force). In Figure 2.13 the shaded
area under f
X
equals Pr{X A}. Now assume that g maps the interval A into the
interval B. Then X A if and only if Y B. Hence Pr{X A} = Pr{Y B},
which means that the two shaded areas in the gure must be identical. This requirement
completely species f
Y
.
For the mathematical details we need to consider an innitesimally small interval A.
Then Pr{X A} = f
X
( x)l(A) , where l(A) denotes the length of A and x is any
point in A. Similarly, Pr{Y B} = f
Y
( y)l(B) where y = g( x) . Hence f
Y
fullls
f
X
( x)l(A) = f
Y
( y)l(B) .
x
f
X
(x)
y = g(x)
f
Y
(y)
A
B
Figure 2.13: Finding the density of Y = g(X) from that of X. Shaded
surfaces have the same area.
The last ingredient is the fact that the absolute value of the slope of g at xis the ratio
l(B)
l(A)
. (We are still assuming innitesimally small intervals.) Hence f
Y
(y)|g

(x)| = f
X
(x)
2.B. Densities after One-To-One Dierentiable Transformations 51
and after solving for f
Y
(y) and using x = g
1
(y) we obtain the desired result
f
Y
(y) =
f
X
(g
1
(y))
|g

(g
1
(y)|
. (2.19)
Example 32. If g(x) = ax +b then f
Y
(y) =
f
X
(
yb
a
)
|a|
.
Example 33. The density f
X
in Figure 2.13 is Rayleigh, specically
f
X
(x) =
_
x exp{
x
2
2
}, x 0
0, otherwise
and let Y = g(X) = X
2
. Then
f
Y
(y) =
_
0.5 exp{
y
2
}, y 0
0, otherwise
Before generalizing, let us summarize the scalar case. Expression (2.19) says that locally
the shape of f
Y
is the same as that of f
X
but there is a denominator on the right that
acts locally as a scaling factor. The absolute value of the derivative of g in a point x is
the local slope of g and it tells us how g scales intervals around x. The larger the slope
at a point the larger the magnication of an interval around that point. As the integral
of f
X
over an interval around x must be the same as that of f
Y
over the corresponding
interval around y = g(x) , if we scale up the interval size, we have to scale down the
density by the same factor.
Next we consider the multidimensional case, starting with two dimensions. Let X =
(X
1
, X
2
)
T
have pdf f
X
(x) and consider rst the random vector Y obtained from the
ane transformation
Y = AX +b
for some non-singular matrix A and vector b . The procedure to determine f
Y
parallels
that for the scalar case. If A is a small rectangle, small enough that f
X
(x) can be
considered as constant for all X A, then Pr{X A} is approximated by f
X
(x)a(A) ,
where a(A) is the area of A. If B is the image of A, then
f
Y
( y)a(B) f
X
( x)a(A) as a(A) 0.
Hence
f
Y
( y) f
X
( x)
a(A)
a(B)
as a(A) 0.
For the next and nal step, we need to know that A maps A of area a(A) into surface B
of area a(B) = a(A)| det A| . So the absolute value of the determinant of a matrix is the
amount by which areas scale through the ane transformation associated to the matrix.
52 Chapter 2.
This is true in any dimension n, but for n=1 we speak of length rather than area and for
n 3 we speak of volume. (For the one-dimensional case, observe that the determinant
of a is a). See Lemma 31 Appendix 2.A for an outline of the proof of this important
geometrical interpretation of the determinant of a matrix. Hence
f
Y
(y) =
f
X
_
A
1
(y b)
_
| det A|
.
We are ready to generalize to a function g : R
n
R
n
which is one-to-one and dieren-
tiable. Write g(x) = (g
1
(x), . . . , g
n
(x)) and dene its Jacobian J(x) to be the matrix
that has
g
i
x
j
at position i, j . In the neighborhood of x the relationship y = g(x) may
be approximated by means of an ane expression of the form
y = Ax +b
where A is precisely the Jacobian J(x) . Hence, leveraging on the ane case, we can
immediately conclude that
f
Y
(y) =
f
X
(g
1
(y))
| det J(g
1
(y))|
(2.20)
which holds for any n.
Sometimes the new random vector Y is described by the inverse function, namely X =
g
1
(Y ) (rather than the other way around, as assumed so far). In this case there is no
need to nd g . The determinant of the Jacobian of g at x is one over the determinant
of the Jacobian of g
1
at y = g(x) .
As a nal note we mention that if g is a many-to-one map, then for a specic y the
pull-back g

1(y) will be a set {x


i
. . . , x
k
} for some k . In this case the right side of (2.20)
will be

i
f
X
(g(x
i
))
| det J(x
i
)|
.
Example 34. (Rayleigh distribution) Let X
1
and X
2
be two independent, zero-mean,
unit-variance, Gaussian random variables. Let R and be the corresponding polar
coordinates, i.e., X
1
= Rcos and X
2
= Rsin . We are interested in the probability
density functions f
R,
, f
R
, and f

. Because we are given the map g from (r, ) to


(x
1
, x
2
) , we pretend that we know f
R,
and that we want to nd f
X
1
,X
2
. Thus
f
X
1
,X
2
(x
1
, x
2
) =
1
| det J|
f
R,
(r, )
where J is the Jacobian of g , namely
J =
_
cos r sin
sin r cos
_
.
Hence det J = r and
f
X
1
,X
2
(x
1
, x
2
) =
1
r
f
R,
(r, ).
2.C. Gaussian Random Vectors 53
Using f
X
1
,X
2
(x
1
, x
2
) =
1
2
exp{
x
2
1
+x
2
2
2
} and x
2
1
+ x
2
2
= r
2
to make it a function of the
desired variables r, , and solving for f
R,
we immediately obtain
f
R,
(r, ) =
r
2
exp
_

r
2
2
_
.
Since f
R,
(r, ) depends only on r , we infer that R and are independent random
variables and that the latter is uniformly distributed in [0, 2) . Hence
f

() =
_
1
2
[0, 2)
0 otherwise
and
f
R
(r) =
_
re

r
2
2
r 0
0 otherwise.
We come to the same conclusion by integrating f
R,
over to obtain f
R
and by inte-
grating over r to obtain f

. Notice that f
R
is a Rayleigh probability density.
Appendix 2.C Gaussian Random Vectors
A Gaussian random vector is a collection of jointly Gaussian random variables. We learn
to use vector notation as it simplies matters signicantly.
Recall that a random variable W is a mapping from the sample space to R. W is a
Gaussian random variable with mean m and variance
2
if and only if its probability
density function (pdf) is
f
W
(w) =
1

2
2
exp
_

(w m)
2
2
2
_
.
Because a Gaussian random variable is completely specied by its mean m and variance

2
, we use the short-hand notation N(m,
2
) to denote its pdf. Hence W N(m,
2
) .
An n-dimensional random vector ( n-rv) X is a mapping X : R
n
. It can be seen as
a collection X = (X
1
, X
2
, . . . , X
n
)
T
of n random variables. The pdf of X is the joint pdf
of X
1
, X
2
, . . . , X
n
. The expected value of X, denoted by EX or by

X, is the n-tuple
(EX
1
, EX
2
, . . . , EX
n
)
T
. The covariance matrix of X is K
X
= E[(X

X)(X

X)
T
] .
Notice that XX
T
is an n n random matrix, i.e., a matrix of random variables, and
the expected value of such a matrix is, by denition, the matrix whose components are
the expected values of those random variables. Notice that a covariance matrix is always
Hermitian.
54 Chapter 2.
The pdf of a vector W = (W
1
, W
2
, . . . , W
n
)
T
that consists of independent and identically
distributed (iid) N(0,
2
) components is
f
W
(w) =
n

i=1
1

2
2
exp
_

w
2
i
2
2
_
(2.21)
=
1
(2
2
)
n/2
exp
_

w
T
w
2
2
_
. (2.22)
The following is one of several possible ways to dene a Gaussian random vector.
Definition 35. The random vector Y R
m
is a zero-mean Gaussian random vector
and Y
1
, Y
2
, . . . , Y
n
are zero-mean jointly Gaussian random variables if and only if there
exists a matrix A R
mn
such that Y can be expressed as
Y = AW (2.23)
where W is a random vector of iid N(0, 1) components.
Note 36. It follows immediately from the above denition that linear combinations of
zero-mean jointly Gaussian random variables are zero-mean jointly Gaussian random vari-
ables. Indeed, Z = BY = BAW .
Recall from Appendix 2.B that if Y = AW for some non-singular matrix A R
nn
,
then
f
Y
(y) =
f
W
(A
1
y)
| det A|
.
When W has iid N(0, 1) components,
f
Y
(y) =
exp
_

(A
1
y)
T
(A
1
y)
2
_
(2)
n/2
| det A|
.
The above expression can be simplied and brought to the standard expression
f
Y
(y) =
1
_
(2)
n
det K
Y
exp
_

1
2
y
T
K
1
Y
y
_
(2.24)
using K
Y
= EAW(AW)
T
= EAWW
T
A
T
= AI
n
A
T
= AA
T
to obtain
(A
1
y)
T
(A
1
y) = y
T
(A
1
)
T
A
1
y
= y
T
(AA
T
)
1
y
= y
T
K
1
Y
y
and
_
det K
Y
=

det AA
T
=

det Adet A = | det A|.


2.C. Gaussian Random Vectors 55
Fact 37. Let Y R
n
be a zero-mean random vector with arbitrary covariance matrix K
Y
and pdf as in (2.24). As a covariance matrix is Hermitian, we can write (see Appendix 2.A)
K
Y
= UU

(2.25)
where U is unitary and is diagonal. It is immediate to verify that U

W has
covariance K
Y
. This shows that an arbitrary zero-mean random vector Y with pdf as
in (2.24) can always be written in the form Y = AW where W has iid N(0, I
n
)
components.
The contrary is not true in degenerated cases. We have already seen that (2.24) follows
from (2.23) when A is a non-singular squared matrix. The derivation extends to any
non-squared matrix A, provided that it has linearly independent rows. This result is
derived as a homework exercise. In that exercise we also see that it is indeed necessary
that the rows of A be linearly independent as otherwise K
Y
is singular and K
1
Y
is not
dened. Then (2.24) is not dened either. An example will show how to handle such
degenerated cases.
Note that many authors use (2.24) to dene a Gaussian random vector. We favor (2.23)
because it is more general, but also as it makes it straightforward to prove a number of
key results associated to Gaussian random vectors. Some of these are dealt with in the
examples below.
In any case, a zero-mean Gaussian random vector is completely characterized by its co-
variance matrix. Hence the short-hand notation Y N(0, K
Y
) .
Note 38. (Degenerate case) Let W N(0, 1) , A = (1, 1)
T
, and Y = AW . By our
denition, Y is a Gaussian random vector. However, A is a matrix of linearly dependent
rows implying that Y has linearly dependent components. Indeed Y
1
= Y
2
. This also
implies that K
Y
is singular: it is a 2 2 matrix with 1 in each component. As already
pointed out, we cannot use (2.24) to describe the pdf of Y . This immediately raises
the question: How do we compute the probability of events involving Y if we do not
know its pdf? The answer is easy. Any event involving Y can be rewritten as an
event involving Y
1
only (or equivalently involving Y
2
only). For instance, the event
{Y
1
[3, 5]} {Y
2
[4, 6]} occurs if and only if {Y
1
[4, 5]}. Hence
Pr
_
Y
1
[3, 5]
_

_
Y
2
[4, 6]
_
= Pr
_
Y
1
[4, 5]
_
= Q(4) Q(5).
Exercise 39. Show that the i th component Y
i
of a Gaussian random vector Y is a
Gaussian random variable.
Solution: Y
i
= AY when A = e
T
i
is the unit row vector with 1 in the i th component
and 0 elsewhere. Hence Y
i
is a Gaussian random variable. To appreciate the convenience
of working with (2.23) instead of (2.24), compare this answer with the tedious derivation
consisting of integrating over f
Y
to obtain f
Y
i
(see Problem 16).
56 Chapter 2.
Exercise 40. Let U be an orthogonal matrix. Determine the pdf of Y = UW .
Solution: Y is zero-mean and Gaussian. Its covariance matrix is K
Y
= UK
W
U
T
=
U
2
I
n
U
T
=
2
UU
T
=
2
I
n
, where I
n
denotes the n n identity matrix. Hence, when
an n-dimensional Gaussian random vector with iid N(0,
2
) components is projected
onto n orthonormal vectors, we obtain n iid N(0,
2
) random variables. This result
will be used often.
Exercise 41. (Gaussian random variables are not necessarily jointly Gaussian) Let Y
1

N(0, 1) , let X {1} be uniformly distributed, and let Y
2
= Y
1
X. Notice that Y
2
has
the same pdf as Y
1
. This follows from the fact that the pdf of Y
1
is an even function.
Hence Y
1
and Y
2
are both Gaussian. However, they are not jointly Gaussian. We come
to this conclusion by observing that Z = Y
1
+Y
2
= Y
1
(1 +X) is 0 with probability 1/2.
Hence Z cannot be Gaussian.
Exercise 42. Is it true that uncorrelated Gaussian random variables are always indepen-
dent? If you think it is . . . think twice. The construction above labeled Gaussian random
variables are not necessarily jointly Gaussian provides a counter example (you should be
able to verify without much eort). However, the statement is true if the random variables
under consideration are jointly Gaussian (the emphasis is on jointly). You should be
able to prove this fact using (2.24). The contrary is always true: random variables (not
necessarily Gaussian) that are independent are always uncorrelated. Again, you should
be able to provide the straightforward proof.
Definition 43. The random vector Y is a Gaussian random vector (and Y
1
, . . . , Y
n
are jointly Gaussian random variables) if and only if Y m is a zero mean Gaussian
random vector as dened above, where m = EY . If the covariance K
Y
is non-singular
(which implies that no component of Y is determined by a linear combination of other
components), then its pdf is
f
Y
(y) =
1
_
(2)
n
det K
Y
exp
_

1
2
(y Ey)
T
K
1
Y
(y Ey)
_
.
2.D. A Fact About Triangles 57
Appendix 2.D A Fact About Triangles
In Example 19 we have derived the error probability for PSK using the fact that for a
triangle with edges a, b , c and angles , , as shown in the left gure, the following
relationship holds:
a
sin
=
b
sin
=
c
sin
. (2.26)
a sin a

c
b
a

c
b
b sin(180 )
To prove the rst equality relating a and b we consider the distance between the vertex
(common to a and b ) and its projection onto the extension of c. As shown in the
left gure, this distance may be computed two ways obtaining a sin and b sin(180 ) ,
respectively. The latter may be written as b sin() . Hence a sin = b sin() , which is
the rst equality. The second equality is proved similarly.
Appendix 2.E Spaces: Vector; Inner Product; Signal
2.E.1 Vector Space
Most readers are familiar with the notion of vector space form a linear algebra course.
Unfortunately, some linear algebra courses for engineers associate vectors to n-tuples
rather than taking the axiomatic point of view which is what we need. Hence we review
the vector space axioms.
To Be Added: The axioms of a vectors space and and a few examples.
While in Chapter 2 all vector spaces are of n-tuples over R, in later chapters we deal with
the vector space of n-tuples over C and the vector space of nite-energy complex-valued
functions. In this appendix we consider general vector spaces over the eld of complex
numbers. Thy are commonly called complex vector spaces. Vector spaces for which the
scalar eld is R are called real vector spaces.
2.E.2 Inner Product Space
Given a vector space and nothing more, one can introduce the notion of a basis for the
vector space, but one does not have the tool needed to dene an orthonormal basis.
58 Chapter 2.
Indeed the axioms of a vector space say nothing about geometric ideas such as length
or angle. To remedy, one endows the vector space with the notion of inner product.
Definition 44. Let V be a vector space over C. An inner product on V is a function
that assigns to each ordered pair of vectors , in V a scalar , in C in such a way
that for all , , in V and all scalars c in C
(a) + , = , +,
c, = c, ;
(b) , = ,

; (Hermitian Symmertry)
(c) , 0 with equality if and only if = 0.
It is implicit in (c) that , is real for all V . From (a) and (b), we obtain an
additional property
(d) , + = , +,
, c = c

, .
Notice that the above denition is also valid for a vector space over the eld of real
numbers, but in this case the complex conjugates appearing in (b) and (d) are superuous.
However, over the eld of complex numbers they are otherwise necessary for any = 0
we could write
0 < i, i = 1, < 0,
where the rst inequality follows from condition (c) and the fact that i is a valid vector,
and the equality follows from (a) and (d) without the complex conjugate. We see that the
complex conjugate is necessary or else we can create the contradictory statement 0 < 0.
On C
n
there is an inner product that is sometimes called the standard inner product. It
is dened on a = (a
1
, . . . , a
n
) and b = (b
1
, . . . , b
n
) by
a, b =

j
a
j
b

j
.
On R
n
, the standard inner product is often called the dot or scalar product and denoted
by a b . Unless explicitly stated otherwise, over R
n
and over C
n
we will always assume
the standard inner product.
An inner product space is a real or complex vector space, together with a specied inner
product on that space. We will use the letter V to denote a generic inner product space.
Example 45. The vector space R
n
equipped with the dot product is an inner product
space and so is the vector space C
n
equipped with the standard inner product.
2.E. Spaces: Vector; Inner Product; Signal 59
By means of the inner product, we introduce the notion of length, called norm, of a
vector , via
=
_
, .
Using linearity, we immediately obtain that the squared norm satises

2
= , =
2
+
2
2Re{, }. (2.27)
The above generalizes (ab)
2
= a
2
+b
2
2ab , a, b R, and |ab|
2
= |a|
2
+|b|
2
2Re{ab},
a, b C.
Theorem 46. If V is an inner product space, then for any vectors , in V and any
scalar c,
(a) c = |c|
(b) 0 with equality if and only if = 0
(c) |, | with equality if and only if = c for some c.
(Cauchy-Schwarz inequality)
(d) + + with equality if and only if = c for some non-negative
c R.
(Triangle inequality)
(e) +
2
+
2
= 2(
2
+
2
)
(Parallelogram equality)
Proof. Statements (a) and (b) follow immediately from the denitions. We postpone the
proof of the Cauchy-Schwarz inequality to Example 50 as at that time we will be able to
make a more elegant proof based on the concept of a projection. To prove the triangle
inequality we use (2.27) and the Cauchy-Schwarz inequality applied to Re{, }
|, | to prove that +
2
( +)
2
. Notice that Re{, } |, | holds
with equality if and only if = c for some non-negative c R. Hence this condition
is necessary for the triangle inequality to hold with equality. It is also sucient since
then also the Cauchy-Schwarz inequality holds with equality. The parallelogram equality
follows immediately from (2.27) used twice, once with each sign.
-

Triangle inequality
-
-

*
@
@
@
@
@
@R

Parallelogram equality
60 Chapter 2.
At this point we could use the inner product and the norm to dene the angle between
two vectors but we do not have any use for this. Instead, we will make frequent use of the
notion of orthogonality. Two vectors and are dened to be orthogonal if , = 0.
Example 47. This example is relevant for what we do from Chapter 3 on. Let W =
{w
0
(t), . . . , w
m1
(t)} be a nite collection of functinos from R to C such that
_

|w(t)|
2
dt
for all elements of W. Let V be the complex vector space spanned by the elements of
W. The reader should verify that the axioms of a vector space are fullled. The standard
inner product for functions from R to C is dened as
, =
_
(t)

(t)dt
which implies the norm
=

_
|(t)|
2
dt,
but it is not a given that this is an inner product on V . It is straightforward to verify that
the innner product axioms (a), (c) and (d) (Denition 44) are fullled for all elements of
V but axiom (b) is not necessarily fullled (see Example 48). If we set the extra conditon
that for all V , , = 0 implies that is the zero vector, then V endowed with
, forms an inner product space. All we have said in this example applies also for the
real vector spaces spanned by functions from R to R.
Example 48. Let V be the set of functions from R to R spanned by the function that is
zero everywhere, except at 0 where it takes value 1. It can easily be checked that this is
a vector space. It contains all the functions that are zero everywhere, except at 0 where
they can take on any value in R. Its zero vector is the function that is 0 everywhere,
including at 0. For all in V , , = 0. Hence , is not an inner product on
V .
Theorem 49. (Pythagoras Theorem) If and are orthogonal vectors in V , then
+
2
=
2
+
2
.
Proof. Pythagoras theorem follows immediately from the equality +
2
=
2
+

2
+ 2Re{, } and the fact that , = 0 by denition of orthogonality.
Given two vectors , V , = 0, we dene the projection of on as the vector

|
collinear to (i.e. of the form c for some scalar c) such that

=
|
is
orthogonal to . Using the denition of orthogonality, what we want is
0 =

, = c, = , c
2
.
Solving for c we obtain c =
,

2
. Hence

|
=
,

2
and

=
|
.
2.E. Spaces: Vector; Inner Product; Signal 61
The projection of on does not depend on the norm of . This is clear from the
denition of projection. Alternatively, it can be veried by letting = b for some b C
and by verifying that

|
=
, b
b
2
b =
,

2
=
|
.
In particular, when the vector onto which we project has unit norm we obtain

|
= |, |,
which tells us that |, | has the geometric interpretation of being the length of the
projection of onto the subspace spanned by .
- -

|
6

Projection of on
Any non-zero vector V denes a hyperplane by the relationship
{ V : , = 0} .
The hyperplane is the set of vectors in V that are orthogonal to . A hyperplane always
contains the zero vector.
An ane plane, dened by a vector and a scalar c, is an object of the form
{ V : , = c} .
The vector and scalar c that dene a hyperplane are not unique, unless we agree that
we use only normalized vectors to dene hyperplanes. By letting =

, the above
denition of ane plane may equivalently be written as { V : , =
c

} or even
as { V :
c

, = 0}. The rst shows that at an ane plane is the set of


vectors that have the same projection
c

on (see the next gure). The second form


shows that the ane plane is a hyperplane translated by the vector
c

. Some authors
make no distinction between ane planes and hyperplanes; in this case both are called
hyperplane.
@
@
@
@
@I

P
P
P
P Pi
B
B
B
B
BM
6

Ane plane dened by .


62 Chapter 2.
In the example that follows, we use the notion of projection to prove the Cauchy-Schwarz
inequality stated in Theorem 46.
Example 50. (Proof of the Cauchy-Schwarz Inequality). The Cauchy-Schwarz inequality
states that for any , V , |, | with equality if and only if = c for
some scalar c C. The statement is obviously true if = 0. Assume = 0 and
write =
|
+

. (See the next gure.) Pythagoras theorem states that


2
=

2
+

2
. If we drop the second term, which is always non-negative, we obtain

2

|

2
with equality if and only if and are collinear. From the denition of
projection,
|

2
=
|,|
2

2
. Hence
2

|,|
2

2
with equality if and only if and
are collinear. This is the Cauchy-Schwarz inequality.
- -

|
6

-
,

The Cauchy-Schwarz inequality


Every nite-dimensional vector space has a basis. If
1
,
2
, . . . ,
n
is a basis for the inner
product space V and V is an arbitrary vector, then there are scalars a
1
, . . . , a
n
such
that =

a
i

i
but nding them may be dicult. However, nding the coecients of a
vector is particularly easy when the basis is orthonormal.
A basis
1
,
2
, . . . ,
n
for an inner product space V is orthonormal if

i
,
j
=
_
0, i = j
1, i = j.
Finding the i th coecient a
i
of an orthonormal expansion =

a
i

i
is immediate. It
suces to observe that all but the i th term of

a
i

i
are orthogonal to
i
and that the
inner product of the i th term with
i
yields a
i
. Hence if =

a
i

i
then
a
i
= ,
i
.
Observe that |a
i
| is the norm of the projection of on
i
. This should not be surprising
given that the i th term of the orthonormal expansion of is collinear to
i
and the sum
of all the other terms are orthogonal to
i
.
There is another major advantage to working with an orthonormal basis. If a and b
are the n-tuples of coecients of the expansion of and with respect to the same
orthonormal basis, then
, = a, b
2.E. Spaces: Vector; Inner Product; Signal 63
where the right-hand side inner product is with respect to the standard inner product.
Indeed
, =

a
i

i
,

j
b
j

j
=

a
i

i
,

j
b
j

a
i

i
, b
i

i
=

a
i
b

i
= a, b.
Letting = the above implies also
= a,
where a =

|a
i
|
2
.
An orthonormal set of vectors
1
, . . . ,
n
of an inner product space V is a linearly in-
dependent set. Indeed 0 =

a
i

i
implies a
i
= 0,
i
= 0. By normalizing the vectors
and recomputing the coecients, we can easily extend this reasoning to a set of orthog-
onal (but not necessarily orthonormal) vectors
1
, . . . ,
n
. They too must be linearly
independent.
The idea of a projection on a vector generalizes to a projection on a subspace. If U is a
subspace of an inner product space V , and V , the projection of on U is dened to
be a vector
|U
U such that
|U
is orthogonal to all vectors in U . If
1
, . . . ,
m
is
an orthonormal basis for U , then the condition that
|U
is orthogonal to all vectors
of U implies 0 =
|U
,
i
= ,
i

|U
,
i
. This shows that ,
i
=
|U
,
i
.
The right side of this equality is the i th coecient of the orthonormal expansion of
|U
with respect to the orthonormal basis. This proves that

|U
=
m

i=1
,
i

i
is the unique projection of on U . We summarize this important result and prove that
the projection of on U is the element of U that is closest to .
Theorem 51. Let U be a subspace of an inner product space V , and let V . The
projection of on U , denoted by
|U
, is the unique element of U that satises any
(hence all) of the following conditions:
(i)
|U
is orthogonal to every element of U ;
(ii)
|U
=

m
i=1
,
i

i
;
(iii) For any U ,
|U
.
Proof. Statement (i) is the denition of projection and we have already proved that it is
equivalent to statement (ii). Now consider any vector U . From Pythagoras theorem
and the fact that
|U
is orthogonal to
|U
U , we obtain

2
=
|U
+
|U

2
=
|U

2
+
|U

2

|U

2
.
Moreover, equality holds if and only if
|U

2
= 0, i.e., if and only if
|U
= .
64 Chapter 2.
Theorem 52. Let V be an inner product space and let
1
, . . . ,
n
be any collection of
linearly independent vectors in V . Then we may construct orthogonal vectors
1
, . . . ,
n
in V such that they form a basis for the subspace spanned by
1
, . . . ,
n
.
Proof. The proof is constructive via a procedure known as the Gram-Schmidt orthogo-
nalization procedure. First let
1
=
1
. The other vectors are constructed inductively as
follows. Suppose
1
, . . . ,
m
have been chosen so that they form an orthogonal basis for
the subspace U
m
spanned by
1
, . . . ,
m
. We choose the next vector as

m+1
=
m+1

m+1
|U
m
, (2.28)
where
m+1
|U
m
is the projection of
m+1
on U
m
. By denition,
m+1
is orthogonal to
every vector in U
m
, including
1
, . . . ,
m
. Also,
m+1
= 0 for otherwise
m+1
contradicts
the hypothesis that it is linearly independent of
1
, . . . ,
m
. Therefore
1
, . . . ,
m+1
is an
orthogonal collection of non-zero vectors in the subspace U
m+1
spanned by
1
, . . . ,
m+1
.
Therefore it must be a basis for U
m+1
. Thus the vectors
1
, . . . ,
n
may be constructed
one after the other according to (2.28).
Corollary 53. Every nite-dimensional vector space has an orthonormal basis.
Proof. Let
1
, . . . ,
n
be a basis for the nite-dimensional inner product space V . Apply
the Gram-Schmidt procedure to nd an orthogonal basis
1
, . . . ,
n
. Then
1
, . . . ,
n
,
where
i
=

i

, is an orthonormal basis.
Gram-Schmidt Orthonormalization Procedure
We summarize the Gram-Schmidt procedure, modied so as to produce orthonormal
vectors. If
1
, . . . ,
n
is a linearly independent collection of vectors in the inner product
space V , then we may construct a collection
1
, . . . ,
n
that forms an orthonormal basis
for the subspace spanned by
1
, . . . ,
n
as follows: We let
1
=

1

and for i = 2, . . . , n
we choose

i
=
i

i1

j=1

i
,
j

i
=

i

.
We have assumed that
1
, . . . ,
n
is a linearly independent collection. Now assume that
this is not the case. If
j
is linearly dependent of
1
, . . . ,
j1
, then at step i = j the
procedure will produce
i
=
i
= 0. Such vectors are simply disregarded.
2.E.3 Signal Space
The following table gives an example of the Gram-Schmidt procedure applied to a set of
signals.
2.E. Spaces: Vector; Inner Product; Signal 65
i
i

i
,
j

i
|V
i1

i
=
i

i
|V
i1

i

i

i
j < i
1
1
1
- -
1
1
2
1
1
_
_
2
0
0
_
_
2
1
1
1
1
1
1
1
1
1
1
_
_
1
1
0
_
_
3
1
2
0, 0
1
1
1
2
2
1
2
_
_
0
0
2
_
_
Table 2.1: Application of the Gram-Schmidt orthonormalization procedure
starting with the waveforms given in the rst column.
66 Chapter 2.
Appendix 2.F Exercises
Problem 1. (Probabilities of Basic Events) Assume that X
1
and X
2
are independent
random variables that are uniformly distributed in the interval [0, 1] . Compute the prob-
ability of the following events. Hint: For each event, identify the corresponding region
inside the unit square.
(a) 0 X
1
X
2

1
3
.
(b) X
3
1
X
2
X
2
1
.
(c) X
2
X
1
=
1
2
.
(d) (X
1

1
2
)
2
+ (X
2

1
2
)
2
(
1
2
)
2
.
(e) Given that X
1

1
4
, compute the probability that (X
1

1
2
)
2
+ (X
2

1
2
)
2
(
1
2
)
2
.
Problem 2. (Basic Probabilities) Find the following probabilities.
(a) A box contains m white and n black balls. Suppose k balls are drawn. Find the
probability of drawing at least one white ball.
(b) We have two coins; the rst is fair and the second is two-headed. We pick one of the
coins at random, we toss it twice and heads shows both times. Find the probability
that the coin is fair.
Problem 3. (Conditional Distribution) Assume that X and Y are random variables
with joint probability density function
f
X,Y
(x, y) =
_
A, 0 x < y 1
0, otherwise.
(a) Are X and Y independent?
(b) Find the value of A.
(c) Find the marginal distribution of Y . Do this rst by arguing geometrically then
compute it formally.
(d) Find (y) = E[X|Y = y] . Hint: Argue geometrically.
(e) Find E[(Y )] using the marginal distribution of Y .
(f) Find E[X] and show that E[X] = E[E[X|Y ]] .
2.F. Exercises 67
Problem 4. (Playing Darts) Assume that you are throwing darts at a target. We assume
that the target is one-dimensional, i.e., that the darts all end up on a line. The bulls
eye is in the center of the line, and we give it the coordinate 0. The position of a dart
on the target can then be measured with respect to 0. We assume that the position X
1
of a dart that lands on the target is a random variable that has a Gaussian distribution
with variance
2
1
and mean 0. Assume now that there is a second target, which is further
away. If you throw a dart to that target, the position X
2
has a Gaussian distribution
with variance
2
2
(where
2
2
>
2
1
) and mean 0. You play the following game: You toss a
coin which gives you Z = 1 with probability p and Z = 0 with probability 1 p for
some xed p [0, 1] . If Z = 1, you throw a dart onto the rst target. If Z = 0, you aim
for the second target instead. Let X be the relative position of the dart with respect to
the center of the target that you have chosen.
(a) Write down X in terms of X
1
, X
2
and Z.
(b) Compute the variance of X. Is X Gaussian?
(c) Let S = |X| be the score, which is given by the distance of the dart to the center of
the target (that you picked using the coin). Compute the average score E[S] .
Problem 5. (Uncorrelated vs. Independent Random Variables) Let X and Y be two
continuous real-valued random variables with joint probability density function f
XY
.
(a) Show that if X and Y are independent, they are also uncorrelated.
(b) Consider two independent and uniformly distributed random variables U {0, 1}
and V {0, 1}. Assume that X and Y are dened as follows: X = U + V and
Y = |U V | . Are X and Y independent? Compute the covariance of X and Y .
What do you conclude?
Problem 6. (One of Three) Assume you are participating in a quiz show. You are
shown three boxes that look identical from the outside, except they have labels 0, 1, and
2, respectively. Only one of them contains one million Swiss francs, the other two contain
nothing. You choose one box at random with a uniform probability. Let A be the random
variable that denotes your choice, A {0, 1, 2}.
(a) What is the probability that the box A contains the money?
The quizmaster knows in which box the money is and he now opens, from the re-
maining two boxes, the one that does not contain the prize. This means that if
neither of the two remaining boxes contain the prize then the quizmaster opens one
with uniform probability. Otherwise, he simply opens the one that does not contain
the prize. Let B denote the random variable corresponding to the box that remains
closed after the elimination by the quizmaster.
68 Chapter 2.
(b) What is the probability that B contains the money?
(c) If you are now allowed to change your mind, i.e., choose B instead of sticking with
A, would you do it?
Problem 7. (Hypothesis Testing: Uniform and Uniform) Consider a binary hypothesis
testing problem in which the hypotheses H = 0 and H = 1 occur with probability P
H
(0)
and P
H
(1) = 1 P
H
(0) , respectively. The observable Y takes values in {1}
2k
, where
k is a xed positive integer. When H = 0, each component of Y is 0 or a 1 with
probability
1
2
and components are independent. When H = 1, Y is chosen uniformly at
random from the set of all sequences of length 2k that have an equal number of ones and
zeros. There are
_
2k
k
_
such sequences.
(a) What is P
Y |H
(y|0) ? What is P
Y |H
(y|1) ?
(b) Find a maximum likelihood decision rule for H based on y . What is the single
number you need to know about y to implement this decision rule?
(c) Find a decision rule that minimizes the error probability.
(d) Are there values of P
H
(0) such that the decision rule that minimizes the error prob-
ability always for one of the two hypotheses regardless of y ? If yes, what are these
values, and what is the decision?
Problem 8. (The Wetterfrosch) Let us assume that a weather frog bases his forecast
of tomorrows weather entirely on todays air pressure. Determining a weather forecast
is a hypothesis testing problem. For simplicity, let us assume that the weather frog only
needs to tell us if the forecast for tomorrows weather is sunshine or rain. Hence we
are dealing with binary hypothesis testing. Let H = 0 mean sunshine and H = 1 mean
rain. We will assume that both values of H are equally likely, i.e. P
H
(0) = P
H
(1) =
1
2
.
Measurements over several years have led the weather frog to conclude that on a day
that precedes sunshine the pressure may be modeled as a random variable Y with the
following probability density function:
f
Y |H
(y|0) =
_
A
A
2
y, 0 y 1
0, otherwise.
Similarly, the pressure on a day that precedes a rainy day is distributed according to
f
Y |H
(y|1) =
_
B +
B
3
y, 0 y 1
0, otherwise.
The weather frogs purpose in life is to guess the value of H after measuring Y .
2.F. Exercises 69
(a) Determine A and B.
(b) Find the a posteriori probability P
H|Y
(0|y) . Also nd P
H|Y
(1|y) .
(c) Plot P
H|Y
(0|y) and P
H|Y
(1|y) as a function of y . Show that the implementation of
the decision rule

H(y) = arg max
i
P
H|Y
(i|y) reduces to

(y) =
_
0, if y
1, otherwise,
(2.29)
for some threshold and specify the thresholds value. Do so by direct calculation,
rather than using the general result (2.4).
(d) Now assume that you implement the decision rule

H

(y) and determine, as a function


of , the probability that the decision rule decides

H = 1 given that H = 0. This
probability is denoted Pr{

H(Y ) = 1|H = 0}.
(e) For the same decision rule, determine the probability of error P
e
() as a function of
. Evaluate your expression at = .
(f) Using calculus, nd the that minimizes P
e
() and compare your result to . Could
you have found the minimizing without any calculation?
Problem 9. (Hypothesis Testing in Laplacian Noise) Consider the following hypothesis
testing problem between two equally likely hypotheses. Under hypothesis H = 0, the
observable Y is equal to a+Z where Z is a random variable with Laplacian distribution
f
Z
(z) =
1
2
e
|z|
.
Under hypothesis H = 1, the observable is given by a +Z. You may assume that a is
positive.
(a) Find and draw the density f
Y |H
(y|0) of the observable under hypothesis H = 0, and
the density f
Y |H
(y|1) of the observable under hypothesis H = 1.
(b) Find the decision rule that minimizes the probability of error. Write out the expres-
sion for the likelihood ratio.
(c) Compute the probability of error of the optimal decision rule.
Problem 10. (Poisson Parameter Estimation) In this example there are two hypotheses,
H = 0 and H = 1, which occur with probabilities P
H
(0) = p
0
and P
H
(1) = 1 p
0
,
70 Chapter 2.
respectively. The observable is Y takes values in the set of nonnegative integers. Under
hypothesis H = 0, Y is distributed according to a Poisson law with parameter
0
, i.e.
p
Y |H
(y|0) =

y
0
y!
e

0
. (2.30)
Under hypothesis H = 1,
p
Y |H
(y|1) =

y
1
y!
e

1
. (2.31)
This is a model for the reception of photons in optical communications.
(a) Derive the MAP decision rule by indicating likelihood and log-likelihood ratios.
Hint: The direction of an inequality changes if both sides are multiplied by a negative
number.
(b) Derive the formula for the probability of error of the MAP decision rule.
(c) For p
0
= 1/3,
0
= 2 and
1
= 10, compute the probability of error of the MAP
decision rule. You may want to use a computer program to do this.
(d) Repeat (3) with
1
= 20 and comment.
Problem 11. (Lie Detector) You are asked to develop a lie detector and analyze its
performance. Based on the observation of brain cell activity, your detector has to decide
if a person is telling the truth or is lying. For the purpose of this problem, the brain
cell produces a sequence of spikes. For your decision you may use only a sequence of n
consecutive inter-arrival times Y
1
, Y
2
, . . . , Y
n
. Hence Y
1
is the time elapsed between the
rst and second spike, Y
2
the time between the second and third, etc. We assume that,
a priori, a person lies with some known probability p. When the person is telling the
truth, Y
1
, . . . , Y
n
is an i.i.d. sequence of exponentially distributed random variables with
intensity , ( > 0) , i.e.
f
Y
i
(y) = e
y
, y 0.
When the person lies, Y
1
, . . . , Y
n
is i.i.d. exponentially distributed with intensity ,
( < ) .
(a) Describe the decision rule of your lie detector for the special case n = 1. Your
detector should be designed so as to minimize the probability of error.
(b) What is the probability P
L|T
that your lie detector says that the person is lying when
the person is telling the truth?
(c) What is the probability P
T|L
that your test says that the person is telling the truth
when the person is lying.
2.F. Exercises 71
(d) Repeat (a) and (b) for a general n. Hint: There is no need to repeat every step of
your previous derivations.
Problem 12. (Fault Detector) As an engineer, you are required to design the test per-
formed by a fault-detector for a black-box that produces a a sequence of i.i.d. bi-
nary random variables , X
1
, X
2
, X
3
, . Previous experience shows that this black
box has an a priori failure probability of
1
1025
. When the black box works properly,
p
X
i
(1) = p. When it fails, the output symbols are equally likely to be 0 or 1. Your
detector has to decide based on the observation of the past 16 symbols, i.e., at time k
the decision will be based on X
k16
, . . . , X
k1
.
(a) Describe your test.
(b) What does your test decide if it observes the output sequence 0101010101010101?
Assume that p = 0.25.
Problem 13. (MAP Decoding Rule: Alternative Derivation) Consider the binary hy-
pothesis testing problem where H takes values in {0, 1} with probabilites P
H
(0) and
P
H
(1) and the conditional probability density function of the observation Y R given
H = i , i {0, 1} is given by f
Y |H
(|i) . Let R
i
be the decoding region for hypothesis i ,
i.e., the set of y for which the decision is

H = i , i {0, 1}.
(a) Show that the probability of error is given by
P
e
= P
H
(1) +
_
R
1
_
P
H
(0)f
Y |H
(y|0) P
H
(1)f
Y |H
(y|1)
_
dy.
Hint: Note that R = R
0

R
1
and
_
R
f
Y |H
(y|i)dy = 1 for i {0, 1}.
(b) Argue that P
e
is minimized when
R
1
= {y R : P
H
(0)f
Y |H
(y|0) < P
H
(1)f
Y |H
(y|1)}
i.e., for the MAP rule.
Problem 14. (One Bit over a Binary Channel with Memory) Consider communicating
one bit via n uses of a binary channel with memory. The channel output Y
i
at time
instant i is given by
Y
i
= X
i
Z
i
i = 1, . . . , n
72 Chapter 2.
where X
i
is the binary channel input, Z
i
is the binary noise and represents modulo 2
addition. All random variables take value in {0, 1}. The noise sequence is generated as
follows: Z
1
is generated from the distribution P
Z
1
(1) = p and for i > 1,
Z
i
= Z
i1
N
i
where N
2
, . . . , N
n
are i.i.d. with probability P
N
(1) = p. Let the codewords (the sequence
of symbols sent on the channel) corresponding to message 0 and 1 be (X
(0)
1
, . . . , X
(0)
n
)
and (X
(1)
1
, . . . , X
(1)
n
) , respectively.
(a) Consider the following operation by the receiver. The receiver creates the vector
(

Y
1
,

Y
2
, . . . ,

Y
n
)
T
where

Y
1
= Y
1
and for i = 2, 3, . . . , n,

Y
i
= Y
i
Y
i1
. Argue
that the vector created by the receiver is a sucient statistic. Hint: Show that
(Y
1
, Y
2
, . . . , Y
n
)

can be reconstructed from (

Y
1
,

Y
2
, . . . ,

Y
n
)

.
(b) Write down (

Y
1
,

Y
2
, . . . ,

Y
n
)

for each of the hypotheses. Notice the similarity with


the problem of communicating one bit via n uses of a binary symmetric channel.
(c) How should the receiver decide between (X
(0)
1
, . . . , X
(0)
n
) and (X
(1)
1
, . . . , X
(1)
n
) so as
to minimize the probability of error?
Problem 15. (Independent and Identically Distributed versus First-Order Markov) Con-
sider testing two equally likely hypotheses H = 0 and H = 1. The observable Y =
(Y
1
, . . . , Y
k
)
T
is a k -dimensional binary vector. Under H = 0 the components of the
vector Y are independent uniform random variables (also called Bernoulli (1/2) random
variables). Under H = 1, the component Y
1
is also uniform, but the components Y
i
,
2 i k , are distributed as follows:
P
Y
i
|Y
1
,...,Y
i1
(y
i
|y
1
, . . . , y
i1
) =
_
3/4, if y
i
= y
i1
1/4, otherwise.
(2.32)
(i) Find the decision rule that minimizes the probability of error. Hint: Write down a
short sample sequence (y
1
, . . . , y
k
) and determine its probability under each hypothesis.
Then generalize.
(ii) Give a simple sucient statistic for this decision.
(iii) Suppose that the observed sequence alternates between 0 and 1 except for one string
of ones of length s, i.e. the observed sequence y looks something like
y = 0101010111111 . . . 111111010101 . . . . (2.33)
What is the least s such that we decide for hypothesis H = 1?
2.F. Exercises 73
Problem 16. (Real-Valued Gaussian Random Variables) For the purpose of this prob-
lem, two zero-mean real-valued Gaussian random variables X and Y are called jointly
Gaussian if and only if their joint density is
f
XY
(x, y) =
1
2

det
exp
_

1
2
_
x, y
_

1
_
x
y
__
, (2.34)
where (for zero-mean random vectors) the so-called covariance matrix is
= E
__
X
Y
_
(X, Y )
_
=
_

2
X

XY

XY

2
Y
_
. (2.35)
(a) Show that if X and Y are zero-mean jointly Gaussian random variables, then X is
a zero-mean Gaussian random variable, and so is Y .
(b) Show that if X and Y are independent zero-mean Gaussian random variables, then
X and Y are zero-mean jointly Gaussian random variables.
(c) However, if X and Y are Gaussian random variables but not independent, then X
and Y are not necessarily jointly Gaussian. Give an example where X and Y are
Gaussian random variables, yet they are not jointly Gaussian.
(d) Let X and Y be independent Gaussian random variables with zero mean and vari-
ance
2
X
and
2
Y
, respectively. Find the probability density function of Z = X +Y .
Observe that no computation is required if we use the denition of jointly Gaussian
random variables given in Appendix 2.C.
Problem 17. (Correlation versus Independence) Let Z be a random variable with prob-
ability density function
f
Z
(z) =
_
1/2, 1 z 1
0, otherwise.
Also, let X = Z and Y = Z
2
.
(a) Show that X and Y are uncorrelated.
(b) Are X and Y independent?
(c) Now let X and Y be jointly Gaussian, zero mean, uncorrelated with variances
2
X
and
2
Y
respectively. Are X and Y independent? Justify your answer.
Problem 18. (Uniform Polar to Cartesian) Let R and be independent random vari-
ables. R is distributed uniformly over the unit interval, is distributed uniformly over
the interval [0, 2) .
74 Chapter 2.
(a) Interpret R and as the polar coordinates of a point in the plane. It is clear that
the point lies inside (or on) the unit circle. Is the distribution of the point uniform
over the unit disk? Take a guess!
(b) Dene the random variables
X = Rcos
Y = Rsin .
Find the joint distribution of the random variables X and Y by using the Jacobian
determinant.
(c) Does the result of part (2) support or contradict your guess from part (1)? Explain.
Problem 19. (Sucient Statistic) Consider a binary hypothesis testing problem specied
by:
H = 0 :
_
Y
1
= Z
1
Y
2
= Z
1
Z
2
H = 1 :
_
Y
1
= Z
1
Y
2
= Z
1
Z
2
where Z
1
, Z
2
and H are independent random variables. Is Y
1
a sucient statistic?
Hint: If Y = aZ for some scalar a then f
Y
(y) =
1
|a|
f
Z
(
y
a
) .
Problem 20. (More on Sucient Statistic) We have seen that if H T(Y ) Y ,
then the probability of error P
e
of a MAP decoder that decides on the value of H upon
observing both T(Y ) and Y is the same as that of a MAP decoder that observes only
T(Y ) . It is natural to wonder if the contrary is also true, specically if the knowledge
that Y does not help reduce the error probability that we can achieve with T(Y ) implies
H T(Y ) Y . Here is a counter-example. Let the hypothesis H be either 0 or 1
with equal probability (the choice of distribution on H is critical in this example). Let
the observable Y take four values with the following conditional probabilities
P
Y |H
(y|0) =
_

_
0.4 if y = 0
0.3 if y = 1
0.2 if y = 2
0.1 if y = 3
P
Y |H
(y|1) =
_

_
0.1 if y = 0
0.2 if y = 1
0.3 if y = 2
0.4 if y = 3
and T(Y ) is the following function
T(y) =
_
0 if y = 0 or y = 1
1 if y = 2 or y = 3.
2.F. Exercises 75
(a) Show that the MAP decoder

H(T(y)) that decides based on T(y) is equivalent to
the MAP decoder

H(y) that operates based on y .
(b) Compute the probabilities Pr{Y = 0 | T(Y ) = 0, H = 0} and Pr{Y = 0 | T(Y ) =
0, H = 1}. Is it true that H T(Y ) Y ?
Problem 21. (Fisher-Neyman Factorization Theorem) Consider the hypothesis testing
problem where the hypothesis is H {0, 1, . . . , m1}, the observable is Y , and T(Y ) is
a function of the observable. Let f
Y |H
(y|i) be given for all i {0, 1, . . . , m1}. Suppose
that there are functions g
1
, g
2
, . . . , g
m1
so that for each i {0, 1, . . . , m 1} one can
write
f
Y |H
(y|i) = g
i
(T(y))h(y). (2.36)
(a) Show that when the above conditions are satised, a MAP decision depends on the
observable Y only through T(Y ) . In other words, Y itself is not necessary. Hint:
Work directly with the denition of a MAP decision rule.
(b) Show that T(Y ) is a sucient statistic, that is H T(Y ) Y . Hint: Start by
observing the following fact. Given a random variable Y with probability density
function f
Y
(y) and given an arbitrary event B, we have
f
Y |Y B
=
f
Y
(y)1
B
(y)
_
B
f
Y
(y)dy
. (2.37)
Proceed by dening B to be the event B = {y : T(y) = t} and make use of (2.37)
applied to f
Y |H
(y|i) to prove that f
Y |H,T(Y )
(y|i, t) is independent of i .
(c) (Example 1) Let Y = (Y
1
, Y
2
, . . . , Y
n
) , Y
k
{0, 1}, be an independent and identically
distributed (i.i.d) sequence of coin tosses such that P
Y
k
|H
(1|i) = p
i
. Show that
the function T(y
1
, y
2
, . . . , y
n
) =

n
k=1
y
k
fullls the condition expressed in equation
(2.36). Notice that T(y
1
, y
2
, . . . , y
n
) is the number of 1s in (y
1
, y
2
, . . . , y
n
) .
(d) (Example 2) Under hypothesis H = i , let the observable Y
k
be Gaussian distributed
with mean m
i
and variance 1; that is
f
Y
k
|H
(y|i) =
1

2
e

(ym
i
)
2
2
,
and Y
1
, Y
2
, . . . , Y
n
be independently drawn according to this distribution. Show that
the sample mean T(y
1
, y
2
, . . . , y
n
) =
1
n

n
k=1
y
k
fullls the condition expressed in
equation (2.36).
76 Chapter 2.
Problem 22. (Irrelevance and Operational Irrelevance) Let the hypothesis H be re-
lated to the observables (U, V ) via the channel P
U,V |H
and for simplicity assume that
P
U|H
(u|h) > 0 and P
V |U,H
(v|u, h) > 0 for every h H, v V and u U . We say
that V is operationally irrelevant if a MAP decoder that observes (U, V ) achieves the
same probability of error as one that observes only U , and this is true regardless of P
H
.
We now prove that irrelevance and operational irrelevance imply one another. We have
already proved that irrelevance implies operational irrelevance. Hence it suces to show
that operational irrelevance implies irrelevance or, equivalently, that if V is not irrelevant,
then it is not operationally irrelevant. We will prove the latter statement. We begin with
a few observations that are instructive. By denition, V irrelevant means H U V .
Hence V irrelevant is equivalent to the statement that, conditioned on U , the random
variables H and V are independent. This gives us one intuitive explanation about why
V is operationally irrelevant. Once we observe that U = u, we can restate the hypothesis
testing problem in terms of a hypothesis H and an observable V that are independent
(Conditioned on U = u) and because of independence, from V we learn nothing about
H. But if V is not irrelevant, then there is at least a u, call it u

, for which H and


V are not independent conditioned on U = u

. It is when such a u is observed that we


should be able to prove that V aects the decision. This suggests that the problem we are
trying to solve is intimately related to the simpler problem that involves the hypothesis
H and the observable V and the two are not independent. We begin with this problem
and then we generalize.
(a) Let the hypothesis be H H (of yet unspecied distribution) and let the observable
V V be related to H via an arbitrary but xed channel P
V |H
. Show that if V is
not independent of H then there are distinct elements i, j H and distinct elements
k, l V such that
P
V |H
(k|i) > P
V |H
(k|j)
P
V |H
(l|i) < P
V |H
(l|j)
(2.38)
(b) Under the condition of the previous question, show that there is a distribution P
H
for which the observable V aects the decision of a MAP decoder.
(c) Generalize to show that if the observables are U and V and P
U,V |H
is xed so that
H U V does not hold, then there is a distribution on H for which V is not
operationally irrelevant.
Problem 23. (16-PAM versus 16-QAM) The two signal constellations in Figure 2.14
are used to communicate across an additive white Gaussian noise channel. Let the noise
variance be
2
.
Each point represents a codeword c
i
for some i . Assume each codeword is used with the
same probability.
(a) For each signal constellation, compute the average probability of error P
e
as a func-
tion of the parameters a and b , respectively.
2.F. Exercises 77
- s s s s s s s s s s s s s s s s
-
a
0
6
-
s s s s
s s s s
s s s s
s s s s
-
b
Figure 2.14:
(b) For each signal constellation, compute the average energy per symbol E as a function
of the parameters a and b , respectively:
E =
16

i=1
P
H
(i)c
i

2
(2.39)
(c) Plot P
e
versus E for both signal constellations and comment.
Problem 24. (Q-Function on Regions, problem from [1]) Let X N(0,
2
I
2
) . For each
of the three gures below, express the probability that X lies in the shaded region. You
may use the Q-function when appropriate.
1 2
x
1
x
2
x
1
x
2
2
2
1
1
x
1
x
2
Figure 2.15:
Problem 25. (QPSK Decision Regions) Let H {0, 1, 2, 3} and assume that when
H = i you transmit the codeword c
i
shown in Figure 2.16. Under H = i , the receiver
observes Y = c
i
+Z.
78 Chapter 2.
-
6
s s
s
s
c
2
c
0
c
1
c
3
x
1
x
2
Figure 2.16:
(a) Draw the decoding regions assuming that Z N(0,
2
I
2
) and that P
H
(i) = 1/4,
i {0, 1, 2, 3}.
(b) Draw the decoding regions (qualitatively) assuming Z N(0,
2
I) and P
H
(0) =
P
H
(2) > P
H
(1) = P
H
(3) . Justify your answer.
(c) Assume again that P
H
(i) = 1/4, i {0, 1, 2, 3} and that Z N(0, K) , where
K =
_

2
0
0 4
2
_
. How do you decode now?
Problem 26. (Properties of the Q Function) Prove properties (a) through (d) of the
Q function dened in Section 2.3. Hint: for property (d), multiple and divide inside the
integral by the integration variable and integrate by parts. By upper- and lowerbounding
the resulting integral, you will obtain the lower and upper bound.
Problem 27. (Antenna Array) The following problem relates to the design of multi-
antenna systems. Consider the binary equiprobable hypothesis testing problem:
H = 0 : Y
1
= A +Z
1
, Y
2
= A +Z
2
H = 1 : Y
1
= A +Z
1
, Y
2
= A +Z
2
,
where Z
1
, Z
2
are independent Gaussian random variables with dierent variances
2
1
=

2
2
, that is, Z
1
N(0,
2
1
) and Z
2
N(0,
2
2
) . A > 0 is a constant.
(a) Show that the decision rule that minimizes the probability of error (based on the
observable Y
1
and Y
2
) can be stated as

2
2
y
1
+
2
1
y
2
0

1
0.
2.F. Exercises 79
(b) Draw the decision regions in the (Y
1
, Y
2
) plane for the special case where
1
= 2
2
.
(c) Evaluate the probability of the error for the optimal detector as a function of
2
1
,
2
2
and A.
Problem 28. (Multiple Choice Exam) You are taking a multiple choice exam. Question
number 5 allows for two possible answers. According to your rst impression, answer 1
is correct with probability 1/4 and answer 2 is correct with probability 3/4. You would
like to maximize your chance of giving the correct answer and you decide to have a look
at what your neighbors on the left and right have to say. The neighbor on the left has
answered

H
L
= 1. He is an excellent student who has a record of being correct 90% of
the time when asked a binary question. The neighbor on the right has answered

H
R
= 2.
He is a weaker student who is correct 70% of the time.
(a) You decide to use your rst impression as a prior and to consider

H
L
and

H
R
as
observations. Formulate the decision problem as a hypothesis testing problem.
(b) What is your answer

H?
Problem 29. (Multi-Antenna Receiver) Consider a communication system with one
transmitter and n receiver antennas. The n-tuple former output of antenna k , denoted
by Y
k
, is modeled by
Y
k
= Bg
k
+Z
k
, k = 1, 2, . . . , n
where B {1} is a uniformly distributed source bit, g
k
models the gain of antenna k
and Z
k
N(0,
2
) . The random variables B, Z
1
, . . . , Z
n
are independent. Using n-tuple
notation the model becomes
Y = Bg +Z,
where Y , g , and Z are n-tuples.
(a) Suppose that the observation Y
k
is weighted by an arbitrary real number w
k
and
combined with the other observations to form
V =
n

k=1
Y
k
w
k
= Y, w,
where w is an n-tuple. Describe the ML receiver for B given the observation V .
(The receiver knows g and of course knows w.)
(b) Give an expression for the probability of error P
e
.
(c) Dene =
|g,w|
gw
and rewrite the expresson for P
e
in a form that depends on w
only through .
80 Chapter 2.
(d) As a function of w, what are the maximum and minimum values for and how do
you choose w to achieve them?
(e) Minimize the probability of error over all possible choices of w. Could you reduce
the error probability further by doing ML decision directly on Y rather than on V ?
Justify your answer.
(f) How would you choose w to minimize the error probability if Z
k
had variance
k
,
k = 1, . . . , n? Hint: With a simple operation at the receiver you can transform the
new problem into the one you have already solved.
Problem 30. (QAM with Erasure) Consider a QAM receiver that outputs a special
symbol called erasure and is denoted by whenever the observation falls in the shaded
area shown in Figure 2.17. Assume that c
0
R
2
is transmitted and that Y = c
0
+ N
is received where N N(0,
2
I
2
) . Let P
0i
, i = 0, 1, 2, 3 be the probability that the
receiver outputs

H = i and let P
0
be the probability that it outputs . Determine P
00
,
P
01
, P
02
, P
03
and P
0
.
b a
y
1
y
2
b
c
0
c
1
c
2
c
3
Figure 2.17:
Comment: If we choose b a large enough, we can make sure that the probability of the
error is very small (we say that an error occurred if

H = i, i {0, 1, 2, 3} and H =

H).
When

H = , the receiver can ask for a retransmission of H. This requires a feedback
channel from the receiver to the sender. In most practical applications, such a feedback
channel is available.
Problem 31. (Repeat Codes and Bhattacharyya Bound) Consider two equally likely
hypotheses. Under hypothesis H = 0, the transmitter sends c
0
= (1, . . . , 1)
T
and under
H = 1 it sends c
1
= (1, . . . , 1)
T
, both of length n. The channel model is the AWGN
with variance
2
in each component. Recall that the probability of error for a ML receiver
2.F. Exercises 81
that observes the channel output Y R
n
is
P
e
= Q
_
n

_
.
Suppose now that the decoder has access only to the sign of Y
i
, 1 i n. In other
words, the observation is
W = (W
1
, . . . , W
n
) = (sign(Y
1
), . . . , sign(Y
n
)). (2.40)
(a) Determine the MAP decision rule based on the observable W . Give a simple sucient
statistic, and draw a diagram of the optimal receiver.
(b) Find the expression for the probability of error

P
e
of the MAP decoder that observes
W . You may assume that n is odd.
(c) Your answer to (b) contains a sum that cannot be expressed in closed form. Express
the Bhattacharyya bound on

P
e
.
(d) For n = 1, 3, 5, 7, nd the numerical values of P
e
,

P
e
, and the Bhattacharyya bound
on

P
e
.
Problem 32. (Tighter Union Bhattacharyya Bound: Binary Case)
In this problem we derive a tighter version of the Union Bhattacharyya Bound for binary
hypotheses. Let
H = 0 : Y f
Y |H
(y|0)
H = 1 : Y f
Y |H
(y|1).
The MAP decision rule is

H(y) = arg max


i
P
H
(i)f
Y |H
(y|i),
and the resulting probability of error is
P
e
= P
H
(0)
_
R
1
f
Y |H
(y|0)dy +P
H
(1)
_
R
0
f
Y |H
(y|1)dy.
(a) Argue that
P
e
=
_
y
min
_
P
H
(0)f
Y |H
(y|0), P
H
(1)f
Y |H
(y|1)
_
dy.
(b) Prove that for a, b 0, min(a, b)

ab
a+b
2
. Use this to prove the tighter version
of the Bhattacharyya Bound, i.e,
P
e

1
2
_
y
_
f
Y |H
(y|0)f
Y |H
(y|1)dy.
82 Chapter 2.
(c) Compare the above bound to the one derived in class when P
H
(0) =
1
2
. How do you
explain the improvement by a factor
1
2
?
Problem 33. (Tighter Union Bhattacharyya Bound: M -ary Case) In this problem we
derive a tight version of the union bound for M -ary hypotheses. Let us analyze the
following M-ary MAP detector:

H(y) = smallest i such that


P
H
(i)f
Y |H
(y|i) = max
j
{P
H
(j)f
Y |H
(y|j)}.
Let
B
i,j
=
_
y : P
H
(j)f
Y |H
(y|j) P
H
(i)f
Y |H
(y|i), j < i
y : P
H
(j)f
Y |H
(y|j) > P
H
(i)f
Y |H
(y|i), j > i
(a) Verify that B
i,j
= B
c
j,i
.
(b) Given H = i , the detector will make an error if and only if y

j:j=i
B
i,j
. The
probability of error is P
e
=

M1
i=0
P
e
(i)P
H
(i) . Show that:
P
e

M1

i=0

j>i
[Pr{B
i,j
|H = i}P
H
(i) +Pr{B
ji
|H = j}P
H
(j)]
=
M1

i=0

j>i
_
_
B
i,j
f
Y |H
(y|i)P
H
(i)dy +
_
B
c
i,j
f
Y |H
(y|j)P
H
(j)dy
_
=
M1

i=0

j>i
__
y
min
_
f
Y |H
(y|i)P
H
(i), f
Y |H
(y|j)P
H
(j)
_
dy
_
Hint: Use the union bound and then group the terms corresponding to B
i,j
and B
ji
.
To prove the last part, go back to the denition of B
i,j
.
(c) Hence show that:
P
e

M1

i=0

j>i
_ _
P
H
(i) +P
H
(j)
2
__
y
_
f
Y |H
(y|i)f
Y |H
(y|j)dy
_
(Hint: For a, b 0, min(a, b)

ab
a+b
2
.)
Problem 34. (Applying the Tight Bhattacharyya Bound) As an application of the tight
Bhattacharyya bound (Problem 32), consider the following binary hypothesis testing prob-
lem
H = 0 : Y N(a,
2
)
H = 1 : Y N(+a,
2
)
where the two hypotheses are equiprobable.
2.F. Exercises 83
(a) Use the Tight Bhattacharyya Bound to derive a bound on P
e
.
(b) We know that the probability of error for this binary hypothesis testing problem is
Q(
a

)
1
2
exp
_

a
2
2
2
_
, where we have used the result Q(x)
1
2
exp
_

x
2
2
_
. How do
the two bounds compare? Comment the result.
Problem 35. (Bhattacharyya Bound for DMCs) Consider a Discrete Memoryless Chan-
nel (DMC). This is a channel model described by an input alphabet X , an output alpha-
bet Y and a transition probability
5
P
Y |X
(y|x) . When we use this channel to transmit an
n-tuple x X
n
, the transition probability is
P
Y |X
(y|x) =
n

i=1
P
Y |X
(y
i
|x
i
).
So far, we have come across two DMCs, namely the BSC (Binary Symmetric Channel)
and the BEC (Binary Erasure Channel). The purpose of this problem is to see that for
DMCs, the Bhattacharyya Bound takes on a simple form, in particular when the channel
input alphabet X contains only two letters.
(a) Consider a transmitter that sends c
0
X
n
when H = 0 and c
1
X
n
when H = 1.
Justify the following chain of inequalities.
P
e
(a)

y
_
P
Y |X
(y|c
0
)P
Y |X
(y|c
1
)
(b)

_
n

i=1
P
Y |X
(y
i
|c
0,i
)P
Y |X
(y
i
|c
1,i
)
(c)
=

y
1
,...,y
n
n

i=1
_
P
Y |X
(y
i
|c
0,i
)P
Y |X
(y
i
|c
1,i
)
(d)
=

y
1
_
P
Y |X
(y
1
|c
0,1
)P
Y |X
(y
1
|c
1,1
) . . .

y
n
_
P
Y |X
(y
n
|c
0,n
)P
Y |X
(y
n
|c
1,n
)
(e)
=
n

i=1

y
_
P
Y |X
(y|c
0,i
)P
Y |X
(y|c
1,i
)
(f)
=

aX,bX,a=b
_

y
_
P
Y |X
(y|c
0,i
)P
Y |X
(y|c
1,i
)
_
n(a,b)
.
where n(a, b) is the number of positions i in which c
0,i
= a and c
1,i
= b .
5
Here we are assuming that the output alphabet is discrete. Otherwise we use densities instead of
probabilities.
84 Chapter 2.
(b) The Hamming distance d
H
(c
0
, c
1
) is dened as the number of positions in which
c
0
and c
1
dier. Show that for a binary input channel, i.e, when X = {a, b}, the
Bhattacharyya Bound becomes
P
e
z
d
H
(c
0
,c
1
)
,
where
z =

y
_
P
Y |X
(y|a)P
Y |X
(y|b).
Notice that z depends only on the channel, whereas its exponent depends only on
c
0
and c
1
.
(c) What is z for
(i) The binary input Gaussian channel described by the densities
f
Y |X
(y|0) = N(

E,
2
)
f
Y |X
(y|1) = N(

E,
2
).
(ii) The Binary Symmetric Channel (BSC) with the transition probabilities de-
scribed by
P
Y |X
(y|x) =
_
1 , if y = x,
, otherwise.
(iii) The Binary Erasure Channel (BEC) with the transition probabilities given by
P
Y |X
(y|x) =
_
_
_
1 , if y = x,
, if y = E
0, otherwise.
(iv) Consider a channel with input alphabet {1}, and output Y = sign(x + Z) ,
where x is the input and Z N(0,
2
) . This is a BSC obtained from quantizing
a Gaussian channel used with binary input alphabet. What is the crossover
probability p of the BSC? Plot the z of the underlying Gaussian channel (with
inputs in R) and that of the BSC. By how much do we need to increase the input
power of the quantized channel to match the z of the unquantized channel?
Problem 36. (Bhattacharyya Bound and Laplacian Noise) Assuming two equally likely
hypotheses, evaluate the Bhattacharyya bound for the following (Laplacian noise) setting:
H = 0 : Y = a +Z
H = 1 : Y = a +Z,
where a R
+
is a constant and Z is a random variable of probability density function
f
Z
(z) =
1
2
exp (|z|) , z R.
2.F. Exercises 85
Problem 37. (SIMO Channel with Laplacian Noise) One of the two signals c
0
= 1, c
1
=
1 is transmitted over the channel shown in Figure 2.18(a). The two noise random variables
Z
1
and Z
2
are statistically independent of the transmitted signal and of each other. Their
density functions are
f
Z
1
() = f
Z
2
() =
1
2
e
||
.
X {c
0
, c
1
}
-
-

-
Y
2
6
Z
2
-

-
Y
1
?
Z
1
(a)
-
(y
1
, y
2
)
y
1
6
y
2
b
a
(1, 1)
s
s
(b)
Figure 2.18:
(a) Derive a maximum likelihood decision rule.
(b) Describe the maximum likelihood decision regions in the (y
1
, y
2
) plane. Describe also
the Either Choice regions, i.e., the regions where it does not matter if you decide
for c
0
or for c
1
. Hint: Use geometric reasoning and the fact that for a point (y
1
, y
2
)
as shown in 2.18(b), |y
1
1| +|y
2
1| = a +b.
(c) A receiver decides that c
1
was transmitted if and only if (y
1
+ y
2
) > 0. Does this
receiver minimize the error probability for equally likely messages?
(d) What is the error probability for the receiver in (c)? Hint: One way to do this is to
use the fact that if W = Z
1
+Z
2
then f
W
() =
e

4
(1 + ) for w > 0.
Problem 38. (Signal Constellation) The following signal constellation with six signals is
used in additive white Gaussian noise of variance
2
:
86 Chapter 2.
-
x
1
6
x
2
s s s
s s s
b
6
?
a
Figure 2.19:
Assume that the six signals are used with equal probability.
(a) Draw the boundaries of the decision regions.
(b) Compute the average probability of error, P
e
, for this signal constellation.
(c) Compute the average energy per symbol for this signal constellation.
Problem 39. (Hypothesis Testing and Fading) Consider the following communication
problem depicted in Figure 2.20: There are two equiprobable hypotheses. When H = 0,
we transmit c
0
= b , where b is an arbitrary but xed positive number. When H = 1,
we transmit c
1
= b . The channel is as shown in the gure below, where Z N(0,
2
)
represents the noise, A {0, 1} represents a random attenuation (fading) with P
A
(0) =
1
2
, and Y is the channel output. The random variables H, A and Z are independent.

-
X {c
0
, c
1
}
Y
A Z
6 6
Figure 2.20:
(a) Find the decision rule that the receiver should implement to minimize the probability
of error. Sketch the decision regions.
(b) Calculate the probability of error P
e
, based on the above decision rule.
2.F. Exercises 87
Problem 40. (Dice Tossing) You have two dices, one fair and one loaded. A friend
told you that the loaded dice produces a 6 with probability
1
4
, and the other values
with uniform probabilities. You do not know a priori which of the two is a fair dice.
You pick with uniform probabilities one of the two dices, and perform n consecutive
tosses. Let Y
i
{1, . . . , 6} be the random variable modeling the i th experiment and let
Y = (Y
1
, , Y
n
) .
(a) Based on the observable Y , nd the decision rule to determine whether the dice you
have chosen is loaded. Your decision rule should maximize the probability of the
correct decision.
(b) Identify a sucient statistic S N.
(c) Find the Bhattacharyya bound on the probability of error. You can either work with
the observable (Y
1
, . . . , Y
n
) or with (Z
1
, . . . , Z
n
) , where Z
i
indicates whether the i th
observation is a six or not. Yet another alternative is to work with S . Depending on
the approach, the following may be useful:

n
i=0
_
n
i
_
x
i
= (1 +x)
n
for n N.
Problem 41. (Antipodal Signaling) Consider the signal constellation shown in Fig-
ure 2.21. Assume that the codewords c
1
and c
0
are used for communication over the
-
6
s
c
1
s
c
0
a a
a
a
x
1
x
2
Figure 2.21:
Gaussian vector channel. More precisely:
H = 0 : Y = c
0
+Z,
H = 1 : Y = c
1
+Z,
where Z N(0,
2
I
2
) . Hence Y = (Y
1
, Y
2
) .
(a) Argue that Y
1
is not a sucient statistic.
(b) Give a dierent signal constellation with two codewords c
0
and c
1
such that, when
used in the above communication setting, Y
1
is a sucient statistic.
88 Chapter 2.
Problem 42. (ML Receiver and UB for Orthogonal Signaling) Let H {1, . . . , m} be
uniformly distributed and consider the communication problem described by:
H = i : Y = c
i
+Z, Z N(0,
2
I
m
),
where c
1
, . . . , c
m
, c
i
R
m
, is a set of constant-energy orthogonal codewords. Without
loss of generality we assume
c
i
=

Ee
i
,
where e
i
is the i th unit vector in R
m
, i.e., the vector that contains 1 at position i and
0 elsewhere, and E is some positive constant.
(a) Describe the maximum likelihood decision rule.
(b) Find the distance c
i
c
j
.
(c) Using the union bound and the Q function, upper-bound the probability P
e
(i) that
the decision is incorrect when H = i .
Problem 43. (Data Storage Channel) The process of storing and retrieving binary data
on a thin-lm disk can be modeled as transmitting binary symbols across an additive white
Gaussian noise channel where the noise Z has a variance that depends on the transmitted
(stored) binary symbol X. The noise has the following input-dependent density:
f
Z
(z) =
_

_
1

2
2
1
e

z
2
2
2
1
if X = 1
1

2
2
0
e

z
2
2
2
0
if X = 0,
where
1
>
0
. The channel inputs are equally likely.
(a) On the same graph, plot the two possible output probability density functions. Indi-
cate, qualitatively, the decision regions.
(b) Determine the optimal receiver in terms of
1
and
0
.
(c) Write an expression for the error probability P
e
as a function of
0
and
1
.
Problem 44. (A Simple Multiple-Access Scheme) Consider the following very simple
model of a multiple-access scheme. There are two users. Each user has two hypotheses.
Let H
1
= H
2
= {0, 1} denote the respective set of hypotheses and assume that both
users employ a uniform prior. Further, let X
1
and X
2
be the respective signals sent
by user one and two. Assume that the transmissions of both users are independent and
that X
1
{1} and X
2
{2} where X
1
and X
2
are positive if their respective
hypothesis is zero and negative otherwise. Assume that the receiver observes the signal
Y = X
1
+ X
2
+ Z, where Z is a zero-mean Gaussian random variable with variance
2
and is independent of the transmitted signal.
2.F. Exercises 89
(a) Assume that the receiver observes Y and wants to estimate both H
1
and H
2
. Let

H
1
and

H
2
be the estimates. What is the generic form of the optimal decision rule?
(b) For the specic set of signals given, what is the set of possible observations, assuming
that
2
= 0? Label these signals by the corresponding (joint) hypotheses.
(c) Assuming now that
2
> 0, draw the optimal decision regions.
(d) What is the resulting probability of correct decision? i.e., determine the probability
Pr{

H
1
= H
1
,

H
2
= H
2
}.
(e) Finally, assume that we are interested in only the transmission of user two. What is
Pr{

H
2
= H
2
}?
Problem 45. (Data Dependent Noise) Consider the following binary Gaussian hypothesis
testing problem with data dependent noise. Under hypothesis H = 0 the transmitted
signal is c
0
= 1 and the received signal is Y = c
0
+Z
0
, where Z
0
is zero-mean Gaussian
with variance one. Under hypothesis H = 1 the transmitted signal is c
1
= 1 and the
received signal is Y = c
1
+Z
1
, where Z
1
is zero-mean Gaussian with variance
2
. Assume
that the prior is uniform.
(a) Write the optimal decision rule as a function of the parameter
2
and the received
signal Y .
(b) For the value
2
= exp{4} compute the decision regions.
(c) Give expressions as simple as possible for the error probabilities P
e
(0) and P
e
(1) .
Problem 46. (Correlated Noise) Consider the following communication problem. The
message is represented by a uniformly distributed random variable H, that takes values
in {0, 1, 2, 3}. When H = i we send c
i
where c
0
= (0, 1)
T
, c
1
= (1, 0)
T
, c
2
= (0, 1)
T
,
c
3
= (1, 0)
T
(see the gure below).
-
6
s s
s
s
c
3
c
1
c
0
c
2
x
1
x
2
1 1
1
1
Figure 2.22:
90 Chapter 2.
When H = i , the receiver observes the vector Y = c
i
+ Z, where Z is a zero-mean
Gaussian random vector whose covariance matrix is = (
4 2
2 5
) .
(a) In order to simplify the decision problem, we transform Y into

Y = BY = Bc
i
+BZ,
where B is a 2-by-2 invertible matrix, and use

Y as a sucient statistic. Find a
B such that BZ is a zero-mean Gaussian random vector with independent and
identically distributed components. Hint: If A =
1
4
(
2 0
1 2
) , then AA
T
= I , with
I = (
1 0
0 1
) .
(b) Formulate the new hypothesis testing problem that has

Y as the observable and
depict the decision regions.
(c) Give an upper bound to the error probability in this decision problem.

S-ar putea să vă placă și