2 Entropy and Mutual Information: I (A) F (P (A) )

2
Entropy and Mutual Information
From the examples on information given earlier, it seems clear that

the information, i(A), provided by the occurrence of an event A should
have the following properties:
1. i(A) must be a monotonically decreasing function of its probability:
i(A) = f [p(A)]
where f () is monotonically decreasing with p(A).
2. i(A) 0 for 0 p(A) 1.
3. i(A) = 0 if p(A) = 1.
4. i(A) > i(B) if p(A) < p(B).
5. If A and B are independent events, then i(A
B) = i(AB) =
i(A) + i(B).
It turns out there is one and only one function that satisfies the
above requirements, namely the logarithmic function.
Definition 1 Let A be an event having probability of occurrence p(A).
Then the amount of information conveyed by the knowledge of the occurrence of A, referred to as the self-information of event A, is given
by
i(A) = log2 [p(A)] bits.
1
The unit of information is in bits when the base of the logarithm

is 2, nats when the the base is e = 2.71 and nepers when a base
10 logarithm is used.
Definition 2 (Mutual information) The amount of information provided by the occurrence of an event B about the occurrence of an event
A, known as the mutual information between A and B, is defined by1
p(A|B)
i(A; B) = log
p(A)
= log p(A) [ log p(A|B)]
= i(A) i(A|B).
In other words, i(A; B) is also the amount of uncertainty about event
A removed by the occurrence of event B.
2.1
Properties of Mutual Information
1.
i(A; B) = i(B; A).
Proof:
p(A|B)
p(A)
p(A, B)
= log
p(A)p(B)
p(B|A)
= log
= I(B; A).
p(B)
i(A; B) = log
Here, we abuse notation somewhat by using i() to denote self-information and i( ; ) to denote
mutual information.
2. Mutual information
i(A; B) < 0 if p(A|B) < p(A)
i(A; B) > 0 if p(A|B) > p(A)
i(A; B) = 0 if p(A|B) = p(A).
3. i(A; A) = log p(A) = i(A).
Definition 3 The mutual information between events A and B given
that event C has occurred is
i(A; B|C) = log
p(A|B, C)
p(A, B|C)
= log
= i(B; A|C).
p(A|C)
p(A|C)p(B|C)
Exercise 1 Prove the following chain rule for mutual information:

i(A; B, C) = i(A; B) + i(A; C|B).
Proof: We have
i(A; B, C) = log
p(A|B, C)
.
p(A)
Now,
p(A, B, C) = p(A|B, C)p(C|B)p(B)
= p(C|A, B)p(A|B)p(B)
p(C|A, B)p(A|B)
p(A|B, C) =
.
p(C|B)
Thus,
p(A|B) p(C|A, B
i(A; B, C) = log
p(A) p(C|B)
p(C|A, B)
p(A|B)
+ log
= log
p(A)
p(C|B)
= i(A; B) + i(C; A|B)
= i(A; B) + i(A; C|B).
Thus, the amount of information about event A provided by the joint

occurrence of events B and C is equal to the amount of information
provided about A by the occurrence of B plus the amount of information provided about A by the occurrence of C given that B has already
occurred.
Example 1 Consider flipping a fair coin and providing information as
to the outcome through the following so-called Z-channel or erasure
channel. We have the following mapping from the coin outcome (H
or T) to the real field: H 0 and T 1. Let X be the binary-valued
random variable associated with the coin flipping.
1
0.5
X
0.5
Y
1
1
1
Figure 1: The binary erasure channel for Example 1.
1. What is the self-information of the events X = 0, X = 1?

4
We have 2 :

1
i(X = 0) = log p(X = 0) = log
= 1 bit.
2
1
i(X = 1) = log p(X = 1) = log( ) = 1 bit.
2
2. What is the self-information of events Y = 0, Y = 1?
We have:
i(Y = 0) = log p(Y = 0).
p(Y = 0) = p(Y = 0|X = 0)p(X = 0) + p(Y = 0|X = 1)p(X = 1)
= 1 p(X = 0) + p(X = 1)
1+
=
.
2
1+
i(Y = 0) = log
bits.
2
Similarly,
i(Y = 1) = log p(Y = 1).
p(Y = 1) = 1 p(Y = 0) =
i(Y = 1) = log
1
.
2
1
bits.
2
3. What is the mutual information between events X = 0 and Y = 0?

We have,
i(Y = 0; X = 0) = log
2
2
p(Y = 0|X = 0)
= log
= 1log(1+)
p(Y = 0)
1+
Unless otherwise stated, log will denote logarithm based 2.
i(Y = 1; X = 0) = log
p(Y = 1|X = 0)
= log(0) = .
p(Y = 1)
i(Y = 0; X = 1) = log
2
p(Y = 0|X = 1)
= log
p(Y = 0)
1+
6. What is the mutual information between X = 0 and Y = 0?
i(Y = 1; X = 0) = log
2.2
p(Y = 1|X = 0)
2(1 )
= log
= 1bit
p(Y = 1)
1
Average Self-Information -Entropy
Definition 4 (Entropy) The entropy of a source represented by a random variable X with realizations taking values from the set X is
H(X) =
p(x) log p(x).
xX
From the above definition, we can write

H(X) = E[ log p(X)] = E[i(X)],
which implies that the entropy of a source is the average amount of
information produced by the source.
Clearly, since log p(x) 0, we have H(X) 0.
Example 2 Consider a source X that produces two symbols with equal

probability (1/2). The entropy of this source is

1
1
1
1
H(X) = log
log
= 1 bit
2
2
2
2
Example 3 Consider the tossing of a fair dice ( each outcome occurs
with equal probability 1/6). The average amount of information produced by this source is

1
1
H(X) = log
6 = log(6) = 2.585 bits.
6
6
Note that the source in Example 3 produces more information on
average than the source in Example 2.
Definition 5 The joint entropy, H(X, Y ), of two discrete random variables X and Y is
H(X, Y ) = EX,Y [log p(X, Y )] =
XX
x
p(x, y) log p(x, y).
Definition 6 The conditional entropy, of Y given X is

H(Y |X) = EX,Y [log p(Y |X)] =
XX
x
p(x, y) log p(y|x).
Theorem 1 (Chain rule for entropy)

H(X, Y ) = H(X) + H(Y |X).
Proof: We have
log p(X, Y ) = log p(Y |X)p(X) = log p(X) log p(Y |X).
7
Taking expectations on both sides above

EX,Y [log p(X, Y )] = EX,Y [log p(X)] EX,Y [log p(Y |X)]
H(X, Y ) = H(X) + H(Y |X).
The above result can be easily generalized to the entropy of a random

vector X = (X1 , X2 , , XN ). Let X n = (X1 , X2 , , Xn ). Then the
entropy of X can be expressed as:
H(X) =
N
X
H(Xn |X n1 ),
n=1
where we let H(X1 |X 0 ) , H(X1 ).

Proof: We can write
p(X) =
N
Y
p(Xn |X n1 ),
n=1
where we let p(X1 |X 0 ) , p(X1 ). Thus,

log p(X) =
N
X
log p(Xn |X n1 ).
n=1
Taking expectations with respect to the joint probability mass function

p(X) on both sides of the equation, we obtain the desired relation.
2.3
Relative Entropy and Mutual Information
Definition 7 The relative entropy or Kullback-Leibler distance between

two probability mass functions p(x) and q(x) is given by
X
p(x)
p(x)
D(pkq) = Ep log
.
=
p(x) log
q(x)
q(x)
x
8
The relative entropy is a measure of the distance between the two distributions p(x) and q(x), even though it is not s true distance metric.
We defined earlier the mutual information between two events. Now
consider two random variables X X and Y Y. If x X is a
realization of X and y Y is a realization of Y , the mutual information
between x and y is
i(x; y) = log
p(x|y)
,
p(x)
which is obviously a random variable over the ensemble of realizations

of X and Y . We have the following definition.
Definition 8 The mutual information between two discrete random
variables X and Y is
p(X|Y )
I(X; Y ) = E log
p(X)
XX
p(x|y)
=
p(x, y) log
p(x)
x
y
XX
p(x, y)
=
p(x, y) log
p(x)p(y)
x
y
= D(p(x, y)kp(x)p(y))
XX
p(y|x)
=
p(x, y) log
p(y)
x
y
= I(Y ; X).
The mutual information between random variables X and Y is the
average amount of information provided about X by observing Y , which
is also the average amount of uncertainty resolved about X by observing Y . As can be seen, I(X; Y ) = I(Y ; X), i.e. Y resolves as much
9
uncertainty about X as X about Y .

We have the following relations between entropy and mutual information:
1.
I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X).
Proof: We have
p(X|Y )
p(X
= E log p(X) E log p(X|Y )
I(X; Y ) = E log
= H(X) H(X|Y ).
Since, as established earlier I(X; Y ) = I(Y ; X), we have
I(X; Y ) = H(Y ) H(Y |X).
2.
I(X; Y ) = H(X) H(X|Y )
= H(X) [H(X, Y ) H(Y )]
= H(X) + H(Y ) H(X, Y ).
3. I(X; X) = H(X) H(X|X) = H(X).
The diagram in Figure 2 summarizes the relationship between the
various quantities:
10
Figure 2: Mutual information between input and output for the z-channel.
Definition 9 The conditional mutual information of discrete random

variables X and Y given Z is
p(X|Y, Z)
I(X; Y |Z) = EXY Z log
= H(X|Z) H(X|Y, Z).
p(X|Z)
Theorem 2 (Chain rule for mutual information):
Let X = (X1 , X2 , , XN ) be a random vector. Then the mutual information between X and Y is
N
X
I(X; Y ) =
I Xn ; Y |X n1 .
n=1
Proof:
11
I(X; Y ) = H(X) H(X|Y )

N
N
X
X
n1
=
H(Xn |X )
H(Xn |X n1 , Y )
=
=
n=1
N
X
n=1
N
X
n=1
H(Xn |X n1 ) H(Xn |X n1 , Y )
I Xn ; Y |X n1 .
n=1
2.4
Jensens Inequality
Definition 10 A function f (x) is convex over an interval (a, b) if for

all x1 , x2 (a, b) and 0 1
f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ).
The function is said to be strictly convex if equality above holds only
if = 0 or = 1.
Definition 11 A function f (x) is said to be concave if f (x) is convex.
Theorem 3 A function f that has a non-negative (positive) second
derivative is convex (strictly convex).
Theorem 4 (Jensens inequality) Let f be a convex function and X a
random variable. Then
E[f (X)] f (E(X)) .
12
If f is strictly convex, equality is if and only if X is a constant (not

random).
Proof: (by induction)
Let L be the number of values the discrete random variable X can take.
For a binary random variable, i.e. L = 2, with probabilities p1 and p2
the inequality becomes
p1 f (x1 ) + p2 f (x2 ) f (p1 x1 + p2 x2 ),
and its true in view of the definition of a convex function.
If f is strictly convex, equality is iff p1 or p2 is zero, i.e. if X is
deterministic.
Now lets assume the inequality holds for L = k 1. We need
to show that it then holds for L = k. Let pi , i = 1, 2, , k be the
probabilities for a k-valued random variable and define p0i = pi /(1
P 0
0
pk ), i = 1, 2, , k 1. Clearly
i pi = 1 and the pi are thus the
probabilities of some (k 1)-valued random variable. We have
k
X
pi f (xi ) = pk f (xk ) + (1 pk )
i=1
k1
X
p0i f (xi )
i=1
pk f (xk ) + (1 pk )f
f
= f
pk xk + (1
k
X
k1
X
!
p0i f (xi )
i=1
!
k1
X
pk )
p0i f (xi )
i=1
p i xi .
i=1
For the equality proof, note that equality in the first inequality above
13
is when all p0i are zero but one and in the last inequality if it or pk are
zero, i.e. X is deterministic.
Theorem 5
D(pkq) 0.
Equality above if and only if p(x) = q(x).
Proof:
p(x)
Ep log
q(x)
q(x)
Ep log
p(x)
q(x)
log Ep
p(x)
0.
D(pkq) =
=
Since the log function is strictly convex, equality above is if and only if
p(x)/q(x) = c, c a constant, i.e. p(x) = cq(x). Summing over x on
both sides, c = 1, i.e. p(x) = q(x).
Corollary 1
I(X; Y ) 0.
Proof:
I(X; Y ) = D(p(x, y)kp(x)p(y)) 0,
with equality iff p(x, y) = p(x)p(y), i.e. X and Y are independent.
14
Theorem 6 Let X be the range of random variable X and |X | the

cardinality of X . Then
H(X) log |X |
with equality iff X is uniformly distributed over X .
Proof: Let u(x) = 1/|X |, x X and p(x) the probability mass function
of X. We have
D(pku) =
p(x) log
p(x)
= log |X | H(X) 0.
u(x)
Theorem 7 Conditioning reduces entropy:

H(X) H(X|Y )
with equality iff X and Y are independent.
Proof:
I(X; Y ) = H(X) H(X|Y ) 0 H(X) H(X|Y ),

with equality if X and Y are independent.
Theorem 8
H(X1 , X2 , , Xn )
n
X
H(Xi ),
i=1
with equality iff the Xi are independent.

Proof: From the chain rule of entropy,
H(X1 , X2 , , Xn ) =
n
X
H(Xi |X1 , X2 , , Xi1 )
i=1
n
X
i=1
with equality iff the Xi are independent, by Theorem 7.

15
H(Xi ),
Theorem 9 (The log-sum inequality) For non-negative numbers ai , i =

1, 2, , n and bi , i = 1, 2, , n
n !
Pn
n
X
X
ai
ai
ai log
ai log Pi=1
.
n
b
b
i
i
i=1
i=1
i=1
Proof: Let A =
Pn
n
X
i=1
i=1 ai , B =
Pn
i=1 bi ,
a0i = ai /A and b0i = bi /B. Then
X
ai
a0 A
= A
ai log
a0i log 0i
bi
bi B
i=1
= A
n
X
a0i log
i=1
a0i
A
+
A
log
b0i
B
= AD(a0 kb0 ) + A log

A log
A
.
B
A
B
Theorem 10 D(pkq) is a convex function of the pair of probability

mass functions p and q. In other words, if (p1 , q1 ) and (p2 , q2 ) are two
pairs of probability mass functions, then
D(p1 kq1 ) + (1 )D(p2 kq2 ) D(p1 + (1 )p2 kq1 + (1 )q2 ).
Proof: By the log-sum inequality,
p1 (x) log
p1 (x)
(1 )p2 (x)
+ (1 )p2 (x) log
q1 (x)
(1 )q2 (x)
(p1 (x) + (1 )p2 (x))
(p1 (x) + (1 )p2 (x)) log
.
(q1 (x) + (1 )q2 (x))
Summing over all x yields the desired property.
Theorem 11 (Concavity of entropy): H(p) is a concave function of p.
16
Proof: We can write

H(p) = log |X | D(pku).
Since the relative entropy is convex (by the previous theorem), it follows
that the entropy is concave.
An interesting alternative proof is as follows: Let X1 be a random
variable with distribution p1 taking values from a set A and X2 another
random variable with distribution p2 taking values from the same set A.
Moreover, let be a binary random variable with
1, with probability
=
2, with probability (1 ).
Now let Z = X . Then, Z takes values from A with probability distribution
p(Z) = p1 + (1 )p2 .
Thus,
H(Z) = H(p1 + (1 )p2 ).
On the other hand,
H(Z|) = H(p1 ) + (1 )H(p2 ).
Since conditioning reduces entropy, we have
H(Z) H(Z|) H(p1 + (1 )p2 ) H(p1 ) + (1 )H(p2 ),
which proves H(p) is concave in p.
17
Exercise 2 Consider two containers C1 and C2 (shown below) containing n1 and n2 molecules, respectively. The energies of the n1 molecules in C1 are i.i.d. random variables with common distribution p1 .
Similarly, the energies of the molecules in C2 are i.i.d. with common
distribution p2 .
1. Find the entropies of the ensembles in C1 and C2 .
2. Now assume that the separation between the two containers is removed. Find the entropy of the mixture and show it is greater than
the sum of the individual entropies in part a).
Solution: Let Xi , i = 1, 2, , n1 be the i.i.d. random variables associated with the energies in container C1 and Yi , i = 1, 2, , n2 those
for container C2 . Then,
1.
H(X1 , X2 , , Xn1 ) =
n1
X
H(Xi ) = n1 H(p1 ).
i=1
Similarly,
H(Y1 , Y2 , , Xn2 ) =
n2
X
H(Yi ) = n2 H(p2 ).
i=1
2. After the separation is removed, let Zi , i = 1, 2, , (n1 + n2 ) be

the random energies associated with the molecules in the mixture.
The (n1 + n2 ) random variables are still i.i.d. with a common
distribution given by
p(Zi ) =
n1
n2
p1 +
p2 .
n1 + n2
n1 + n2
18
Thus,
n1
n2
H(Z1 , Z2 , , Z(n1 +n2 ) ) = (n1 + n2 )H(
p1 +
p2 )
n
+
n
n
+
n
1
2
1
2
n1
n2
(n1 + n2 )
H(p1 ) +
H(p2 )
n1 + n2
n1 + n2
= n1 H(p1 ) + n2 H(p2 ),
where the inequality above is due to the concavity of the entropy.
Theorem 12 Let (X, Y ) have joint distribution p(x, y) = p(x)p(y|x).
The mutual information I(X; Y ) is a concave function of p(x) for fixed
p(y|x) and a convex function of p(y|x) for a fixed p(x).
2.5
The Data Processing Theorem
Definition 12 Three random variables X, Y, and Z form a Markov

chain in that order, denoted X Y Z, if, conditioned on Y , Z is
independent of X. In this case
p(x, y, z) = p(x)p(y|x)p(z|x, y) = p(x)p(y|x)p(z|y)
or
p(z|x, y) = p(z|y) p(z, x|y) = p(z|y)p(x|y).
Theorem 13 (Data Processing theorem): Consider the Markov chain
X Y Z. Then
I(X; Y ) I(X; Z).
19
Proof: We have
I(X; Y Z) = I(X; Y ) + I(X; Z|Y )
= I(X; Z) + I(X; Y |Z).
However, I(X; Z|Y ) = 0 since X and Z and independent given Y .
Thus,
I(X; Y ) = I(X; Z) + I(X; Y |Z) I(X; Z).
Equality above is when I(X; Y |Z) = 0, i.e. when X and Y are independent given Z, or, in other words, when we have a Markov chain
X Z Y.
Corollary 2 If Z = g(Y ), then I(X; Y ) I(X; g(Y )).
Proof: X Y g(Y ) is a Markov chain.
2.6
Fanos Inequality
Consider the simple communication system depicted in Figure 3. We

are interested in estimating X from the received data Y . Towards this
of
end, Y is processed by some function g to obtain an estimate X
X. In digital communications we are interested in estimating X in as
error-free a fashion as possible. Thus, we are interested in minimizing
the probability of error,
Pe = Pr[X 6= X].
Fanos inequality relates the probability of error to the H(X|Y ).
20
P(y|x)
g( )
^
X
Figure 3: Figure for Fanos Inequality.
Theorem 14 (Fanos Inequality:)

H(Pe ) + Pe log(|X | 1) H(X|Y ).
A somewhat weaker inequality is
1 + Pe log |X | H(X|Y )
or
Pe
H(X|Y ) 1
.
log |X |
Proof:
Let
E=
=
if X
6 X
= X.
if X
Clearly, p(E = 1) = Pe and p(E = 0) = 1 Pe . Then

H(E, X|Y ) = H(X|Y ) + H(E|X, Y )
= H(E|Y ) + H(X|E, Y ).
with no uncertainty and X and X
determine E,
Since Y induces X
H(E|X, Y ) = 0. Thus,
H(X|Y ) = H(E|Y ) + H(X|E, Y )
H(Pe ) + H(X|E, Y ).
21
(1)
Now,
H(X|E, Y ) = Pr(E = 0) H(X|E = 0, Y ) +Pr(E = 1)H(X|E = 1, Y )
|
{z
}
0
= Pr(E = 1)H(X|E = 1, Y )
Pe log(|X | 1).
(2)
Combining (1) and (2) we obtain the desired bound.

The weaker bound is obtained easily by bounding H(Pe ) from above
by 1 and replacing |X | 1 by |X |.
2.7
Exercises for Chapter 2
Exercise 3 Consider the Z-channel discussed in Example 1. Compute and plot the mutual information between the input and output of
the channel.
Solution: We have
I(X; Y ) =
XX
x
XX
x
= 1
p(x, y) log
p(x|y)
p(x)
p(y|x)p(x) log
p(y|x)
p(y)
1+
log(1 + ) + log .
2
2
Exercise 4 The channel in Figure 5 below is known as the binary symmetric channel (BSC) and it is another simple model for a channel.
22
0.9
0.8
0.7
I(X;Y)
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 4: Mutual information between input and output for the z-channel.
0.5
1-p
p
X
p
0.5
1p
Figure 5: The binary symmetric channel.
Compute the mutual information between the input X and the output
Y as a function of the cross-over probability p.
We have
I(X; Y ) =
XX
x
p(y|x)p(x) log
p(y|x)
p(y)
Now,
1
1
1
(1 p) + p =
2
2
2
1
p(Y = 1) =
2
p(Y = 0) =
23
Thus,
I(X; Y ) = (1 p) log[2(1 p)] + p log(2p)
= 1 + p log p + (1 p) log(1 p)
= 1 hb (p),
where
hb (p) = p log p (1 p) log(1 p)
is the binary entropy function.
A plot of the mutual information as a function of p is given below:
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 6: The mutual information between input and output for a BSC.
Exercise 5 Show that if X is a function of Y then H(Y ) H(X).

Solution: By the corollary to the data processing theorem,
I(Y ; Y ) I(Y ; X) H(Y ) H(X) H(X|Y ) = H(X).
24
Another solution:
H(Y ) = H(X, Y ) H(X|Y )
| {z }
0
= H(X) + H(Y |X)

H(X).
Exercise 6 Consider the simple rate 2/3 parity-check code where the
third bit, known as the parity bit, is the exclusive or of the first two bits:
000
011
101
110
Let the first two (information bits) be defined by the random vector
X = (X1 , X2 ) and the parity-bit by random variable Y .
1. How much uncertainty is resolved about what X is by observing
Y?
2. How much uncertainty about X is resolved by observing Y and
X2 ?
3. Now suppose the parity bit Y is observed through a BSC with a
cross-over probability that produces an output Z. How much
uncertainty is resolved about X by observing Z?
Solution:
25
1. We need to compute I(X; Y ). We have

I(X; Y ) = H(X) H(X|Y ).
Clearly,
H(X) = 2 bits
Now,
H(X|Y ) = H(X1 |Y ) + H(X2 |X1 , Y )
{z
}
|
0
= Pr(Y = 0)H(X1 |Y = 0) + Pr(Y = 1)H(X1 |Y = 1)

1
1
= H(X1 |Y = 0) + H(X1 |Y = 1)
2
2
= H(X1 |Y = 0) = 1
Thus,
I(X; Y ) = 2 1 = 1 bit.
2.
I(X; Y, X2 ) = H(X) H(X|Y, X2 ) = 2 bits
{z
}
|
0
3. We have
I(X; Z, Y ) = I(X; Z) + I(X; Y |Z)
= I(X; Y ) + I(X; Z|Y )
| {z }
0
I(X; Z) = I(X; Y ) I(X; Y |Z).
Now,
1
1
I(X; Y |Z = 0) + I(X; Y |Z = 1)
2
2
= I(X; Y |Z = 0),
I(X; Y |Z) =
26
since p(Z = 0) = p(Z = 1) = 1/2 and, due to symmetry, it can be

argued I(X; Y |Z = 0) = I(X; Y |Z = 1) (verify this). Then,
I(X; Y |Z = 0) =
1
XX
x
p(X = x, Y = y|Z = 0)
y=0
log
=
1
XX
x
p(Y = y|X = x, Z = 0)p(X = x|Z = 0)
y=0
log
=
1
XX
x
p(X = x|Y = y, Z = 0)
p(X = x|Z = 0)
p(X = x|Y = y)
p(X = x|Z = 0)
p(Y = y|X = x)p(X = x|Z = 0)
y=0
p(Y = y|X = x)p(X = x)

p(X = x|Z = 0)p(Y = y)
X
1
=
p(X = x|Z = 0) log
2p(X = x|Z = 0)
x
log
= hb ().
Thus,
I(X;Z) = 1 hb ().
27

2 Entropy and Mutual Information: I (A) F (P (A) )

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

2 Entropy and Mutual Information: I (A) F (P (A) )

Încărcat de

Drepturi de autor:

Formate disponibile

2

Entropy and Mutual Information

From the examples on information given earlier, it seems clear that

The unit of information is in bits when the base of the logarithm

Properties of Mutual Information

Exercise 1 Prove the following chain rule for mutual information:

Thus, the amount of information about event A provided by the joint

Figure 1: The binary erasure channel for Example 1.

1. What is the self-information of the events X = 0, X = 1?

3. What is the mutual information between events X = 0 and Y = 0?

Unless otherwise stated, log will denote logarithm based 2.

4. What is the mutual information between events X = 0 and Y = 1?

5. What is the mutual information between events X = 1 and Y = 0?

6. What is the mutual information between X = 0 and Y = 0?

Average Self-Information -Entropy

p(x) log p(x).

From the above definition, we can write

Example 2 Consider a source X that produces two symbols with equal

p(x, y) log p(x, y).

Definition 6 The conditional entropy, of Y given X is

p(x, y) log p(y|x).

Theorem 1 (Chain rule for entropy)

Taking expectations on both sides above

The above result can be easily generalized to the entropy of a random

where we let H(X1 |X 0 ) , H(X1 ).

where we let p(X1 |X 0 ) , p(X1 ). Thus,

Taking expectations with respect to the joint probability mass function

Relative Entropy and Mutual Information

Definition 7 The relative entropy or Kullback-Leibler distance between

which is obviously a random variable over the ensemble of realizations

uncertainty about X as X about Y .

Definition 9 The conditional mutual information of discrete random

I(X; Y ) = H(X) H(X|Y )

Definition 10 A function f (x) is convex over an interval (a, b) if for

If f is strictly convex, equality is if and only if X is a constant (not

Theorem 6 Let X be the range of random variable X and |X | the

Theorem 7 Conditioning reduces entropy:

I(X; Y ) = H(X) H(X|Y ) 0 H(X) H(X|Y ),

with equality iff the Xi are independent.

H(Xi |X1 , X2 , , Xi1 )

with equality iff the Xi are independent, by Theorem 7.

Theorem 9 (The log-sum inequality) For non-negative numbers ai , i =

a0i = ai /A and b0i = bi /B. Then

= AD(a0 kb0 ) + A log

Theorem 10 D(pkq) is a convex function of the pair of probability

Summing over all x yields the desired property.

Theorem 11 (Concavity of entropy): H(p) is a concave function of p.

Proof: We can write

2. After the separation is removed, let Zi , i = 1, 2, , (n1 + n2 ) be

The Data Processing Theorem

Definition 12 Three random variables X, Y, and Z form a Markov

Consider the simple communication system depicted in Figure 3. We

Figure 3: Figure for Fanos Inequality.

Theorem 14 (Fanos Inequality:)

Clearly, p(E = 1) = Pe and p(E = 0) = 1 Pe . Then

Combining (1) and (2) we obtain the desired bound.

Exercises for Chapter 2

Figure 5: The binary symmetric channel.

Exercise 5 Show that if X is a function of Y then H(Y ) H(X).

= H(X) + H(Y |X)

1. We need to compute I(X; Y ). We have

= Pr(Y = 0)H(X1 |Y = 0) + Pr(Y = 1)H(X1 |Y = 1)

I(X; Z) = I(X; Y ) I(X; Y |Z).