Documente Academic
Documente Profesional
Documente Cultură
B) = i(AB) =
i(A) + i(B).
It turns out there is one and only one function that satisfies the
above requirements, namely the logarithmic function.
Definition 1 Let A be an event having probability of occurrence p(A).
Then the amount of information conveyed by the knowledge of the occurrence of A, referred to as the self-information of event A, is given
by
i(A) = log2 [p(A)] bits.
1
p(A|B)
i(A; B) = log
p(A)
= log p(A) [ log p(A|B)]
= i(A) i(A|B).
In other words, i(A; B) is also the amount of uncertainty about event
A removed by the occurrence of event B.
2.1
1.
i(A; B) = i(B; A).
Proof:
p(A|B)
p(A)
p(A, B)
= log
p(A)p(B)
p(B|A)
= log
= I(B; A).
p(B)
i(A; B) = log
Here, we abuse notation somewhat by using i() to denote self-information and i( ; ) to denote
mutual information.
2. Mutual information
i(A; B) < 0 if p(A|B) < p(A)
i(A; B) > 0 if p(A|B) > p(A)
i(A; B) = 0 if p(A|B) = p(A).
3. i(A; A) = log p(A) = i(A).
Definition 3 The mutual information between events A and B given
that event C has occurred is
i(A; B|C) = log
p(A|B, C)
p(A, B|C)
= log
= i(B; A|C).
p(A|C)
p(A|C)p(B|C)
p(A|B, C)
.
p(A)
Now,
p(A, B, C) = p(A|B, C)p(C|B)p(B)
= p(C|A, B)p(A|B)p(B)
p(C|A, B)p(A|B)
p(A|B, C) =
.
p(C|B)
Thus,
p(A|B) p(C|A, B
i(A; B, C) = log
p(A) p(C|B)
p(C|A, B)
p(A|B)
+ log
= log
p(A)
p(C|B)
= i(A; B) + i(C; A|B)
= i(A; B) + i(A; C|B).
X
0.5
Y
1
1
1
We have 2 :
1
i(X = 0) = log p(X = 0) = log
= 1 bit.
2
1
i(X = 1) = log p(X = 1) = log( ) = 1 bit.
2
2. What is the self-information of events Y = 0, Y = 1?
We have:
i(Y = 0) = log p(Y = 0).
p(Y = 0) = p(Y = 0|X = 0)p(X = 0) + p(Y = 0|X = 1)p(X = 1)
= 1 p(X = 0) + p(X = 1)
1+
=
.
2
1+
i(Y = 0) = log
bits.
2
Similarly,
i(Y = 1) = log p(Y = 1).
p(Y = 1) = 1 p(Y = 0) =
i(Y = 1) = log
1
.
2
1
bits.
2
2
p(Y = 0|X = 0)
= log
= 1log(1+)
p(Y = 0)
1+
i(Y = 1; X = 0) = log
p(Y = 1|X = 0)
= log(0) = .
p(Y = 1)
i(Y = 0; X = 1) = log
2
p(Y = 0|X = 1)
= log
p(Y = 0)
1+
i(Y = 1; X = 0) = log
2.2
p(Y = 1|X = 0)
2(1 )
= log
= 1bit
p(Y = 1)
1
Definition 4 (Entropy) The entropy of a source represented by a random variable X with realizations taking values from the set X is
H(X) =
xX
XX
x
XX
x
N
X
H(Xn |X n1 ),
n=1
N
Y
p(Xn |X n1 ),
n=1
N
X
log p(Xn |X n1 ).
n=1
X
p(x)
p(x)
D(pkq) = Ep log
.
=
p(x) log
q(x)
q(x)
x
8
The relative entropy is a measure of the distance between the two distributions p(x) and q(x), even though it is not s true distance metric.
We defined earlier the mutual information between two events. Now
consider two random variables X X and Y Y. If x X is a
realization of X and y Y is a realization of Y , the mutual information
between x and y is
i(x; y) = log
p(x|y)
,
p(x)
p(X|Y )
I(X; Y ) = E log
p(X)
XX
p(x|y)
=
p(x, y) log
p(x)
x
y
XX
p(x, y)
=
p(x, y) log
p(x)p(y)
x
y
= D(p(x, y)kp(x)p(y))
XX
p(y|x)
=
p(x, y) log
p(y)
x
y
= I(Y ; X).
The mutual information between random variables X and Y is the
average amount of information provided about X by observing Y , which
is also the average amount of uncertainty resolved about X by observing Y . As can be seen, I(X; Y ) = I(Y ; X), i.e. Y resolves as much
9
I(X; Y ) = E log
= H(X) H(X|Y ).
Since, as established earlier I(X; Y ) = I(Y ; X), we have
I(X; Y ) = H(Y ) H(Y |X).
2.
I(X; Y ) = H(X) H(X|Y )
= H(X) [H(X, Y ) H(Y )]
= H(X) + H(Y ) H(X, Y ).
3. I(X; X) = H(X) H(X|X) = H(X).
The diagram in Figure 2 summarizes the relationship between the
various quantities:
10
Figure 2: Mutual information between input and output for the z-channel.
p(X|Y, Z)
I(X; Y |Z) = EXY Z log
= H(X|Z) H(X|Y, Z).
p(X|Z)
Theorem 2 (Chain rule for mutual information):
Let X = (X1 , X2 , , XN ) be a random vector. Then the mutual information between X and Y is
N
X
I(X; Y ) =
I Xn ; Y |X n1 .
n=1
Proof:
11
n=1
N
X
n=1
N
X
n=1
H(Xn |X n1 ) H(Xn |X n1 , Y )
I Xn ; Y |X n1 .
n=1
2.4
Jensens Inequality
pi f (xi ) = pk f (xk ) + (1 pk )
i=1
k1
X
p0i f (xi )
i=1
pk f (xk ) + (1 pk )f
f
= f
pk xk + (1
k
X
k1
X
!
p0i f (xi )
i=1
!
k1
X
pk )
p0i f (xi )
i=1
p i xi .
i=1
For the equality proof, note that equality in the first inequality above
13
is when all p0i are zero but one and in the last inequality if it or pk are
zero, i.e. X is deterministic.
Theorem 5
D(pkq) 0.
Equality above if and only if p(x) = q(x).
Proof:
p(x)
Ep log
q(x)
q(x)
Ep log
p(x)
q(x)
log Ep
p(x)
0.
D(pkq) =
=
Since the log function is strictly convex, equality above is if and only if
p(x)/q(x) = c, c a constant, i.e. p(x) = cq(x). Summing over x on
both sides, c = 1, i.e. p(x) = q(x).
Corollary 1
I(X; Y ) 0.
Proof:
I(X; Y ) = D(p(x, y)kp(x)p(y)) 0,
with equality iff p(x, y) = p(x)p(y), i.e. X and Y are independent.
14
p(x) log
p(x)
= log |X | H(X) 0.
u(x)
n
X
H(Xi ),
i=1
n
X
i=1
n
X
i=1
H(Xi ),
Pn
n
X
i=1
i=1 ai , B =
Pn
i=1 bi ,
X
ai
a0 A
= A
ai log
a0i log 0i
bi
bi B
i=1
= A
n
X
a0i log
i=1
a0i
A
+
A
log
b0i
B
A
.
B
A
B
p1 (x)
(1 )p2 (x)
+ (1 )p2 (x) log
q1 (x)
(1 )q2 (x)
(p1 (x) + (1 )p2 (x))
(p1 (x) + (1 )p2 (x)) log
.
(q1 (x) + (1 )q2 (x))
16
1, with probability
=
2, with probability (1 ).
Now let Z = X . Then, Z takes values from A with probability distribution
p(Z) = p1 + (1 )p2 .
Thus,
H(Z) = H(p1 + (1 )p2 ).
On the other hand,
H(Z|) = H(p1 ) + (1 )H(p2 ).
Since conditioning reduces entropy, we have
H(Z) H(Z|) H(p1 + (1 )p2 ) H(p1 ) + (1 )H(p2 ),
which proves H(p) is concave in p.
17
Exercise 2 Consider two containers C1 and C2 (shown below) containing n1 and n2 molecules, respectively. The energies of the n1 molecules in C1 are i.i.d. random variables with common distribution p1 .
Similarly, the energies of the molecules in C2 are i.i.d. with common
distribution p2 .
1. Find the entropies of the ensembles in C1 and C2 .
2. Now assume that the separation between the two containers is removed. Find the entropy of the mixture and show it is greater than
the sum of the individual entropies in part a).
Solution: Let Xi , i = 1, 2, , n1 be the i.i.d. random variables associated with the energies in container C1 and Yi , i = 1, 2, , n2 those
for container C2 . Then,
1.
H(X1 , X2 , , Xn1 ) =
n1
X
H(Xi ) = n1 H(p1 ).
i=1
Similarly,
H(Y1 , Y2 , , Xn2 ) =
n2
X
H(Yi ) = n2 H(p2 ).
i=1
n1
n2
p1 +
p2 .
n1 + n2
n1 + n2
18
Thus,
n1
n2
H(Z1 , Z2 , , Z(n1 +n2 ) ) = (n1 + n2 )H(
p1 +
p2 )
n
+
n
n
+
n
1
2
1
2
n1
n2
(n1 + n2 )
H(p1 ) +
H(p2 )
n1 + n2
n1 + n2
= n1 H(p1 ) + n2 H(p2 ),
where the inequality above is due to the concavity of the entropy.
Theorem 12 Let (X, Y ) have joint distribution p(x, y) = p(x)p(y|x).
The mutual information I(X; Y ) is a concave function of p(x) for fixed
p(y|x) and a convex function of p(y|x) for a fixed p(x).
2.5
19
Proof: We have
I(X; Y Z) = I(X; Y ) + I(X; Z|Y )
= I(X; Z) + I(X; Y |Z).
However, I(X; Z|Y ) = 0 since X and Z and independent given Y .
Thus,
I(X; Y ) = I(X; Z) + I(X; Y |Z) I(X; Z).
Equality above is when I(X; Y |Z) = 0, i.e. when X and Y are independent given Z, or, in other words, when we have a Markov chain
X Z Y.
Corollary 2 If Z = g(Y ), then I(X; Y ) I(X; g(Y )).
Proof: X Y g(Y ) is a Markov chain.
2.6
Fanos Inequality
Pe = Pr[X 6= X].
Fanos inequality relates the probability of error to the H(X|Y ).
20
P(y|x)
g( )
^
X
H(X|Y ) 1
.
log |X |
Proof:
Let
E=
=
if X
6 X
= X.
if X
(1)
Now,
H(X|E, Y ) = Pr(E = 0) H(X|E = 0, Y ) +Pr(E = 1)H(X|E = 1, Y )
|
{z
}
0
= Pr(E = 1)H(X|E = 1, Y )
Pe log(|X | 1).
(2)
2.7
Exercise 3 Consider the Z-channel discussed in Example 1. Compute and plot the mutual information between the input and output of
the channel.
Solution: We have
I(X; Y ) =
XX
x
XX
x
= 1
p(x, y) log
p(x|y)
p(x)
p(y|x)p(x) log
p(y|x)
p(y)
1+
log(1 + ) + log .
2
2
Exercise 4 The channel in Figure 5 below is known as the binary symmetric channel (BSC) and it is another simple model for a channel.
22
0.9
0.8
0.7
I(X;Y)
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 4: Mutual information between input and output for the z-channel.
0.5
1-p
p
X
p
0.5
1p
Compute the mutual information between the input X and the output
Y as a function of the cross-over probability p.
We have
I(X; Y ) =
XX
x
p(y|x)p(x) log
p(y|x)
p(y)
Now,
1
1
1
(1 p) + p =
2
2
2
1
p(Y = 1) =
2
p(Y = 0) =
23
Thus,
I(X; Y ) = (1 p) log[2(1 p)] + p log(2p)
= 1 + p log p + (1 p) log(1 p)
= 1 hb (p),
where
hb (p) = p log p (1 p) log(1 p)
is the binary entropy function.
A plot of the mutual information as a function of p is given below:
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 6: The mutual information between input and output for a BSC.
24
Another solution:
H(Y ) = H(X, Y ) H(X|Y )
| {z }
0
Exercise 6 Consider the simple rate 2/3 parity-check code where the
third bit, known as the parity bit, is the exclusive or of the first two bits:
000
011
101
110
Let the first two (information bits) be defined by the random vector
X = (X1 , X2 ) and the parity-bit by random variable Y .
1. How much uncertainty is resolved about what X is by observing
Y?
2. How much uncertainty about X is resolved by observing Y and
X2 ?
3. Now suppose the parity bit Y is observed through a BSC with a
cross-over probability that produces an output Z. How much
uncertainty is resolved about X by observing Z?
Solution:
25
3. We have
I(X; Z, Y ) = I(X; Z) + I(X; Y |Z)
= I(X; Y ) + I(X; Z|Y )
| {z }
0
Now,
1
1
I(X; Y |Z = 0) + I(X; Y |Z = 1)
2
2
= I(X; Y |Z = 0),
I(X; Y |Z) =
26
1
XX
x
p(X = x, Y = y|Z = 0)
y=0
log
=
1
XX
x
y=0
log
=
1
XX
x
p(X = x|Y = y, Z = 0)
p(X = x|Z = 0)
p(X = x|Y = y)
p(X = x|Z = 0)
y=0
= hb ().
Thus,
I(X;Z) = 1 hb ().
27