Documente Academic
Documente Profesional
Documente Cultură
- Chapter 1 and 2
School of ICE
Sungkyunkwan University
1
Chapter 1: Information
Theory
Data Compression and Data
Transmission: Channel and
Source Coding
Information Theory
Information theory answers two fundamental questions in
communication theory:
Ultimate date compression (Entropy H)
Ultimate transmission rate of communication channel (Channel
capacity C)
Data compression
limit
min I X ; X
Data transmission
limit
max I X ; Y
H X p x log p x
xX
Example
32
1
1
H X p (i ) log p (i ) log
log 32 5 bits
32
i 1
i 1 32
Note (unit of information)
xX
Entropy rate
Entropy for a stochastic process
Coding Techniques
Optimal codes
Huffman codes
Optimality
Rate distortion
D: distortion allowed.
R(D): Compression bound for the given distortion.
Data Transmission
Fundamental question
Given a channel with probabilistic relationship between the input X
and the output Y, i.e., p(y|x)
How much we can transmit data through the channel without error?
Old myth
Error has a positive relationship with the noise of the channel.
Shannons breakthrough
Error free communication is possible if we transmits with the rate
below the channel capacity.
Channel
p(y|x)
Y
9
Data Transmission
Noiseless binary channel
In each transmission, we can send 1 bit reliably to the receiver and
the capacity is 1 bit.
Noisy channel
We cannot transmit all the information (without error) that the
input can carry.
0
1
0
1
A noisy channel
10
Data Transmission
Mutual information
Measure of the dependence between two random variables
p ( x, y )
I ( X ; Y ) H ( X ) H ( X | Y ) p ( x, y ) log
p( x) p( y )
x, y
Relative entropy between p(x,y) and p(x)p(y)
Relative entropy
Distance between two probability mass functions.
p( x)
D ( p || q ) p ( x) log
q( x)
x
11
Data Transmission
Channel capacity
Maximum data transmission with vanishing error probability.
Maximum mutual information between the input random variable
and the output one.
C max p ( x ) I ( X ; Y )
Coding techniques
Hamming codes
Simplest channel code, not capacity approaching
Random codes
Shannon brought it up.
Capacity approaching, impractical.
12
Data Transmission
Gaussian channel
Most useful model for communication problems.
Capacity is derived and generalized to parallel Gaussian channel.
13
Chapter 2
Entropy, Relative Entropy, and
Mutual Information
14
Entropy
Let X be a discrete random variable with alphabet X and
probability mass function p x Pr X x , x X.
X g X p x
1
The entropy of X is the expected value of g x log
p X
1
H X E p log
pX
15
Entropy
X = 1,10, 20
H X p x log p x
xX
16
Properties of Entropy
Lemma 2.1.1: H X 0
Proof: 0 p x 1 implies log 1 p x 0.
17
Example (Entropy)
Example 2.1.1:
1, with probability p
X
0, with probability 1 p
H X p log p 1 p log 1 p H p
1
H p
0.5
p
1
18
Example (Entropy)
Example 2.1.2 :
a
b
X
c
d
with probability 1 2
with probability 1 4
with probability 1 8
with probability 1 8
1
1 1
1 1
1 1
1 7
H X log log log log
2
2 4
4 8
8 8
8 4
19
Joint Entropy
Definition: The joint entropy H X , Y of a pair of discrete
random variable X , Y with a joint distribution p x, y
is defined as
H X , Y E log p X , Y p x, y log p x, y
xX yY
Example :
Y
1
2
1
12
2
14
18
18
20
Joint Entropy
H X , Y p x, y log p x, y
xX yY
p x, y log p x, y
x1,2 y1,2
p 1, y log p 1, y p 2, y log p 2, y
y1,2
y1,2
Joint Entropy
Y
X
1
1
12
2
14
18
18
p 1,1 1 2
p x, y
p x 1, y 1
p x y =
= p x 1 y 1
p y
p y 1
22
Conditional Entropy
Definition: If X , Y ~ p x, y , then the conditional
entropy H Y X is defined
H Y X p x H Y X x
xX
p x p y x log p y x
xX
yY
p x p y x log p y x
xX yY
p x, y log p y x
xX yY
E p x , y log p Y X
23
Conditional Entropy
H Y X p x, y log p y x
xX yY
1
14
2
14
14
14
24
Example :
2
H Y X p X i H Y X i
i 1
p X 1 H Y X 1 p X 2 H Y X 2
p X 1 p X 2 1 2
H Y X 1 H 1 2,1 2 1
H Y X 2 H 1 2,1 2 1
H Y X 1 2 1 1 2 1 1
X
1
1
14
2
14
14
14
25
Y X
1
2
3
4
18
1 16 1 32 1 32
1 16
18
1 32 1 32
1 16 1 16 1 16 1 16
14
0
0
0
4
H X Y p Y i H X Y i
i 1
p x, y log p y x
x
Chain Rule
Theorem 2.2.1 (Chain rule):
H X , Y H X H Y X
Proof :
H X , Y p x, y log p x, y
xX yY
p x, y log p x p y x
xX yY
p x, y log p x p x, y log p y x
xX yY
xX yY
p x log p x p x, y log p y x
xX
xX yY
H X H Y X
27
H Y H Y X H X H X Y
H X , Y H X H Y X
H X , Y H Y H X Y
H Y X H X Y
28
Relative Entropy
The relative entropy is a measure of the distance between
two distribution.
If we knew the true distribution of the random variable,
then we could construct a code with average description
length H p . Instead, if we use the code for a distribution q
we would need H p D p q .
Definition: The relative entropy or Kullback Leibler distance
between two probability mass function p x and q x is defined as
D p q
p x
p x
E p log
p x log
q x
q x
29
Relative Entropy
Example 2.3.1: Let X 0,1 and consider two distribution
p and q on X . Let p 0 1 r , p 1 r , and q 0 1 s, q 1 s.
r
1 r
r log
D p q 1 r log
1 s
s
s
1 s
D q p 1 s log
s log
r
1 r
If r s, then D p q D q p 0.
If r 1 2, s 1 4,
1
12 1
12
1
+ log
D p q = log
1 log 3 0.2075
2
34 2
14
2
3
34 1
14 3
D q p = log
+ log
log 3 1 0.1887
4
12 4
12 4
Mutual Information
Definition : Consider two random variable X and Y with
a joint probability mass function p x, y and marginal
probability mass functions p x and p y . The mutual
information I X ; Y is the relative entropy between
the joint distribution and the product distribution.
I X ;Y
X Y
p x, y
p x, y log
p x p y
D p x, y p x p y
p X ,Y
E p x , y log
p X p Y
31
I X ;Y
X Y
p x, y
p x, y log
p x p y
p x y
X Y p x, y log p x
p x, y log p x p x, y log p x y
X
Y
X
Y
p x log p x p x, y log p x y
xX
xX xY
H X H X Y
32
H X ,Y
Diagram
H X Y
H X
I X ; Y H Y X
H Y
33
H X 1 , X 2 , , X n H X i X i 1 , , X 1
i 1
H X1, X 2 H X1 H X 2 X1
H X1, X 2 , X 3 H X1 H X 2 , X 3 X1
H X1 H X 2 X1 H X 3 X1, X 2
34
H X 2 X1
H X1
H X 3 X1, X 2
35
I X ;Y Z H X Z H X Y , Z
I X 1 , X 2 , , X n ; Y I X i ; Y X i 1 , X i 2 , , X 1
i 1
I X1 , X 2 ; Y I X1; Y I X 2 ; Y X1
I X1 , X 2 , X 3 ; Y I X1; Y I X 2 ; Y X1 I X 3 ; Y X1 , X 2
36
37
D p y x q y x p x p y x log
x
p x p y x log
x
p x, y log
x
p y x
q y x
p y x
q y x
p y x
q y x
38
D p x, y q x, y D p x q x D p y x q y x
Proof:
D p x, y q x, y
p x, y
p x, y log
q x, y
x
y
p x, y log
x
p x p y x
q x q y x
p y x
p x
p x, y log
p x, y log
q x x y
q y x
x
y
D p x q x D p y x q y x
39
Convex Functions
Definition : A function f x is said to be convex over an interval
a, b
f x1 1 x2 f x1 1 f x2
Convexity of a function
e.g.,
x2,
ex,
f x2
f x1
x2
x1
f x1 1 x2 f x1 1 f x2
x1 1 x2
41
Jensens Inequality
Theorem 2.6.2 : If f x is a convex function and X is a random
variable, then E f X f E X
Proof by induction on the number of mass points
1. p1 f x1 p2 f x2 f p1 x1 p2 x2 , p1 p2 1
k 1
k 1
2. Hypothesis pi f xi f pi xi , pi 1
i 1
i 1
i 1
k
k
k
3. Proof pi f xi f pi xi , pi 1
i 1
i 1
i 1
k 1
42
Jensens Inequality
k 1
pi
pi f xi pk f xk 1 pk
f xi
i 1
i 1 1 pk
k 1
pk f xk 1 pk pi f xi
i 1
k 1
k 1
f pk xk 1 pk pixi
i 1
43
Information Inequality
Theorem 2.6.3 Information inequality : Let p x , q x , x
be two probability mass functions. Then D p q 0 with equality
if and only if p x q x .
Proof:
D p q
x
p x
p x log
q x
q x
p x log
p x
log
x
q x
p x
by Jensen's inequality
p x
log q x 0
x
44
I X ; Y D p x, y p x p y
Corollary : D p y x q y x 0
Corollary : I X ; Y Z 0
45
Uniform Distribution
Theorem 2.6.4 : H X log where denotes the number of
elements in the range of X , with equality if and only if X has a
uniform distribution over
Proof: Let u x
p x
D p u p x log
p x log p x p x log u x
u x
H X p x log
H X log
p x
Conditioning
Theorem 2.6.5 Conditioning reduces entropy :
H X Y H X with equality if and only if X and Y are
independent.
Proof: 0 I X ; Y H X H X Y
Example :
X
3/4
1/8
1/8
1 7
H X H , 0.544,
8 8
H X Y 1 0, H X Y 2 1
3
1
H X Y H X Y 1 H X Y 2 0.25
4
4
47
H X 1 , X 2 , , X n H X i X i 1 , , X 1
i 1
n
H Xi
i 1
48
i 1 i
ai log ai log n
bi i 1
i 1
i 1 bi
n
49
f t f t for 1.
Setting b b and t a b ,
i
i i
j 1
n
j 1
bj
b
a
a
b
a
b
a
i
i
i
i
i
i
i
i n b b log b i n b b log i n b b
i j 1 j i
j 1 j i
j 1 j i
a
ai
ai
ai
i 1 i
i log b i log n b
i
j 1 j
n
ai
i ai log b
i
a log
i 1 i
n
i 1 i
50
D p1 1 p2 q2 1 q2 D p1 q1 1 D p2 q2
for all 0 1
51
ai
i ai log
b
bi
i 1 i
a
i 1 i
n
a1 a2
a1
a2
a1 log a2 log
a1 a2 log
b1 b2
b1
b2
a1 p1 x , a2 1 p2 x ,
b1 q1 x , b2 1 q2 x
p1 x 1 p2 x
p1 x 1 p2 x log q x 1 q x
2
1
p1 x
1 p2 x
p1 x log
1 p2 x log
q1 x
1 q2 x
52
p1 x 1 p2 x
x p1 x 1 p2 x log q x 1 q x
1
2
p1 x
p2 x
p1 x log
1 p2 x log
q1 x
q2 x
x
D p1 x 1 p2 x q1 x 1 q2 x
D p1 x q1 x 1 D p2 x q2 x
53
Concavity of Entropy
Proof:
H p log D p u
54
The first term: p(y|x) fixed. p(y) is a linear function of p(x) H(Y),
concave with p(y), is a concave function of p(x).
The second term is a linear function p(x)
Difference is a concave function of p(x).
I ( X ;Y , Z ) I ( X ;Y | Z ) I ( X ; Z )
I ( X ;Y ) I ( X ; Z | Y )
where I ( X ; Z ) 0
I ( X ;Y ) I ( X ;Y | Z )
I ( X ; Y1 ) (1 ) I ( X ; Y2 )
56
Markov Chain
Definition : Random variable X , Y , Z are said to form a Markov chain
in that order denoted by X Y Z if the conditional distribution
of Z depends only on Y and is conditionally independent of X .
Specifically X , Y , and Z form a Markov chain X Y Z if the joint
probability can be written as
p x, y , z p x p y x p z y p x, y p z y .
p x , y
p x, y , z p x, y p z y
p x, z y
p x y p z y
p y
p y
X Y Z implies that Z Y X . Thus the condition is sometimes
written X Y Z .
If Z f Y , then X Y Z .
57
I X ;Y , Z I X ; Z I X ;Y Z
I X ;Y I X ; Z Y
I X ; Y I X ; Z and I X ; Y I X ; Y Z
58
Fanos Inequality
Suppose we have RVs X and Y. Fanos bounds the error
when we estimating X from Y.
Procedure
X sent, Y observed.
Generate an estimator of X that is
Pe: The probability of error, i.e.
X g (Y )
Pe Pr( X X )
E: Error indicator RV
1 if X X
E
0 if X X
Note Pe P( E 1).
59
Fanos Inequality
Theorem (Fano): For any estimator
with Pe=Pr(X ) then we have
such that X Y
H (Pe ) + Pe log | | H (X | X ) H (X | Y )
The inequality can be weakened to
1 + Pe log | | H (X | Y )
or
H (X | Y ) - 1
Pe
log | |
Robert Fano
60
Fanos Inequality
61
Fanos Inequality
Proof
62
Fanos Inequality
Corollary: For any two RVs X and Y, let p=Pr(XY)
Corollary:
63