Elements of Information Theory-Chapter1-2

Information Theory
- Chapter 1 and 2
School of ICE
Sungkyunkwan University
1
Chapter 1: Information
Theory
Data Compression and Data
Transmission: Channel and
Source Coding
Information Theory
Information theory answers two fundamental questions in
communication theory:
Ultimate date compression (Entropy H)
Ultimate transmission rate of communication channel (Channel
capacity C)
Data compression
limit
min I X ; X
Data transmission
limit
max I X ; Y
Also, a lot of applications

Physics, Statistics, Math, Computer Science, Economics
3
Entropy of A Random Variable

Entropy
Uncertainty of a random variable X
Information quantity that the random variable possess (or carries)
Mathematically
H X p x log p x
xX
Example
Uniform distributed r.v. with 32 outcomes

32
32
1
1
H X p (i ) log p (i ) log
log 32 5 bits
32
i 1
i 1 32
Note (unit of information)
H X p x log 2 p x bits = p x log e p x nats

xX
xX
Data Compression (Example)

Send a message to another person indicating the outcom.
There are 8 outcomes and the prob. of outcomes are
1 1 1 1 1 1 1 1
, , , , , , ,
2 4 8 16 64 64 64 64
Case 1: 000, 001, 010, 011, 100, 101, 110, 111

Average length: 3bit
Case 2 : 0, 10, 110, 1110, 111100, 111101, 111110, 111111

Average length: 2bit
Same as actual entropy
5
Data Compression (Example)

Case 2 : 0, 10, 110, 1110, 111100, 111101, 111110, 111111
Case 2 : 0, 10, 110, 1110, 111100, 111101, 111110, 111111
1101011100
Entropy rate
Entropy for a stochastic process
Data Compression: Coding Techniques

Data compression
Source coding
Ultimate compression is bounded by H(X)
Coding Techniques
Optimal codes
Huffman codes
Optimality
Shannon Fano Elias codes

Arithmetic codes
Lossy Data Compression

Practically
We cannot implement lossless data compression.
Much more efficiently compress by allowing a certain amount of
distortion (from the original source)
More distortion allowed, further your data can be compressed.
Rate distortion
D: distortion allowed.
R(D): Compression bound for the given distortion.
Data Transmission
Fundamental question
Given a channel with probabilistic relationship between the input X
and the output Y, i.e., p(y|x)
How much we can transmit data through the channel without error?
Old myth
Error has a positive relationship with the noise of the channel.
Shannons breakthrough
Error free communication is possible if we transmits with the rate
below the channel capacity.
Channel
p(y|x)
Y
9
Data Transmission
Noiseless binary channel
In each transmission, we can send 1 bit reliably to the receiver and
the capacity is 1 bit.
Noisy channel
We cannot transmit all the information (without error) that the
input can carry.
0
1
0
1
Noiseless binary channel
A noisy channel
10
Data Transmission
Mutual information
Measure of the dependence between two random variables
p ( x, y )
I ( X ; Y ) H ( X ) H ( X | Y ) p ( x, y ) log
p( x) p( y )
x, y
Relative entropy between p(x,y) and p(x)p(y)
Relative entropy
Distance between two probability mass functions.
p( x)
D ( p || q ) p ( x) log
q( x)
x
11
Data Transmission
Channel capacity
Maximum data transmission with vanishing error probability.
Maximum mutual information between the input random variable
and the output one.
C max p ( x ) I ( X ; Y )
Coding techniques
Hamming codes
Simplest channel code, not capacity approaching
Random codes
Shannon brought it up.
Capacity approaching, impractical.
12
Data Transmission
Gaussian channel
Most useful model for communication problems.
Capacity is derived and generalized to parallel Gaussian channel.
13
Chapter 2
Entropy, Relative Entropy, and
Mutual Information
14
Entropy
Let X be a discrete random variable with alphabet X and
probability mass function p x Pr X x , x X.
The entropy H X of discrete random variable X is defined by

H X p x log p x
xX
The expectation of the random variable g X is written

Ep g X
X g X p x
1
The entropy of X is the expected value of g x log
p X
1
H X E p log
pX
15
Entropy
X = 1,10, 20
H X p x log p x
xX
p 1 log p 1 p 10 log p 10 p 20 log p 20
16
Properties of Entropy
The entropy H X of discrete random variable X is defined by

H X p x log 2 p x
xX
Lemma 2.1.1: H X 0
Proof: 0 p x 1 implies log 1 p x 0.
Lemma 2.1.1: H b X log b a H a X

Proof: log b p log b a log a p.
17
Example (Entropy)
Example 2.1.1:
1, with probability p
X
0, with probability 1 p
H X p log p 1 p log 1 p H p
1
H p
0.5
p
1
18
Example (Entropy)
Example 2.1.2 :
a
b
X
c
d
with probability 1 2
1
1 1
1 1
1 1
1 7
H X log log log log
2
2 4
4 8
8 8
8 4
19
Joint Entropy
Definition: The joint entropy H X , Y of a pair of discrete
random variable X , Y with a joint distribution p x, y
is defined as
H X , Y E log p X , Y p x, y log p x, y
xX yY
Example :
Y
1
2
1
12
2
14
18
18
20
Joint Entropy
H X , Y p x, y log p x, y
xX yY
p x, y log p x, y
x1,2 y1,2
p 1, y log p 1, y p 2, y log p 2, y
y1,2
y1,2
p 1,1 log p 1,1 p 1, 2 log p 1, 2

p 2,1 log p 2,1 p 2, 2 log p 2, 2

7 4
X
1
2
Y
1 12 14
Example :
2 18 18
21
Joint Entropy
Y
X
1
1
12
2
14
18
18
p 1,1 1 2
p x, y
p x 1, y 1
p x y =
= p x 1 y 1
p y
p y 1
22
Conditional Entropy
Definition: If X , Y ~ p x, y , then the conditional
entropy H Y X is defined
H Y X p x H Y X x
xX
p x p y x log p y x
xX
yY
p x p y x log p y x
xX yY
p x, y log p y x
xX yY
E p x , y log p Y X
23
Conditional Entropy
H Y X p x, y log p y x
xX yY
p 1,1 log p 11 p 1,1 log p 2 1
p 2,1 log p 1 2 p 2, 2 log p 2 2

1
1
log 4 1
4
2
X
1
1
14
2
14
14
14
24
Example: Conditional Entropy

H Y X p x H Y X x
xX
Example :
2
H Y X p X i H Y X i
i 1
p X 1 H Y X 1 p X 2 H Y X 2
p X 1 p X 2 1 2
H Y X 1 H 1 2,1 2 1
H Y X 2 H 1 2,1 2 1
H Y X 1 2 1 1 2 1 1
X
1
1
14
2
14
14
14
25
Example: Conditional Entropy
Y X
1
2
3
4
18
1 16 1 32 1 32
1 16
18
1 32 1 32
1 16 1 16 1 16 1 16
14
0
0
0
4
H X Y p Y i H X Y i
i 1
p x, y log p y x
x
Chain Rule
Theorem 2.2.1 (Chain rule):
H X , Y H X H Y X
Proof :
H X , Y p x, y log p x, y
xX yY
p x, y log p x p y x
xX yY
p x, y log p x p x, y log p y x
xX yY
xX yY
p x log p x p x, y log p y x
xX
xX yY
H X H Y X
27
Remark: Chain Rule
H Y H Y X H X H X Y
H X , Y H X H Y X
H X , Y H Y H X Y
H Y X H X Y
28
Relative Entropy
The relative entropy is a measure of the distance between
two distribution.
If we knew the true distribution of the random variable,
then we could construct a code with average description
length H p . Instead, if we use the code for a distribution q
we would need H p D p q .
Definition: The relative entropy or Kullback Leibler distance
between two probability mass function p x and q x is defined as
D p q
p x
p x
E p log
p x log
q x
q x
29
Relative Entropy
Example 2.3.1: Let X 0,1 and consider two distribution
p and q on X . Let p 0 1 r , p 1 r , and q 0 1 s, q 1 s.
r
1 r
r log
D p q 1 r log
1 s
s
s
1 s
D q p 1 s log
s log
r
1 r
If r s, then D p q D q p 0.
If r 1 2, s 1 4,
1
12 1
12
1
+ log
D p q = log
1 log 3 0.2075
2
34 2
14
2
3
34 1
14 3
D q p = log
+ log
log 3 1 0.1887
4
12 4
12 4
Mutual Information
Definition : Consider two random variable X and Y with
a joint probability mass function p x, y and marginal
probability mass functions p x and p y . The mutual
information I X ; Y is the relative entropy between
the joint distribution and the product distribution.
I X ;Y
X Y
p x, y
p x, y log
p x p y
D p x, y p x p y
p X ,Y
E p x , y log
p X p Y
31
Relationship between Entropy and Mutual Info.
I X ;Y
X Y
p x, y
p x, y log
p x p y
p x y
X Y p x, y log p x
p x, y log p x p x, y log p x y
X
Y
X
Y
p x log p x p x, y log p x y
xX
xX xY
H X H X Y
32
Entropy and Mutual Information

I X ; Y H Y H Y X
H X , Y H X H Y X
I X ; Y H X H Y H X , Y
H X ,Y
Diagram
H X Y
H X
I X ; Y H Y X
H Y
33
Chain Rule for Entropy

Theorem 2.5.1 Chain rule for entropy : Let X 1 , X 2 , , X n be
drawn according to p x1 , x2 , , xn . Then
n
H X 1 , X 2 , , X n H X i X i 1 , , X 1
i 1
H X1, X 2 H X1 H X 2 X1
H X1, X 2 , X 3 H X1 H X 2 , X 3 X1
H X1 H X 2 X1 H X 3 X1, X 2
34
Chain Rule for Entropy

H X1, X 2 , X 3 H X1 H X 2 , X 3 X1
H X1 H X 2 X1 H X 3 X1, X 2
H X 2 X1
H X1
H X 3 X1, X 2
35
Conditional Mutual Information

Definition : The conditional mutual information of random variable
X and Y given Z is defined by
I X ;Y Z H X Z H X Y , Z
Theorem 2.5.2 Chain rule for mutual information

n
I X 1 , X 2 , , X n ; Y I X i ; Y X i 1 , X i 2 , , X 1
i 1
I X1 , X 2 ; Y I X1; Y I X 2 ; Y X1
I X1 , X 2 , X 3 ; Y I X1; Y I X 2 ; Y X1 I X 3 ; Y X1 , X 2
36
Conditional Mutual Information

I X ;Y Z H X Z H X Y , Z
37
Conditional Relative Entropy

Definition: The conditional relative entropy is the average
of the relative entropies between the conditional
probability mass functions p(y|x) and q(y|x) averaged over
the pmf p(x).
D p y x q y x p x p y x log
x
p x p y x log
x
p x, y log
x
p y x
q y x
p y x
q y x
p y x
q y x
38
Chain Rule for Relative Entropy

Theorem 2.5.3 :
D p x, y q x, y D p x q x D p y x q y x
Proof:
D p x, y q x, y
p x, y
p x, y log
q x, y
x
y
p x, y log
x
p x p y x
q x q y x
p y x
p x
p x, y log
p x, y log
q x x y
q y x
x
y
D p x q x D p y x q y x
39
Convex Functions
Definition : A function f x is said to be convex over an interval
a, b
if for every x1 , x2 a, b and 0 1,
f x1 1 x2 f x1 1 f x2
A function f is said to be strictly convex if equality holds only if

=0 or =1.
Definition : A function f x is concave if f x is convex.
Theorem 2.6.1: If the function f x has a second derivative
which is non-negative (positive) everywhere, then the function
is convex (strictrly convex).
40
Convexity of a function
e.g.,
x2,
ex,
f x2
etc. are convex.
f x1
x2
x1
f x1 1 x2 f x1 1 f x2
x1 1 x2
Theorem 2.6.1 is directly proved using the Taylor series

expansion
f ( x* )
f ( x) f ( x0 ) f ( x0 )( x x0 )
( x x0 ) 2
2
41
Jensens Inequality
Theorem 2.6.2 : If f x is a convex function and X is a random
variable, then E f X f E X
Proof by induction on the number of mass points
1. p1 f x1 p2 f x2 f p1 x1 p2 x2 , p1 p2 1
k 1
k 1
2. Hypothesis pi f xi f pi xi , pi 1
i 1
i 1
i 1
k
k
k
3. Proof pi f xi f pi xi , pi 1
i 1
i 1
i 1
k 1
42
Jensens Inequality
k 1
pi
pi f xi pk f xk 1 pk
f xi
i 1
i 1 1 pk
k 1
pk f xk 1 pk pi f xi
i 1
k 1
pk f xk 1 pk f pixi the induction hypothesis

i 1
k 1
f pk xk 1 pk pixi
i 1
f pi xi the definition of convexity

i 1
43
Information Inequality
Theorem 2.6.3 Information inequality : Let p x , q x , x
be two probability mass functions. Then D p q 0 with equality
if and only if p x q x .
Proof:
D p q
x
p x
p x log
q x
q x
p x log
p x
log
x
q x
p x
by Jensen's inequality
p x
log q x 0
x
44
Non-negativity of Mutual Information

Corollary Non-negativity of mutual information :
For any two random variable X , Y , I X ; Y 0 with equality
if and only if X and Y are independent.
I X ; Y D p x, y p x p y
Corollary : D p y x q y x 0
Corollary : I X ; Y Z 0
45
Uniform Distribution
Theorem 2.6.4 : H X log where denotes the number of
elements in the range of X , with equality if and only if X has a
uniform distribution over
Proof: Let u x
be the uniform probability mass function over ,
and let p x be the probability mass function for X . Then
p x
D p u p x log
p x log p x p x log u x
u x
H X p x log
H X log
p x
H X log 0 by non-negativity of relative entropy

H X log
46
Conditioning
Theorem 2.6.5 Conditioning reduces entropy :
H X Y H X with equality if and only if X and Y are
independent.
Proof: 0 I X ; Y H X H X Y
Example :
X
3/4
1/8
1/8
1 7
H X H , 0.544,
8 8
H X Y 1 0, H X Y 2 1
3
1
H X Y H X Y 1 H X Y 2 0.25
4
4
47
Independence Bound and Entropy

Theorem 2.6.5 Independence bound on entropy :
Let X 1 , X 2 , X n be drawn according to p x1 , x2 , , xn .
n
Then H X 1 , X 2 , X n H X i with equality if and only if

i 1
the X i are independent.
Proof : By the chain rule for entropies

n
H X 1 , X 2 , , X n H X i X i 1 , , X 1
i 1
n
H Xi
i 1
48
Log Sum Inequality

Theorem 2.7.1 Log sum inequality : For non-negative numbers,
a1 , a2 , , an and b1 , b2 , , bn ,
a
ai
i 1 i
ai log ai log n
bi i 1
i 1
i 1 bi
n
with equality if and only if ai bi const.
49
Log Sum Inequality

Proof
f t t log t is strictly convex, since f t 1 log e 0

t
for all positive t. Hence by Jensen's inequality,
f t f t for 1.
Setting b b and t a b ,
i
i i
j 1
n
j 1
bj
b
a
a
b
a
b
a
i
i
i
i
i
i
i
i n b b log b i n b b log i n b b
i j 1 j i
j 1 j i
j 1 j i

a
ai
ai
ai
i 1 i
i log b i log n b
i
j 1 j
n
ai
i ai log b
i
a log
i 1 i
n
i 1 i
50
Convexity and Relative Entropy

Theorem 2.7.2 : D p q is convex in the pair p, q , i.e., if p1 , q1
and p2 , q2 are two pairs of probability mass functions, then
D p1 1 p2 q2 1 q2 D p1 q1 1 D p2 q2
for all 0 1
51

Proof
by log sum inequality a log
ai
i ai log
b
bi
i 1 i
a
i 1 i
n
a1 a2
a1
a2
a1 log a2 log
a1 a2 log
b1 b2
b1
b2
a1 p1 x , a2 1 p2 x ,
b1 q1 x , b2 1 q2 x
p1 x 1 p2 x
p1 x 1 p2 x log q x 1 q x
2
1
p1 x
1 p2 x
p1 x log
1 p2 x log
q1 x
1 q2 x
52

Proof (Contd)
p1 x 1 p2 x
p1 x 1 p2 x log q x 1 q x
2
1
p1 x
p2 x
p1 x log
1 p2 x log
q1 x
q2 x
p1 x 1 p2 x
x p1 x 1 p2 x log q x 1 q x
1
2
p1 x
p2 x
p1 x log
1 p2 x log
q1 x
q2 x
x
D p1 x 1 p2 x q1 x 1 q2 x
D p1 x q1 x 1 D p2 x q2 x
53
Concavity of Entropy
Theorem 2.7.3 Concavity of entropy : H p is a concave function

of p such that H p1 1 p2 H p1 1 H p2 .
Proof:
H p log D p u
where is the uniform distribution on outcomes. The concavity

of H the follows directly from the the convexity of D.
54
Mutual Information and Input Distr.

Theorem 2.7.4. (Part 1) Let (X,Y)~p(x,y)=p(x)p(y|x). The
mutual information I(X;Y) is a concave function of p(x) for
fixed p(y|x).
Proof
I ( X ; Y ) H (Y ) H (Y | X ) H (Y ) p ( x) H (Y | x)
x
The first term: p(y|x) fixed. p(y) is a linear function of p(x) H(Y),
concave with p(y), is a concave function of p(x).
The second term is a linear function p(x)
Difference is a concave function of p(x).
Note: Closely related to the channel coding theorem.

55
Mutual Information and Input Distr.

Theorem 2.7.4. (Part 2) I(X,Y) is a convex function of p(y|x)
for fixed p(x).
Proof
Consider a RV X, and two channels with p1(y|x) and p2(y|x).
Let one channel be chosen randomly according to a binary RV Z
that is independent of X.
I ( X ;Y , Z ) I ( X ;Y | Z ) I ( X ; Z )
I ( X ;Y ) I ( X ; Z | Y )
where I ( X ; Z ) 0
I ( X ;Y ) I ( X ;Y | Z )
I ( X ; Y1 ) (1 ) I ( X ; Y2 )
56
Markov Chain
Definition : Random variable X , Y , Z are said to form a Markov chain
in that order denoted by X Y Z if the conditional distribution
of Z depends only on Y and is conditionally independent of X .
Specifically X , Y , and Z form a Markov chain X Y Z if the joint
probability can be written as
p x, y , z p x p y x p z y p x, y p z y .
p x , y
p x, y , z p x, y p z y
p x, z y
p x y p z y
p y
p y
X Y Z implies that Z Y X . Thus the condition is sometimes
written X Y Z .
If Z f Y , then X Y Z .
57
Data Processing Inequality

Theorem 2.8.1 Data processing inequality : If X Y Z ,
then I X ; Y I X ; Z .
Proof : By the chain rule
I X ;Y , Z I X ; Z I X ;Y Z
I X ;Y I X ; Z Y
I X ; Y I X ; Z and I X ; Y I X ; Y Z
58
Fanos Inequality
Suppose we have RVs X and Y. Fanos bounds the error
when we estimating X from Y.
Procedure
X sent, Y observed.
Generate an estimator of X that is
Pe: The probability of error, i.e.
X g (Y )
Pe Pr( X X )
E: Error indicator RV
1 if X X
E
0 if X X
Note Pe P( E 1).
59
Fanos Inequality
Theorem (Fano): For any estimator
with Pe=Pr(X ) then we have
such that X Y
H (Pe ) + Pe log | | H (X | X ) H (X | Y )
The inequality can be weakened to
1 + Pe log | | H (X | Y )
or
H (X | Y ) - 1
Pe
log | |
Robert Fano
60
Fanos Inequality
61
Fanos Inequality
Proof
62
Fanos Inequality
Corollary: For any two RVs X and Y, let p=Pr(XY)
Corollary:
63

Elements of Information Theory-Chapter1-2

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Elements of Information Theory-Chapter1-2

Încărcat de

Drepturi de autor:

Formate disponibile

Information Theory

Also, a lot of applications

Entropy of A Random Variable

Uniform distributed r.v. with 32 outcomes

H X p x log 2 p x bits = p x log e p x nats

Data Compression (Example)

Case 1: 000, 001, 010, 011, 100, 101, 110, 111

Case 2 : 0, 10, 110, 1110, 111100, 111101, 111110, 111111

Data Compression (Example)

Data Compression: Coding Techniques

Shannon Fano Elias codes

Lossy Data Compression

Noiseless binary channel

The entropy H X of discrete random variable X is defined by

The expectation of the random variable g X is written

p 1 log p 1 p 10 log p 10 p 20 log p 20

The entropy H X of discrete random variable X is defined by

Lemma 2.1.1: H b X log b a H a X

p 1,1 log p 1,1 p 1, 2 log p 1, 2

p 2,1 log p 2,1 p 2, 2 log p 2, 2

p 1,1 log p 11 p 1,1 log p 2 1

p 2,1 log p 1 2 p 2, 2 log p 2 2

Example: Conditional Entropy

Example: Conditional Entropy

Remark: Chain Rule

Relationship between Entropy and Mutual Info.

Entropy and Mutual Information

Chain Rule for Entropy

Chain Rule for Entropy

Conditional Mutual Information

Theorem 2.5.2 Chain rule for mutual information

Conditional Mutual Information

Conditional Relative Entropy

Chain Rule for Relative Entropy

if for every x1 , x2 a, b and 0 1,

A function f is said to be strictly convex if equality holds only if

etc. are convex.

Theorem 2.6.1 is directly proved using the Taylor series

pk f xk 1 pk f pixi the induction hypothesis

f pi xi the definition of convexity

Non-negativity of Mutual Information

be the uniform probability mass function over ,

and let p x be the probability mass function for X . Then

H X log 0 by non-negativity of relative entropy

Independence Bound and Entropy

Then H X 1 , X 2 , X n H X i with equality if and only if

the X i are independent.

Proof : By the chain rule for entropies

Log Sum Inequality

with equality if and only if ai bi const.

Log Sum Inequality

f t t log t is strictly convex, since f t 1 log e 0

Convexity and Relative Entropy

Convexity and Relative Entropy

by log sum inequality a log

Convexity and Relative Entropy

Theorem 2.7.3 Concavity of entropy : H p is a concave function

where is the uniform distribution on outcomes. The concavity

Mutual Information and Input Distr.

Note: Closely related to the channel coding theorem.

Mutual Information and Input Distr.

Data Processing Inequality

Proof : By the chain rule

S-ar putea să vă placă și