Sunteți pe pagina 1din 63

Information Theory

- Chapter 1 and 2
School of ICE
Sungkyunkwan University
1

Chapter 1: Information
Theory
Data Compression and Data
Transmission: Channel and
Source Coding

Information Theory
Information theory answers two fundamental questions in
communication theory:
Ultimate date compression (Entropy H)
Ultimate transmission rate of communication channel (Channel
capacity C)
Data compression
limit

min I X ; X

Data transmission
limit

max I X ; Y

Also, a lot of applications


Physics, Statistics, Math, Computer Science, Economics
3

Entropy of A Random Variable


Entropy
Uncertainty of a random variable X
Information quantity that the random variable possess (or carries)
Mathematically

H X p x log p x
xX

Example

Uniform distributed r.v. with 32 outcomes


32

32

1
1
H X p (i ) log p (i ) log
log 32 5 bits
32
i 1
i 1 32
Note (unit of information)

H X p x log 2 p x bits = p x log e p x nats


xX

xX

Data Compression (Example)


Send a message to another person indicating the outcom.
There are 8 outcomes and the prob. of outcomes are
1 1 1 1 1 1 1 1
, , , , , , ,
2 4 8 16 64 64 64 64

Case 1: 000, 001, 010, 011, 100, 101, 110, 111


Average length: 3bit

Case 2 : 0, 10, 110, 1110, 111100, 111101, 111110, 111111


Average length: 2bit
Same as actual entropy
5

Data Compression (Example)


Case 2 : 0, 10, 110, 1110, 111100, 111101, 111110, 111111
Case 2 : 0, 10, 110, 1110, 111100, 111101, 111110, 111111
1101011100

Entropy rate
Entropy for a stochastic process

Data Compression: Coding Techniques


Data compression
Source coding
Ultimate compression is bounded by H(X)

Coding Techniques
Optimal codes
Huffman codes
Optimality

Shannon Fano Elias codes


Arithmetic codes

Lossy Data Compression


Practically
We cannot implement lossless data compression.
Much more efficiently compress by allowing a certain amount of
distortion (from the original source)
More distortion allowed, further your data can be compressed.

Rate distortion
D: distortion allowed.
R(D): Compression bound for the given distortion.

Data Transmission
Fundamental question
Given a channel with probabilistic relationship between the input X
and the output Y, i.e., p(y|x)
How much we can transmit data through the channel without error?

Old myth
Error has a positive relationship with the noise of the channel.

Shannons breakthrough
Error free communication is possible if we transmits with the rate
below the channel capacity.

Channel
p(y|x)

Y
9

Data Transmission
Noiseless binary channel
In each transmission, we can send 1 bit reliably to the receiver and
the capacity is 1 bit.

Noisy channel
We cannot transmit all the information (without error) that the
input can carry.
0
1

0
1

Noiseless binary channel

A noisy channel

10

Data Transmission
Mutual information
Measure of the dependence between two random variables

p ( x, y )
I ( X ; Y ) H ( X ) H ( X | Y ) p ( x, y ) log
p( x) p( y )
x, y
Relative entropy between p(x,y) and p(x)p(y)

Relative entropy
Distance between two probability mass functions.

p( x)
D ( p || q ) p ( x) log
q( x)
x

11

Data Transmission
Channel capacity
Maximum data transmission with vanishing error probability.
Maximum mutual information between the input random variable
and the output one.

C max p ( x ) I ( X ; Y )

Coding techniques
Hamming codes
Simplest channel code, not capacity approaching

Random codes
Shannon brought it up.
Capacity approaching, impractical.

12

Data Transmission
Gaussian channel
Most useful model for communication problems.
Capacity is derived and generalized to parallel Gaussian channel.

13

Chapter 2
Entropy, Relative Entropy, and
Mutual Information

14

Entropy
Let X be a discrete random variable with alphabet X and
probability mass function p x Pr X x , x X.

The entropy H X of discrete random variable X is defined by


H X p x log p x
xX

The expectation of the random variable g X is written


Ep g X

X g X p x

1
The entropy of X is the expected value of g x log
p X
1
H X E p log
pX
15

Entropy

X = 1,10, 20
H X p x log p x
xX

p 1 log p 1 p 10 log p 10 p 20 log p 20

16

Properties of Entropy

The entropy H X of discrete random variable X is defined by


H X p x log 2 p x
xX

Lemma 2.1.1: H X 0
Proof: 0 p x 1 implies log 1 p x 0.

Lemma 2.1.1: H b X log b a H a X


Proof: log b p log b a log a p.

17

Example (Entropy)
Example 2.1.1:
1, with probability p
X
0, with probability 1 p
H X p log p 1 p log 1 p H p
1
H p

0.5
p

1
18

Example (Entropy)

Example 2.1.2 :
a
b

X
c
d

with probability 1 2
with probability 1 4
with probability 1 8
with probability 1 8

1
1 1
1 1
1 1
1 7
H X log log log log
2
2 4
4 8
8 8
8 4

19

Joint Entropy
Definition: The joint entropy H X , Y of a pair of discrete
random variable X , Y with a joint distribution p x, y
is defined as
H X , Y E log p X , Y p x, y log p x, y
xX yY

Example :

Y
1
2

1
12

2
14

18

18
20

Joint Entropy
H X , Y p x, y log p x, y
xX yY

p x, y log p x, y

x1,2 y1,2

p 1, y log p 1, y p 2, y log p 2, y
y1,2
y1,2

p 1,1 log p 1,1 p 1, 2 log p 1, 2


p 2,1 log p 2,1 p 2, 2 log p 2, 2


7 4
X
1
2
Y
1 12 14
Example :
2 18 18
21

Joint Entropy
Y

X
1

1
12

2
14

18

18

p 1,1 1 2

p x, y
p x 1, y 1
p x y =
= p x 1 y 1
p y
p y 1

22

Conditional Entropy
Definition: If X , Y ~ p x, y , then the conditional
entropy H Y X is defined
H Y X p x H Y X x
xX

p x p y x log p y x
xX

yY

p x p y x log p y x
xX yY

p x, y log p y x
xX yY

E p x , y log p Y X

23

Conditional Entropy
H Y X p x, y log p y x
xX yY

p 1,1 log p 11 p 1,1 log p 2 1

p 2,1 log p 1 2 p 2, 2 log p 2 2


1
1
log 4 1
4
2
X
1

1
14

2
14

14

14

24

Example: Conditional Entropy


H Y X p x H Y X x
xX

Example :
2

H Y X p X i H Y X i
i 1

p X 1 H Y X 1 p X 2 H Y X 2
p X 1 p X 2 1 2
H Y X 1 H 1 2,1 2 1
H Y X 2 H 1 2,1 2 1
H Y X 1 2 1 1 2 1 1

X
1

1
14

2
14

14

14

25

Example: Conditional Entropy

Y X
1
2
3
4

18

1 16 1 32 1 32

1 16

18

1 32 1 32

1 16 1 16 1 16 1 16
14
0
0
0
4

H X Y p Y i H X Y i
i 1

p x, y log p y x
x

Chain Rule
Theorem 2.2.1 (Chain rule):
H X , Y H X H Y X
Proof :
H X , Y p x, y log p x, y
xX yY

p x, y log p x p y x
xX yY

p x, y log p x p x, y log p y x
xX yY

xX yY

p x log p x p x, y log p y x
xX

xX yY

H X H Y X
27

Remark: Chain Rule

H Y H Y X H X H X Y
H X , Y H X H Y X
H X , Y H Y H X Y
H Y X H X Y

28

Relative Entropy
The relative entropy is a measure of the distance between
two distribution.
If we knew the true distribution of the random variable,
then we could construct a code with average description
length H p . Instead, if we use the code for a distribution q
we would need H p D p q .
Definition: The relative entropy or Kullback Leibler distance
between two probability mass function p x and q x is defined as

D p q

p x
p x
E p log
p x log
q x
q x
29

Relative Entropy
Example 2.3.1: Let X 0,1 and consider two distribution
p and q on X . Let p 0 1 r , p 1 r , and q 0 1 s, q 1 s.
r
1 r
r log
D p q 1 r log
1 s
s
s
1 s
D q p 1 s log
s log
r
1 r

If r s, then D p q D q p 0.
If r 1 2, s 1 4,
1
12 1
12
1
+ log
D p q = log
1 log 3 0.2075
2
34 2
14
2
3
34 1
14 3
D q p = log
+ log
log 3 1 0.1887
4
12 4
12 4

Mutual Information
Definition : Consider two random variable X and Y with
a joint probability mass function p x, y and marginal
probability mass functions p x and p y . The mutual
information I X ; Y is the relative entropy between
the joint distribution and the product distribution.
I X ;Y

X Y

p x, y
p x, y log
p x p y

D p x, y p x p y

p X ,Y
E p x , y log
p X p Y
31

Relationship between Entropy and Mutual Info.

I X ;Y

X Y

p x, y
p x, y log
p x p y
p x y

X Y p x, y log p x
p x, y log p x p x, y log p x y
X
Y
X
Y

p x log p x p x, y log p x y
xX
xX xY

H X H X Y
32

Entropy and Mutual Information


I X ; Y H Y H Y X
H X , Y H X H Y X
I X ; Y H X H Y H X , Y

H X ,Y

Diagram

H X Y

H X

I X ; Y H Y X

H Y

33

Chain Rule for Entropy


Theorem 2.5.1 Chain rule for entropy : Let X 1 , X 2 , , X n be
drawn according to p x1 , x2 , , xn . Then
n

H X 1 , X 2 , , X n H X i X i 1 , , X 1
i 1

H X1, X 2 H X1 H X 2 X1
H X1, X 2 , X 3 H X1 H X 2 , X 3 X1
H X1 H X 2 X1 H X 3 X1, X 2

34

Chain Rule for Entropy


H X1, X 2 , X 3 H X1 H X 2 , X 3 X1
H X1 H X 2 X1 H X 3 X1, X 2

H X 2 X1

H X1

H X 3 X1, X 2

35

Conditional Mutual Information


Definition : The conditional mutual information of random variable
X and Y given Z is defined by

I X ;Y Z H X Z H X Y , Z

Theorem 2.5.2 Chain rule for mutual information


n

I X 1 , X 2 , , X n ; Y I X i ; Y X i 1 , X i 2 , , X 1
i 1

I X1 , X 2 ; Y I X1; Y I X 2 ; Y X1
I X1 , X 2 , X 3 ; Y I X1; Y I X 2 ; Y X1 I X 3 ; Y X1 , X 2
36

Conditional Mutual Information


I X ;Y Z H X Z H X Y , Z

37

Conditional Relative Entropy


Definition: The conditional relative entropy is the average
of the relative entropies between the conditional
probability mass functions p(y|x) and q(y|x) averaged over
the pmf p(x).

D p y x q y x p x p y x log
x

p x p y x log
x

p x, y log
x

p y x

q y x

p y x

q y x

p y x

q y x

38

Chain Rule for Relative Entropy


Theorem 2.5.3 :

D p x, y q x, y D p x q x D p y x q y x

Proof:

D p x, y q x, y

p x, y
p x, y log
q x, y
x
y
p x, y log
x

p x p y x
q x q y x

p y x
p x
p x, y log
p x, y log
q x x y
q y x
x
y

D p x q x D p y x q y x

39

Convex Functions
Definition : A function f x is said to be convex over an interval

a, b

if for every x1 , x2 a, b and 0 1,

f x1 1 x2 f x1 1 f x2

A function f is said to be strictly convex if equality holds only if


=0 or =1.
Definition : A function f x is concave if f x is convex.
Theorem 2.6.1: If the function f x has a second derivative
which is non-negative (positive) everywhere, then the function
is convex (strictrly convex).
40

Convexity of a function
e.g.,

x2,

ex,

f x2

etc. are convex.

f x1

x2

x1

f x1 1 x2 f x1 1 f x2

x1 1 x2

Theorem 2.6.1 is directly proved using the Taylor series


expansion
f ( x* )
f ( x) f ( x0 ) f ( x0 )( x x0 )
( x x0 ) 2
2

41

Jensens Inequality
Theorem 2.6.2 : If f x is a convex function and X is a random
variable, then E f X f E X
Proof by induction on the number of mass points
1. p1 f x1 p2 f x2 f p1 x1 p2 x2 , p1 p2 1
k 1
k 1
2. Hypothesis pi f xi f pi xi , pi 1
i 1
i 1
i 1
k
k
k
3. Proof pi f xi f pi xi , pi 1
i 1
i 1
i 1
k 1

42

Jensens Inequality

k 1

pi
pi f xi pk f xk 1 pk
f xi

i 1
i 1 1 pk
k 1

pk f xk 1 pk pi f xi
i 1

k 1

pk f xk 1 pk f pixi the induction hypothesis


i 1

k 1

f pk xk 1 pk pixi
i 1

f pi xi the definition of convexity


i 1

43

Information Inequality
Theorem 2.6.3 Information inequality : Let p x , q x , x
be two probability mass functions. Then D p q 0 with equality
if and only if p x q x .

Proof:

D p q
x

p x
p x log
q x

q x
p x log
p x

log
x

q x
p x
by Jensen's inequality
p x

log q x 0
x

44

Non-negativity of Mutual Information


Corollary Non-negativity of mutual information :
For any two random variable X , Y , I X ; Y 0 with equality
if and only if X and Y are independent.

I X ; Y D p x, y p x p y

Corollary : D p y x q y x 0
Corollary : I X ; Y Z 0

45

Uniform Distribution
Theorem 2.6.4 : H X log where denotes the number of
elements in the range of X , with equality if and only if X has a
uniform distribution over
Proof: Let u x

be the uniform probability mass function over ,

and let p x be the probability mass function for X . Then

p x
D p u p x log
p x log p x p x log u x
u x
H X p x log

H X log

p x

H X log 0 by non-negativity of relative entropy


H X log
46

Conditioning
Theorem 2.6.5 Conditioning reduces entropy :
H X Y H X with equality if and only if X and Y are
independent.
Proof: 0 I X ; Y H X H X Y

Example :
X

3/4

1/8

1/8

1 7
H X H , 0.544,
8 8
H X Y 1 0, H X Y 2 1
3
1
H X Y H X Y 1 H X Y 2 0.25
4
4
47

Independence Bound and Entropy


Theorem 2.6.5 Independence bound on entropy :
Let X 1 , X 2 , X n be drawn according to p x1 , x2 , , xn .
n

Then H X 1 , X 2 , X n H X i with equality if and only if


i 1

the X i are independent.

Proof : By the chain rule for entropies


n

H X 1 , X 2 , , X n H X i X i 1 , , X 1
i 1
n

H Xi
i 1

48

Log Sum Inequality


Theorem 2.7.1 Log sum inequality : For non-negative numbers,
a1 , a2 , , an and b1 , b2 , , bn ,
a
ai

i 1 i
ai log ai log n

bi i 1
i 1
i 1 bi
n

with equality if and only if ai bi const.

49

Log Sum Inequality


Proof

f t t log t is strictly convex, since f t 1 log e 0


t
for all positive t. Hence by Jensen's inequality,

f t f t for 1.
Setting b b and t a b ,
i

i i

j 1

n
j 1

bj

b
a
a
b
a
b
a

i
i
i
i
i
i
i

i n b b log b i n b b log i n b b
i j 1 j i
j 1 j i
j 1 j i


a
ai
ai
ai

i 1 i
i log b i log n b
i

j 1 j
n

ai
i ai log b
i

a log

i 1 i
n

i 1 i
50

Convexity and Relative Entropy


Theorem 2.7.2 : D p q is convex in the pair p, q , i.e., if p1 , q1
and p2 , q2 are two pairs of probability mass functions, then

D p1 1 p2 q2 1 q2 D p1 q1 1 D p2 q2
for all 0 1

51

Convexity and Relative Entropy


Proof

by log sum inequality a log

ai
i ai log
b
bi
i 1 i
a

i 1 i
n

a1 a2
a1
a2
a1 log a2 log
a1 a2 log
b1 b2
b1
b2

a1 p1 x , a2 1 p2 x ,

b1 q1 x , b2 1 q2 x

p1 x 1 p2 x
p1 x 1 p2 x log q x 1 q x
2
1
p1 x
1 p2 x

p1 x log
1 p2 x log
q1 x
1 q2 x
52

Convexity and Relative Entropy


Proof (Contd)
p1 x 1 p2 x
p1 x 1 p2 x log q x 1 q x
2
1
p1 x
p2 x
p1 x log
1 p2 x log
q1 x
q2 x

p1 x 1 p2 x
x p1 x 1 p2 x log q x 1 q x
1
2

p1 x
p2 x
p1 x log
1 p2 x log

q1 x
q2 x
x

D p1 x 1 p2 x q1 x 1 q2 x

D p1 x q1 x 1 D p2 x q2 x

53

Concavity of Entropy

Theorem 2.7.3 Concavity of entropy : H p is a concave function


of p such that H p1 1 p2 H p1 1 H p2 .

Proof:

H p log D p u

where is the uniform distribution on outcomes. The concavity


of H the follows directly from the the convexity of D.

54

Mutual Information and Input Distr.


Theorem 2.7.4. (Part 1) Let (X,Y)~p(x,y)=p(x)p(y|x). The
mutual information I(X;Y) is a concave function of p(x) for
fixed p(y|x).
Proof
I ( X ; Y ) H (Y ) H (Y | X ) H (Y ) p ( x) H (Y | x)
x

The first term: p(y|x) fixed. p(y) is a linear function of p(x) H(Y),
concave with p(y), is a concave function of p(x).
The second term is a linear function p(x)
Difference is a concave function of p(x).

Note: Closely related to the channel coding theorem.


55

Mutual Information and Input Distr.


Theorem 2.7.4. (Part 2) I(X,Y) is a convex function of p(y|x)
for fixed p(x).
Proof
Consider a RV X, and two channels with p1(y|x) and p2(y|x).
Let one channel be chosen randomly according to a binary RV Z
that is independent of X.

I ( X ;Y , Z ) I ( X ;Y | Z ) I ( X ; Z )
I ( X ;Y ) I ( X ; Z | Y )
where I ( X ; Z ) 0
I ( X ;Y ) I ( X ;Y | Z )
I ( X ; Y1 ) (1 ) I ( X ; Y2 )

56

Markov Chain
Definition : Random variable X , Y , Z are said to form a Markov chain
in that order denoted by X Y Z if the conditional distribution
of Z depends only on Y and is conditionally independent of X .
Specifically X , Y , and Z form a Markov chain X Y Z if the joint
probability can be written as
p x, y , z p x p y x p z y p x, y p z y .

p x , y

p x, y , z p x, y p z y
p x, z y

p x y p z y
p y
p y
X Y Z implies that Z Y X . Thus the condition is sometimes
written X Y Z .
If Z f Y , then X Y Z .

57

Data Processing Inequality


Theorem 2.8.1 Data processing inequality : If X Y Z ,
then I X ; Y I X ; Z .

Proof : By the chain rule

I X ;Y , Z I X ; Z I X ;Y Z

I X ;Y I X ; Z Y

I X ; Y I X ; Z and I X ; Y I X ; Y Z

58

Fanos Inequality
Suppose we have RVs X and Y. Fanos bounds the error
when we estimating X from Y.
Procedure
X sent, Y observed.
Generate an estimator of X that is
Pe: The probability of error, i.e.

X g (Y )

Pe Pr( X X )
E: Error indicator RV

1 if X X
E
0 if X X

Note Pe P( E 1).

59

Fanos Inequality
Theorem (Fano): For any estimator
with Pe=Pr(X ) then we have

such that X Y

H (Pe ) + Pe log | | H (X | X ) H (X | Y )
The inequality can be weakened to

1 + Pe log | | H (X | Y )
or

H (X | Y ) - 1
Pe
log | |

Robert Fano
60

Fanos Inequality

61

Fanos Inequality
Proof

62

Fanos Inequality
Corollary: For any two RVs X and Y, let p=Pr(XY)

Corollary:

63

S-ar putea să vă placă și