Documente Academic
Documente Profesional
Documente Cultură
1
= 3 binary symbols.
Answer 2: What if we allow the length of each representation to vary amongst the outcomes.
For example, a Human code for this source would give the following representation:
Outcome Probability Representation 2
0
1
2
0
1
1
4
10
2
1
8
110
3
1
16
1110
4
1
64
111100
5
1
64
111101
6
1
64
111110
7
1
64
111111
ECE 723 Information Theory and Coding McMaster University
12 CHAPTER 1. INTRODUCTION
Note: Each bit of each codeword can be thought of as asking a yes-or-no question about
the outcome. Bit 1 asks the question, Did horse 0 win? (0=yes and 1=no). Similarly for the
remaining bits. We will see this again when we discuss Human source codes in greater detail.
The average number of binary symbols required to represent the source,
2
, can be computed
as
2
=
1
2
(1) +
1
4
(2) +
1
8
(3) +
1
16
(4) +
4
64
(6) = 2 binary symbols
which is less than
1
= 3 binary symbols. In fact, the code above provides the minimum average
codeword length of any representation for the source.
Denition: The source entropy, H(X), is dened as
H(X) =
xS
X
p(x) log
2
1
p(x)
bits. As we will show later in the course, the most economical representation have average
codeword length (
) is
H(X)
< H(X) + 1
For the source considered in the example,
H(X) =
1
2
log 2 +
1
4
log 4 +
1
8
log 8 +
1
16
log 16 +
4
64
log 64 = 2 bits
Thus, the above Human code for the source is optimal.
One time versus average performance
Note that although
2
<
1
, the variance of the codeword lengths in Representation 2 is larger
than the case in Representation 1.
If we had a one-o experiment where it was necessary to ever encode a single outcome, it is
possible that Representation 1 may outperform Representation 2.
McMaster University ECE 723 Information Theory and Coding
CHAPTER 1. INTRODUCTION 13
Example:
In the English language the frequency of the letter E is approximately 13% nearly indepen-
dently of the writer (and it is 5 times more likely then the next most frequent letter). We can
develop a coder for this case which minimizes the average number of bits required to represent
the source but it must be noted that this only minimizes the average number of bits required
In 1939, E.V. Wright published a book of more that 50000 words, entitled Gadsby, in which he
did not use the letter E at all! He even avoided abbreviations such as Mr. and Mrs. which
when expanded contain the letter E. Here is an excerpt,
Upon this basis I am going to show you how a bunch of bright young folks did nd
a champion; a man with boys and girls of his own; a man of so dominating and happy
individuality that Youth is drawn to him as is a y to a sugar bowl. It is a story
about a small town. It is not a gossipy yarn; nor is it a dry, monotonous account, full
of such customary ll-ins as romantic moonlight casting murky shadows down a
long, winding country road. Nor will it say anything about tinklings lulling distant
folds; robins carolling at twilight, nor any warm glow of lamplight from a cabin
window. No. It is an account of up-and-doing activity; a vivid portrayal of Youth
as it is today; and a practical discarding of that worn-out notion that a child dont
know anything.
In this case the encoder designed to minimize the average representation length for conventional
language would not do very well since the behavior of the text in Gadsby is not typical.
Information theory and coding deal with the typical or expected behavior of the source.
Entropy is a measure of the average uncertainty associated with the source.
We will demonstrate later in the course that for sequences of outcomes from a source,
that nearly all of the probability mass is contained in a relatively small set termed the
typical set.
ECE 723 Information Theory and Coding McMaster University
14 CHAPTER 1. INTRODUCTION
The most likely outcomes need not be typical (Example: Bernoulli source), but there is
a collection of low probability event all with nearly the same probability.
We will shown that the entropy, H(X), is a measure the size of the typical set, i.e., H(X)
is combinatorial in nature.
1.2.2 Channel Encoder
Goal: To achieve an economical (high rate) and reliable (low probability of error) transmission
of bits over a channel.
With a channel code we add redundancy to the transmitted data sequence which allows
for the correction of errors that are introduced by the channel
Alternatively, we can view a channel code as imposing a structure on the set of transmitted
codewords which is exploited at the receiver to improve detection.
Example: Natures coding: DNA and RNA contain a coded recipe to synthesize proteins in
organisms. Error protection in replication of DNA.
McMaster University ECE 723 Information Theory and Coding
CHAPTER 1. INTRODUCTION 15
All possible
received vectors
noisy channel
Input vectors to
Each codeword which is transmitted is corrupted by the channel. Each transmitted
codeword corresponds to a set of possible received vectors (set of typical outcomes)
Specify a code (i.e., a set of codewords) so that at the receiver it is possible to distinguish
which element was sent with high-probability (i.e., probability of overlap of regions is
small).
The channel coding theorem tells us the maximum number of such codewords we can
dene and still maintain completely distinguishable outputs.
Shannons Channel Coding Theorem There is a quantity called the capacity, C, of a
channel such that for every rate R < C there exists a sequence of ( 2
nR
..
# codewords
, n
..
# chan. uses
)
codes such that Pr[error] 0 as n . Conversely, for any code, if Pr[error] 0 as n
then R < C.
ECE 723 Information Theory and Coding McMaster University
16 CHAPTER 1. INTRODUCTION
Example: Binary Symmetric Channel
1
0
1 1
0
p
p
(1p)
(1p) 1/2
p
C
Input channel alphabet = Output channel alphabet = {0, 1}
Assume independent channel uses (i.e., no memory).
Channel randomly ips the bit with probability p.
For p = 0 or p = 1, C = 1 bits/channel use (noiseless channel or inversion channel
Worst case, p = 1/2, in which case the input and the output are statistically independent
C = 0
Question: How do we devise codes which perform well on this channel ?
Repetition Code
In this code, we take the bit to be transmitted an repeat it 2m + 1 times, for some integer
m 0. The code consists of two possible codewords,
C = {000 0
. .
2m+1
, 111 1
. .
2m+1
}
At the receiver decoding is done by a majority voting scheme: If there are mores 0s than 1s
in the receive codeword then declare 0 transmitted, else 1.
McMaster University ECE 723 Information Theory and Coding
CHAPTER 1. INTRODUCTION 17
Consider the case when m = 3.
Received codewords Received codewords
decoding to message 0 decoding to message 1
000 111
001 110
010 101
100 011
It is clear that as long as the number of ipped bits is less than half the length of the repetition
codewords, it is possible to recover the message exactly.
As m we would expect that the channel to ip a proportion of p < 1/2 of the bits. In
fact, we will show that as m it is possible to make the probability that more than a
proportion of p of the bits is ipped negligibly small (by the weak law of large numbers).
P
r
o
b
a
b
i
l
i
t
y
increasing
1/2 p
Proportion of bits flipped
m
Therefore, Pr[error] 0 as m , however, rate 0 ! Therefore, this is not an ecient code.
Shannon demonstrated that there exist codes which are capacity achieving at non-zero rates.
Hamming Code
Consider forming a code with a little more algebraic structure. Dene a code as the span of
a basis set over the binary eld (i.e., take all additions modulo 2). A (7, 4) Hamming code is
ECE 723 Information Theory and Coding McMaster University
18 CHAPTER 1. INTRODUCTION
dened as the set of all linear combinations of the four rows of the generator matrix,
G =
_
_
1 1 1 0 0 0 0
0 1 0 1 1 0 0
0 0 1 1 0 1 0
0 0 0 1 1 1 1
_
_
i.e, for x {0, 1}
4
(binary vectors of length 4), every code word c = xG. Thus, the rate of this
code is 4/7 The set of all possible codewords is
C =
_
_
0000000 1110000 0101100 1011100
0011010 1101010 0110110 1000110
0001111 1111111 0100011 1010011
0010101 1100101 0111001 1001001
_
_
Let c C be represented as c = (u
1
, u
2
, u
3
, u
4
, p
1
, p
2
, p
3
).
Decoding is done by multiplying the received vector by the parity check matrix, H. The code
C is the kernel or null-space of H
T
, that is, for c C, cH
T
= 0. For the (7, 4) Hamming code
presented above, the parity check matrix is
H =
_
_
1 0 1 1 1 0 0
1 1 0 1 0 1 0
0 1 1 1 0 0 1
_
_
This decoding essentially solves the parity check equations given by each row of H, namely
1. u
1
+u
3
+u
4
+p
1
= 0
2. u
1
+u
2
+u
4
+p
2
= 0
3. u
2
+u
3
+u
4
+p
3
= 0
These equations can be represented in a Venn diagram. Consider all the possible single bit
McMaster University ECE 723 Information Theory and Coding
CHAPTER 1. INTRODUCTION 19
errors. Say p
1
is in error. It is clear that only equation 1 will not be satised while equations 2
and 3 will be. Additionally, if u
2
is corrupted, then equation1 will be satised and equations 2
and 3 will not be. In this way the Hamming code is able to detect and correct every single bit
error.
Eqn. 1
p
1
p
3
u
3
u
2
u
4
u
1
p
2
Eqn. 3
Eqn. 2
Thus, the (7, 4) Hamming code is a single error correcting code operating at rate 4/7 0.57.
By comparison, the single error correcting repetition code of length three operates at a rate of
1/3.
1.3 Review of Probability Theory
In order study information theory and coding it is necessary to have some background in
probability theory. Data transmission on these channels is viewed as a random experiment
while the entropy and mutual information are computed with respect to the underlying random
variables at the source and receiver. Here we present only a brief description of some of the
essential concepts required to understand the initial sections of the course. Additional theory
will be introduced as required in the course. More complete references on probability theory
and stochastic processes can found in undergraduate
3
level as well as graduate
4
level texts.
3
A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, second edition, Addison-
Wesley Publishing Company, Reading, MA, 1994.
4
A. Papoulis and S. U. Pillai, Probability, Random Variables and Stochastic Processes, fourth edition, Mc-
Graw Hill Companies, 2002.
ECE 723 Information Theory and Coding McMaster University
20 CHAPTER 1. INTRODUCTION
1.3.1 Discrete Probability Models
Discrete probability models consist of a random experiment with a nite or countable number
of outcomes. For example: toss of a die, ip of a coin, number of data packets arriving in a
time interval, etc.
The sample space, S, of the experiment is the set of all possible outcomes and contains a nite
or countable number of elements. Let S = {
1
,
2
, . . .}.
An event is a subset of S. Events consisting of a single outcome are termed elementary events.
Example: S is the certain event and is the null or impossible event.
Note that S and are events of every sample space.
Every event A S has a real number, p(A) assigned to it which is the probability of event A.
Probabilities must satisfy the axioms:
1. p(A) 0
2. p(S) = 1
3. for A, B S, if A B = ,then p(A B) = p(A) +p(B).
Formally, a probability space is dened by the triple (S, B, p) where B is the Borel eld of all
events of S. For the countable discrete models considered here, we let B be the power set of S,
i.e. the set of all subsets of S. We can assign probabilities to all subsets of S so that the axioms
of probability are satised (Note that this is not the case for continuous random variables where
B is dened as the smallest Borel eld that includes all half lines x x
i
for any x
i
).
Since S is countable, the probability of every event can be written in terms of the probabilities
of elementary events,
i
, namely p({
i
}).
Random Variables: A random variable, X(), is a function which assigns a real number to
every outcome
i
S, i.e.,
X : S R
McMaster University ECE 723 Information Theory and Coding
CHAPTER 1. INTRODUCTION 21
Notice that X is neither random nor a variable!
Let S
X
be the set of all values taken by X and dene
p
X
(x) := p[{
i
: X(
i
) = x}]. (1.1)
Notation: X() will be abbreviated to X to simplify notation with the link between random
variables and the underlying random ensemble taken as given. We further abuse notation by
writing (1.1) as
p
X
(x) := Pr[X = x].
By the axioms of probability
p
X
(x) 0
xS
X
p
X
(x) = 1.
The values p
X
(x) are called the probability mass function (pmf) of X.
Notation: For convenience we will abbreviate p
X
(x) to p(x). Although this is an abuse of
notation it will be clear from the context which random variable is referred to.
Vector Random Variables The above can be extended to the case of vector random variables.
Vector random variables assign a vector of real numbers to each outcome of S. For example,
consider the vector random variable Z consisting of all two-tuples of the form (X, Y ). This
random vector can be viewed as the combination of two random variables describing each of
the coordinates. The pmf of Z is often termed the joint pmf of X and Y and can be written as
p
Z
(x, y) = p
X,Y
(x, y)
= Pr(X = x, Y = y)
The pmfs of the coordinates, p(x) and p(y), are termed the marginals of p
X,Y
(x, y). Let S
X
and S
Y
denote the range of values for each co-ordinate of Z. The maginals can then be written
ECE 723 Information Theory and Coding McMaster University
22 CHAPTER 1. INTRODUCTION
as
p(x) =
yS
Y
p
X,Y
(x, y)
and
p(y) =
xS
Y
p
X,Y
(x, y).
Notation: We similarly abuse notation in the case of vector random variables and let p
X,Y
(x, y)
be denoted by p(x, y).
1.3.2 Conditional Probability and Independence
Take two random variables X and Y dened on the same probability space. The conditional
probability mass function, p
X|Y
(x
k
|y
j
), describes the probability of the event [X = x
k
] given
that the event [Y = y
j
] has occurred. Formally it can be dened as,
p
X|Y
(x
k
|y
j
) = Pr[X = x
k
|Y = y
j
]
=
Pr[X = x
k
, Y = y
j
]
Pr[Y = y
j
]
=
p(x
k
, y
j
)
p(y
j
)
whenever p(y
j
) > 0.
Notation: We will again simplify notation and represent p
X|Y
(x|y) as p(x|y).
The random variables X and Y as said to be independent if their joint distribution can be
factored as
(x, y) S
X
S
Y
p(x, y) = p(x)p(y).
McMaster University ECE 723 Information Theory and Coding
CHAPTER 1. INTRODUCTION 23
For independent X and Y ,
p(x|y) =
p(x, y)
p(y)
=
p(x)p(y)
p(y)
= p(x)
Notice that knowledge of Y does not impact the distribution of X. In a future lecture we will
demonstrate that when X and Y are independent, Y provides no information about X.
1.3.3 Expected Value
The expected value or mean of a random variable X is dened as
E[X] =
x
k
S
X
x
k
p(x
k
).
The expected value of a function of X, f(X), is itself a random variable as has mean
E[f(X)] =
x
k
S
X
f(x
k
)p(x
k
).
ECE 723 Information Theory and Coding McMaster University