ECE 723 - Information Theory and Coding

ECE 723 Information Theory and Coding
Dr. Steve Hranilovic

Department of Electrical and Computer Engineering
McMaster University
Email: hranilovic@ece.mcmaster.ca
Winter 2004
Portions of this document are c Copyright by Steve Hranilovic 2004
2
Department of Electrical and Computer Engineering ECE 723Winter 2004
McMaster University Course Outline
Information Theory and Coding
Instructor: Dr. Steve Hranilovic
Email: hranilovic@ece.mcmaster.ca
Web: http://www.ece.mcmaster.ca/hranilovic
Lectures: Thursdays 9am-12pm (starting Jan. 8th), CRL/B102
Oce Hours: By appointment in oce ITB-147.
Course Description:
This course will provide an introductory look into the broad areas of information theory and
coding theory. As stated in the course text,
Information theory answers two fundamental questions in communication theory:
what is the ultimate data compression (answer: the entropy H) and what is the
ultimate transmission rate of communication (answer: the channel capacity C).
In addition to discussing these fundamental performance limits, the course will also present
coding techniques which approach these limits.
Tentative Outline (time permitting):
Entropy: entropy, relative entropy, mutual information, chain rules, data processing
inequality, the asymptotic equipartition property, entropy rates for stochastic processes.
Data Compression: the source coding theorem, Kraft inequality, Shannon-Fano codes,
Human codes, universal source codes.
Channel Capacity: discrete channels, random coding bound and converse, Gaussian
channels, parallel Gaussian channels and water-pouring, bandlimited channels.
Error Control Coding: linear block codes and their properties, hard-decision de-
coding, convolutional codes, Viterbi decoding algorithm, iterative decoding.
Course Text: T.M. Cover and J.A. Thomas, Elements of Information
Theory, John Wiley & Sons, 1991. (on reserve Thode Library)
Reference Texts: S.B. Wicker, Error Control Systems for Digital
Communication and Storage, Prentice-Hall, 1995. (on reserve Thode)
R.G. Gallager, Information Theory and Reliable Communication,
John Wiley & Sons, 1965. (on reserve Thode)
Assessment: Mini-Project I 15%, Midterm 30%,
Mini-Project II 15%, Final 40%.
Permitted Aids: Course notes, handwritten class notes and a non-programmable
calculator will be permitted for midterm and nal exams.
McMaster University ECE 723 Information Theory and Coding
3
Policy Reminders
Senate and the Faculty of Engineering require all course outlines to include the following re-
minders:
The Faculty of Engineering is concerned with ensuring an environment that is free of all adverse
discrimination. If there is a problem, that cannot be resolved by discussion among the persons
concerned, individuals are reminded that they should contact the Department Chair, the Sexual
Harassment Ocer or the Human Rights Consultant, as soon as possible.
Students are reminded that they should read and comply with the Statement on Academic
Ethics and the Senate Resolutions on Academic Dishonesty as found in the Senate Policy
Statements distributed at registration and available in the Senate Oce.
Academic dishonesty consists of misrepresentation by deception or by other fraudulent means
and can result in serious consequences, e.g. the grade of zero on an assignment, loss of credit
with a notation on the transcript (notation reads: Grade of F assigned for academic dishon-
esty), and/or suspension or expulsion from the university.
It is your responsibility to understand what constitutes academic dishonesty. For informa-
tion on the various kinds of academic dishonesty please refer to the Academic Integrity Policy,
specically Appendix 3, located at
http://www.mcmaster.ca/senate/academic/ac integrity.htm
The following illustrates only three forms of academic dishonesty:
1. Plagiarism, e.g. the submission of work that is not ones own or for which other credit
has been obtained.
2. Improper collaboration in group work.
3. Copying or using unauthorized aids in tests and examinations.
ECE 723 Information Theory and Coding McMaster University
4
Note to the Reader:
These lecture notes provide a brief overview of the topics covered in a one semester (12 week)
course on information theory and coding intended for rst year graduate students.
I have drawn on a number of sources to produce these course notes. These notes are based
on the course text by Cover and Thomas, Gallagers classic text on information theory, course
notes provided by Prof. Frank R. Kschischang of the University of Toronto, as well as other
sources which are referenced in the text.
These course notes are a perpetual work in progress. Please report any typographical or other
errors to the author by email.
Note that these lecture notes are not intended to replace the course text or the references.
They merely serve as a starting point for learning and applying the material. Students are
encouraged to read the course text or other references in information and coding theory to gain
a more complete understanding of the area.
I hope you nd these notes useful in your studies of information theory and coding.
Steve Hranilovic
January 8, 2004.
Chapter 1
Introduction
Please refer to the course outline to get details of the course administration, grading, and lecture
times. Also, please consult the course homepage
1
regularly to download course resources,
references, course projects as well as to be aware of recent issues in the course.
This chapter presents an overview of the topics covered in the course. It provides motivation
for further study and places the work in the context of communications theory in general.
1
http://www.ece.mcmaster.ca/hranilovic/teaching/ece723/ece723.html
5
6 CHAPTER 1. INTRODUCTION
1.1 A Short History
This is a brief history of the events leading up to and following Shannons pioneering work
2
.
In no way is this list meant to be comprehensive, but it is merely to establish the context of
the discoveries of Shannon and others which are presented in this course.
1838 Samuel Morse and Alfred Vail code book derived to assign sequences of long and short
electrical current pulses to each letter of the alphabet. Their intuition was to assign
shortest sequences to the most frequently used letters. Relative frequencies of letters
estimated by counting the number of types in the various compartments of a printers
toolbox. Within 15% of optimal code rate.
1858 Discovery of noise and limited transmit power It is conjectured that large transmit
voltages caused the failure of the rst transatlantic telegraph cable.
1874 Thomas Edison quadriplex telegraph system used two intensities of current as well
as two directions. Rate increases to two bits per symbol.
1876 Alexander Graham Bell demonstrates the telephone at the Centennial Exhibition in
Philadelphia.
1924 Harry Nyquist Nyquist rate and reconstruction of bandlimited signals from their sam-
ples. Also stated formula R = K log m where R is the rate of transmission, K is a measure
of the number of symbols per second and m is the number of message amplitudes avail-
able. Amount of information that can be transmitted is proportional to the product of
bandwidth and time of transmission.
1928 R.V.L. Hartley (inventor of the oscillator) In paper entitled, Transmission of Infor-
mation proposed formula H = nlog s, where H is the information of the message, s
is the number of possible symbols, n is the length of the message in symbols.
2
This historical time-line is adapted from the introductory chapters of the books by John R. Pierce, entitled
An Introduction to Information Theory: Symbols, Signals and Noise (second edition, Dover Publications Inc,
New York, NY, 1980), John G. Proakis book, Digital Communications (fourth edition, McGraw-Hill, Boston,
NY, 2001.)
CHAPTER 1. INTRODUCTION 7
1938 C.E. Shannon, in his Masters thesis A Symbolic Analysis of Relay and Switching Circuits
at MIT, makes the link for the rst time between Boolean algebra and the construction
of logic circuits.
During WWII Wiener and Kolmogorov - optimal linear lter for the estimation of a signal
in the presence of additive noise.
1948 C.E. Shannon ecient source representation, reliable information transmission, dig-
itization foundation of communication and information theory. Made the startling
discovery that arbitrarily reliable communications are possible at non-zero rates. Prior to
Shannon, it was conventionally believed that in order to get arbitrarily low probability of
error, the transmission rate must go to zero. His Mathematical Theory of Communication
reconciled the work of a vast number of researchers and proved to be the foundation of
modern communications theory.
1950 R. Hamming developed a family of error-correcting codes to mitigate channel impair-
ments.
1952 D. Human ecient source encoding
1950-60s Muller, Reed, Solomon, Bose, Ray-Chaudhuri, Hocquenghem Algebraic Codes.
1970s Wozencraft, Reien, Fano, Forney, Viterbi Convolutional Codes
1980s Ungerboeck, Forney, Wei coded modulation
1990s Berrou, Glavieux, Gallager Near capacity achieving coding schemes: Turbo Codes,
Low-Density Parity Check Codes and Iterative Decoding.
1.2 Course Overview
Generalized communication system:
Channel
Source
Info.
Transmitter Receiver Destination
Noise
Information SourceThis consists of any source of data we wish to transmit or store.
TransmitterPerforms task of mapping data source to the channel alphabet in an ecient
manner.
ReceiverPerforms mapping from channel to data to ensure reliable reception.
DestinationData sink.
Question: Under what conditions can the output of the source be conveyed reliably to the
destination? What is reliable? Low prob. of error? Low distortion (high delity)?
Expanded communication system:
Transmitter
Channel
Encoder
Source
Encoder
Source
Source
Decoder
Channel
Decoder
Noise
Channel
Characterized by Capacity (C) Characterized by Entropy (H)
Receiver
Destination
Results:
Shannon realized that the information source can be represented by a universal represen-
tation termed the bit.
No loss in optimality in separation of source and channel codes (Joint Source-Channel
Coding Theorem).
Source Encoder
map from source to bits
matched to the information source
Goal is to get an ecient representation of the source (i.e., least number of bits per
second, minimum distortion ... etc)
Channel Encoder
map from bits to channel
depends on channel available (channel model, bandwidth, noise, distortion, ... etc)
In communications theory we work with hypothetical channels which in some way
capture the essential features of the physical world.
Goal is to get reliable transmission
1.2.1 Source Encoder
Goal: To achieve an economical representation (i.e., small number of binary symbols) of the
source on average.
Example: An urn containing 8 numbered balls. One ball is selected. How many binary
symbols are required to represent the outcome?
Outcome 1 2 3 4 5 6 7 8
Representation 000 001 010 011 100 101 110 111
Answer: Require 3 binary symbols to represent any given outcome.
Example: Consider a horse race with 8 horses. It was determined that the probability of horse
i winning is
Pr[horse i wins] =
_
1
2
,
1
4
,
1
8
,
1
16
,
1
64
,
1
64
,
1
64
,
1
64
_
Answer 1: Lets try the code of the previous section
Outcome Probability Representation 1
0
1
2
000
1
1
4
001
2
1
8
010
3
1
16
011
4
1
64
100
5
1
64
101
6
1
64
110
7
1
64
111
Note that we require 3 binary symbols to represent any given outcome. Note also that the
average number of binary symbols to represent a given outcome is also

1
= 3 binary symbols.
Answer 2: What if we allow the length of each representation to vary amongst the outcomes.
For example, a Human code for this source would give the following representation:
Outcome Probability Representation 2
0
1
2
0
1
1
4
10
2
1
8
110
3
1
16
1110
4
1
64
111100
5
1
64
111101
6
1
64
111110
7
1
64
111111
Note: Each bit of each codeword can be thought of as asking a yes-or-no question about
the outcome. Bit 1 asks the question, Did horse 0 win? (0=yes and 1=no). Similarly for the
remaining bits. We will see this again when we discuss Human source codes in greater detail.
The average number of binary symbols required to represent the source,

2
, can be computed
as
2
=
1
2
(1) +
1
4
(2) +
1
8
(3) +
1
16
(4) +
4
64
(6) = 2 binary symbols
which is less than

1
= 3 binary symbols. In fact, the code above provides the minimum average
codeword length of any representation for the source.
Denition: The source entropy, H(X), is dened as
H(X) =
xS
X
p(x) log
2
1
p(x)
bits. As we will show later in the course, the most economical representation have average
codeword length (
) is
H(X)

< H(X) + 1
For the source considered in the example,
H(X) =
1
2
log 2 +
1
4
log 4 +
1
8
log 8 +
1
16
log 16 +
4
64
log 64 = 2 bits
Thus, the above Human code for the source is optimal.
One time versus average performance
Note that although

2
<

1
, the variance of the codeword lengths in Representation 2 is larger
than the case in Representation 1.
If we had a one-o experiment where it was necessary to ever encode a single outcome, it is
possible that Representation 1 may outperform Representation 2.
Example:
In the English language the frequency of the letter E is approximately 13% nearly indepen-
dently of the writer (and it is 5 times more likely then the next most frequent letter). We can
develop a coder for this case which minimizes the average number of bits required to represent
the source but it must be noted that this only minimizes the average number of bits required
In 1939, E.V. Wright published a book of more that 50000 words, entitled Gadsby, in which he
did not use the letter E at all! He even avoided abbreviations such as Mr. and Mrs. which
when expanded contain the letter E. Here is an excerpt,
Upon this basis I am going to show you how a bunch of bright young folks did nd
a champion; a man with boys and girls of his own; a man of so dominating and happy
individuality that Youth is drawn to him as is a y to a sugar bowl. It is a story
about a small town. It is not a gossipy yarn; nor is it a dry, monotonous account, full
of such customary ll-ins as romantic moonlight casting murky shadows down a
long, winding country road. Nor will it say anything about tinklings lulling distant
folds; robins carolling at twilight, nor any warm glow of lamplight from a cabin
window. No. It is an account of up-and-doing activity; a vivid portrayal of Youth
as it is today; and a practical discarding of that worn-out notion that a child dont
know anything.
In this case the encoder designed to minimize the average representation length for conventional
language would not do very well since the behavior of the text in Gadsby is not typical.
Information theory and coding deal with the typical or expected behavior of the source.
Entropy is a measure of the average uncertainty associated with the source.
We will demonstrate later in the course that for sequences of outcomes from a source,
that nearly all of the probability mass is contained in a relatively small set termed the
typical set.
The most likely outcomes need not be typical (Example: Bernoulli source), but there is
a collection of low probability event all with nearly the same probability.
We will shown that the entropy, H(X), is a measure the size of the typical set, i.e., H(X)
is combinatorial in nature.
1.2.2 Channel Encoder
Goal: To achieve an economical (high rate) and reliable (low probability of error) transmission
of bits over a channel.
With a channel code we add redundancy to the transmitted data sequence which allows
for the correction of errors that are introduced by the channel
Alternatively, we can view a channel code as imposing a structure on the set of transmitted
codewords which is exploited at the receiver to improve detection.
Example: Natures coding: DNA and RNA contain a coded recipe to synthesize proteins in
organisms. Error protection in replication of DNA.
All possible
received vectors
noisy channel
Input vectors to
Each codeword which is transmitted is corrupted by the channel. Each transmitted
codeword corresponds to a set of possible received vectors (set of typical outcomes)
Specify a code (i.e., a set of codewords) so that at the receiver it is possible to distinguish
which element was sent with high-probability (i.e., probability of overlap of regions is
small).
The channel coding theorem tells us the maximum number of such codewords we can
dene and still maintain completely distinguishable outputs.
Shannons Channel Coding Theorem There is a quantity called the capacity, C, of a
channel such that for every rate R < C there exists a sequence of ( 2
nR
..
# codewords
, n
..
# chan. uses
)
codes such that Pr[error] 0 as n . Conversely, for any code, if Pr[error] 0 as n
then R < C.
Example: Binary Symmetric Channel
1
0
1 1
0
p
p
(1p)
(1p) 1/2
p
C
Input channel alphabet = Output channel alphabet = {0, 1}
Assume independent channel uses (i.e., no memory).
Channel randomly ips the bit with probability p.
For p = 0 or p = 1, C = 1 bits/channel use (noiseless channel or inversion channel
Worst case, p = 1/2, in which case the input and the output are statistically independent
C = 0
Question: How do we devise codes which perform well on this channel ?
Repetition Code
In this code, we take the bit to be transmitted an repeat it 2m + 1 times, for some integer
m 0. The code consists of two possible codewords,
C = {000 0
. .
2m+1
, 111 1
. .
2m+1
}
At the receiver decoding is done by a majority voting scheme: If there are mores 0s than 1s
in the receive codeword then declare 0 transmitted, else 1.
Consider the case when m = 3.
Received codewords Received codewords
decoding to message 0 decoding to message 1
000 111
001 110
010 101
100 011
It is clear that as long as the number of ipped bits is less than half the length of the repetition
codewords, it is possible to recover the message exactly.
As m we would expect that the channel to ip a proportion of p < 1/2 of the bits. In
fact, we will show that as m it is possible to make the probability that more than a
proportion of p of the bits is ipped negligibly small (by the weak law of large numbers).
P
r
o
b
a
b
i
l
i
t
y
increasing
1/2 p
Proportion of bits flipped
m
Therefore, Pr[error] 0 as m , however, rate 0 ! Therefore, this is not an ecient code.
Shannon demonstrated that there exist codes which are capacity achieving at non-zero rates.
Hamming Code
Consider forming a code with a little more algebraic structure. Dene a code as the span of
a basis set over the binary eld (i.e., take all additions modulo 2). A (7, 4) Hamming code is
dened as the set of all linear combinations of the four rows of the generator matrix,
G =
_
_
1 1 1 0 0 0 0
0 1 0 1 1 0 0
0 0 1 1 0 1 0
0 0 0 1 1 1 1
_
_
i.e, for x {0, 1}
4
(binary vectors of length 4), every code word c = xG. Thus, the rate of this
code is 4/7 The set of all possible codewords is
C =
_
_
0000000 1110000 0101100 1011100
0011010 1101010 0110110 1000110
0001111 1111111 0100011 1010011
0010101 1100101 0111001 1001001
_
_
Let c C be represented as c = (u
1
, u
2
, u
3
, u
4
, p
1
, p
2
, p
3
).
Decoding is done by multiplying the received vector by the parity check matrix, H. The code
C is the kernel or null-space of H
T
, that is, for c C, cH
T
= 0. For the (7, 4) Hamming code
presented above, the parity check matrix is
H =
_
_
1 0 1 1 1 0 0
1 1 0 1 0 1 0
0 1 1 1 0 0 1
_
_
This decoding essentially solves the parity check equations given by each row of H, namely
1. u
1
+u
3
+u
4
+p
1
= 0
2. u
1
+u
2
+u
4
+p
2
= 0
3. u
2
+u
3
+u
4
+p
3
= 0
These equations can be represented in a Venn diagram. Consider all the possible single bit
errors. Say p
1
is in error. It is clear that only equation 1 will not be satised while equations 2
and 3 will be. Additionally, if u
2
is corrupted, then equation1 will be satised and equations 2
and 3 will not be. In this way the Hamming code is able to detect and correct every single bit
error.
Eqn. 1
p
1
p
3
u
3
u
2
u
4
u
1
p
2
Eqn. 3
Eqn. 2
Thus, the (7, 4) Hamming code is a single error correcting code operating at rate 4/7 0.57.
By comparison, the single error correcting repetition code of length three operates at a rate of
1/3.
1.3 Review of Probability Theory
In order study information theory and coding it is necessary to have some background in
probability theory. Data transmission on these channels is viewed as a random experiment
while the entropy and mutual information are computed with respect to the underlying random
variables at the source and receiver. Here we present only a brief description of some of the
essential concepts required to understand the initial sections of the course. Additional theory
will be introduced as required in the course. More complete references on probability theory
and stochastic processes can found in undergraduate
3
level as well as graduate
4
level texts.
3
A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, second edition, Addison-
Wesley Publishing Company, Reading, MA, 1994.
4
A. Papoulis and S. U. Pillai, Probability, Random Variables and Stochastic Processes, fourth edition, Mc-
Graw Hill Companies, 2002.
1.3.1 Discrete Probability Models
Discrete probability models consist of a random experiment with a nite or countable number
of outcomes. For example: toss of a die, ip of a coin, number of data packets arriving in a
time interval, etc.
The sample space, S, of the experiment is the set of all possible outcomes and contains a nite
or countable number of elements. Let S = {
1
,
2
, . . .}.
An event is a subset of S. Events consisting of a single outcome are termed elementary events.
Example: S is the certain event and is the null or impossible event.
Note that S and are events of every sample space.
Every event A S has a real number, p(A) assigned to it which is the probability of event A.
Probabilities must satisfy the axioms:
1. p(A) 0
2. p(S) = 1
3. for A, B S, if A B = ,then p(A B) = p(A) +p(B).
Formally, a probability space is dened by the triple (S, B, p) where B is the Borel eld of all
events of S. For the countable discrete models considered here, we let B be the power set of S,
i.e. the set of all subsets of S. We can assign probabilities to all subsets of S so that the axioms
of probability are satised (Note that this is not the case for continuous random variables where
B is dened as the smallest Borel eld that includes all half lines x x
i
for any x
i
).
Since S is countable, the probability of every event can be written in terms of the probabilities
of elementary events,
i
, namely p({
i
}).
Random Variables: A random variable, X(), is a function which assigns a real number to
every outcome
i
S, i.e.,
X : S R
Notice that X is neither random nor a variable!
Let S
X
be the set of all values taken by X and dene
p
X
(x) := p[{
i
: X(
i
) = x}]. (1.1)
Notation: X() will be abbreviated to X to simplify notation with the link between random
variables and the underlying random ensemble taken as given. We further abuse notation by
writing (1.1) as
p
X
(x) := Pr[X = x].
By the axioms of probability
p
X
(x) 0
xS
X
p
X
(x) = 1.
The values p
X
(x) are called the probability mass function (pmf) of X.
Notation: For convenience we will abbreviate p
X
(x) to p(x). Although this is an abuse of
notation it will be clear from the context which random variable is referred to.
Vector Random Variables The above can be extended to the case of vector random variables.
Vector random variables assign a vector of real numbers to each outcome of S. For example,
consider the vector random variable Z consisting of all two-tuples of the form (X, Y ). This
random vector can be viewed as the combination of two random variables describing each of
the coordinates. The pmf of Z is often termed the joint pmf of X and Y and can be written as
p
Z
(x, y) = p
X,Y
(x, y)
= Pr(X = x, Y = y)
The pmfs of the coordinates, p(x) and p(y), are termed the marginals of p
X,Y
(x, y). Let S
X
and S
Y
denote the range of values for each co-ordinate of Z. The maginals can then be written
as
p(x) =
yS
Y
p
X,Y
(x, y)
and
p(y) =
xS
Y
p
X,Y
(x, y).
Notation: We similarly abuse notation in the case of vector random variables and let p
X,Y
(x, y)
be denoted by p(x, y).
1.3.2 Conditional Probability and Independence
Take two random variables X and Y dened on the same probability space. The conditional
probability mass function, p
X|Y
(x
k
|y
j
), describes the probability of the event [X = x
k
] given
that the event [Y = y
j
] has occurred. Formally it can be dened as,
p
X|Y
(x
k
|y
j
) = Pr[X = x
k
|Y = y
j
]
=
Pr[X = x
k
, Y = y
j
]
Pr[Y = y
j
]
=
p(x
k
, y
j
)
p(y
j
)
whenever p(y
j
) > 0.
Notation: We will again simplify notation and represent p
X|Y
(x|y) as p(x|y).
The random variables X and Y as said to be independent if their joint distribution can be
factored as
(x, y) S
X
S
Y
p(x, y) = p(x)p(y).
For independent X and Y ,
p(x|y) =
p(x, y)
p(y)
=
p(x)p(y)
p(y)
= p(x)
Notice that knowledge of Y does not impact the distribution of X. In a future lecture we will
demonstrate that when X and Y are independent, Y provides no information about X.
1.3.3 Expected Value
The expected value or mean of a random variable X is dened as
E[X] =
x
k
S
X
x
k
p(x
k
).
The expected value of a function of X, f(X), is itself a random variable as has mean
E[f(X)] =
x
k
S
X
f(x
k
)p(x
k
).

ECE 723 - Information Theory and Coding

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

ECE 723 - Information Theory and Coding

Încărcat de

Drepturi de autor:

Formate disponibile

ECE 723 Information Theory and Coding

Dr. Steve Hranilovic

S-ar putea să vă placă și