Sunteți pe pagina 1din 17

Overview

• What is data compression?


• Lossless and lossy compression
• Codes
– Fixed-length codes
– Variable-length codes
• Entropy
• Huffman Coding

November 2003
Sanath Jayasena 10-1
What is Data Compression?
• Transformation of data into a more
compact form
– takes less space than before

• Why compress data?


– Saves storage space
– Saves transmission time over a network

November 2003
Sanath Jayasena 10-2
Simple Example
• Suppose ASCII code of a char. is 1 byte
• Suppose we have a text file containing
one hundred instances of ‘a’
– File size would be about 100 bytes
• Let us store this as “100a” in a new file to
convey the same information
– New file size would be 4 bytes
– 4/100  96% saving

November 2003
Sanath Jayasena 10-3
Lossless Data Compression
• Last example shows “lossless”
compression
– Can retrieve original data by decompression
• Lossless compression used when data
integrity is important
• Example software
– winzip, gzip, compress, bzip, GIF

November 2003
Sanath Jayasena 10-4
Lossy Data Compression
• “Lossy” means original not retrievable
– Reduces size by permanently eliminating
certain information
– When uncompressed, only a part of the
original information is there (but the user
may not notice it)
• When can we use lossy compression?
– For audio, images, video
– E.g., jpeg, mpeg
November 2003
Sanath Jayasena 10-5
Codes
• Ways to represent information
• The code for a character is a “codeword”
• We consider binary codes
– Each character represented by a unique
binary codeword
• Fixed-length coding
– Length of codeword of each character same
– E.g., ASCII, Unicode

November 2003
Sanath Jayasena 10-6
Fixed-Length Coding
• Suppose there are n characters
• What is the minimum number of bits
needed for fixed-length coding?
log2 n
• Example: {a, b, c, d, e}; 5 characters
log2 5 = 2.3… = 3 bits per character
– We can have codewords: a=000, b=001,
c=010, d=011, e=100

November 2003
Sanath Jayasena 10-7
Variable-Length Coding
• Length of codewords may differ from character
to character
• Frequent characters get short codewords
• Infrequent ones get long codewords
• Example

a b c d e f
Frequency 46 13 12 16 8 5
Codeword 0 101 100 111 1101 1100
November 2003
Sanath Jayasena 10-8
Variable-Length Coding …contd

• Make sure that a codeword does not


occur as the prefix of another codeword
• What we need is a “prefix-free code”
– Last example is a prefix-free code
• Prefix-free codes give unique decoding
– E.g., “001011101” is decoded as “aabe”
based on the table in last example
• Huffman coding algorithm shows how to
obtain prefix-free codes (later)
November 2003
Sanath Jayasena 10-9
Entropy
• Entropy is the central
concept in the field of
“information theory”,
pioneered by Claude
Shannon
• Entropy shows that there
is a fundamental limit to
lossless data
compression

November 2003
Sanath Jayasena 10-10
Entropy …contd

• The limit is called the entropy rate and


denoted by H
• H is the smallest number of bits per character
that can be used
• Let n be the size of the alphabet, pi be the
probability of the i th character in the alphabet.
The entropy rate is:
n
H = ∑ − pi log 2 pi
i =1
November 2003
Sanath Jayasena 10-11
Example

Character a b c d e

Frequency 4 5 2 1 1

= −  4 + 5 + 2 + 1 + 1 
13 H  4 log 2 5 log 2 2 log 2 log 2 log 2 
 13 13 13 13 13 

November 2003
Sanath Jayasena 10-12
Huffman Coding Algorithm
• Huffman invented a greedy method to
construct an optimal prefix-free variable-
length code
– Code based on frequency of occurrence
• Optimal code given by a full binary tree
– Every internal node has 2 children
– If |C| is the size of alphabet, , there are |C|
leaves and |C|-1 internal nodes

November 2003
Sanath Jayasena 10-13
Huffman Coding Algorithm …contd

• We build the tree bottom-up


– Begin with |C| leaves
– Perform |C|-1 “merging” operations

• Let f [c] denote frequency of character c


• We use a priority queue Q in which high
priority means low frequency
– GET-MIN(Q) removes element with the
lowest frequency and returns it

November 2003
Sanath Jayasena 10-14
Huffman Algorithm
Input: Alphabet C and frequencies f [ ]
Result: Optimal coding tree for C

HUFFMAN(C, f ) Running time is


n ← |C| O(n lg n)
Q←C
for i ← 1 to n-1
z ← New-Node( )
x ← z.left ← GET-MIN(Q)
y ← z.right ← GET-MIN(Q)
f [z] ← f [x] + f [y]
INSERT(Q, z)
return GET-MIN(Q)
November 2003
Sanath Jayasena 10-15
Example
• Obtain the optimal coding for the following
using the Huffman Algorithm

Character a b c d e f

Frequency 45 13 12 16 9 5

Solution on the board


November 2003
Sanath Jayasena 10-16
Announcements
• Assignment 7
– assigned today
– due next week

• Next lecture
– Dynamic Programming

November 2003
Sanath Jayasena 10-17

S-ar putea să vă placă și