Loss Less Compression

Overview
• What is data compression?

• Lossless and lossy compression
• Codes
– Fixed-length codes
– Variable-length codes
• Entropy
• Huffman Coding
November 2003
Sanath Jayasena 10-1
What is Data Compression?
• Transformation of data into a more
compact form
– takes less space than before
• Why compress data?

– Saves storage space
– Saves transmission time over a network
November 2003
Simple Example
• Suppose ASCII code of a char. is 1 byte
• Suppose we have a text file containing
one hundred instances of ‘a’
– File size would be about 100 bytes
• Let us store this as “100a” in a new file to
convey the same information
– New file size would be 4 bytes
– 4/100  96% saving
November 2003
Lossless Data Compression
• Last example shows “lossless”
compression
– Can retrieve original data by decompression
• Lossless compression used when data
integrity is important
• Example software
– winzip, gzip, compress, bzip, GIF
November 2003
Lossy Data Compression
• “Lossy” means original not retrievable
– Reduces size by permanently eliminating
certain information
– When uncompressed, only a part of the
original information is there (but the user
may not notice it)
• When can we use lossy compression?
– For audio, images, video
– E.g., jpeg, mpeg
November 2003
Codes
• Ways to represent information
• The code for a character is a “codeword”
• We consider binary codes
– Each character represented by a unique
binary codeword
• Fixed-length coding
– Length of codeword of each character same
– E.g., ASCII, Unicode
November 2003
Fixed-Length Coding
• Suppose there are n characters
• What is the minimum number of bits
needed for fixed-length coding?
log2 n
• Example: {a, b, c, d, e}; 5 characters
log2 5 = 2.3… = 3 bits per character
– We can have codewords: a=000, b=001,
c=010, d=011, e=100
November 2003
Variable-Length Coding
• Length of codewords may differ from character
to character
• Frequent characters get short codewords
• Infrequent ones get long codewords
• Example
a b c d e f
Frequency 46 13 12 16 8 5
Codeword 0 101 100 111 1101 1100
November 2003
Variable-Length Coding …contd
• Make sure that a codeword does not

occur as the prefix of another codeword
• What we need is a “prefix-free code”
– Last example is a prefix-free code
• Prefix-free codes give unique decoding
– E.g., “001011101” is decoded as “aabe”
based on the table in last example
• Huffman coding algorithm shows how to
obtain prefix-free codes (later)
November 2003
Entropy
• Entropy is the central
concept in the field of
“information theory”,
pioneered by Claude
Shannon
• Entropy shows that there
is a fundamental limit to
lossless data
compression
November 2003
Entropy …contd
• The limit is called the entropy rate and

denoted by H
• H is the smallest number of bits per character
that can be used
• Let n be the size of the alphabet, pi be the
probability of the i th character in the alphabet.
The entropy rate is:
n
H = ∑ − pi log 2 pi
i =1
November 2003
Example
Character a b c d e
Frequency 4 5 2 1 1
= −  4 + 5 + 2 + 1 + 1 
13 H  4 log 2 5 log 2 2 log 2 log 2 log 2 
 13 13 13 13 13 
November 2003
Huffman Coding Algorithm
• Huffman invented a greedy method to
construct an optimal prefix-free variable-
length code
– Code based on frequency of occurrence
• Optimal code given by a full binary tree
– Every internal node has 2 children
– If |C| is the size of alphabet, , there are |C|
leaves and |C|-1 internal nodes
November 2003
Huffman Coding Algorithm …contd
• We build the tree bottom-up

– Begin with |C| leaves
– Perform |C|-1 “merging” operations
• Let f [c] denote frequency of character c

• We use a priority queue Q in which high
priority means low frequency
– GET-MIN(Q) removes element with the
lowest frequency and returns it
November 2003
Huffman Algorithm
Input: Alphabet C and frequencies f [ ]
Result: Optimal coding tree for C
HUFFMAN(C, f ) Running time is

n ← |C| O(n lg n)
Q←C
for i ← 1 to n-1
z ← New-Node( )
x ← z.left ← GET-MIN(Q)
y ← z.right ← GET-MIN(Q)
f [z] ← f [x] + f [y]
INSERT(Q, z)
return GET-MIN(Q)
November 2003
Example
• Obtain the optimal coding for the following
using the Huffman Algorithm
Character a b c d e f
Frequency 45 13 12 16 9 5
Solution on the board

November 2003
Announcements
• Assignment 7
– assigned today
– due next week
• Next lecture
– Dynamic Programming
November 2003

Loss Less Compression

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Loss Less Compression

Încărcat de

Drepturi de autor:

Formate disponibile

Overview

• What is data compression?

• Why compress data?

• Make sure that a codeword does not

• The limit is called the entropy rate and

• We build the tree bottom-up

• Let f [c] denote frequency of character c

HUFFMAN(C, f ) Running time is

Solution on the board

S-ar putea să vă placă și