Sunteți pe pagina 1din 42

Data Compression

By Ziya Arnavut Department of Computer and Information Sciences SUNY Fredonia

12-20-2011

Tremendous amount of data is communicated every day.


Example:
World-Wide-Web: Many people surf on the net. People communicate over internet using software such as skype.

The transmission time is related to


a) Amount (size) of data b) channel s capacity.

Can we reduce the transmission time?


Of course.
Reduce the size of data. How? Use a suitable compression technique. b) Increase the channel capacity. For example: 100 MB 1 GB c) Or, utilize both (a) and (b) a)

Can we reduce the size of data?


Yes. Using compression techniques. Two main data compression techniques:
1. lossless (noiseless).
Examples: Text, Medical and Remote Sensing Imaginary

2.

lossy (noisy) techniques.


Examples: Image, sound and some other Multimedia app.

Lossy Techniques
Data compressed by lossy techniques are not exactly recoverable. In many applications, this feature helps to increase the channel throughput. For example: JPEG images Original image may have significant loss but this does not cause a problem, since humans can often comprehend things even when there is noise. Hence, depending on the application, lossy techniques may be used to increase the channel throughput.

Original Image Size on disk: 2.25 Mbytes

85% JPEG compressed image. Size on disk: 267460 Bytes

A Lossy + Lossless technique: Color-mapped (Palette) Images


To reduce transmission, storage or, most often, display restrictions, sometimes there is a need to restrict the number of colors in an image color reduction step since they are acquired with a high number of different colors. This step is known as color quantization and several techniques have been proposed A color-quantized image is a matrix of indices, where each index i corresponds to a triplet (Ri, Gi, Bi) in the color-mapped table of the image. Color-quantized images are also known as pseudo-color, color-mapped or palette images.
Index 0 1 2 3 4 .. 254 255 R 28 19 34 39 44 .. 193 206 G 0 2 1 2 0 .. 211 212 B 1 5 1 3 2 .. 223 222

Examples: Graphic Interchange Format (GIF) compressed images

Frymier image Size: 1,238,678 bytes GIF: 229,930 bytes

Yahoo image Size: 28,110 bytes GIF: 6,968 bytes Ben and Jerry Size 28,326 bytes GIF: 4,387 bytes

Lossless Coding
A lossless compression scheme has two components
1) Modeling 2) Coding

First, I will address Coding


Let A be an alphabet, a collection of distinct symbols. Let S = s1 s2 ... sn be a sequence from an alphabet A. That is, S is a data string. Example: From the English Alphabet {a..z, , A..Z}, This is an example is a data string.

Assigning binary sequences to individual alphabet elements is called encoding. The set of binary sequences resulting from an encoding is called code C= {c1,c2, cn} An element of a code is called code-word (i.e., ci C) For example, ASCII code consists of 128 code-words. Each code-word has 7-bits. An 8th bit is appended for parity checking, or other control purposes. Example: A<=>1000001, B<=>1000010

Fixed Length Code:


Example: ASCII codes.

Do we gain by using fixed length codes?


No! Why?

Use techniques similar to


Morse telegraph codes. That is, assign short code-words (less number of bits) to the characters that appear more often, and longer code-words to the characters that appear less frequently.

This is called variable length coding.

Question: Why does this work?


Most often, the frequency distribution of the letters in a data string are far from uniform. Example: In an English text the most frequently occurring letter is e .

If a data source or string has uniform distribution, variable length coding techniques does not work. For an independent data source S with probabilities of occurrence p(s1) ,..., p(sm) the zero-order entropy is
H(S) = 7i p(si) * log2 p(si) bps

Entropy of a source yields lower-bound of encoding cost. Two well-known variable length coding techniques are Huffman and Arithmetic Coding. They can code a data string close to or equal to its entropy. Example:
Consider a data string with 64 characters from the alphabet {a, ,z} S=aaaaaaaaaabbbbbbbccegiidffgiiaaabaaaabbccccaaccaa aabbaaaaaaabeee The zero-order entropy of this string is 2.26 bps. Hence, At best we can code the data for
64 * 2.26 = 144.64 bits

Huffman Coding
Merge together the two least probable characters, and repeating this process until there is only one character remaining.

For our String, here is the probably distribution:


Freq 30 13 8 1 4 4 2 2 64 si a b c d e i g f P(si) 0.469 0.203 0.125 0.016 0.063 0.063 0.031 0.031 1.000
After Sorting

si a b c e i g f d

P(si) 0.469 0.203 0.125 0.063 0.063 0.031 0.031 0.016

Building Huffman Tree


a b c e i g f d 0.469 0.203 0.125 0.063 0.063 0.031 0.031 0.016 a b c e i 0.469 0.203 0.125 0.063 0.063 0.047 g 0.031 0.532 a 0.469 a 0.469 a 0.469 a 0.469 0.329 a 0.469 b 0.203 b 0.203 b 0.203 0.203 b 0.203 0.126 c 0.125 0.126 0.078 c 0.125 0.078 e 0.063 i 0.063

Huffman Tree & Code


si P(si) Code a 0.469 1 b 0.203 01 c 0.125 0000 e 0.063 0010 i 0.063 0011 g 0.031 00011 f 0.031 000100 d 0.016 000101 Total # of Bits Bits 30 26 32 4 16 20 12 12 152

Hence, Entropy gives us lower bound in terms of the number of bits need to encode a data string.

Can we do better? Consider, attaching integer values to the symbols in S. Then apply
HS = s1, s2-s1, s3-s2, ...,s64-s63 .

Note: S can be recovered easily. However, frequency distribution for the new sequence HS is letter 0 1 2 -1 -2 3 -5 -8 frequency 42 9 6 2 1 1 1 1

Using Huffman coding we may encode HS as follows:


0 <=> 0, 1 <=> 10, 2 <=>110, -1 <=>1110, -2 <=> 111100, 3 <=>111101, 8<=>111110, and -5<=>111111.

Total bits 42*1 + 9*2 + ... + 1*6 + 1*6 + 1*6 = 110 A rate of r = 110/64 = 1.71875 bps. H operator is an example of decorrelation step, which is used in modeling of data.

The aim of Decorrelation step is to remove redundancies from the data. Note that approximately a further gain of 0.60 bps. Hence, by considering relationships among the data elements, we can obtain better compression. Research in lossless compression has focused on modeling the data source in order to exploit the correlation among data elements

Arithmetic Coding
Suppose we have an alphabet (a, b, c), with probabilities of occurrence of (0.7, 0.1, 0.2). Each symbol may be assigned to the following ranges based on its probability:
Sample Symbol Ranges Symbol a b c Probability 70% 10% 20% Range [0.00, 0.70) [0.70, 0.80) [0.80, 1.00)

Encoding with Arithmetic Coder


The pseudo code below illustrates how additional symbols may be added to an encoded string by restricting the string's range bounds.
lower bound = 0 upper bound = 1 while there are still symbols to encode current range = upper bound - lower bound upper bound = lower bound + (current range upper bound of new symbol) lower bound = lower bound + (current range lower bound of new symbol) end while

Any value between the computed lower and upper probability bounds now encodes the input string.

Example: To encode abc Encode 'a' Encode b'

current range = 1 - 0 = 1 upper bound = 0 + (1 0.7) = 0.7 lower bound = 0 + (1 0.0) = 0.0

current range = 0.7 - 0.0 = 0.7 upper bound = 0.0 + (0.7 0.80) = 0.56 lower bound = 0.0 + (0.7 0.7) = 0.49
current range = 0.56 - 0.49 = 0.07 upper bound = 0.49+ (0.07 1.00) = 0.56 lower bound = 0.49 + (0.07 0.80) = 0.546

Encode c'

The string "abc" may be encoded by any value within the probability range [0.546, 0.56). For example with 0.55.

0.0 a

0.0

0.49

0.546

0.7 b 0.8 c 1.0

0.49 b 0.56 c 0.70 0.56 c 0.56 c b b

Encoding string abc with AC

Decoding Strings
encoded value = encoded input while string is not fully decoded identify the symbol containing encoded value within its range //remove effects of symbol from encoded value current range = upper bound of new symbol lower bound of new symbol encoded value = (encoded value - lower bound of new symbol) current range end while

0.0 a

0.0

0.49

0.546

0.7 b 0.8 c 1.0

0.49 b 0.56 c 0.70


Decoding abc = 0.55

b c 0.56
a b c

b c 0.56

Universal Lossless Compressors


Dictionary Based Algorithm (Ziv-Lempel) Encoding Algorithm 1. Initialize the dictionary to contain all blocks of length one (D={a,b}). 2. Search for the longest block W which has appeared in the dictionary. 3. Encode W by its index in the dictionary. 4. Add W followed by the first symbol of the next block to the dictionary. 5. Go to Step 2.

The following example illustrates how the encoding is performed.


Data: a b b a a b b a a b a b b a a a a b a a b b a 0 1 1 0 2 4 2 6 5 5 7 3 0
Dictionary Index 0 1 2 3 4 5 6 a b ab bb ba aa abb Entity Index 7 8 9 10 11 12 13 Entity baa aba abba aaa aab baab bba

The size of the dictionary can grow infinitely large.


In practice, the dictionary size is limited. Once the limit is reached, no more entries are added.
For example, a dictionary of size 4096.
This corresponds to 12 bits per index.

Various implementation of Ziv-Lempel algorithm has been implemented.


Gzip (Gnu Zip) is a freeware available from internet.

Burrows-Wheeler Transformation
Let w = [3,1,3,1,2] be a data string. Construct 3 1 3 1 2 M= 3 1 2 3 1 1 2 3 1 3 2 3 1 3 1 by forming the successive rows of M, which are consecutive cyclic left-shifts of w.

By sorting the rows of M lexically we transform it to 1 1 2 3 3 2 3 3 1 1 3 1 1 2 3 1 2 3 3 1 3 3 1 1 2

M=

Let the last column of M denoted by L

Note that the original data string w is the 5 th row of M . Given the I = 5 (row index) of w in M and L = [3, 3, 1, 1, 2] we can recover w. How? 1 3 1 3 M= 2 1 3 1 3 _ _ _ 2

Is the transformation enough?


Of course not!

The transformation collects similar elements near by. To achieve better compression we need to use some other techniques, like H operator. In this case we can use Move-to-Front (Recency Rank), or Inversion coding/transformation.

Move To Front Coding (Recency Ranking)


Introduced by Bentley et al. (1986) (independently discovered by Elias (1987)) Move-to-Front coding is an adaptive technique, which is used when the data have locality of reference. When an MTF coder is implemented for an 8bit data string the identity permutation is constructed from the set of 0,..., 255.

Example: Let {a, b, c, d} be our alphabet. Let S= bbbaaadddddccc be our data string. The MTF encoding is performed as following: b b b a a a d .. 0123 0123 0123 0123 0123 0123 0123 .. abcd bacd bacd bacd abcd abcd abcd .. 1 0 0 1 0 0 3 .. Output:
1001003000300.

MTF decoding on 1001003000300 is done as following: 1 0 0 1 0 0 3 .. 0123 0123 0123 0123 0123 0123 0123 .. abcd bacd bacd bacd abcd abcd abcd .. b b b a a a d .. Output: bbbaaaddddccc Why do we use MTF? If the data have locality of reference, MTF transformed data yields better distribution for encoding.

Linear Index Permutation (LIT)


For example, let w = [1, 3, 1, 3, 2] Original 12345 13132 l 14253 13132 l-1 13524 11233

Note that, the inverse permutation of LIT l is l-1 = [1 ,3, 5, 2, 4]. l-1 is called the Canonical Sorting Permutation of w. Also, elements of w is sorted in non-decreasing order by l-1 and consists of m-blocks of different sizes. Sorted data can be encoded cheaply.

Hence, the problem is to encode the canonical sorting permutation. Interval ranking, except the first time appearance of an element from a data string yields a rank, which is simply the count of the number of elements between every two same element. Let H be the difference operator on a sequence, then it is easy to prove that the first-order entropy H(H l-1 ) $ H(Interval Rank)

Inversion Coding
Let 4= [41,42, ,4n] be an arbitrary permutation of an n-set S of positive integers. A Left Bigger (LB) inversion vector associated with 4 is the sequence [I1,I2, ,In ] of non-negative integers defined as follows: Ik= | {j: 1 < j <k <= n and 4j < 4k}|. Example: Let [1, 3, 5, 2, 4] be a permutation from the set {1, 2, 3, 4, 5}. LB inversion technique yields [0, 0 ,0, 2, 1].

When the H operator (difference operator) is applied to an inversion vector, except m-1 values (recall that there are m blocks), all the values would be positive ( or negative). The value |It - It-1 -1| is the count of How many bigger (or smaller) elements exists between the last and recent occurrence of a symbol from the alphabet? . Hence, the decorrelation of inversion vector elements yields a value which is called inversion rank (distance).

Elias proved that the Recency Ranking (MTF) yields better compression than the Interval Ranking. It is easy to prove that the inversion ranking yields better compression than the Interval ranking. While theoretically its hard to relate MTF and Inversion Coding, simulation results has shown that inversion coding yields better compression than the MTF coding.

The Bzip2 compression scheme utilizes


1) 2) 3) BWT transformation MTF Coder Variable length Coding (Huffman coder)

Currently, Bzip2 is one of the best universal compression scheme. My contributions in this area:
Theoretical settings of the BWT (1997) New and faster transformation than BWT, Linear Order Transformation (1999) Inversion coding for large data files (2004) BWIC is available from www.cs.fredonia.edu/arnavut/research.html Yields better compression than bzip2 on several different data files. For example, on large text files, pseudo-color images, audio files, images.

Data File Bib Book1 Book2 Geo News Obj1 Obj2 Paper1 Paper2 Pic Progc Progl Progp Trans Bible Calag.tar E.coli World192 Avg. W. Avg.

Size 111261 768768 610856 102400 377109 21504 246814 53161 82199 513216 39611 71646 49379 93695 4047392 3276813 4638690 2473400

MTF 5.94 5.12 5.24 6.03 5.55 6.06 6.15 5.46 5.23 1.09 5.59 4.93 5.12 5.55 5.04 4.75 2.25 5.34 5.02 4.23

IC 5.68 4.84 4.95 6.16 5.32 5.70 6.09 5.42 5.13 1.03 5.67 4.96 5.30 5.49 4.56 4.45 2.14 5.03 4.88 3.96

BSC 2.11 2.61 2.22 4.83 2.65 4.02 2.58 2.65 2.61 0.84 2.67 1.88 1.86 1.63 1.71 2.44 2.21 1.49 2.39 2.04

BSWIC 2.17 2.52 2.19 4.97 2.70 4.30 2.77 2.74 2.65 0.81 2.82 1.91 1.96 1.77 1.62 2.28 2.10 1.47 2.43 1.96

Results in bps using arithmetic coder

THANK YOU!

Questions?
1/2/2012 FIT, December 2011 42

S-ar putea să vă placă și