Documente Academic
Documente Profesional
Documente Cultură
12-20-2011
2.
Lossy Techniques
Data compressed by lossy techniques are not exactly recoverable. In many applications, this feature helps to increase the channel throughput. For example: JPEG images Original image may have significant loss but this does not cause a problem, since humans can often comprehend things even when there is noise. Hence, depending on the application, lossy techniques may be used to increase the channel throughput.
Yahoo image Size: 28,110 bytes GIF: 6,968 bytes Ben and Jerry Size 28,326 bytes GIF: 4,387 bytes
Lossless Coding
A lossless compression scheme has two components
1) Modeling 2) Coding
Assigning binary sequences to individual alphabet elements is called encoding. The set of binary sequences resulting from an encoding is called code C= {c1,c2, cn} An element of a code is called code-word (i.e., ci C) For example, ASCII code consists of 128 code-words. Each code-word has 7-bits. An 8th bit is appended for parity checking, or other control purposes. Example: A<=>1000001, B<=>1000010
If a data source or string has uniform distribution, variable length coding techniques does not work. For an independent data source S with probabilities of occurrence p(s1) ,..., p(sm) the zero-order entropy is
H(S) = 7i p(si) * log2 p(si) bps
Entropy of a source yields lower-bound of encoding cost. Two well-known variable length coding techniques are Huffman and Arithmetic Coding. They can code a data string close to or equal to its entropy. Example:
Consider a data string with 64 characters from the alphabet {a, ,z} S=aaaaaaaaaabbbbbbbccegiidffgiiaaabaaaabbccccaaccaa aabbaaaaaaabeee The zero-order entropy of this string is 2.26 bps. Hence, At best we can code the data for
64 * 2.26 = 144.64 bits
Huffman Coding
Merge together the two least probable characters, and repeating this process until there is only one character remaining.
si a b c e i g f d
Hence, Entropy gives us lower bound in terms of the number of bits need to encode a data string.
Can we do better? Consider, attaching integer values to the symbols in S. Then apply
HS = s1, s2-s1, s3-s2, ...,s64-s63 .
Note: S can be recovered easily. However, frequency distribution for the new sequence HS is letter 0 1 2 -1 -2 3 -5 -8 frequency 42 9 6 2 1 1 1 1
Total bits 42*1 + 9*2 + ... + 1*6 + 1*6 + 1*6 = 110 A rate of r = 110/64 = 1.71875 bps. H operator is an example of decorrelation step, which is used in modeling of data.
The aim of Decorrelation step is to remove redundancies from the data. Note that approximately a further gain of 0.60 bps. Hence, by considering relationships among the data elements, we can obtain better compression. Research in lossless compression has focused on modeling the data source in order to exploit the correlation among data elements
Arithmetic Coding
Suppose we have an alphabet (a, b, c), with probabilities of occurrence of (0.7, 0.1, 0.2). Each symbol may be assigned to the following ranges based on its probability:
Sample Symbol Ranges Symbol a b c Probability 70% 10% 20% Range [0.00, 0.70) [0.70, 0.80) [0.80, 1.00)
Any value between the computed lower and upper probability bounds now encodes the input string.
current range = 1 - 0 = 1 upper bound = 0 + (1 0.7) = 0.7 lower bound = 0 + (1 0.0) = 0.0
current range = 0.7 - 0.0 = 0.7 upper bound = 0.0 + (0.7 0.80) = 0.56 lower bound = 0.0 + (0.7 0.7) = 0.49
current range = 0.56 - 0.49 = 0.07 upper bound = 0.49+ (0.07 1.00) = 0.56 lower bound = 0.49 + (0.07 0.80) = 0.546
Encode c'
The string "abc" may be encoded by any value within the probability range [0.546, 0.56). For example with 0.55.
0.0 a
0.0
0.49
0.546
Decoding Strings
encoded value = encoded input while string is not fully decoded identify the symbol containing encoded value within its range //remove effects of symbol from encoded value current range = upper bound of new symbol lower bound of new symbol encoded value = (encoded value - lower bound of new symbol) current range end while
0.0 a
0.0
0.49
0.546
b c 0.56
a b c
b c 0.56
Burrows-Wheeler Transformation
Let w = [3,1,3,1,2] be a data string. Construct 3 1 3 1 2 M= 3 1 2 3 1 1 2 3 1 3 2 3 1 3 1 by forming the successive rows of M, which are consecutive cyclic left-shifts of w.
M=
Note that the original data string w is the 5 th row of M . Given the I = 5 (row index) of w in M and L = [3, 3, 1, 1, 2] we can recover w. How? 1 3 1 3 M= 2 1 3 1 3 _ _ _ 2
The transformation collects similar elements near by. To achieve better compression we need to use some other techniques, like H operator. In this case we can use Move-to-Front (Recency Rank), or Inversion coding/transformation.
Example: Let {a, b, c, d} be our alphabet. Let S= bbbaaadddddccc be our data string. The MTF encoding is performed as following: b b b a a a d .. 0123 0123 0123 0123 0123 0123 0123 .. abcd bacd bacd bacd abcd abcd abcd .. 1 0 0 1 0 0 3 .. Output:
1001003000300.
MTF decoding on 1001003000300 is done as following: 1 0 0 1 0 0 3 .. 0123 0123 0123 0123 0123 0123 0123 .. abcd bacd bacd bacd abcd abcd abcd .. b b b a a a d .. Output: bbbaaaddddccc Why do we use MTF? If the data have locality of reference, MTF transformed data yields better distribution for encoding.
Note that, the inverse permutation of LIT l is l-1 = [1 ,3, 5, 2, 4]. l-1 is called the Canonical Sorting Permutation of w. Also, elements of w is sorted in non-decreasing order by l-1 and consists of m-blocks of different sizes. Sorted data can be encoded cheaply.
Hence, the problem is to encode the canonical sorting permutation. Interval ranking, except the first time appearance of an element from a data string yields a rank, which is simply the count of the number of elements between every two same element. Let H be the difference operator on a sequence, then it is easy to prove that the first-order entropy H(H l-1 ) $ H(Interval Rank)
Inversion Coding
Let 4= [41,42, ,4n] be an arbitrary permutation of an n-set S of positive integers. A Left Bigger (LB) inversion vector associated with 4 is the sequence [I1,I2, ,In ] of non-negative integers defined as follows: Ik= | {j: 1 < j <k <= n and 4j < 4k}|. Example: Let [1, 3, 5, 2, 4] be a permutation from the set {1, 2, 3, 4, 5}. LB inversion technique yields [0, 0 ,0, 2, 1].
When the H operator (difference operator) is applied to an inversion vector, except m-1 values (recall that there are m blocks), all the values would be positive ( or negative). The value |It - It-1 -1| is the count of How many bigger (or smaller) elements exists between the last and recent occurrence of a symbol from the alphabet? . Hence, the decorrelation of inversion vector elements yields a value which is called inversion rank (distance).
Elias proved that the Recency Ranking (MTF) yields better compression than the Interval Ranking. It is easy to prove that the inversion ranking yields better compression than the Interval ranking. While theoretically its hard to relate MTF and Inversion Coding, simulation results has shown that inversion coding yields better compression than the MTF coding.
Currently, Bzip2 is one of the best universal compression scheme. My contributions in this area:
Theoretical settings of the BWT (1997) New and faster transformation than BWT, Linear Order Transformation (1999) Inversion coding for large data files (2004) BWIC is available from www.cs.fredonia.edu/arnavut/research.html Yields better compression than bzip2 on several different data files. For example, on large text files, pseudo-color images, audio files, images.
Data File Bib Book1 Book2 Geo News Obj1 Obj2 Paper1 Paper2 Pic Progc Progl Progp Trans Bible Calag.tar E.coli World192 Avg. W. Avg.
Size 111261 768768 610856 102400 377109 21504 246814 53161 82199 513216 39611 71646 49379 93695 4047392 3276813 4638690 2473400
MTF 5.94 5.12 5.24 6.03 5.55 6.06 6.15 5.46 5.23 1.09 5.59 4.93 5.12 5.55 5.04 4.75 2.25 5.34 5.02 4.23
IC 5.68 4.84 4.95 6.16 5.32 5.70 6.09 5.42 5.13 1.03 5.67 4.96 5.30 5.49 4.56 4.45 2.14 5.03 4.88 3.96
BSC 2.11 2.61 2.22 4.83 2.65 4.02 2.58 2.65 2.61 0.84 2.67 1.88 1.86 1.63 1.71 2.44 2.21 1.49 2.39 2.04
BSWIC 2.17 2.52 2.19 4.97 2.70 4.30 2.77 2.74 2.65 0.81 2.82 1.91 1.96 1.77 1.62 2.28 2.10 1.47 2.43 1.96
THANK YOU!
Questions?
1/2/2012 FIT, December 2011 42