Documente Academic
Documente Profesional
Documente Cultură
Lempel-Ziv Codes
To set the stage for Lempel-Ziv codes, suppose we wish to find the best block code for compressing a datavector X . Then we have to take into account the complexity of the code. We could represent the total number of codebits at the decoder output as: [# of codebits to describe block code] + [# of codebits from using code on X ] The codebits used to describe the block code that is chosen to compress X form a prefix of the encoder output and constitute what is called the overhead of the encoding procedure. If we wish to choose the best block code for compressing X , from among block codes of all orders, we would choose the block code in order to minimize the total of the overhead codebits and the encoded datavector codebits. One could also adopt this approach to code design in order to choose the best finite memory code for compressi ng X , or, more generally, the best finite-state code.
EXAMPLE 1 . Suppose we wish to compress English text using finite memory codes. A finite memory code of order zero entails 51 bits of overhead. (Represent the Kraft vector used as a binary tree with 26 terminal nodes and 2 x 26 1 = 51 nodes all together. You have to let the decoder know how to grow this tree it takes one bit of information at each of the 51 nodes to do that, since the decoder will either grow two branches at each node, or none.) A finite memory first order code f or English text will entail 27 x 51 = 1377 bits of overhead. (You need a codebook of 27 different codes, with 51 bits to describe each code.) A finite memory second order code for English text can be described with 677 x 51 = 34527 bits of overhead. (There are 26 x 26 + 1 = 677 codes in the codebook, in this case.) You would keep increasing the order of your finite memory code until you find the order for which you have minimized the sum of the amount of overhead plus the length of the encoded English text via the best finite memory code of that order.
It would be nice to have a compression technique that entails no overhead, while performing at least as well as the block codes, or the finite memory codes, or the finite -state codes (provided the length of the datavector is long enough). Overhead is caused because statistics of the datavector (consisting of various frequency counts) are collected first and then used to choose the code. Since the code arrived at depends on these statistics, overhead is needed to describe the code. Suppose instead that information about the datavector is collected "on the fly" as you encode the samples in the datavector from left to right in encoding the current sample (or group of samples), you could use information collected ab out the previously encoded samples. A code which operates in this way might not need any overhead to describe it. Codes like this which require no overhead at the decoder output are called adaptive codes . The Lempel-Ziv code, the subject of this chapter, w ill be our first example of an adaptive code. There are quite a number of variants of the Lempel-Ziv code. The variant we shall describe in this chapter is called LZ78, after the date of the paper [1].
takes place is called Lempel-Ziv parsing . The first variable-length block arising from the Lempel -Ziv parsing of the datavector X = ( X 1 , X 2 , ... , X n ) is the single sample X 1 . The second block in the parsing is the shortest prefix of ( X 2 , ... , X n ) which is not equal to X 1 . Suppose this second block is ( X 2 , ... , X j ). Then, the third block in Lempel-Ziv parsing will be the shortest prefix of ( X j + 1, ... , X n ) which is not equal to either X 1 or ( X 2 , ... , X j ). In general, suppose the Lempel -Ziv parsing procedure has produced the first k variable-length blocks B 1 , B 2 , ... , B k in the parsing, and X ( k ) is that part left of X after B 1 , B 2 , ... , B k have been removed. Then the next block B k+ 1 in the parsing is the shortest prefix of X ( k ) which is not equal to any of the preceding blocks B 1 , B 2 , ... , B k . (If there is no such block, then B k+1 = X ( k ) and the Lempel-Ziv parsing procedure terminates.) By construction, the sequence of variable -length blocks B 1 , B 2 , ... , B t produced by the Lempel-Ziv parsing of X are distinct, except that the last block B t could be equal to one of the preceding ones.
This parsing can also be accomplished via MATLAB. Here are the results of a MATLAB session that the reader can try:
For example, (4 , 0) corresponds to the block 00 in the parsing. Since the last symbol of 00 is 0, the pair (4 , 0) ends in 0. The 4 in the first entry refers to the fact that B 4 = 0 is the preceding block in the parsing which is equal to what we get by deleting the last sym bol of 00. For our next step, we replace each pair (i, s) by the integer ki+s. Thus, the sequence of pairs (2) becomes the sequence of integers
5-2
2~0+1=1,21+0=2,21+1=3,20+0=0,24+0=8,23+0=6,20+1=1
(3)
To finish our description of the encoding process in Lempel -Ziv coding, let I 1 , I2 , ... , I t denote the integers corresponding to the blocks B1 , B 2, ... , B t in the Lempel-Ziv parsing of the datavector X . Each integer I j is expanded to base two and these binary expansions are "padded" with zeroes on the left so that the overall length of the string of bits assigned to Ij is d log 2(kj )e. The reason why this many bits is necessary and sufficient is seen by examining the largest that I j can possibly be. Let ( i, s) be the pair associated with I j. Then the biggest that i can be is j 1 and the biggest that s can be is k 1. Thus the biggest that Ij can be is k ~ ( j 1) + k 1 = kj 1, and the number of bits in the binary expansion of kj 1 is d log 2(kj) e. Let Wj be the string of bits of length d log 2( kj)e assigned to Ij as described in the preceding. Then, the Lempel-Ziv encoder output is obtained by concatenating together the strings W1 , W2 , ... , Wt . To illustrate, suppose a binary datavector has seven blocks B1, B2, ... , B 7 in its Lempel-Ziv parsing (such as in Example 2). These blocks are assigned, respectively, strings of codebits W1, W2, W3, W4, W5, W6, W7 of lengths d log2(2) e = 1 bits, d log2(4) e = 2 bits, d log2(6) e = 3 bits, d log2(8) e = 3 bits, d log 2(10) e = 4 bits, d log2(12) e = 4 bits, and d log 2(14) e = 4 bits. Therefore, any binary data vector with seven blocks in its Lempel Ziv parsing would result in an encoder output of length 1 + 2 + 3 + 3 + 4 + 4 + 4 = 21 codebits. In particular, for the datavector in Example 2, the seven strings W1, ... , W7 are (referring to (3)):
W1 W2 W3 W4 W5 W6 W7
= = = = = = =
Concatenating, we see the encoder output from the Lempel-Ziv coding of the datavector in Example 1 is 110011000100001100001.
Let us decode to get X. For an alphabet of size three, dlog2(3j)e codebits are allocated to the j-th block in the LempelZiv parsing. This gives us the following table of codebit allocations: codebit allocation table parsing block number # of codebits 1 2 2 3 3 4 4 4 5 4 6 5 7 5 8 5 Partitioning up the encoder output (4) according to the allocations in the above table, we obtain the partition: 00,100, 0010,1010,1011, 00001, 00000 5-3
Converting these to integer form we get: 0, 4, 2, 10, 11, 1, 0 Dividing each of these integers by three and recording quotient and remainder in each case, we get the pairs (0, 0), (1, 1), (0, 2), (3, 1), (3, 2), (0, 1), (0, 0) Working backward from these pairs we obtain the Lempel-Ziv parsing 0, 01, 2, 21, 22, 1, 0 and the datavector
X = (0, 0, 1, 2, 2, 1, 2, 2, 1, 0)
Fi g ur e 1 : Le mp e l - Zi v P a rsi n g T ree fo r E x a mp l e 2
0 1 4
3 6
We explain to the reader the meaning of this tree. Label each left branch with a "1" and each right branch with a "0". For each node i ( i = 1 , ... , 6) write down the variable-length block consisting of the bits encountered along the path from the root node (labelled 0) to node i this block B 2 is then the i -th block in the Lempel-Ziv parsing of the datavector. For example, if we follow the path from node 0 to node 6, we see a left branch, a left branch, and a right branch, which converts to the block 110. Thus, the sixth block in the Lempel-Ziv parsing of our datavector is 110. Let the datavector be X = ( X 1 , X 2 , ... , X n ). The encoder grows the Lempel-Ziv parsing tree as follows. Suppose there are q distinct blocks in the Lempel-Ziv parsing, B 1 , B 2 , ... , B q . Then the encoder grows trees T 1 , T 2 , ... , T q . Tree T 1 consists of node 0, node 1, and a single branch going from node 0 to node 1 that is labelled with the symbol B 1 = X 1 . For each i > 1, tree T 2 is determined from tree T 2 _ 1 as follows: (a) (b) Remove B 1 , ... ,
B2 _ 1
Starting at the root node of T2 _ 1 , follow the path driven by X ~ 2 ~ until a terminal node of T2 _ 1 is reached (the labels on the resulting path form a prefix of X ~ 2 ~ which is one of the blocks B ~ E { B 1 , B 2 , ... , B 2 _ 1 } , and the terminal node reached is labelled j ). Let X be the next symbol in X ~ 2 ~ to appear after B ~ . Grow a branch from node j of T2 _ 1 , label this * branch with the symbol X , and label the new node at the end of this branch as "node i ". This new tree is T 2 .
*
(c)
The decoder can also grow the Lempel-Ziv parsing tree as decoding of the compressed datavector proceeds from left to right. We will leave it to the reader to see how that is done. Growing a Lempel-Ziv parsing tree allows the encoding and decoding operations in Lempel -Ziv coding to be done in a fast manner. Also, there are modifications of Lempel -Ziv coding (not to be discussed here) in which enhancements in data compression are obtained by making use of the structure of the parsing tree. 5-4
LZ(X) nH1(X) + nn
(5)
The constant term n , which depends only on the datavector length n , is called the first order redundancy and its units are bits per data sample. The better a data compression algorithm is, the smaller the redundancy will be. The following result gives the first order redundancy for the Lempel -Ziv code.
INTERPRETATION. We introduce some notation which makes it more convenient to talk about redundancy. If f z n g is a sequence of real numbers, and f n g is a sequence of real numbers, we say that z n is O ( n ) if there is a positive constant D such that
zn Djnj
for all sufficiently large positive integers n . Using our new notation, we see that the above RESULT says that the first order redundancy of the Lempel-Ziv code is O (log 2 log 2 n= log 2 n ) (where n denotes the length of the datavector). What does our redundancy result say? Recall that H 1 ( X) is a lower bound on the compression rate that results when one compresses X using the best memoryless code that can be designed for X . Thus, the RESULT tells us that the Lempel-Ziv code yields a compression rate on any datavector of length n no worse than log 2 log 2 n= log 2 n bits per sample more than the compression rate for the best memoryless code for the datavector. Since the quantity log 2 log 2 n= log 2 n is very small when n is large, we can achieve through Lempel-Ziv coding a compression performance approximately no worse than that achievable by the best memoryless code for the given datavector. To show that the RESULT is true, we need the notion of unnormalized entropy . Let ( Y 1 ; Y 2 ; ::: ; Y m ) be a datavector. (We allow the case in which each entry Y i is itself a datavector; for example, the Yi 's may be * blocks arising from a Lempel-Ziv parsing.) The unnormalized entropy H ( Y 1 ; ::: ; Y m ) of the datavector ( Y 1 ; ::: ; Y m ) is defined to be m , the length of the datavector, times the first order entropy H 1 ( Y 1 ; ::: ; Y m ) of the datavector. This gives us the formula
log2 p(Yi)
i=1
(7)
where p is the probability distribution on the set of entries of the datavector which assigns to each entry Y the probability p(Y ) defined by
where j B 2 j denotes the length of the block B 2 . Since the blocks B 1 ; B 2 ; ::: ; B t - 1 are distinct, (t We know that 1) log 2 ( t 1) = H ( B 1 ; ::: ; B t - 1 ) H ( B 1 ; ::: ; B t )
* *
LZ(X) = Xt dlog2(ki)e
2=1
where k is the size of the data alphabet. Expanding out the right side of the preceding equation, one can see that there is a constant c1 such that
LZ ( X ) ( c 1 + k ) t + ( t 1) l og 2 ( t 1)
for all datavectors X . From Exercise 6 at the end of the chapter,
Xt
2=1
log2 jB2j
an d s o
L Z ( X ) =n H 1 ( X ) + ( X )
where (X) = (c1 + k)(t=n) + n log2(1 + loge n) +log2(n=t) By Exercise 8 at the end of the chapter, there is a constant c 2 such that
-1
(n=t)
(8)
t log2 n c2n
Applying this to the first and third terms on the right side of (8), it is seen that (X) = O + Olog2 log2 n + log2 n n
Of the three terms on the right above, the third term is dominant. We have achieved the bound (5) with n given by (6). The RESULT is proved. We now want to compare the compression performance of the Lempel -Ziv code to the performance of block codes of an arbitrary order j . Consider an arbitrary datavector X = ( X 1 ; ::: ; X n ) of length n a multiple of j . By a complicated argument similar the argu ment given above for j = 1 (which we omit), it can be shown that there is a constant C j such that
LZ(X)=n
H j ( X ) log + n C 2
log2 log2 n
j
The second term on the right above is the j - th order redundancy of the Lempel-Ziv code. In other words, the relation (9) tells us that for any j , the j -th order redundancy of the Lempel-Ziv code is O (log 2 log 2 n= log 2 n ), which becomes very small as n gets large. Recall from Chapter 4 that H j ( X ) is a lower bound on the compression rate of the best j -th order block code for X . We conclude that no matter how large the order of the block code that one attempts to use, the Lempel Ziv algorithm will yield a compression rate on an arbitrary datavector approximately no worse than that of the block code, provided the datavector is long enough relative to the order of the block code. Hence, one loses nothing in compression rate by using the Lempel-Ziv code instead of a block code. Also, one is able to compress a datavector faster via the Lempel-
(9)
Ziv code than via block coding. To see this, one need only look at memoryless codes. For a datavector of
length n , the overall time for best compression of the datavector via a memoryless code is proportional to n ~ . (The overall compression time in this case would be the time it takes to design the Huffman code for the datavector plus the time it takes to compress the datavector with the Huffman code; since the first time is proportional to n ~ and the second time is proportional to n , the overall compression time is proportional to n ~ .) On the other hand, if the Lempel -Ziv code is implemented properly, it will take time proportional to n to compress any datavector of length n . (No time is wasted on design; the Lempel -Ziv code structure is the same for every datavector.) We conclude: Lempel-Ziv coding yields a compression performance as good as or better than the best block codes (provided the datavector is long enough). Lempel-Ziv coding yields faster compression of the data than does coding via the best block codes, because no time is wasted on design. The Lempel-Ziv code has been our first example of a code which does at least as well as the block codes in terms of the redundancy of all orders becoming small with large datavector length. Such codes are called universal codes . Although the Lempel-Ziv code is a universal code, there are universal codes whose redundancy goes to zero faster with increasing datavector length than does the redundancy of the Lempel -Ziv code. This point is discussed further in Chapter 15.
5.6.1 LZparse.m
Here is the m-file for the MATLAB function LZparse :
%This m-file is called LZparse.m %It accomplishes Lempel-Ziv parsing of a binary %datavector %x is a binary datavector %y = LZparse(x) is a vector consisting of the indices %of the blocks in the Lempel -Ziv parsing of x % f u n c t i o n y = L Z p a r s e ( x) N =le ngth ( x) ; dict=[]; lengthdict=0; while lengthdict < N i=lengthdict+1; k=0; while k==0 v=x(lengthdict+1:i); j=bitstring_to_index(v); A=(dict~=j); k=prod(A); if i==N k=1; else end
5-7
i=i+1; end dict=[dict j]; lengthdict=lengthdict + length(v); end y=dict; The function " LZparse " was illustrated in Example 2.
5.6.2 LZcodelength.m
Here is the m-file for the MATLAB function LZcodelength : %This m-file is named LZcodelength.m %x = a binary datavector %LZcodelength(x) =length in codebits of the encoder %output resulting from the Lempel-Ziv coding of x % function y = LZcodelength(x) u=LZparse(x); t=length(u); S=0; for i=1:t; S=S+ceil(log2(2*i)); end y=S; To illustrate the MATLAB function LZcodelength , we performed the following MATLAB session: x=[1 1 0 1 1 0 0 0 1 1 0 1]; LZcodelength(x) 21 As a result of this session, we computed the length of the codeword resulting from the Lempel -Ziv encoding of the datavector in Example 2, and " 21 " was printed out on the screen. This is the correct length of this codeword, as computed earlier in these notes.
5.7 Exercises
1. What is the minimum number of variable-length blocks that can appear in the Lempel-Ziv parsing of a binary datavector of length 28? What is the maximum number? 2. Find the binary codeword that results when the datavector 11101011100101001111 is encoded using the Lempel-Ziv code. 3. The alphabet of a datavector is { 0 , 1 , 2 } . The codeword 10100000001101010100010110010000 results when the datavector is Lempel -Ziv encoded. Find the datavector. 4. Let X = (X1, X2, ... , Xn) be a datavector and let B1, B2, ... , Bt be variable-length blocks into which X is partitioned (from left to right). Show that
(10)
where I B 2 I is the length of B 2 . (Inequality (10) can be proved by grouping appropriately the terms * that appear in the summation giving the unnormalized entropy x ( B 1 , B 2 , ... , B t ); see formula (7).) 5-8
5. Let ( X 1 , X 2 , ... , X n ) be a datavector and let A be the data alphabet. Show that
X
a2A
p1(a) log2~p1(a) l 0 p2 a J
for any two probability distributions p 1 , p 2 on A ; see Exercise 1 of Chapter 3.) 6. Consider a datavector ( X 1 , X 2 , ... , X n ) in which each sample X i is a positive integer less than or equal to N . Show that
Xn
i=1
log2 Xi
(Hint: First, use the result of Exercise 5 with the probability distribution
1~j
p ( j ) = 1 + (1 ~ 2) + (1 ~ 3) + ... + (1 ~N ) , j = 1 , ... , N
Then use the inequality (1 ~ 2) + (1 ~ 3) + ... + (1 ~N )
~ N (1 ~x ) dx = log e N
i
which can be seen by approximating the area under the curve y = 1 ~x by a sum of areas of rectangles.) 7. Let A be an arbitrary finite alphabet. Define L l z ( n ) to be the minimum Lempel -Ziv codeword length assigned to the datavectors of length n over the alphabet A . Show that lim log2 L lz(n) n~oo log2 n
*
= 1~2
This property points out a hidden defect of the Lempel -Ziv code. Because the limit on the left above is greater than zero, there exist certain datavectors which the Lempel -Ziv code does not compress very well. 8. Consider all datavectors of all lengths over a fixed finite alphabet A . If X is such a datavector, let t ( X ) denote the number of variable-length blocks that appear in the Lempel -Ziv parsing of X . Show that there is a constant M (depending on the size of the alphabet A ), such that for any integer n 2, and any datavector X of length n ,
t(X) Mnl
og2 n
(Hint: Let t = t ( X ) and let B 1 , B 2 , ... , B t_1 be the first t 1 variable-length blocks in the Lempel-Ziv parsing of X . Let j B i j denote the length of block B i . In the inequality
j B 1 j + j B 2 j + ... + j B t _ 1 j n
find a lower bound for the left hand side using the fact that the B i 's are distinct.) 9. We discuss a variant of the Lempel-Ziv code which yields shorter codewords for some datavectors than does LZ78. Encoding is accomplished via three steps. In Step 1, we partition the datavector ( X 1 , ... , X n ) into variable-length blocks in which the first block is of length one, and each succeeding block (except
5-9
for possibly the last block) is the shortest prefix of the rest of the datavector which is not windowed in the datavector as we slide to the left. To illustrate, the datavector 000110 is partitioned into 0,001,10 (11)
in Step 1. (On the other hand, LZ78 partitions this datavector into four blocks instead of three: 0 , 00 , 1 , 10.) In Step 2, each block B in the sequence of blocks from Step 1 is represented as a triple ( i, j, k ) in which k is the last symbol in B , i is the length of the block B , and j is the smallest integer such that if we look at the i 1 samples in the datavector starting with sample X j , we will see windowed the block obtained by removing the last symbol from B . (Take j = 0 if B has length one.) For example, for the blocks in (11), Step 2 gives us the triples (1, 0, 0), (3,1,1), (2, 4, 0) In Step 3, the sequence of triples from Step 2 is converted into a binary codeword. There is a clever way to do this which we shall not discuss here. All we need to know for the purposes of this exercise is that if there are t triples and the datavector length is n , then the approximate length of the binary codeword is t log 2 n . (a) (b) Show that there are infinitely many binary datavectors such that Step 1 yields a partition of the datavector into 5 blocks. Let X ( n ) be the datavector consisting of n zeroes. Let LZ ( X ( n ) ) be the length of the binary codeword which results when X ( n ) is encoded using the variant of the Lempel -Ziv code. Show * (n) (n) that LZ ( X ) ~LZ ( X ) converges to zero as n ~ oo.
*
5-10
References
[1] J. Ziv and A. Lempel, "Compression of individual sequences via variable-rate coding," IEEE Trans. Inform. Theory, vol. 24, pp. 530-536, 1978.
5-11