Documente Academic
Documente Profesional
Documente Cultură
terms
36
0018-92351931S3.00~1993IEEE
32
64
coder, which tries to assign the most economical possible variablelength bit string to each symbol in an alphabet. Its goal is to see that
symbols that occur often are assigned very short codes while the
rarer symbols get many more bits.
HUFFMAN TREE. To see how Huffman coding works, assume that a
message is to be encoded and that analysis of previous messages
shows the following letter frequencies: 66 Fs, 64 Bs, 32 Cs, 29 As,
23 Gs, 12 Ds, and 9 Es. The first step is to order the letters from
highest to lowest frequency of occurrence, which is proportional to
the probability of occurrence [Fig. 1, top]. Next, the two leastfrequent letters are combined and their frequencies added to find
the frequency for the combination. In this case, the D and the E
have a combined frequency of 21.
Next, combine whichever two nodes now have the lowest fre-
29
32
64
B
29
64
32
01
100
23
101
110
12
D
1110
9
1111
i
I
m
n
0
P
Y
0.2-0.4
0.5-0.6
0.6-0.8
0
I
0.4-0.5
0.1-0.2
0.0-0.1
0.2-0.4
0.1
0.2
0.1
0.1
0.2
0.1
0.1
0.1-0.2
0.2-0.4
0.4-0.5
0.5-0.6
0.6-0.8
0.8-0.9
I
0.9-1 .o
0.8-0.9
0.86-0.88
0.864-0.868
0.0004
0.86780-0.86784
0.00004
0.867824-0.867832
I
I
I 0.000008 I 0.867827248678280
0.0000008
0.86782728-0.867827360
0.00000008
0.867827280-0.867827288
0.000000008 0.8678272816-0.8678272832
37
____
II -
-11
~.
'
be encoded and then adding the results to the lower bound of the
current interval. The subdivision of the message interval continues
until all the symbols of a message are encoded [Fig. 2, bottom].
Once the message is encoded, it may be transmitted as any value
that lies within the final message interval. At first sight, that looks
impractical since computers have limited precision, and will
overflow after just a few symbols. But this is not really a problem,
since the encoded message need not be stored and transmitted all
at once. As soon as the most significant (leftmost) digits in the
upper and lower boundaries become equal, they are guaranteed not
to change as additional symbols are encoded. Thus they may be
transmitted, and the remaining fraction shifted left to occupy the
vacated positions.
To indicate to the decoder that the end of the message has been
reached, an end-of-messagesymbol is usually added to the message
alphabet.
TRIE-BASED CODES. Good as they are, the Huffman and arithmetic
models are less than efficient at modeling text because storage constraints keep them from capturing the higher-order relationships
between words and phrases. Far more effective are two simple
string-matchmg techniques, known as LZ-77 and LZ-78, which find
and eliminate the redundancy in repetitive strings. These techniques
were invented by two computer scientists, A. Lempel and J. Ziv, in
1977 and 1978.
LZ-77 exploits the fact that the words and phrases within a text
stream are likely to be repeated. When they do occur, they may be
encoded as a pointer to any previous occurrence plus a field indicating the length of a match. When there is no match, an escape bit
may be used to indicate that a noncompressed character follows.
Key to the operation of LZ-77 is a sliding history buffer, which
stores the text most recently transmitted. When the buffer fills up,
its oldest contents are discarded.
dT\"
'
<a>
38
I
_
'
F[
141 In a nonmndowed protocol envlronment [top], Packet 2 cannot
be sent untd Packet I has been acknowledged. Sznce that involves two
latenczes, transmzsszon is slowed down. With a windowed protocol,
packets are sent con&inually,overlappzng the lznk latency [bottom]
Uncompressed data
(
Negative
latency
Tran;;yn
p
-/~ecompre~~ioj
time
151 More ES less. Even though zt takes tzme to compress and decompress data, compression can nevertheless reduce transmzsszon
tzme because zt reduces the amount of data to be sent.
length of 343 bits, or slightly less than 43 bytes.
This technique works best with moderate-size dctionaries of 2
kilobytes or more, typically compressing text to a third or less of its
original length. The hardest part of it to implement is the rapid
search for the best match in the buffer. Most implementations use
binary trees, hash tables, or (for maximum speed) content-addressable memories.
U-78ALGORITHMS. Unfortunately, when data passes out of the sliding
history buffer, it is no longer useful. The LZ-78 algorithm attempts
to overcome that LZ-77 limitation with a greedy match-and-grow algorithm.
To do so, it employs a data structure known as a trie (from ret r i e d ) , which is often used for implementing dictionaries [Fig. 31.
Pointing to a trie node at the end of a phrase is enough to encode
the entire phrase. The decompressor, which, of course, must have
an identical trie, parses backwards from the reference node to the
root node to decode.
When a string is too short to be efficiently specified as a node, its
first character is sent in uncoded form preceded by an escape
symbol to indicate that it is not encoded.
Figure 3 indicates how LZ-78 works. The string THE THREE
LITTLE TOADS, which is represented there by part of a trie,
can be encoded from the trie in four pointers plus one space. First,
a pointer to the space following THE; then a pointer to the space
following THREE; next, a pointer to the E in LITTLE. At t h s
juncture, a space is needed, whch is most efficiently encoded as a
flag bit followed by a space. Last comes a pointer to the S in
TOADS.
Assuming that the trie of Fig. 3 is only part of a larger dictionary
with a total of 32K nodes, each node can be specified with 16 bits15 to idenbfy the node plus a flag bit, which distinguishes between
trie node identification numbers and uncoded data.
~~
~~~~
7 17
I
----
39