Efficiency of Lossless Data Compression: MIPRO 2011, May 23-27, 2011, Opatija, Croatia

Efficiency of lossless data compression
Mladen Konecki
1
, Robert Kudeli
1
i Alen Lovreni
1
1
Faculty of Organization and Informatics,
University of Zagreb, Pavlinska 2, 42000 Varadin, Croatia
E-mail(s): mladen.konecki@foi.hr, robert.kudelic@foi.hr, alen.lovrencic@foi.hr
Abstract - Data compression is important part of
information and communication technologies. There are
many benefits from using data compression, like saving
space on hard drives or lowering use of transmission
bandwidth in the network. There are also many algorithms
and many tools that are used today in this field. In this
paper we will focus on lossless data compression and give a
short overview of algorithms that are used in the most
popular data archiving tools for lossless data compression.
Compression rate of data files depends a lot on the type of
data so the test will be carried out on a group of most
commonly used data types. Tools for data compression
implement known algorithms in many different variations
and we will determine which tools have the best
compression capabilities, which are the fastest and which
tools have the best relation between compression and speed.
I. INTRODUCTION
In the last two decades, there has been a major
transformation, or we can say even revolution, in the way
we communicate, and this progress is still under way. This
transform includes ever-resent, ever-growing Internet: the
explosive development of mobile communications; and
the ever-increasing importance of video communication.
Data compression is technology that enables this
multimedia revolution. It wouldnt be possible to put all
the images, audio or video, on websites if it were not for
data compression. Mobile phones would not be able to
provide communication with increasing clarity were it not
for compression. Even digital TV would not be possible
without compression. Data compression was in the
domain of a small group of engineers and scientists, and
today, its ubiquitous. The most important role has
lossless compression because it is applied in the business
world. Lossless compression is used more and more, even
in the multimedia world, because of better technology that
allows us to do so. Lossless compression is obviously
better for any data transfer, if we dont have the time and
capacity constraints.
Data that needs to be transmitted and stored is rapidly
growing, so why not develop better transmission and
storage technologies? This is happening but its not
enough. We have DVDs, optical fibers, ADSL, cable
modems. We are able to store much more data than before
and transmit it much faster. While both storage and
transmission capacities are steadily increasing with new
technology, as a corollary to Parkinsons First Law, it
seems that the need for storage and transmission increases
at least twice as fast as storage and transmission capacities
improve [1]. There are situations in which capacity has
not increased significantly. For example, the amount of
information we can transmit over the airwaves will always
be limited by the characteristics of the atmosphere.
When we speak about compression techniques or
compression algorithms, there are two main categories of
algorithms: lossless algorithms and lossy algorithms.
Lossless compression techniques involve no loss of
information. If data have been losslessly compressed, the
original data can be covered exactly from the compressed
data. Lossless compression is used for applications that
cannot tolerate any difference between the original and
reconstructed data. There are many situations that require
compression where we want the reconstruction to be
identical to the original, like executable programs, text
documents, source code etc [2].
Lossy compression techniques involve some loss of
information, and data that have been compressed using
lossy techniques generally cannot be recovered or
reconstructed exactly. In return for accepting this
distortion in the reconstruction, we can generally obtain
much higher compression ratios then is possible with
lossless compression. Lossy compression is usually used
to compress multimedia data: audio, video, image. Its
used in applications like streaming media and internet
telephony [2].
In this paper we will focus on lossless compression
application. We will give a short overview of algorithms
that are used in archiving tools for lossless data
compression and compare those tools to determine their
efficiency in compression ratio, compression speed and
relation between compression ratio and speed. Archiving
tools implement known algorithms in many different
variations. The exact compression scheme we use will
depend on a number of different factors. Some of the most
important factors are the characteristics of the data that
need to be compressed. So there are many possibilities to
combine known algorithms to compress the most common
data types that are used today.
II. LOSSLESS COMPRESSION ALGORITHMS
A. Modeling and coding
The development of data compression algorithms for a
variety of data can be divided into two phases. The first
phase is called modeling. The model component somehow
captures the probability distribution of the data by
knowing or discovering something about the structure of
the input. The second phase is called coding. The coder
MIPRO 2011, May 23-27, 2011, Opatija, Croatia
810
component then takes advantage of the probability biases
generated in the model to generate codes. It does this by
effectively lengthening low probability messages and
shortening high-probability messages. Although there are
many different ways to design the model component of
compression algorithms, the coder components tend to be
quite generic in current algorithms are almost
exclusively based on either Huffman or arithmetic codes
[3].
B. Shannon-Fano
It turns out that information theory ties the model and
coder components together. Shannon borrowed the
definition of entropy from statistical physics to capture the
notion of how much information is contained in a and
their probabilities. Shannon defined entropy as,
_
e
=
S s
s p
s p S H
) (
1
log ) ( ) (
2
(1)
where p(s) is the probability of message s. If we consider
the individual messages S s e , Shannon defined the
notion of the self information of a message as
) (
1
log ) (
2
s p
s i =
(2)
This self information represents the number of bits of
information contained in it and the number of bits we
should use to send message [4].
R. M. Fano and C. E. Shannon have developed a
coding procedure to generate a binary code tree [5]. To
create a code tree according to Shannon and Fano an
ordered table is required providing the frequency of any
symbol. Each part of the table will be divided into two
segments. The algorithm has to ensure that both the upper
and the lower part of the segment have nearly the same
sum of frequencies. This procedure will be repeated until
only single symbols are left. In the table I. we can see
example of Shannon-Fano coding.
Linear coding of 5 symbols would require 3 bit per
symbol, but Shannon-Fano algorithms gives average
length of 2.26 bit.
C. Huffman coding
The Huffman coding is invented by David Huffman
[6] and it is a procedure to generate a binary code tree.
The algorithm ensures that the probability for the
occurrence of every symbol results in its code length. The
Huffman algorithm is very simple and is most easily
described in terms of how it generates the prefix-code
tree:
- Start with a forest of trees, one for each message.
Each tree contains a single vertex with weight
i i
p w =
- Repeat until only a single tree remains
- Select two trees with the lowest weight roots
(
2
w w
i
=
).
- Combine them into a single tree by adding a new
root with weight
2 1
w w +
, and making the two
trees its children. Convention will be to put the
lower weight root on the left if
2 1
w w =
.
The computation of the entropy results in an average
code length of 2.176 bit per symbol. The Huffman code
attains an average of 2.26 bits per symbol, thus Huffman
coding approaches the optimum on 97.74%. For some
applications it can be helpful to reduce the variance in the
code length. The variance is defined as
_
e
C c
a
C l c l c p
2
)) ( ) ( )( (
(3)
It turns out that a simple modification to the Huffman
algorithm can be used to generate a code that has a
minimum variance.
The construction of a code tree for the Huffman
coding is based on a certain probability distribution. There
are three different variants about determination of this
distribution: static, dynamic and adaptive. We will not go
into each variation because its out of the scope of this
paper but you can find more about it in [7] and [8].
D. Arithmetic coding
The crucial algorithms for arithmetic coding were
introduced by J. Rissanen [9][11]. The aim of the
arithmetic coding is to define a method that provides code
word with an ideal length. Like for every other entropy
coder, it is required to know probability for the appearance
of the individual symbols. The arithmetic coding assigns
an interval to each symbol, whose size reflects the
probability for the appearance of this symbol. The code
word of a symbol is an arbitrary rational number
belonging the corresponding interval. Arithmetic coding
differs from other forms of entropy encoding in that rather
than separating the input into component symbols and
replacing each with a code, arithmetic coding encodes the
entire message into a single number, a fraction n where
(0.0 n < 1.0). Picture 1 shows the example of arithmetic
coding.
Arithmetic coding assigns an interval to a sequence of
messages using the following recurrences
Table I. Example of Shannon-Fano coding
Symbol Frequency Code
Length
Code Total
Length
A
B
C
D
E
24
12
10
8
8
2
2
2
3
3
00
01
10
110
111
48
24
20
24
24
Picture 1. Example of arithmetic coding
811
s < -
=
=
s < - +
=
=

n i p s
i p
s
n i s f l
i f
l
i i
i
i
i i i
i
i
1
1
1
1
1
1 1
(4)
where
n
l
is the lower bound of the interval and
n
s
is the
size of the interval. We assume the interval is inclusive of
the lower bound, but exclusive of the upper bound. The
recurrence narrows the interval on each step to some part
of the previous interval. Since interval starts in the range
(0,1), it always stays within this range.
Equivalent to adaptive Huffman coding, it is possible
strategy to regard the symbols already encoded and to
adapt the probability array every time [10].
E. Dictionary techniques
In many applications, the output of the source consists
of recurring patterns. A very reasonable approach to
encoding such sources is to keep a list, or dictionary, of
frequently occurring patterns. When there patterns appear
in the source output, they are encoded with a reference to
the dictionary. If the pattern does not appear in the
dictionary, then it can be encoded using some other, less
efficient method.
A variety of compression methods is based on the
fundamental work of A. Lempel and J. Ziv[12][13]. Their
original algorithms are denoted as LZ77 and LZ78. A
variety of derivates were introduced in the meantime.
LZ77 is a dictionary based algorithm that addresses
byte sequences from former contents instead of the
original data. In general only one coding scheme exists, all
data will be coded in the same form:
- Find the longest match of a string starting at the
cursor and completely contained in the lookahead
buffer to a string starting in the dictionary.
- Output a triple (p, n, c) containing the position p
of the occurrence in the window, the length n of
the match and the next character c past the match.
- Move the cursor n + 1 characters forward.
Table II shows the example of coding using LZ77. The
dictionary is bold, faced, and the look-ahead buffer is
underlined.
LZ78 is based on a dictionary that will be created
dynamically at runtime. Both the encoding and the
decoding process use the same rules to ensure that an
identical dictionary is available. This dictionary contains
any sequence already used to build the former contents.
The compressed data have the general form: index
addressing an entry of the dictionary, first deviating
symbol. In contrast to LZ77 no combination of address
and sequence length is used. Instead only index to the
dictionary is stored [13].
There are many derivates that came from these two
main models. Most popular derivations are LLSS [15] and
LZW [14].We will not examine any further algorithms
because its out of the scope of this paper.
F. Context coding
One very effective way to determine that certain
symbols occur with much higher probability is to look at
the probability of occurrence of a letter in the context in
which it occurs. That is, we do not look at each symbol in
a sequence. Instead, we examine the history of the
sequence before determining the likely probabilities of
different values that the symbol can take.
Context mixing is based on prediction by partial
matching (PPM). It takes advantage of the previous K
characters to generate conditional probability of the
current character. The simplest way to do this would be to
keep a dictionary for every possible string s of k
characters, and for each string have counts for every
character x that follows s. The conditional probability if x
in the context s in then C(x|s)/C(s), where C(x|s) is the
number of times x follows s and C(s) is the number of
times s appears. Table III shows example of this method
on a string accbaccacba for k=2.
Another algorithm based od context mixing is
algorithm by D. J. Wheeler and M. Burrows [16] and its
called Burrows-Wheeler-Transforamtion (BWT). Main
idea of algorithm is that a block of original data will be
converted into a certain form and will be sorted
afterwards. The result is a sequence of ordered data that
reflects frequency appearing symbol combinations by
repetitions. The transformation is made by a matrix, in
which a block of the original data is arranged. Each row
represents a rotation of the left of the preceding row.
Afterwards, this matrix will be sorted lexicographically.
The transferred data may be encoded substantially better
by adaptive Huffman or arithmetic coding then original
data.
There are also other variants of mentioned algorithms
but these are the main categories. Archiving tools
implement mentioned algorithms in many different ways
and in many different variations. We will make a
comparison of popular archiving tools and determine their
efficiency in compression ratio, speed and relation
between these two parameters.
Table II. shows the example of coding using LZ77.
Step Input String Output Code
1
2
3
4
5
a a c a a c a b c a b a a a c
(0, 0, a)
(1, 1, c)
(3, 4, b)
(3, 3, a)
(1, 2, c)
Table III. Context coding using PPM
Order 0 Order 1 Order 2
Context Counts Context Counts Context Counts
empty a=4
b=2
c=5
a
b
c
c=3
a=2
a=1
b=2
c=2
ac
ba
ca
cb
cc
b=1
c=2
c=1
a=1
a=2
a=1
b=1
812
Table IV. Tested archiving software
Software Algorithms
7-Zip (free software)
FreeARC (free software)
PAQ8 (free software)
PeaZip (free software)
WinRK (free software)
WinAce (proprietary)
WinRar (proprietary)
WinZip (proprietary)
- filters, LZ77, LZMA, PPM, BWT
- filters, LZMA, PPM, LZP
- filters, CM
- LZ77, Huff , LZMA, PPM
- filters, PPM
- filters, LZ77, Huff
- filters, LZ77, PPM, Huff
- LZH, LZW, SF, Huff, PPM
BWT Burrows-Wheeler Transform, CM Context Modeling, Filters
program used filters, Huff Huffman, LZ Lempel-Ziv algorithms, PPM
Prediction by Partial Match, SF Shannon-Fano
Table V. Test files
File format Percentage
txt 25,42%
pdf 17,10%
doc 9,79%
bmp 9,27%
Jpg 8,00%
Png 6,02%
exe 11,18%
wav 7,58%
avi 5,64%
Table VI. Testing results, ZIP format
Software Algorithm
(method)
Compression
ratio (%)
Time
(s)
Ratio*time
7-Zip Deflate 36,47% 2,29 70,95
Deflate64 37,14% 2,38 72,97
BZip2 39,47% 4,18 123,41
LZMA 40,30% 7,78 226,53
PPMd 41,46% 6,57 187,58
PeaZip Deflate 36,45% 3,15 97,64
Deflate64 37,14% 3,24 99,34
BZip2 39,47% 4,54 134,04
LZMA 35,77% 5,64 176,69
PPMd 41,46% 7,24 206,71
WinRK Deflate 36,53% 14,83 459,09
Deflate64 37,21% 20,43 625,67
BZip2 39,20% 10,44 309,61
WinAce maximum 35,43% 5,73 180,45
normal 32,83% 6,12 200,50
WinRAR best 35,59% 4,93 154,87
good 35,58% 4,12 129,44
normal 35,47% 6,75 212,44
fast 34,24% 2,23 71,51
fastest 32,81% 1,89 53,74
WinZip SuperFast 32,82% 2,65 86,83
Deflate64 36,19% 8,1 252,08
BZip2 39,47% 9,99 294,94
LZMA 40,34% 13,95 405,91
PPMd 41,53% 27,13 773,73
Best method 43,86% 13,41 367,20
III. TESTING METHOD
There are several things we need to determine to make
the testing: computer specification for testing, archiving
tools that we will test and define the data on which we will
test these tools.
These are the specifications of computer on which the
test was performed:
- AMD Phenom II X4 945, 3Ghz, 8MB cache, QC
- MB Gigabyte MA785GT-UD3H
- ATI Redeon HD 5750, 1GB DDR5
- 4gb KingMax, DDR3 133MHz
- Seagate 1000GB SATA II 7200rpm, 32MB cache
- Corsair TX650W power supply
- OS Windows 7 Professional x64
The software that we tested was selected based on the
level of use, by license, by algorithms they implement and
by supported formats. Table IV. shows software that we
tested along with algorithms that they implement. 3
solutions are proprietary software, licensed under
exclusive legal right, while other 5 are free to be used for
any purpose. In this research, we will focus on ZIP format
comparison and see the difference between solutions using
this format. We will also compare ZIP to some other
formats that are used today and see the difference in
efficiency. We will also see the difference in compression
ratio and speed while using different compression
algorithms.
Its also very important to determine the data that we
will compress. The set of files that was created to test
lossless compression algorithms was presented in [17] for
the first time, and it was called Calgary corpus. This
corpus defined most common data files that were used and
tests were carried out on these files. Now, this data files
are more than 20 years old. Now set of files was defined
by [18] and it was called Canterbury corpus. It was
noticed that algorithms were modified and developed to
score good results on these two corpuses, so even they
dont give fair results. We created group of test files based
on these two corpuses that use most popular data types
that are used today. Table V. shows the content of test
data that we used in this paper. This group of files is made
of most commonly used multimedia data files that are
used today. Half of the data files are text files because
compression algorithms compress text really good and
text is never already compressed in any way. Other
multimedia files can be already compressed so those files
are not that relevant. All files together take 51.142.656
bytes.
IV. TESTING RESULTS
Table VI. shows the results of chosen compression
tools, based on different algorithms they use, compression
ratio, compression speed, relation between data and speed
while using ZIP format for archiving. FreeArc can
compress only in arc format so we will analyze that tool a
bit later, when well compare ZIP to other compression
formats. PAQ8 cant compress to ZIP format also, but
both PAQ8 and FreeArc showed interesting results when
compared to ZIP format.
Deflate is compression procedure based on LZ77 and
Huffman code. Enhanced deflate or deflate64 remain
completely unchanged, but the dictionary is 64Kb big and
there is expansion of the length codes. BZip2 is algorithm
that uses run length encoding, BWT and Huffman coding.
LZMA is base on LZ77 algorithm, the compressed stream
is a stream of bits, encoded using adaptive binary range
coder. PPMd is a optimized PPM model that increments
the pseudocount of the never-seen symbol every time
the never-seen symbol is used. Some software just have
profiles where they compensate between speed and
compression, so they are usually called fastest or
maximum compression.
813
Table VII. Test results, compression formats
Format +
algorithm
Compression
ratio (%)
Time (s) Ratio*time
ZIP maximum 43,86% 13,41 367,20
ZIP normal 35,47% 6,75 212,44
ZIP fastest 32,81% 1,64 53,74
ARC speed 37,20% 1,08 33,08
ARC very fast 72,25% 2,38 32,21
ARC fast 72,80% 4,16 55,19
ARC normal 73,27% 4,68 61,02
ARC high 77,63% 6,25 68,19
ARC maximum 77,64% 6,34 69,15
7Z LZMA 84,34% 12,51 95,54
7Z LZMA2 84,36% 12,55 95,71
7Z PPMd 50,86% 20,29 486,25
7Z BZip2 44,17% 5,40 147,04
RAR best 40,54% 9,32 270,28
RAR normal 38,64% 7,13 213,37
RAR fastest 33,96% 2,89 93,09
LPAQ8 57,42% 34,68 720,15
ZPAQ 58,00% 154,60 3166,88
RK maximum 85,70% 156,74 1093,51
RK fast 84,99% 27,21 199,19
Picture 3. Compression ratio (percentage)
Picture 2. Compression time (seconds)
If we look at the compression ratio we can see that
there is no significant difference, no matter what
algorithm is used. Results vary from 32,81% to 43,86%. If
we look at all deflate results, they are all at approximately
36%. Deflate64 is always around 37%. LZMA is always
at around 40%, except in PeaZip. PPMd is always at 41%.
So we can conclude that even if compression tools have
their filters, compression ratio mostly depends on
compression algorithm that is used and there are small
deviations in results. Average value of compression in ZIP
format is 37,47%, based on all tools and all algorithms.
If we look at the speed of compression, they are all
approximately in range from 2 seconds to 15 seconds,
except 2 extremes. But even with those low speed results,
compression didnt change a lot. WinRAR scored fastest
compression time but with lowest compression ratio. Best
times with best compression ratio have 7-Zip and PeaZip,
using PPMd, scoring around 41% compression ratio in
just 7,5 secounds. We can see that if we compress with
ZIP format, we cant get significant differences in
compression effectiveness.
Table VII. shows comparison with some other
compression formats, with different presets and
algorithms. Today, most compression tools support zip,
rar and 7z formats. For other formats you need specific
software. We can see that zip and rar formats scored
similar results. Maximum compression, around 85%,
scored 7z LZMA and LZMA2 along with RK format. 7Z
with compression ratio of 84,36% and time of 12,55
seconds is the best solution for most of computer users,
scoring very high compression ratio with relatively low
compression time, since compression ratio is for most of
the users more important factor. Arc format showed some
really great results if speed is limiting factor. Scoring
37,20% in just 1 second and 72,25% in 2,38 seconds.
Those results beat ZIP format in compression ratio and
speed in great degree and they have best compression and
speed proportion. PAQ8 showed great results in
compression rate but those algorithms are slow. PAQ-like
algorithms are showing some great results on compression
ratio, beating all other algorithms. They won few years in
a row the Huterr Prize [19]. Unfortunately, most of those
algorithms are not implemented in any compression tools
we use today. PeaZip implements only 3 variants of those
algorithms.
On picture 2 we can see relationship in compression
times between best, average and worst compression times
for each compression tool and also we can see relationship
among the different compression tools. If you need fast
compression method, then its recommended to use
WinRAR, 7-Zip or FreeArc. PeaZip has also good results
on best and average compression time but worst case
doesnt have sufficiently good result.
On picture 3 we can see relationship in compression
ratio between best, average and worst compression ratios
for each compression tool and also we can see relationship
among the different compression tools. If you need best
compression ratio, then its recommended to use WinAce,
FreeArc and PeaZip. FreeArc has slightly lower best
compression but it has best average compression ratio so
its good to keep that in mind.
V. CONCLUSION
There are many different algorithms that deal with the
compression problem. We can see that ZIP format is not
nearly the best solution for data archiving today. There are
new formats, new algorithms. We can see that 7z format is
scoring really good compression ratios with decent speed.
PAQ8 algorithms deliver best compression ratios but
those algorithms are really slow and there is a need for
tools that will implement those algorithms. Arc format has
great compression speed and its the best solutions for
applications that need real time compression. We can
conclude that FreeArc showed overall best results as a
814
compression tool, both in compression ratio and speed.
Arc as a compression format is still not a standard but we
believe that because of its performance, it will be much
more used in the future.
REFERENCES
[1] C. N. Parkinson: Parkinson's Law: and Other Studies in
Administration, Ballantine Books, New York, 1987.
[2] K. Sayood: Introduction to Data Compression, 3rd ed., Morgan
Kaufmann, San Francisco, 1995.
[3] G. E. Blelloch: Introduction to Data Compression, Carnage
Mellon University, 2010.
[4] C. E. Shannon: A Mathematical Theory of Communication, Bell
System Tehnical Journal, 27:379-423, 623-656, 1948.
[5] R. M. Fano: Transmission of Information. MIT Press, Cambridge,
MA, 1961.
[6] D. A. Huffman: A method for the construction of minimum
redundancy codes. Proc. IRE, 40:1098-1101, 1951.
[7] R. G. Galler: Variations on a theme by Huffman. IEEE
Transactions and Information Theory, IT-24(6):668-674,
November 1978.
[8] N. Faller: An Adaptive System for Data Compression. 7th
Asilomar Conference on Circuits, Systems, and Computers, p.
593-597, IEEE, 1973.
[9] J. Rissanen: Modeling by the Shortest Data Description,
Automatica, 14:465-471, 1978.
[10] A. Moffat: Linear time adaptive arithmetic coding, IEEE
Transactions on Information Theory, 36:401-406, 1990.
[11] J. Rissanen, G. G. Langdon: Arithmetic coding, IBM Journal of
Research and Development, 23:149-162, 1979.
[12] J. Ziv, A. Lempel: A universal algorithm for data compression,
IEEE Transactions on Information Theory, 23:337-343, 1977.
[13] J. Ziv, A. Lempel: Compression of individual sequences via
variable-rate coding, IEEE Transactions on Information Theory,
24:530-536, 1978.
[14] T. A. Welch: A technique for high-performance data compression,
IEEE Computer, 17:8, June 1984.
[15] J. A. Storer, T. G. Syzmanski: Data compression via textual
substitution, Journal of the ACM, 29:928-951, 1982.
[16] M. Burrows, D. J. Wheeler: A Block Sorting Data Compression
Algorithm, Technical Report SRC 124, Digital Systems Research
Center, 1994.
[17] T. C. Bell, J G. Cleary, I .H. Witten: Text Compression, Prentice
Hall, Englewood Cliffs, NJ, 1990.
[18] R. Arnold, T. Bell: A corpus for the evaluation of lossless
compression algorithms, Proceedings on Data Compression
Conference, p. 201-210, 1997.
[19] http://prize.hutter1.net/ [7.2.2011.]
815

Efficiency of Lossless Data Compression: MIPRO 2011, May 23-27, 2011, Opatija, Croatia

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Efficiency of Lossless Data Compression: MIPRO 2011, May 23-27, 2011, Opatija, Croatia

Încărcat de

Drepturi de autor:

Formate disponibile

Efficiency of lossless data compression

S-ar putea să vă placă și