Data Compression

0ata compressIon Is the process of convertIng an Input data stream or the source stream or the orIgInal
raw data Into another data stream that has a smaller sIze.0ata compressIon Is popular because of two
reasons

1) People lIke to accumulate data and hate to throw anythIng away. No matter however large a storage
devIce may be, sooner or later It Is goIng to overflow. 0ata compressIon seems useful because It delays
thIs InevItabIlIty
2) People hate to waIt a long tIme for data transfer. There are many known methods of data
compressIon. They are based on the dIfferent Ideas and are suItable for dIfferent types of data. They
produce dIfferent results, but they are all based on the same basIc prIncIple that they compress data
by removIng the redundancy from the orIgInal data In the source fIle. The Idea of compressIon by
reducIng redundancy suggests the general law of data compressIon, whIch Is to assIgn short codes to
common events and long codes to rare events. 0ata compressIon Is done by changIng Its
representatIon from IneffIcIent to effIcIent form.
The maIn aIm of the fIeld of data compressIon Is of course to develop methods for better and
better compressIon. ExperIence shows that fIne tunIng an algorIthm to squeeze out the last remaInIng
bIts of redundancy from the data gIves dImInIshIng returns. 0ata compressIon has become so Important
that some researches have proposed the sImplIcIty and power theory. SpecIfIcally It says, data
compressIon may be Interpreted as a process of removIng unnecessary complexIty In InformatIon and
thus maxImIzIng the sImplIcIty whIle preservIng as much as possIble of Its non redundant descrIptIve
power.

ASIC TYPES DF 0ATA CDhPPESSIDN
There are two basIc types of data compressIon.
1. Lossy compressIon
2. Lossless compressIon

LDSSY CDhPPESSIDN
n lossy compressIon some InformatIon Is lost durIng the processIng, where the Image data Is stored Into
Important and unImportant data. The system then dIscards the unImportant data
t provIdes much hIgher compressIon rates but there wIll be some loss of InformatIon compared to the
orIgInal source fIle. The maIn advantage Is that the loss cannot be vIsIble to eye or It Is vIsually
lossless. 7Isually lossless compressIon Is based on knowledge about colour Images and human
perceptIon.

LDSSLESS CDhPPESSIDN
n thIs type of compressIon no InformatIon Is lost durIng the compressIon and the decompressIon
process. Here the reconstructed Image Is mathematIcally and vIsually IdentIcal to the orIgInal one. t
achIeves only about a 2:1 compressIon ratIo. ThIs type of compressIon technIque looks for patterns In
strIngs of bIts and then expresses them more concIsely.

TECHNIUES DF 0ATA CDhPPESSIDN
There are three Important technIques of data compressIon.
1) basIc technIque
2) statIstIcal technIque
J) dIctIonary method

8ASC TECHNQUES
These are the technIques, whIch have been used only In the past. The Important basIc technIques are
run length encodIng and move to front encodIng.

STATSTCAL TECHNQUES
They are based on the statIstIcal model of the data. Under thIs statIstIcal technIques there comes
three Important technIques
- Shannon Fano codIng
- Huffman codIng
- ArIthmetIc codIng

0CTDNAFY |ETHD0S
ThIs method select strIngs of symbols and encodes each strIng as a token usIng a dIctIonary. The
Important dIctIonary methods are
- LZ77 (slIdIng wIndow)
- LZFW1

ASIC TECHNIUES

1. 1.FUN LENCTH ENCD0NC

The basIc Idea behInd thIs approach to data compressIon Is thIs: If a data Item occurs n consecutIve
tImes In the Input stream replace the n occurences wIth a sIngle paIr n d . the n consecutIve
occurences of a data Item are called run length of n and thIs approach Is called run length encodIng or
FLE.

FLE |ACE CD|PFESSDN
FLE Is a natural candIdate for compressIng graphIcal data. A dIgItal Image consIsts of small dots called
pIxels. Each pIxel can be eIther one bIt IndIcatIng a black or whIte dot or several bIts IndIcatIng one of
several colours or shades of gray. We assume that thIs pIxels are stored In an array called bItmap In the
memory. PIxels are normally arranged In the bIt map In scan lInes. So the fIrst bIt map pIxel Is the dot
at the top left corner of the Image and the last pIxel Is the one at the bottom rIght corner.
CompressIng an Image usIng FLE Is based on the observatIon that If we select a pIxel In the Image at
random there Is a good chance that Its neIghbours wIll have the same colour. The compressor thus
scans the bIt map row by row lookIng for runs of pIxels of same colour.
ConsIder the grayscale bItmap
12,12,12, 12, 12, 12, 12, 12, 12, J5,76,112,67,87,8787,5, 5, 5, 5, 5, 5,1
Compressed Form

9, 12, J5, 76, 112, 67, J, 87, 6, 5, 1

1. |D7E TD FFDNT CD0NC
The basIc Idea of thIs method Is to maIntaIn the alphabet A of symbols as a lIst where frequently
occurIng symbols are located near the front. A symbol 'a' Is encoded as the no of symbols that precede
It In thIs lIst. Thus If A=('t','h','e','s') and the next symbol In the Input stream to be encoded Is 'e', It wIll
be encoded as '2' sInce It Is preceded by two symbols. The next step Is that after encodIng 'e' the
alphabet Is modIfIed to A=('e','t','h','s') . ThIs move to front step reflects the hope that once 'e' has been
read from the Input stream It wIll read many more tImes and wIll at least for a whIle be a common
symbol.
Let A = ("t", "h", "e", "s" )
After encodIng the symbol "e", A Is modIfIed.
|odIfIed Form:
A = ("e", "t", "h", "s" )

A07ANTACE
ThIs method Is locally adaptIve sInce It adapts Itself to the frequencIes of the symbol In the local areas
of Input stream. ThIs method produces good results If the Input stream satIsfIes thIs hope that Is If the
local frequency of symbols changes sIgnIfIcantly from area to area In the Input stream.

STATISTICAL TECHNIUES
1. SHANNDN FAND CD0NC
Shannon fano codIng was the fIrst method developed for fIndIng good varIable sIze codes. We start wIth
a set of n symbols wIth known probabIlItIes of occurences. The symbols are fIrst arranged In the
descendIng order of the probabIlItIes. The set of symbols Is then dIvIded Into two subsets that have the
same probabIlItIes. All symbols of one subset are assIgned codes that start wIth a zero whIle the codes
of the symbols In the other subset start wIth a one. Each subset Is then recursIvely dIvIded Into two.
The second bIt of all codes Is determIned In a sImIlar way. When a subset contaIns just two symbols
theIr codes are dIstInguIshed by addIng one more bIt to each. The process contInues untIl no subset
remaIns.
ConsIder a set of seven symbols, whose probabIlItIes are gIven. They are arranged In the descendIng
order of the probabIlItIes. The two symbols In the fIrst subset are assIgned codes that start wIth 1, so
theIr fInal codes are 11 and 10. The second subset Is dIvIded In the second step, Into two symbols and
three symbols. Step J dIvIdes last three symbols Into 1 and 2.
ShannonFano Example

Prob. Steps FInal
____________________________________________________________
1. 0.25 1 1 :11
2. 0.20 1 0 :10
J. 0.15 0 1 1 :011
4. 0.15 0 1 0 :010
5. 0.10 0 0 1 :001
6. 0.10 0 0 0 1 :0001
7. 0.05 0 0 0 0 :0000

The average sIze of thIs code Is
= 0.25 x 2 + 0.20x2 + 0.15 xJ + 0.15 x J + 0.10 x J + 0.10 x 4 + 0.05 x 4
= 2.7 bIts / symbol.
ThIs Is a good result because the entropy Is = 2.67.

A0VANTACE
The advantage of thIs method Is that It Is very easy to Implement.

2. HUFF|AN CD0NC
A commonly used method for data compressIon Is huffman codIng. The method starts by buIldIng a lIst
of all the alphabet symbols In descendIng order of theIrprobabIlItIes. t then constructs a tree wIth a
symbol at every leaf from the bottom up. ThIs Is done In steps where at each step the two symbols wIth
smallestprobabIlItIes are selected, added to the top of partIal tree, deleted from the lIst and replaced
wIth an auxIlIary symbol representIng both of them. When the lIst Is reduced to just one auxIlIary
symbol the tree Is complete. The tree Is then traversed to determIne the codes of the symbols.
The huffman method Is somewhat sImIlar to shannon fano method. The maIn dIfference between the
two methods Is that shannon fano constructs Its codes from top to bottom whIle huffman constructs a
code tree from bottom up.
ThIs Is best Illustrated by an example. CIven fIve symbols wIth probabIlItIes as shown In FIgure. They
are paIred In the followIng order:

1. a4 Is combIned wIth a5 and both are replaced by the combIned symbol a45, whose probabIlIty Is 0.2.
2. There are now four symbols left, a1, wIth probabIlIty 0.4, and a2, aJ, and a45, wIth probabIlItIes 0.2
each. We arbItrarIly select aJ and a45 combIne them and replace them wIth the auxIlIary symbol aJ45,
whose probabIlIty Is 0.4.
J. Three symbols are now left, a1, a2, and aJ45, wIth probabIlItIes 0.4, 0.2, and 0.4 respectIvely. We
arbItrarIly select a2 and aJ45, combIne them and replace them wIth the auxIlIary symbol a2J45, whose
probabIlIty Is 0.6.
4. FInally, we combIne the two remaInIng symbols a1, and a2J45, and replace them wIth a12J45 wIth
probabIlIty 1.
The tree Is now complete, "lyIng on Its sIde" wIth the root on the rIght and the fIve leaves on the left.
To assIgn the codes, we arbItrarIly assIgn a bIt of 1 to the top edge, and a bIt of 0 to the bottom edge
of every paIr of edges. ThIs results In the codes 0, 10, 111, 1101, and 1100. The assIgnments of bIts to
the edges Is arbItrary.

The average sIze of thIs code Is 0.4 x 1 + 0.2 x 2 + 0.2 x J + 0.1 x 4 + 0.1 x 4 = 2.2 bIts / symbol, but
even more Importantly, the Huffman code Is not unIque.

APPLCATDN N |ACE CD|PFESSDN
Now the followIng approaches Illustrates how all these fore saId technIques are applIed to Image
compressIon. PhotographIc dIgItal Images generate a lot of data takIng up large amounts of storage
space and thIs Is one of the maIn problems encountered In dIgItal ImagIng. To rectIfy thIs problem
Image compressIon Is used dependIng on the type of data, text, graphIcs, photographIc or vIdeo. mage
compressIon reduces Image data by IdentIfyIng patterns In the bIt strIngs, descrIbIng pIxel values then
replacIng them wIth a short code.

8ASC PFNCPLE DF |ACE CD|PFESSDN
The Idea of losIng Image InformatIon becomes more palatable when we consIder how dIgItal Images are
created. Here are three examples: (1) A reallIfe Image may be scanned from a photograph or a
paIntIng and dIgItIzed (converted to pIxels). (2) An Image may be recorded by a vIdeo camera that
creates pIxels and stores them dIrectly In memory. (J) An Image may be paInted on the screen by
means of a paInt program. n all these cases, some InformatIon Is lost when the Image Is dIgItIzed. The
fact that the vIewer Is wIllIng to accept thIs loss suggests that further loss of InformatIon nIght be
tolerable If done properly.
0IgItIzIng an Image Involves two steps: samplIng and quantIzatIon. SamplIng an Image Is the process of
dIvIdIng the twodImensIonal orIgInal Image Into small regIons: pIxels. QuantIzatIon Is the process of
assIgnIng an Integer value to each pIxel. NotIce that dIgItIzIng sound Involves the same two steps, wIth
the dIfference that sound Is onedImensIonal.

Here Is a sImple process to determIne qualItatIvely the amount of data loss In a compressed Image.
CIven an Image A, (1) compress It to 8, (2) decompress 8 to C, and (J) subtract d = C - A. If a was
compressed wIthout any loss and decompressed properly, then C should be IdentIcal to A and Image 0
should be unIformly whIte. The more data was lost In the compressIon, the farther wIll 0 be from
unIformly whIte.
The maIn prIncIples dIscussed so far were FLE, scalar quantIzatIon, statIstIcal methods, and dIctIonary
based methods. 8y Itself, none Is very satIsfactory for color or grayscale Images.
FLE can be used for (lossless or lossy) compressIon of an Image. ThIs Is sImple, and It Is used by certaIn
parts of JPEC, especIally by Its lossless mode. n general, however, the other prIncIples used by JPEC
produce much better compressIon than does FLE alone. FacsImIle compressIon uses FLE combIned wIth
Huffman codIng and gets good results, but only for bIlevel Images.
Scalar quantIzatIon can be used to compress Images, but Its performance Is medIocre. magIne an
Image wIth 8bIt pIxels. t can be compressed wIth scalar quantIzatIon by cuttIng off the four least
sIgnIfIcant bIts of each pIxel. ThIs yIelds a compressIon ratIo of 0.5, not very ImpressIve, and at the
same tIme reduces the number of colors (or grayscales) from 256 to just 16. Such a reductIon not only
degrades the overall qualIty of the reconstructed Image, but may also create bands of dIfferent colors
whIch Is a notIceable and annoyIng effect.
StatIstIcal methods work best when the symbols beIng compressed have dIfferent probabIlItIes. An
Input stream where all symbols have the same probabIlItIes wIll not compress, even though It may not
necessarIly be random. t turns out that for contInuoustone color or grayscale Image, the dIfferent
colors or shades often have roughly the same probabIlItIes. ThIs Is why statIstIcal methods are not good
choIce for compressIng such Images, and why new approaches for Images wIth color dIscontInuItIes,
where adjacent pIxels have wIdely dIfferent colors compress better wIth statIstIcal methods, but It Is
not easy to predIct, just by lookIng at an Image, whether It has enough color dIscontInuItIes.
0IctIonarybased compressIon methods also tend to be unsuccessful In dealIng wIth contInuoustone
Images. Such an Image typIcally contaIns adjacent pIxels wIth sImIlar colors, but does not contaIn
repeatIng patterns. Even an Image that contaIns repeated patterns such as vertIcal lInes may lose them
when dIgItIzed. A vertIcal lIne In the orIgInal Image may become slIghtly slanted when the Image Is
dIgItIzed, so the pIxels In a scan row may end up havIng slIghtly dIfferent colors from those In adjacent
rows, resultIng In a dIctIonary wIth short strIngs.
Another problem wIth dIctIonary compressIon of Images Is that such methods scan the Image row by
row, and may thus mIss vertIcal correlatIons between pIxels. TradItIonal methods are therefore
unsatIsfactory for Image compressIon, so we turn on to novel approaches. They are all dIfferent, but
they remove redundancy from an Image by usIng the followIng prIncIple.
mage compressIon Is based on the fact that neIghbourIng pIxels are hIghly correlated.

APPFDACH 1
ThIs Is used for bI level Images. A pIxel In such an Image Is represented by 1 bIt. ApplyIng the prIncIple
of Image compressIon to It therefore means that the ImmedIate neIghbours of a pIxel 'p' tends to be
sImIlar to 'p'. Thus It makes sense to use run length encodIng to compress the Image. A compressIon
method for such an Image may scan It row by row and compute the run length of black and whIte
pIxels. A compressIon method for such an Image may scan It In raster Ie, row by row and compute the
lengths of runs of black and whIte pIxels. They are encoded by varIable sIze codes and are wrItten on
the compressor. An example of such a method Is facsImIle compressIon.
0ata compressIon Is especIally Important when Images are transmItted over a communIcatIon lIne
because the user Is typIcally waItIng at the receIver, eager to see somethIng quIckly. 0ocuments
transferred between fax machInes are sent as bItmaps, so a standard data compressIon method was
needed when those were developed and proposed by the 'nternatIonal TelecommunIcatIons UnIon'.
Although It has no power enforcement, the standards It recommends are generally accepted and
adopted by Industry.
The fIrst data compressIon standard developed by the TU were T2 and TJ. These are now obsolete and
have been replaced by T4 and T6. They have typIcal speeds of 64 k band. 8oth methods can produce
compressIon ratIos of 10:1 or better, reducIng the transmIssIon tIme of a typIcal pate to about a mInute
wIth former and a few seconds wIth the later.
O Probabilities

O Electronics

O Knowledge

O A Natural

O The mage

O The dea

O The Basics

O Dictionary

O Probabilities

O Electronics
!7,8,39
Site Admin

Posts: 150
Joined: Sat May 28, 2011 6:29 pm

Data Compression

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Compression

Încărcat de

Drepturi de autor:

Formate disponibile

0ata compressIon Is the process of convertIng an Input data stream or the source stream or the orIgInal

S-ar putea să vă placă și