Arithmetic Compression With C

Arithmetic Compression With C#
Agola Kisira Odero, 10 Dec 2009
This article describes arithmetic compression with an implemetnation in C#.

Download source code - 19.5 KB
Introduction
According to Wikipedia:
Arithmetic coding is a method for lossless data compression. Normally, a string of
characters such as the words hello there is represented using a fixed number of bits
per character, as in the ASCII code. Like Huffman coding, arithmetic coding is a form
of variable-length entropy encoding that converts a string into another form that
represents frequently used characters using more bits, with the goal of using fewer
bits in total. As opposed to other entropy encoding techniques that separate the
input message into its component symbols and replace each symbol with a code
word, arithmetic coding encodes the entire message into a single number, a fraction
ne, where (0.0 n < 1.0).
Some of the properties of arithmetic coding are:
1.
When applied to independent and identically distributed (IID) sources, the

compression of each symbol is provably optimal.
2.
It is effective in a wide range of situations and compression ratios. The same

arithmetic coding implementation can effectively code all the diverse data
created by different processes such as modeling parameters, transform
coefficients, signaling etc.
3.
It simplifies the modeling of complex sources, yielding near optimal or

significantly improved compression for sources that are not IID.
4.
Its main process is arithmetic, which is supported with ever-increasing

efficiency by all general purpose or digital processors.
5.
It is suited for use as a compression black-box by those who are not coding
experts or do not want to implement the coding algorithm themselves.
In this article, I present an implementation of an arithmetic coder in C# (coder.cs in

the source code). The coding algorithm is adapted from an algorithm originally
presented
in
C
by
Mark
Nelson
in
his
paper
available
here
(http://dogma.net/markn/articles/arith/part1.htm). The coding algorithm has been
separated from the statistical model and encapsulated as a C# object.
Description
The coder object in the source code attached to this article can work as a black box
implementation of an arithmetic encoder/decoder. However, to use it, we need to
understand the basics of statistical modeling for arithmetic coding and a few of the
basic concepts of arithmetic coding.
In general, using arithmetic coding depends on creating a statistical model of the
data. In this example, I will assume that we are trying to encode the words HELLO
WORLD.
Creating the statistical model of the data proceeds as follows:

1.
Taking the number of independent characters in the words to be encoded

(HELLO WORLD), we obtain, in alphabetical order:
i.
ii.
iii.
iv.
v.
vi.
vii.
The total number of characters to be encoded is 10. For convenience and clarity, I
have ignored the space between the words HELLO and WORLD.
2.
We can arrange these characters into a table with their corresponding

frequency, as below:
Charact Frequenc
er
y
3.
Each of the characters will be assigned a range based on its frequency/

probability of occurrence. This range will be between 0 and 1, as below (note that
I have not used any optimizations for the frequency and probability model; for the
most optimal compression, this is necessary; however, for the purposes of our
example, this should do).
Charact Frequenc Probabilit

Range
er
y
y
D
1/10
0.0 0.1
1/10
0.1 0.2
1/10
0.2 0.3
3/10
0.3 0.6
2/10
0.6 0.8

Range
er
y
y
1/10
0.8 0.9
1/10
0.9 1.0
4.
The algorithm for encoding is as below:
Hide Copy Code
Set low to 0.0

Set high to 1.0
While there are still input symbols do
get an input symbol
code_range = high - low.
high = low + code_range * high_range of the symbol being coded
low = low + code_range * low_range of the symbol being coded
End of While
output low
Applying this to our input (HELLO WORLD), we obtain:

Hide Shrink
Copy Code
encoding H (Hs range is 0.2 - 0.3) Range(or code_range above) = 1 - 0 = 1

low = 0 + (1 * 0.2) = 0.2
high = 0 + (1 * 0.3) = 0.3
no output
encoding E (Es range is 0.1 - 0.2) Range = 0.3 - 0.2 = 0.1
low = 0.2 + (0.1 * 0.1) = 0.21

high = 0.2 + (0.1 * 0.2) = 0.22
output 0.2
encoding L (Ls range is 0.3 - 0.6) Range = 0.22 - 0.21 = 0.01
low = 0.21 + (0.01 * 0.3) = 0.213

high = 0.21 + (0.01 * 0.6) = 0.216
output 0.21
encoding the next L (Ls range is 0.3 - 0.6) Range = 0.216 - 0.213 = 0.003
low = 0.213 + (0.003 * 0.3) = 0.2139

high = 0.213 + (0.003 * 0.6) = 0.2148
no output
encoding O (Os range is 0.6 - 0.8) Range = 0.2148 - 0.2139 = 0.0009
low = 0.2139 + (0.0009 * 0.6) = 0.21444

high = 0.2139 + (0.0009 * 0.8) = 0.21462
output 0.214
encoding W (Ws range is 0.9 - 1.0) Range = 0.21462 - 0.21444 = 0.00018
low = 0.21444 + (0.00018 * 0.9) = 0.214602
high = 0.21444 + (0.00018 * 1.0) = 0.21462
output 0.2146
encoding O (Os range is 0.6 - 0.8) Range = 0.21462 - 0.214602 = 0.000018
low = 0.214602 + (0.000018 * 0.6) = 0.2146128
high = 0.214602 + (0.000018 * 0.8) = 0.2146164
output 0.21461
encoding R (Rs range is 0.8 - 0.9) Range = 0.2146164 - 0.2146128 = 0.0000036
low = 0.2146128 + (0.0000036 * 0.8) = 0.21461568
high = 0.2146128 + (0.0000036 * 0.9) = 0.21461604
no output
encoding L (Ls range is 0.3 - 0.6) Range = 0.21461604 - 0.21461568 = 0.00000036
low = 0.21461568 + (0.00000036 * 0.3) = 0.214615788
high = 0.21461568 + (0.00000036 * 0.6) = 0.214615896
output 0.214615
encoding D (Ds range is 0.0 - 0.1) Range = 0.214615896 - 0.214615788 = 0.000000108
low = 0.214615788 + (0.000000108 * 0.0) = 0.214615788
high = 0.214615788 + (0.000000108 * 0.1) = 0.2146157988
output 0.2146157 and 88 from low
...
Giving the table below:

Range Low
er
y
y
High
1/10
0.2 0.30.2
0.3
1/10
0.1 0.20.21
0.22
3/10
0.3 0.60.213
0.216
3/10
0.3 0.60.2139
0.2148
3/10
0.6 0.80.21444
0.21462
1/10
0.9 1.00.214602
0.214620
2/10
0.6 0.80.2146128 0.2146164
1/10
0.8 0.90.21461568 0.21461604
3/10
0.3 0.60.21461578 0.21461589

Range Low
er
y
y
8
D
1/10
0.0 0.1
High
6
0.21461578 0.21461580
8
6
So the final low value 0.214615788 will encode the string HELLOWORLD (without
the space, which was omitted for clarity).
5.
The algorithm for decoding the number and retrieving the encoded string/data
is as below:
Find the symbol represented by the range that the number is in, output it. Remove
the effects of encoding and repeat. In pseudo-code:
Hide Copy Code
get the number encoding the data

loop
current symbol = the symbol/character in which range the number falls
current range = current symbols high value current symbols low value
subtract current symbols low value from number
divide the number by the current range
end loop
This algorithm would give us the working below, taking the output from the example
above:
Hide Shrink
Copy Code
The number is 0.214615788

current symbol = H (range: 0.2 0.3)
current range = 0.3 0.2 = 0.1
subtract current symbols low value from number = 0.214615788 0.2 = 0.014615788
divide the number by the current range = 0.014615788/0.1 = 0.14615788
current symbol = E (range: 0.1 0.2)

current symbol = L (range: 0.3 0.6)

current symbol = L (range: 0.3 0.6)

current symbol = O (range: 0.6 0.8)

current symbol = W (range: 0.9 1.0)

This working yields the table below:

Range Number
er
y
y
0.21461578
8
1/10
0.2 0.3
1/10
0.1 0.20.14615788
3/10
0.3 0.60.4615788
3/10
0.3 0.60.538596
3/10
0.6 0.80.79532
1/10
0.9 1.00.9766
2/10
0.6 0.80.766
1/10
0.8 0.90.83
3/10
0.3 0.60.3
1/10
0.0 0.10.0
Implementation
The above algorithms give us the basis of a working implementation of arithmetic
coding.
In a practical implementation of arithmetic coding/decoding, note that:
i.
The range assigned to a symbol includes the lower limit but not the upper
limit of that range. In other words, in our example, the symbol H owns the
range 0.2 to 0.29999...but not 0.3. Mathematically, this can be written as H
owns the range (0.2 n < 0.3).
ii.
We can use integer mathematics to implement arithmetic coding. This allows

us to simplify the representation of symbols and ranges, and frees us from
some of the limitations of floating point calculations. We achieve this
simplification by using integers and setting an imaginary decimal point at the
beginning of the range. Using this simplification and applying it to our example,
our probability tables now looks like this:

Range
er
y
y
D
1/10
0000
0999
1/10
1000
1999
1/10
2000
2999
3/10
3000
5999
2/10
6000
7999
1/10
8000
8999
1/10
9000
9999
Basically, this means that the symbol D owns the range 0.0000 to 0.0999...which is
the range (0.0 n < 0.1) written in mathematical notation.
Keep in mind that the number 99999... should be thought of as 0.9999..., with an
infinity of 9s after the decimal point. In the limit, there is an infinitesimally small
difference between this upper limit 1.
iii.
Consider the encoding table above:

Range Low
er
y
y
High
1/10
0.2 0.30.2
0.3
1/10
0.1 0.20.21
0.22
3/10
0.3 0.60.213
0.216
3/10
0.3 0.60.2139
0.2148
3/10
0.6 0.80.21444
0.21462
1/10
0.9 1.00.214602
0.214620

Range Low
er
y
y
High
2/10
0.6 0.80.2146128 0.2146164
1/10
0.8 0.90.21461568 0.21461604
3/10
0.3 0.6
0.21461578 0.21461589
8
6
1/10
0.0 0.1
0.21461578 0.21461580
8
6
Notice that as the encoding proceeds, the high and low numbers significant digits
tend to converge.
In the second iteration (while encoding E), the digit 2 converges for both the high
and low numbers, and will never change again regardless of how many more
characters we encode thereafter. This is a property of the encoding algorithm which
continually narrows the encoding range.
As encoding proceeds, we obtain the sequence of low numbers, 0.2, 0.21, 0.213, and
the high numbers, 0.3, 0.22, 0.216 (output when encoding L). At this point, the
significant digits 2 and 1 have converged for both the high and low numbers.
Again, at the fifth iteration (while encoding 0), we have the number 0.21444 in the
current low number, and 0.21462 in the current high number. At this point, the most
significant numbers 2, 1, and 4 have converged.
In a practical implementation, once the significant digits of the high and low
numbers have converged, they can be considered to have no further effect on the
calculation. They act to simply narrow the encoding range, retaining the decimal
place of the encoding, but have no further significant effect on subsequent
calculations.
If we imagine our upper and lower numbers as being in an infinitely large array (or a
very large array), in our practical calculation, we may safely ship out any significant
digits that have converged, and simply shift the entire array one more place to the
right.
iv.
The above calculations can be modeled using integer mathematics. As

described below:
We may set the range between high and low to 00000 and 99999, with an imaginary
decimal point in front of these numbers. Furthermore, we will assume that the
number of 0s in the low number after the decimal point stretches to infinity, and the
number of 9s in the high number also stretches to infinity.
For the purposes of our calculation, while the range between 00000 (0.00000...) and
99999 (0.99999...) is actually 99999 (0.99999...), we increment it by 1 (0.000...1).
This is because the number of 9s after the decimal point is taken to be infinite.
Therefore, the difference between the high and low numbers in the limit is actually 1.
Keep in mind that the imaginary decimal point is in front of each of the high or low
range numbers below. The numbers 0.000000... and 0.99999... should be assumed
to continue indefinitely. Since our bounds are now 0.00000 and 0.99999, and not 0.0
and 1.0, when calculating the range, we will add 1 to compensate. I.e., the high
value should be considered to be 1.
In the pseudo code below, MSD stands for the Most Significant Digit or the Left Most
Digit in the number.
Hide Copy Code
set low to 00000

set high to 99999
while there are still input symbols
get an input symbol

code_range = (high low) + 1
high = low + code_range * high_range of the symbol being coded
low = low + code_range * low_range of the symbol being coded
while the MSD of high and low match

if the MSD of the high and low match
output MSD
remove MSD from low and shift a new 0 into low
remove MSD of high and shift a new 9 into high
end if
end while the MSD of high and low match
end while there are still input symbols
Applying this to our input (HELLO WORLD) we obtain:

Hide Copy Code
First initialize high and low,
Set low to 00000 (or 0.000...)

Set high to 99999(or 0.999...)
encoding H (Hs range is 0.2 0.3) Range(or code_range above) =

99999 - 00000 = 99990 + 1 = 100000 (or 1.00000...)
low = 00000 + (100000 * 0.2) = 20000 (or 0.200000)

high = 00000 + (100000 * 0.3) = 30000 1 = 29999 (or .299999...)
Also, at this point, 2 has converged, so shifting out 2 and furthermore shifting in
another 9 into the high array gives us the new high and low, as below:
Hide Copy Code
output 2
Set low to 00000

Set high to 99999
encoding E (Es range is 0.1 0.2) Range = (99999 00000) + 1 = 100000
low = 00000 + (100000 * 0.1) = 10000 (or 0.100000...)

high = 19999 + (100000 * 0.2) = 20000 1 = 19999 (or 0.199999....)
At this point, 1 has converged, so shifting out 1 and furthermore shifting in another 9
into the high array gives us the new high and low, as below:
Hide Shrink
Copy Code
output 21
Set low to 00000

Set high to 99999
encoding L (Ls range is 0.3 0.6) Range = (99999 00000) + 1 = 100000
low = 00000 + (100000 * 0.3) = 30000 (or 0.30000)

high = 00000 + (100000 * 0.6) = 60000 1 = 59999 (since 0.5999 = 0.599999...)
no output
Set low to 30000

Set high to 59999
encoding the next L (Ls range is 0.3 0.6)

Range= (59999 - 30000) + 1 = 29999 + 1 = 30000
low = 30000 + (30000 * 0.3) = 30000 + 9000 = 39000

high = 30000 + (30000 * 0.6) = 30000 + 18000 = 48000 1 = 47999
no output
Set low to 39000

Set high to 47999
encoding O (Os range is 0.6 - 0.8) Range = (47999 39000) + 1 = 9000
low = 39000 + (9000 * 0.6) = 39000 + 5400 = 44400

high = 39000 + (9000 * 0.8) = 39000 + 7200 = 46200 1 = 46199
At this point, 4 has converged, so shifting out 4 and furthermore shifting in another
digit into both the high and low array gives us the new high and low, as below:
Hide Copy Code
output 214
Set low to 44000

Set high to 61999
encoding W (Ws range is 0.9 1.0) Range = (61999 44000) + 1 = 18000
low = 44000 + (18000 * 0.9) = 44000 + 16200 = 60200

high = 44000 + (18000 * 1.0) = 44000 + 18000 = 62000 1 = 61999
At this point, 6 has converged, so shifting out 6 and furthermore shifting in another
digit into both the high and low array gives us the new high and low, as below:
Hide Copy Code
output 2146
Set low to 02000

Set high to 19999
encoding O (Os range is 0.6 0.8) Range = (19999 - 02000) + 1 = 18000
low = 2000 + (18000 * 0.6) = 2000 + 10800 = 12800

high = 2000 + (18000 * 0.8) = 2000 + 14400 = 16400 1 = 16399
At this point, 1 has converged, so shifting it out and shifting in another digit into both
the high and low array gives us the new high and low, as below:
Hide Copy Code
output 21461
Set low to 28000

Set high to 63999
encoding R (Rs range is 0.8 0.9) Range = (63999 28000) + 1 = 36000
low = 28000 + (36000 * 0.8) = 28000 + 28800 = 56800

high = 28000 + (36000 * 0.9) = 28000 + 32400 = 60400 1 = 60399
no output
Set low to 56800

Set high to 60399
encoding L (Ls range is 0.3 0.6) Range = (60399 56800) + 1 = 3600
low = 56800 + (3600 * 0.3) = 56800 + 1080 = 57880

high = 56800 + (3600 * 0.6) = 56800 + 2160 = 58960 1 = 58959
At this point, 5 has converged, so shifting it out and shifting in another digit into both
the high and low array gives us the new high and low, as below:
Hide Copy Code
output 214615
Set low to 78800

Set high to 89599
encoding D (Ls range is 0.0 0.1) Range = (89599 78800) + 1 = 10800
low = 78800 + (1080 * 0.0) = 78800 + 0 = 78800

high = 78800 + (1080 * 0.1) = 78800 + 108 = 78908
At this point, 7 and 8 have converged, so shifting out 7 and 8 and further shifting
another digit into both the output array gives us the coded string:
Hide Copy Code
output 21461578
output 8 from low
the decimal procedure : output 0.2146157 and 88 from low
END ENCODING
...
Giving the table below:

Charact Probabilit
Range Low
er
y
Initialize
High Output
00000 99999
1/10
0.2 0.320000 29999 2
1/10
0.1 0.210000 19999 1
3/10
0.3 0.630000 59999
3/10
0.3 0.639000 47999
Charact Probabilit
Range Low
er
y
High Output
3/10
0.6 0.844400 46199 4
1/10
0.9 1.060200 61999 6
2/10
0.6 0.812800 16399 1
1/10
0.8 0.956800 60399
3/10
0.3 0.657880 58959 5
1/10
0.0 0.178800 78907 78
And finally, we also shift out the last digit in low: 8.

This logic is implemented in the method encode_symbol of the coder object/class
(coder.cs in the source code).
Underflow
It is possible that while encoding symbols, a situation could arise in which the high
and low cannot converge. In the event that the encoded word has a string of 0s or 9s
in it, the high and low values will slowly converge on a value, but may not see their
most significant digits match immediately. For example, high and low may look like
this:
Hide Copy Code
High:
700003
Low:
699994
The calculated range is only a single digit long, which means the encoder does not
have enough precision to be accurate.
In effect, the range between high and low has become so small that any calculation
will always return the same values. Moreover, since the most significant digits of
both high and low are not equal, the algorithm can't output the digit and shift.
The way to avoid underflow is to prevent it altogether. This is done by modifying the
algorithm slightly. If the two MSDs don't match, but are now on adjacent numbers, a
second test is done. If high and low are one apart, we test to see if the second most
significant digit in high is a 0, and the second digit in low is a 9. If so, it means that
the underflow is threatening.
When there is potential for underflow, the encoder does a slightly different shift
operation. It deletes the second digits from high and low, and shifts the rest of the
digits left. The most significant digit, however, stays in place. Also, an underflow
counter is set to mark the digit that was discarded. The operation looks like this:
Hide Copy Code
Before
------
After
-----
High:
40344
43449
Low:
39810
38100
Underflow:
This checks if underflow occurs after every iteration/calculation operation.

When the MSDs finally converge to a single value, it outputs the value, and then a
number of "underflow" digits that were previously discarded. The underflow digits
will be all 9s or 0s, depending on whether the high and low converged to the higher
or lower value. In the C# implementation of this algorithm, the underflow counter
keeps track of how many ones or zeros to output.
v.
Decoding
Previously, in the decoding process, we could use the entire input number. This,
however, may not be possible in practice since we can't perform an operation like
that on a number that could potentially be millions or billions of bytes long. Just as in
the encoding process, the decoder can operate using simple finite integer
calculations.
The decoder maintains three integers. The first two, high and low, correspond
exactly to the high and low values maintained by the encoder. The third number,
code, contains the current bits being read in from the input bits stream.
Important: The current probability is determined by where the present code value
falls along that range. If you divide the value-low by high-low+1, you get the actual
probability for the present symbol.
vi.
Finally, there must be some mechanism to stop the decoding calculation

since, in theory, it is possible for an encoding number to yield more decoded
characters than were encoded in it. Two methods for doing this are possible. A
special <stop> character may be encoded into the number, or alternatively, a
number representing the number of characters to encode can be passed to the
decoding function/method.
The C# implementation
The encoding procedure written in C# is included in the downloadable source code
for this article. The code for the encoder as well as the decoder were first published
(in C) in an article entitled "Arithmetic Coding for Data Compression" in the February
1987 issue of "Communications of the ACM", by Ian H. Witten, Radford Neal, and John
Cleary, published again by Mark Nelson as C code, and then ported to C# by myself
and is being published here with the author's permission.
I have modified the code slightly so as to further isolate statistical modeling and
arithmetic coding. The coder (coder.cs in the source code) class is an object that
implements arithmetic coding. This class can then be used by any statistical model
of the data. I have included a test project and a few examples of testing the coding
and decoding methods of the class.
There are two major differences between the algorithms shown earlier and the code
included in this article.
The first difference is in the way probabilities are transmitted. In the algorithms
shown above, the probabilities were kept as a pair of floating point numbers on the
0.0 to 1.0 range. Each symbol had its own section of that range. In the C# class
included here, a symbol has a slightly different definition. Instead of two floating
point numbers, the symbol's range is defined as two integers, which are counts along
a scale.
The scale is also included as part of the coder class definition: as the property ' scale'
of
the
class
(coder.scale).
This
scale
is
used
in
the
methods encode_symbol and decode_symbol to convert the probability integer into
its floating point equivalent, or to decode an encoded symbol into its char
equivalent. This also means that a user of the class can input the probability of a
symbol as integers instead of floating point numbers. See the included test solution
for an example.
So, for instance in the "HELLO WORLD" example, the letter H was defined previously
as the high/low pair of 0.2 and 0.3. In the code being used here, "H" would be
defined as the low and high counts of 2 and 3, with the symbol scale being 10.
The second difference in this algorithm is that all of the comparison and shifting
operations are being done in base 2, rather than base 10. The illustrations given
previously were done on base 10 numbers to make the algorithms a little more
comprehensible. The algorithms work properly in base 10, but masking off digits and
shifting in base 10 on most computers is expensive and slow. Instead of comparing
the two MSD digits, we now compare the two MSD bits.
There are two things missing that are needed in order to use the encoding and
decoding algorithms. The first is a set of bit oriented input and output routines.
These are shown in the code listing and are presented here as the bit IO routines.
The coder.symbol structure is responsible for storing the probabilities of each
character, and performing two different transformations.
During the encoding process, the coder.symbol structure has to take a character to
be encoded and convert it to a probability range. The probability range is defined as
a low count, a high count in the structure.
During the decoding process, the coder.symbol structure has to take a count derived
from the input bit stream and convert it into a character for output.
The coder object is created with
these coder.symbol structures.
'dictionary'
or
'alphabet'
made
up
of
Test code
An example program is shown in the included solution. It implements a
compression/expansion program that uses some arbitrary models based on the
discussed examples.
The decoding method needs to know the length of the encoded string so as to know
when to stop decoding the message.
The test project encodes an arbitrarily defined input string, and writes it out to
a MemoryStream object. This is the return value of the Encode method of the class.
The MemoryStream object returned will contain an array of bytes with the
compressed data in binary.
One can then decode the stream by calling the Expand method of the class.
The Expand method of the class takes a memory stream and the length of the
encoded message as parameters. To test it, we can encode a string and then pass
the binary stream returned back to the Expand method for decoding.
During the encoding process, a routine called convert_int_to_symbol is called. This
routine gets a given input character, and converts it to a low count, high count, and
scale using the current model. Since our model is a set of fixed probabilities, this just
means looking up the probabilities in the input coder.symbol struct. Once those are
defined, the encoder can be called.
During the decoding process, there are two functions associated with modeling. In
order to determine what character is waiting to be decoded on the input stream, the
model needs to be interrogated to determine what the present scale is. In our
example, the scale (or range of counts) is fixed by the coder.scale property. When
decoding, a modeling function called convert_symbol_to_int is called. It takes the
given count, and determines what character matches the count. Finally, the decoder
is called again to process that character out of the input stream.

Arithmetic Compression With C

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Arithmetic Compression With C

Încărcat de

Drepturi de autor:

Formate disponibile

Arithmetic Compression With C#

Agola Kisira Odero, 10 Dec 2009

This article describes arithmetic compression with an implemetnation in C#.

When applied to independent and identically distributed (IID) sources, the

It is effective in a wide range of situations and compression ratios. The same

It simplifies the modeling of complex sources, yielding near optimal or

Its main process is arithmetic, which is supported with ever-increasing

In this article, I present an implementation of an arithmetic coder in C# (coder.cs in

Creating the statistical model of the data proceeds as follows:

Taking the number of independent characters in the words to be encoded

We can arrange these characters into a table with their corresponding

Each of the characters will be assigned a range based on its frequency/

Charact Frequenc Probabilit

Charact Frequenc Probabilit

The algorithm for encoding is as below:

Hide Copy Code

Set low to 0.0

Applying this to our input (HELLO WORLD), we obtain:

encoding H (Hs range is 0.2 - 0.3) Range(or code_range above) = 1 - 0 = 1

low = 0.2 + (0.1 * 0.1) = 0.21

low = 0.21 + (0.01 * 0.3) = 0.213

low = 0.213 + (0.003 * 0.3) = 0.2139

low = 0.2139 + (0.0009 * 0.6) = 0.21444

Giving the table below:

0.6 0.80.2146128 0.2146164

0.8 0.90.21461568 0.21461604

0.3 0.60.21461578 0.21461589

Charact Frequenc Probabilit

get the number encoding the data

The number is 0.214615788

current symbol = E (range: 0.1 0.2)

current symbol = L (range: 0.3 0.6)

current symbol = L (range: 0.3 0.6)

current symbol = O (range: 0.6 0.8)

current symbol = W (range: 0.9 1.0)

This working yields the table below:

We can use integer mathematics to implement arithmetic coding. This allows

Charact Frequenc Probabilit

Consider the encoding table above:

Charact Frequenc Probabilit

Charact Frequenc Probabilit

0.6 0.80.2146128 0.2146164

0.8 0.90.21461568 0.21461604

The above calculations can be modeled using integer mathematics. As

Hide Copy Code

set low to 00000

while there are still input symbols

get an input symbol

while the MSD of high and low match

Applying this to our input (HELLO WORLD) we obtain:

First initialize high and low,

Set low to 00000 (or 0.000...)

encoding H (Hs range is 0.2 0.3) Range(or code_range above) =

low = 00000 + (100000 * 0.2) = 20000 (or 0.200000)

Set low to 00000

encoding E (Es range is 0.1 0.2) Range = (99999 00000) + 1 = 100000

low = 00000 + (100000 * 0.1) = 10000 (or 0.100000...)

Set low to 00000

encoding L (Ls range is 0.3 0.6) Range = (99999 00000) + 1 = 100000

low = 00000 + (100000 * 0.3) = 30000 (or 0.30000)

Set low to 30000

encoding the next L (Ls range is 0.3 0.6)

low = 30000 + (30000 * 0.3) = 30000 + 9000 = 39000

Set low to 39000