Sunteți pe pagina 1din 4

Mini Project 2 ------ Practice on C programming Please email to my TA:

Due: 2010-12-15

In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes". For more detailed information see http://en.wikipedia.org/wiki/Huffman_coding.

In this project, we aim to put Huffman coding into practice and use it to implement a compress/uncompress utility. By using this utility, we can compress a regular file; or uncompress a compressed file that is compressed by our utility. For example:

Basic idea:

1 scan file and do statistic on each character (the times of its occurrence in this file)

2 create Huffman tree based on the statistic

3 compute Huffman code of each character based on the Huffman tree in step 2 4 encode the source file: write the huffman code of each character into the compressed file. For example:

The binary value of character Ais 01000001, which have 8 bits. Suppose the Huffman code of Ais 0110. Then we only need to write 4 bits instead of 8 bits to

represent A. So we save four bits. Since each byte has 8 bits, we use these save 4 bits to store other characters Huffman code. So suppose the Huffman code of Bis 110 and binary value of character Bis 01000010. Then in the original file, it requires two bytes to store Aand B, but now we only need 7 bits. However, since each byte has 8 bits, you need to make up another bit. It either comes from one bit of another Huffman code of the following character or a 0 if the end of the file is reached.

Original file: 01000001

A

01000010……

B

 Compressed file: 0110 110 x…… A B

When you create the compressed file, you need to put the encoding information into the compressed file in order to use it when uncompress the file.

When uncompressing the file, you read the encoding information first and re-construct Huffman tree. Next decode the file based on the Huffman tree.

Huffman Code: Example

The following example bases on a data source using a set of five different symbols. The symbol's frequencies are:

Symbol

Frequency

A 24

B 12

C 10

D 8

E 8

----> total 186 bit

(with 3 bit per code word)

The two rarest symbols 'E' and 'D' are connected first, followed by 'C' and 'D'. The new parent nodes have the frequency 16 and 22 respectively and are brought together in the next step. The resulting node and the remaining symbol 'A' are subordinated to the root node that is created in a final step.

Code Tree according to Huffman

 Symbol Frequency Code Code total Length Length A 24 0 1 24 B 12 100 3 36 C 10 101 3 30 D 8 110 3 24 E 8 111 3 24

---------------------------------------

ges. 186 bit

(3 bit code)

Basic Requirement:

tot. 138 bit

1. Achieve Huffman code based on the statistic of all the characters of a file

2. Output the file based on the Huffman code, you can output plain Huffman code of each character.

3. Decode a encoded file.

Example:

Text:

Abbdc Huffman code:

A: 10 b: 01 c: 11 d: 00 Your compressedfile should show:

1001010011

If input 101010110000