Compression & Decompression

A Project Report On
COMPRESSION & DECOMPRESSION

Submitted in partial fulfillment of the requirement for the Award of the degree of
Bachelor of Technology
In
Information Technology
By
RAHUL SINGH SHAKUN GARG 0407713057, 0407713042
Dr. K.N.MODI INSTITUTE OF ENGINEERING & TECHNOLOGY

Approved by A.I.C.T.E. Affiliated to U. P. Technical University, Lucknow Modianagar 201204, (Batch: 2004-2008)
-1-
CONTENTS
ACKNOWLEDGEMENT CERTIFICATE LIST OF TABLES LIST OF FIGURES ABSTRACT SYNOPSIS OF THE PROJECT 1 OBJECTIVE 2 SCOPE 3 DESIGN PRINCIPLE & EXPLANATION 3.1 Module Description 3.1.1 Huffman Zip 3.1.2 Encoder 3.1.3 Decoder 3.1.4 Table 3.1.5 DLNode 3.1.6 Priority Queue 3.1.7 Huffman Node 4 HARDWARE & SOFTWARE REQUIREMENTS 18 4 5 6 6 7-13 14-18 14 14 16-17
-2-
MAIN REPORT 1 2 Objective & Scope of the Project Theoretical Background 2.1 Introduction 2.2 Theory 2.3 Definition 2.3.1 Lossless vs Lossy Compression 2.3.2 Image Compression 2.3.3 Video Compression 2.3.4 Text Compression 2.3.4 LZW Algorithm 3 4 Problem Statement System analysis and design 4.1 Analysis 4.2 Design 4.2.1 System design 4.2.2 Design objective 4.2.3 Design principle 5 Stages in System Life Cycle 5.1 Requirement Determination 5.2 Requirement Specifications 5.3 Feasibility Analysis 5.4 Final Specification 5.5 Hardware Study 5.6 System Design 5.7 System Implementation 5.8 System Evaluation 5.9 System Modification 5.10 System Planning 6 7 Hardware & Software Requirement Project Description 7.1 Huffman Algorithm 7.2 Code Construction
20-118 20 23 23 23 24-35
36 39 39 40-48
49 49 49 50 50 51 51 52 52 52 53 60 61 61 68
-3-
7.3 Huffing Program 7.4 Building Table 7.5 Decompressing 8 9 Working of Project 8.1 Module & their description Data Flow Diagram 10 Print Layouts 11 Implementation 12 Testing 12.1 Test plan 12.2 Terms in testing fundamentals 13 14 Conclusion Future Enhancement & New Direction 14.1 New Direction 14.2 Scope of future work 14.3 Scope of future application 15 16 Source Code References
68 69 70 73 73 75 82 85 87 87 88 94 95 95 96 96 97-118 119
7.6 Transmission & storage of Huffman encoded data 72
-4-
ACKNOWLEDGEMENT
Keep away from people who try to belittle your ambitions. Small people always do that, but the really great make you feel that you too, can become great. We take this opportunity to express my sincere thanks and deep gratitude to all those people who extended their wholehearted co-operation and have helped me in completing this project successfully.
First of all, we would like to thank Mr. Gaurav Vajpai (Project Guide) for his strict supervision, constant encouragement, inspiration and guidance, which ensure the worthiness of my work. Working under him was an enrich experience. His inspiring suggestions and timely guidance enabled us to perceive the various aspects of the project in a new light.
We would also thank to Head Dept. of IT, Prof. Jaideep Kumar,H.O.D., who guided us a lot in completing this project. We would also like to thank my parents & project mate for guiding and encouraging me throughout the duration of the project.
We will be failing in our mission if we do not thank other people who directly or indirectly helped us in the successful completion of this project. So, our heartfull thanks to all the teaching and non- teaching staff of computer science and engineering department of our institution for their valuable guidance throughout the working of this project.
RAHUL SINGH SHAKUN GARG MANISH SRIVASTAVA
-5-
Dr. K.N. Modi Institute of Engineering and Technology Modinagar

Affiliated to UP Technical University, Lucknow
DEPARTMENT OF INFORMATION TECHNOLOGY
CERTIFICATE
This is to certify that RAHUL SINGH (0407713057), SHAKUN GARG (0407713042) and MANISH SRIVASTAVA (0407713021) of the final year B. Tech. (IT) have carried out a project work on COMPRESSION & DECOMPRESSION under the guidance of Mr. GAURAV VAJPAI in Department IT for the partial fulfillment of the award of the degree of Bachelor of Technology in Computer Science & Engineering in Meerut Institute of Engineering & Technology, Meerut (Affiliated to U.P. Technical University, Lucknow) is a bonafide record of work done by them during the year 2007 2008.
Head of the Department:
Internal Guide: Mr. GAURAV VAJPAI
(Mr. JAIDEEP KUMAR)

Head, Department of IT
-6-
LIST OF TABLES
Table No.
Table Name
Page No.
1 2
FILE TABLE DETAIL TABLE
LIST OF FIGURES
Figure No.
Figure Name
Page No.
1. 2 3.
ARCHITECTURE OF NETPOD PERT CHART GANTT CHART
19 36 38
-7-
COMPRESSION & DECOMPRESSION
STATEMENT ABOUT THE PROBLEM
In todays world of computing, it is hardly possible to do without graphics, images and sounds. Just by looking at the applications around us, the Internet, development of Video CDs (Compact Disk) ,Video Conferencing, and much more, all these applications use graphics and sound intensively. I guess many of us have surfed the Internet, have you ever become so frustrated in waiting for a graphics intensive web page to be opened that you stopped the transfer I bet you have. Guess what will happened if those graphics are not compressed? Uncompressed graphics, audio And video data consumes very large amount of physical storage which for the case of uncompressed video, even present CD technology is unable to handle.
WHY IS THE PARTICULAR TOPIC CHOSEN?

Files available for transfer from one host to another over a network (or via modem) are often stored in a compressed format or some other special format well-suited to the storage medium and/or transfer method. There are many reasons for compressing/archiving files. The more common are:
-8-
File compression can significantly reduce the size of a file (or group of files). Smaller files take up less storage space on the host and less time to transfer over the network, saving both time and money
OBJECTIVE AND SCOPE OF THE PROJECT

The objective of this system is to compress and decompress files. This system will be used to compress files , so that they may take less memory for storage and transmission from one computer to another. This system will work in following ways:
To compress a text and image file using Huffman coding. To decompress the compressed file to original format. To show the compression ratio.
Our project will be able to Compress message in such a form that can be easily transmitted over the network or from one system to another. At the receiver end after decompressing the message receiver will get the original message. This is how effective transmission of data that take place between sender and receiver.
Reusability:
Reusability is possible as and when we require in this application. We can update it next version. Reusable software reduces design, coding and testing cost by amortizing effort over several designs. Reducing the amount of code also simplifies understanding, which increases
-9-
the likelihood that the code is correct. We follow up both types of reusability: Sharing of newly written code within a project and reuse of previously written code on new projects.
Extensibility:
This software is extended in ways that its original developers may not expect. The following principles enhance extensibility like Hide data structure, Avoid traversing multiple links or methods, Avoid case statements on object type and distinguish public and private operations.
Robustness:
Its method is robust if it does not fail even if it receives improper parameters. There are some facilities like Protect against errors, Optimize after the program runs, validating arguments and Avoid predefined limits.
Understandability:
A method is understandable if someone other than the creator of the method can understand the code (as well as the creator after a time lapse). We use the method, which small and coherent helps to accomplish this.
Cost-effectiveness:
Its cost is under the budget and make within given time period. It is desirable to aim for a system with a minimum cost subject to the condition that it must satisfy all the requirements.
Scope of this document is to put down the requirements, clearly identifying the information needed by the user, the source of the information and outputs expected from the system
- 10 -
METHODOLOGY ADOPTED
The methodology used is the classic Life-cycle model the WATERFALL MODEL
- 11 -
HARDWARE & SOFTWARE REQUIREMENTS

HARDWARE SPECIFICATIONS: Processor Ram Monitor Keyboard Mouse Pentium- I\II\III\higher 128 MB RAM or higher 15 Inch (Digital) with 800 X 600 support 101 Keys keyboard 2 Button Serial/ PS-2
Tools / Platform Language Used:
Language: Java OS: Any OS such as Windows XP/98/NT/Vista
- 12 -
TESTING TECHNOLOGIES
Some of the commonly used Strategies for Testing are as follows:-
Unit testing Module testing Integration testing TEST Testing System testing Acceptance testing
UNIT TESTING Unit testing is the testing of a single program module in an isolated environment. Testing of the processing procedures is the main focus
MODULE TESTING A module encapsulates related component. So can be tested without other system modules.
INTEGERATION TESTING
Integration testing is the testing of the interface among the system modules. In other words it ensures that the module is handles as intended.
. - 13 -
SYSTEM TESTING System testing is the testing of the system against its initial objectives. It is done either in a simulated environment or in a live environment.
ACCEPTANCE TESTING
Acceptance Testing is performed with realistic data of the client to demonstrate that the software is working satisfactorily. Testing here is focused on external behavior of the system; the internal logic of program is not emphasized.
WHAT CONTRIBUTION WOULD THE PROJECT MAKE?

The contributions of COMPRESSION & DECOMPRESSION are as follows:
Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth .
It involve trade-offs between various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data.
- 14 -
SYNOPSIS OF THE PROJECT
1. OBJECTIVE
2. SCOPE
- 15 -
Reusability:
Reusability is possible as and when we require in this application. We can update it next version. Reusable software reduces design, coding and testing cost by amortizing effort over several designs. Reducing the amount of code also simplifies understanding, which increases the likelihood that the code is correct. We follow up both types of reusability: Sharing of newly written code within a project and reuse of previously written code on new projects.
Extensibility:
Robustness:
Understandability:
Cost-effectiveness:
- 16 -
3. DESIGN PRINCIPLES & EXPLANATION
MODULE DESCRIPTION
There are following functions in project
Huffman Zip Encoder Decoder Table DLnode Priority Queue Huffman Node
Huffman zip is the main function which uses applet. It is used for user interface.
Encoder is the module for compressing the file. It implements Huffman algorithm for compressing the text and image file. It first calculate the frequencies of all the occurring symbols. Then on the basis of these frequencies it generates the priority queue. This priority queue is used for finding the symbols with least frequencies. Now the two symbols with lowest frequencies are deleted from the queue and a new symbol is added to the queue with
- 17 -
frequency equal to the sum of these two symbols. In the meanwhile we generate a tree with leaf nodes are the two deleted node and the root node is the new node added to the queue. At last we traverse the tree starting from the root node to the leaf node assigning 0 to the left child and 1 to the right node. In this way we assign code to every symbol in the file. These are binary codes then we group these binary codes and calculate the equivalent integers and store them in the output file, which is the compressed file.
Decoder works in the reverse order as the encoder. It reads the input from the compressed file and convert it into equivalent binary code. It has one another input the binary tree generated in the encoding process and on the basis of these data it generates the original file. This project is based on lossless compression.
Table is used for storing the codes of each symbol. Priority queue takes input the symbols and there related frequencies and on the basis of these frequencies it assign priorities to each symbol. Huffman node is used for creating the binary tree it takes input two symbol from the priority queue and create two nodes by comparing the frequencies of these two symbol. It places the symbol with less frequency to the left and the symbol with high frequency to the right, it then deletes these two symbol from the priority queue and places a new symbol with frequency equal to the sum of frequencies of these two deleted symbol. It also generate a parent node to the two node and assign frequency equal to the sum of frequencies of the two leaf node.
- 18 -
5. HARDWARE & SOFTWARE REQUIREMENTS
Existing hardware will be used: Intel Pentium-IV 128 MB RAM SVGA Color Monitor on PCI with 1MB RAM 101 Keys Keyboard 1 Microsoft Mouse with pad
Tools / Platform Language Used: Language: Java OS: Any OS such as Windows XP/98/NT, Database: MS Access.
- 19 -
MAIN REPORT
OBJECTIVE AND SCOPE
SCOPE
Reusability:
Reusability is possible as and when we require in this application. We can update it next version. Reusable software reduces design, coding and testing cost by amortizing effort over several designs. Reducing the amount of code also simplifies understanding, which increases
- 20 -
the likelihood that the code is correct. We follow up both types of reusability: Sharing of newly written code within a project and reuse of previously written code on new projects.
Extensibility:
Robustness:
Understandability:
Cost-effectiveness:
- 21 -
- 22 -
THEORETICAL BACKGROUND
Introduction
A brief introduction to information theory is provided in this section. The definitions and assumptions necessary to a comprehensive discussion and evaluation of data compression methods are discussed. The following string of characters is used to illustrate the concepts defined: EXAMPLE = aa bbb cccc ddddd eeeeee fffffff gggggggg.
Theory:
The theoretical background of compression is provided by information theory (which is closely related to algorithmic information theory) and by rate-distortion theory. These fields of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Doyle and Carlson (2000) wrote that data compression "has one of the simplest and most elegant design theories in all of engineering". Cryptography and coding theory are also closely related. The idea of data compression is deeply connected with statistical inference.
Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy data compression systems typically include even more stages, including, for example, prediction, frequency transformation, and quantization.
The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, although compression can be slow. LZW (Lempel-Ziv-Welch) is used
- 23 -
in GIF images. LZ methods utilize a table based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX).
The very best compressors use probabilistic models whose predictions are coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent.
Definition :
In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. For example, this article could be encoded with fewer bits if one were to accept the convention that the word "compression" be encoded as "comp". One popular instance of compression with which many computer users are familiar is the ZIP file format, which, as well as providing compression, acts as an archiver, storing many files in a single output file.
As is the case with any form of communication, compressed data communication only works when both the sender and receiver of the information understand the encoding scheme. For example, this text makes sense only if the receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver.
- 24 -
Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be viewed (or heard), and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it's being decompressed (you always have the option of decompressing the video in full before you watch it, but this is inconvenient and requires storage space to put the uncompressed video). The design of data compression schemes therefore involve trade-offs between various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data.
A code is a mapping of source messages (words from the source alphabet alpha) into codewords (words of the code alphabet beta). The source messages are the basic units into which the string to be represented is partitioned. These basic units may be single symbols from the source alphabet, or they may be strings of symbols. For string EXAMPLE, alpha = { a, b, c, d, e, f, g, space}. For purposes of explanation, beta will be taken to be { 0, 1 }. Codes can be categorized as block-block, block-variable, variable-block or variable-variable, where block-block indicates that the source messages and codewords are of fixed length and variable-variable codes map variable-length source messages into variable-length codewords. A block-block code for EXAMPLE is shown in Figure 1.1 and a variable-variable code is given in Figure 1.2. If the string EXAMPLE were coded using the Figure 1.1 code, the length of the coded message would be 120; using Figure 1.2 the length would be 30.
- 25 -
source message
codeword
source message
codeword
000
aa
001
bbb
010
cccc
10
011
ddddd
11
100
eeeeee
100
101
fffffff
101
110
gggggggg
110
space
111
space
111
The oldest and most widely used codes, ASCII and EBCDIC, are examples of block-block codes, mapping an alphabet of 64 (or 256) single characters onto 6-bit (or 8-bit) codewords. These are not discussed, as they do not provide compression. The codes featured in this survey are of the block-variable, variable-variable, and variable-block types.
When source messages of variable length are allowed, the question of how a message ensemble (sequence of messages) is parsed into individual messages arises. Many of the algorithms described here are defined-word schemes. That is, the set of source messages is determined prior to the invocation of the coding scheme. For example, in text file processing
- 26 -
each character may constitute a message, or messages may be defined to consist of alphanumeric and non-alphanumeric strings.
In Pascal source code, each token may represent a message. All codes involving fixed-length source messages are, by default, defined-word codes. In free-parse methods, the coding algorithm itself parses the ensemble into variable-length sequences of symbols. Most of the known data compression methods are defined-word schemes; the free-parse model differs in a fundamental way from the classical coding paradigm.
A code is distinct if each codeword is distinguishable from every other (i.e., the mapping from source messages to codewords is one-to-one). A distinct code is uniquely decodable if every codeword is identifiable when immersed in a sequence of codewords. Clearly, each of these features is desirable. The codes of Figure 1.1 and Figure 1.2 are both distinct, but the code of Figure 1.2 is not uniquely decodable. For example, the coded message 11 could be decoded as either ddddd or bbbbbb. A uniquely decodable code is a prefix code (or prefixfree code) if it has the prefix property, which requires that no codeword is a proper prefix of any other codeword. All uniquely decodable block-block and variable-block codes are prefix codes. The code with codewords { 1, 100000, 00 } is an example of a code which is uniquely decodable but which does not have the prefix property. Prefix codes are instantaneously decodable; that is, they have the desirable property that the coded message can be parsed into codewords without the need for lookahead. In order to decode a message encoded using the codeword set { 1, 100000, 00 }, lookahead is required. For example, the first codeword of the message 1000000001 is 1, but this cannot be determined until the last (tenth) symbol of the message is read (if the string of zeros had been of odd length, then the first codeword would have been 100000).
- 27 -
A minimal prefix code is a prefix code such that if x is a proper prefix of some codeword, then x sigma is either a codeword or a proper prefix of a codeword, for each letter sigma in beta. The set of codewords { 00, 01, 10 } is an example of a prefix code which is not minimal. The fact that 1 is a proper prefix of the codeword 10 requires that 11 be either a codeword or a proper prefix of a codeword, and it is neither. Intuitively, the minimality constraint prevents the use of codewords which are longer than necessary. In the above example the codeword 10 could be replaced by the codeword 1, yielding a minimal prefix code with shorter codewords. The codes discussed in this paper are all minimal prefix codes.
In this section, a code has been defined to be a mapping from a source alphabet to a code alphabet; we now define related terms. The process of transforming a source ensemble into a coded message is coding or encoding. The encoded message may be referred to as an encoding of the source ensemble. The algorithm which constructs the mapping and uses it to transform the source ensemble is called the encoder. The decoder performs the inverse operation, restoring the coded message to its original form.
Lossless vs. lossy compression:
Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely, but nevertheless perfectly. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.
- 28 -
Another kind of compression, called lossy data compression, is possible if some loss of fidelity is acceptable. For example, a person viewing a picture or television video scene might not notice if some of its finest details are removed or not represented perfectly (i.e. may not even notice compression artifacts). Similarly, two clips of audio may be perceived as the same to a listener even though one is missing details found in the other. Lossy data compression algorithms introduce relatively minor differences and represent the picture, video, or audio using fewer bits.
Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.
However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually result in an expansion, as will attempts to compress encrypted data.
In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, which for example always removes the last byte of a file, will always compress a file up to the point where it is empty.
A good example of lossless vs. lossy compression is the following string -- 888883333333. What you just saw was the string written in an uncompressed form. However, you could save space by writing it 8[5]3[7]. By saying "5 eights, 7 threes", you still have the original string, just written in a smaller form. In a lossy system, using 83 instead, you cannot get the original data back (at the benefit of a smaller filesize).
- 29 -
A small overview of different compression is presented below:
Image compression:
Image here refers to not only still images but also motion-pictures and compression is the process used to reduce the physical size of a block of information.
Compression is simply representing information more efficiently; "squeezing the air" out of the data, so to speak. It takes advantage of three common qualities of graphical data; they are often redundant, predictable or unnecessary.
Today , compression has made a great impact on the storing of large volume of image data. Even hardware and software for compression and decompression are increasingly being made part of a computer platform. Compression does have its trade-offs. The more efficient the compression technique, the more complicated the algorithm will be and thus, requires more computational resources or more time to decompress. This tends to affect the speed. Speed is not so much of an importance to still images but weighs a lot in motion-pictures. Surely you do not want to see your favourite movies appearing frame by frame in front of you.
Most methods for irreversible, or ``lossy'' digital image compression, consist of three main steps: Transform, quantizing and coding, as illustrated in figure
The three steps of digital image compression.
- 30 -
Image compression is the application of Data compression on digital images. In effect, the objective is to reduce redundancy of the image data in order to be able to store or transmit data in an efficient form.
Image compression can be lossy or lossless. Lossless compression is sometimes preferred for artificial images such as technical drawings, icons or comics. This is because lossy compression methods, especially when used at low bit rates, introduce compression artifacts. Lossless compression methods may also be preferred for high value content, such as medical imagery or image scans made for archival purposes. Lossy methods are especially suitable for natural images such as photos in applications where minor (sometimes imperceptible) loss of fidelity is acceptable to achieve a substantial reduction in bit rate.
The best image quality at a given bit-rate (or compression rate) is the main goal of image compression. However, there are other important properties of image compression schemes:
Scalability generally refers to a quality reduction achieved by manipulation of the bitstream or file (without decompression and re-compression). Other names for scalability are progressive coding or embedded bitstreams. Despite its contrary nature, scalability can also be found in lossless codecs, usually in form of coarse-to-fine pixel scans. Scalability is especially useful for previewing images while downloading them (e.g. in a web browser) or for providing variable quality access to e.g. databases. There are several types of scalability:
Region of interest coding Certain parts of the image are encoded with higher quality than others. This can be combined with scalability (encode these parts first, others later).
- 31 -
Meta information Compressed data can contain information about the image which can be used to categorize, search or browse images. Such information can include color and texture statistics, small preview images and author/copyright information.
The quality of a compression method is often measured by the Peak signal-to-noise ratio. It measures the amount of noise introduced through a lossy compression of the image. However, the subjective judgement of the viewer is also regarded as an important, perhaps the most important measure.
Video Compression:
A raw video stream tends to be quite demanding when it comes to storage requirements, and demand for network capacity when being transferred between computers. Before being stored or transferred, the raw stream is usually transformed to a representation using compression. When compressing an image sequence, one may consider the sequence a series of independent images, and compress each frame using single image compression methods, or one may use specialized video sequence compression schemes, taking advantage of similarities in nearby frames. The latter will generally compress better, but may complicate handling of variations in network transfer speed.
Compression algorithms may be classified into two main groups, reversible and irreversible. If the result of compression followed by decompression gives a bitwise exact copy of the original for every compressed image, the method is reversible. This implies that no quantizing is done, and that the transform is accurately invertible, i.e. it does not introduce round-off errors.
When compressing general data, like an executable program file or an accounting database, it is extremely important that the data can be reconstructed exactly. For images and sound, it is
- 32 -
often convenient, or even necessary to allow a certain degradation, as long as it is not too noticeable by an observer.
Text compression:
The following methods yield two basic data compression algorithms, which produce good compression ratios and run in linear time.
The first strategy is a statistical encoding that takes into account the frequencies of symbols to built a uniquely decipherable code optimal with respect to the compression criterion. Huffman method (1951) provides such an optimal statistical coding. It admits a dynamic version where symbol counting is done at coding time. The command "compact" of UNIX implements this version.
Ziv and Lempel (1977) designed a compression method using encoding segments. These segments are stored in a dictionary that is built during the compression process. When a segment of the dictionary is encountered later while scanning the original text it is substituted by its index in the dictionary. In the model where portions of the text are replaced by pointers on previous occurrences, the Ziv and Lempel's compression scheme can be proved to be asymptotically optimal (on large enough texts satisfying good conditions on the probability distribution of symbols). The dictionary is the central point of the algorithm. Furthermore, a hashing technique makes its implementation efficient. This technique improved by Welch (1984) is implemented by the "compress" command of the UNIX operating system.
- 33 -
The problems and algorithms discussed above give a sample of text processing methods. Several other algorithms improve on their performance when the memory space or the number of processors of a parallel machine are considered for example. Methods also extend to other discrete objects such as trees and images.
- 34 -
LZW ALGORITHM
Compressor algorithm: w = NIL; while (read a char c) do if (wc exists in dictionary) then w = wc; else add wc to the dictionary; output the code for w; w = c; endif done output the code for w;
Decompressor algorithm: read a char k; output k; w = k; while (read a char k) do if (index k exists in dictionary) then entry = dictionary entry for k; else if (index k does not exist in dictionary && k == currSizeDict) entry = w + w[0]; else signal invalid code; endif output entry; add w+entry[0] to the dictionary; w = entry; done
- 35 -
DEFINITION OF THE PROBLEM
Problem Statement:
In today's world of computing, it is hardly possible to do without graphics, images and sound. Just by looking at the applications around us, the Internet, development of Video CDs (Compact Disks), Video Conferencing, and much more, all these applications use graphics and sound intensively.
I guess many of us have surfed the Internet; have you ever become so frustrated in waiting for a graphics intensive web page to be opened that you stopped the transfer I bet you have. Guess what will happened if those graphics are not compressed ?
Uncompressed graphics, audio and video data consumes very large amount of physical storage which for the case of uncompressed video, even present CD technology is unable to handle. Why is this so ?
CASE 1
Take for instance, if we want to display a TV-quality full motion Video, how much of physical storage will be required ? Szuprowics states that "TV-quality video requires 720 kilobytes per frame (kbpf) displayed at 30 frames per second (fps) to obtain a full-motion effect, which means that one second of digitised video consumes approximately 22 MB (megabytes) of storage. A standard CD-ROM disk with 648 MB capacity and data transfer
- 36 -
rate of 150 KBps could only provide a total of 30 seconds of video and would take 5 seconds to display a single frame." Based on Szuprowics's statement we can see that this is clearly unacceptable.
Transmission of uncompressed graphics, audio and video is a problem too. Expensive cables with high bandwidth are required to achieve satisfactory result, which is not feasible for the general market.
CASE 2
Take for example the transmission of uncompressed audio signal over the line for one second :
Table is based on Steinmetz and Nahrstedt (1995)
From the table we can see that for better quality of sound transmitted over the channel, both the bandwidth and storage requirement increases, and the size is not feasible at all.
Thus, to provide feasible and cost effective solutions, most multimedia systems use compression techniques to handle graphics, audio and video data streams.
Therefore, in this paper I will address on one specific standard of compression, JPEG. And at the same time, I will also be going through basic compression techniques that serve as the building blocks for JPEG.
- 37 -
This paper focused on three forms of JPEG image compression : 1) Baseline Lossy JPEG ,2) Progressive and 3) Motion JPEG. Each of their algorithm; characteristics and advantages will be gone through.
I hope that by the end of the paper, reader will gain more knowledge of JPEG, understand how it works and not just know that it's another form of image compression standard.
- 38 -
SYSTEM ANALYSIS AND DESIGN
Analysis and design refers to the process of examining a business situation with the intent of improving it through better procedures and methods. The two main steps of development are: Analysis Design
ANALYSIS:
System analysis is conducted with the following objectives in mind: Identify the users need. Evaluate the system concept for feasibility. Perform economic and technical analysis. Allocate functions to hardware, software, people, and other system elements. Establish cost and schedule constraints.
Create a system definition that forms the foundation for all subsequent engineering work. Both hardware and software expertise are required to successfully attain the objectives listed above.
- 39 -
DESIGN
The most creative and challenging phase of the system life cycle is system design. The term design describes a final system and the process by which it is developed. It refers to the technical specifications (analogous to the engineers blueprints) that will be applied in implementing the candidate system. It also includes the construction of programs and program testing. The key question here is: How should the problem be solved? The major steps in designing are:
The first step is to determine how the output is to be produced and I what format. Samples of the output (and input) are also presented. Second, input data and master files (data base) have to be designed to meet the requirements of the proposed output. The operational (processing) phases are handled through program construction and testing, including a list of the programs needed to meet the systems objectives and complete documentation. Finally, details related to justification of the system and an estimate of the impact of the candidate system on the user and the organization are documented and evaluated by management as a step towards implementation.
The final report prior to the implementation phase includes procedural flowcharts, record layouts, report layouts, and workable plans for implementing the candidate system. Information on personnel, money, h/w, facilities and their estimated cost must also be available. At this point, projected costs must be close to actual cost of implementation.
- 40 -
In some firms, separate groups of programmers do the programming where as other firms employ analyst-programmers that do the analysis and design as well as code programs. For this discussion, we assume that two separate persons carry out analysis and programming. There are certain functions, though, that the analyst must perform while programs are being written.
SYSTEM DESIGN: Software design sits at the technical kernel of software engineering and is applied regardless of the software process model that is used. Beginning once software requirements have been analyzed and specified, software design is the first of the three technical activities Design, Code generation and Test-that are required to build and verify the software. Each activity transforms information in a manner that ultimately results in validated computer software.
The
importance
of
software
design
can
be
stated
with
single word-quality. Design is the place where quality is fostered in software engineering. Design provides us with representation of software that can be assessed for quality. Design is the only way that we can accurately translate a customers requirements into a finished software product or system. Software design serves as the foundation for all the software engineering and software support steps that follow. Without design we risk building an unstable system-one that will fall when small changes are made; one that may be difficult to test; one whose quality cannot be assessed until late in the software process, when time is short and many dollars have already been spent.
- 41 -
DESIGN OBJECTIVES: Design phase of software development deals with transforming the customer requirements as described in the SRS document into a form implement able using a programming language. However, we can broadly classify various design activities into two important parts: Preliminary (or high level) design Detailed design
During high level design, different modules and the control relationships among them are identified and interfaces among these modules are defined. The outcome of high level design is called the Program Structure or Software Architecture. The structure chart is used to represent the control hierarchy in a high level design.
During detailed design, the data structure and the algorithms used by different modules are designed. The outcome of the detailed design is usually known as the Module Specification document.
A good design should capture all the functionality of the system correctly. It should be easily understandable, efficient and it should be easily amenable to change that is easily maintainable. Understandability of a design is a major factor, which is used to evaluate the goodness of a design, since a design that is easily understandable is also easy to maintain and change.
In order to enhance the understandability of a design, it should have the following features:
Use of consistent and meaningful names for various design components.
- 42 -
Use of cleanly decomposed set of modules. Neat arrangement of modules in a hierarchy that is tree-like diagram.
Modular design is one of the fundamental principles of a good design. Decomposition of a problem into modules facilitates taking advantage of the divide and conquers principle if different modules are almost independent of each other then each module can be understood separately, eventually reducing the complexity greatly.
Clean decomposition of a design problem into modules means that the modules in a software design should display High Cohesion and Low Coupling.
The primary characteristics of clean decomposition are high cohesion and low coupling. Cohesion is a measure of the functional strength of a module. Coupling of a module with another module is a measure of the design of functional independence or interaction between the two modules.
A module having high cohesion and low coupling is said to be Functional Independent of other modules by the term functional independence we mean that a Cohesive module performs a single task or function.
Functionally independent module has minimal interaction with other modules. Functional independence is a key to good design primarily due to the following reasons:
- 43 -
Functional independence reduces error propagation. An error existing in one module does not directly affect other modules and also any error existing in other modules does not directly this module.
Reuse of a module is possible because each module performs some well-defined and precise function and the interface of the module with other modules is simple and minimum complexity of the design is reduced because different modules can be understood in isolation, as modules are more or less independent of each other.
DESIGN PRINCIPLES: Top-Down and Bottom-Up Strategies Modularity Abstraction Problem Partitioning and Hierarchy
TOP-DOWM AND BOTTOM-UP STRATEGIES: A system consists of components, which have components of their own; indeed a system is a hierarchy of components. The highest-level components correspond to the total system. To design such hierarchies there are two possible approaches: top-down and bottom-up. The topdown approach starts from the highest-level component of the hierarchy and proceeds through to lower levels. By contrast, a bottom-up approach starts with the lowest-level component of the hierarchy and proceeds through progressively higher levels to the top-level component. - 44 -
Top-down design methods often result in some form of stepwise refinement. Starting from an abstract design, in each step the design is refined to more concrete to a more concrete level, until we reach a level where no more refinement is needed and the design can be implemented directly. Bottom-up methods work with layers of abstraction Starting from the very bottom, operations that provide a layer of abstraction are implemented. The operations of this layer are then used to implement more powerful operations and a still higher layer of abstraction, until the stage is reached where the operations supported by the layer are those desired by the system.
MODULARITY: The real power of partitioning comes if a system is partitioned into modules so that the modules are solvable and modifiable separately. It will be even better if the modules are also separately compliable. A system is considered modular if it consists of discrete components so that each component can be implemented separately, and a change to one component has minimal impact on other components.
Modularity is a clearly a desirable property in a system. Modularity helps in system debugging-isolating the system problem to a component is easier if the system is modular-in system repair-changing a part of the system is easy as it affects few other parts-and in system building-a modular system can be easily built by putting its modules together.
- 45 -
ABSTRACTION : Abstraction is a very powerful concept that is used in all-engineering disciplines. It is a tool that permits a designer to consider a component at an abstract level without worrying about the details of the implementation of the component. Any component or system provides some services to its environment. An abstraction of a component describes the external behavior of that component without bothering with the internal details that produce the behavior. Presumably, the abstract definition of a component is much simpler than the component itself.
There are two common abstraction mechanisms for software systems: Functional abstraction and Data abstraction.
In functional abstraction, a module is specified by the function it performs. For example, a module to compute the log of a value can be abstractly represented by the function log. Similarly, a module to sort an input array can be represented by the specification of sorting. Functional abstraction is the basis of partitioning in function- oriented approaches. That is, when the problem is being partitioned, the overall transformation function for the system is partitioned into smaller functions that comprise the system function. The decomposition of this is terms of functional modules.
- 46 -
The second unit for abstraction is data abstraction. Data abstraction forms the basis for object-oriented design. In using this abstraction, a system is viewed as a set of objects providing some services. Hence, the decomposition of the system is done with respect to the objects the system contains.
Problem Partitioning and Hierarchy: When solving a small problem, the entire problem can be tackled at once. For solving larger problems, the basic principles are the time-tested principle of divide and conquer. Clearly, dividing in such a manner that all the divisions have to be conquered together is not the intent of this wisdom. This principle, if elaborated, would mean, Divide into smaller pieces, so that each piece can be conquered separately.
Problem partitioning, which is essential for solving a complex problem, leads to hierarchies in the design. That is, the design produced by using problem partitioning can be represented as a hierarchy of components. The relationship between the elements in this hierarchy can vary depending on the method used. For example, the most common is the whole-part of relationship. In this the system consists of some parts, each past consists of subparts, and so on. This relationship can be naturally represented as a hierarchical structure between various system parts. In general hierarchical structure makes it much easier to comprehend a complex system. Due to this, all design methodologies aim to produce a design that has nice hierarchical structures.
- 47 -
STAGES IN A SYSTEMS LIFE CYCLE
Requirement Determination
A system is intended to meet the needs of an organization so as to save storage capacity. Thus the first step in the design is to specify these needs or requirements. Determining the requirements to be met by a system in an organization. Having done this, the next step is to determine the requirements to be met by the system. Meetings of prospective user departments are held and, through discussions, priorities among various applications are determined, subject to the constraints of available computer memory, bandwidth, time taken for transferring and budget.
Requirement Specification
The top management of an organization first decides that a compression & decompression system would be desirable to improve the operations of the organization. Once this basic decision is taken, a system analyst is consulted. The first job of the system analyst is to understand the existing system. During this stage he understands the various aspect of algorithm, datastructures. Based on this he identifies what aspects of the operations of the project need changes. The analyst discusses it and users his functions and determines the areas where a changes can made it effective. The applications where a file transferring is allowed is checked. It is not important to get the users involved from the initial stages of the development of an application.
- 48 -
Feasibility Analysis
Having drawn up the rough specification, the next step is to check whether it is feasible to implement the system. A feasibility study takes into account various constraints within which the system should be implemented and operated. The resources needed for implementation such as computing equipment, manpower and cost are estimated, based on the specifications of users requirements. These estimates are compared with the available resources. A comparison of the cost of the system and the benefits which will accrue is also made. This document, known as the feasibility report, is given to the management of the organization.
Final Specifications
The developer of this s/w studies this feasibility
report and suggests modifications in the
requirements, if any. Knowing the constraints on available resources, and the modified requirements specified by the organization, the final specifications of the system to be developed are drawn up by the system analyst. These specifications should be in a form which can be easily understood by the users. The specification state what the system would achieve. It does not describe how the system would do it. These specifications are given back to the users who study them, consult their colleagues and offer suggestions to the systems analyst for appropriate changes. These changes are incorporated by the system analyst and a new set of applications are given back to the users. After discussions between the system analyst and the users the final specifications are drawn up which are approved for
- 49 -
implementation? Along with this, criteria for system approval are specified, which will normally include a system test plan.
Hardware Study
Based on the finalized specifications it is necessary to determine the configuration of hardware and support software essential to execute the specified application.
System Design
The next step is to develop the logical design of the system. The inputs to the system design phase are functional specifications of the system and details about the computer configuration. During this phase the logic of the programs is designed, and program test plans and implementation plan are drawn up. The system design should begin from the objectives of the system.
System Implementation
The next phase is implementation of the system. In this phase all the programs are written, user operational document is written, users are trained, and the system tested with operational data.
- 50 -
System Evaluation
After the system has been in operation for a reasonable period, it is evaluated and a plan for its improvement is drawn up .This is called system life cycle. The shortcomings of a systemnamely, what a user expected from the system and what he actually got-are realized only after a system is used for a reasonable time. Similarly, the shortcomings in this system are realized only after it is implemented and used for sometime.
System Modification
A computer-based system is a piece of software. It can be modified. Modifications will definitely cost time and money. But users expect modifications to be made as the name software itself implies it is soft and hence changeable.
Further, systems designed for use by clients cannot be static. These systems are intended for real world problem. The environment in which a activity is conducted never remains static. New changes occurred . New efficient algorithms occurred as research have been going on..
Thus a system which cannot be modified to fulfill the changing requirements of an organization is bad. A system should be designed for change. The strength of a good computer-based system is that it is amenable to change. A good system designer is one who can foresee what aspects of a system would change and would design the system in a flexible way to easily accommodate changes.
- 51 -
SYSTEM PLANNING
To understand system development, we need to recognize that a candidate has a planning, just like living system or a new product. System analysis and design are keyed to the system planning. The analyst must progress from one stage to another methodically, answering key questions and achieving results in each stage.
RECOGNITION OF NEED One must know what the problem is before it can be solved. The basis for a candidate system is recognition of a need for improving an information system or procedure. The need leads to a preliminary survey or n initial investigation to determine whether an alternative system can solve the problem. It entails looking into the duplication of effort, bottlenecks, inefficient existing procedure, or whether parts of the existing system would be candidates for computerization.
FEASIBILITY STUDY:
Many feasibility studies are disillusioning for both users and analysts. First, the study often pre supposes that when the feasibility document is being prepared, the analyst is in a position to evaluate solutions. Second, most studies tend to overlook the confusion inherent in the system develop the constraints and assumed attitudes .If the feasibility study is to serve as decision document, it must answer three key questions:
Is there a new and a better way to do the job that it will benefit the user? What are the costs and savings of the alternative(s)?
- 52 -
What is recommended?
The most successful system projects are not necessarily the biggest or Most visible in a business but rather than truly meets user expectations. Most projects fail because of inflated Expectations than for any reason.
Feasibility study is broadly divided into three parts:
Economic feasibility Technical feasibility Operational feasibility
1. ECONOMIC FEASIBILITY:
It is the most frequently used method for evaluating the effectiveness of a system that is expected from the system and compares them with costs. If benefits outweigh costs then the decision is made to design and implement the system. Otherwise, further justification or alteration in the proposed system will have to be made if it is to have a change of being approved. This is an ongoing effort that improves in accuracy at each phase of the system life cycle.
So in our system we have considered these categories for the purpose of cost/benefits analysis or economic feasibility.
- 53 -
1. Hardware Cost:
It relates to the actual purchase or lease of computer and peripherals (for example, printer, disk, drive, tape unit). Determining the actual cost of the hardware is generally more difficult when various users than for a dedicated stand-alone system share the system. In some cases, the best way to control for this cost is to treat it as an operating cost.
In this system we are taking it as operating cost so as to minimize the cost of the initial installation of the computer hardware.
2. Personnel Cost:
It includes EDP staff salaries and benefits (health insurance, vacation time, sick pay, pay, etc.) as well as pay for those involved in developing the system. Cost incurred during the development of a system is Online costs and labeled development costs. Once the system is installed, the costs of operating and maintaining the system become recurring cost.
Facility costs are expanses incurred in the preparation of the physical site where the application or the computer will be in operation. This includes wiring, flooring, acoustics, lighting and air conditioning. These costs are treated as one-time costs and are incorporated in to the overall cost estimate of the candidate system.
As our proposed system it incurred only wiring cost now a days all the sites are well maintained such as flooring and lighting. Thus it would not go to incur extra expanse.
- 54 -
Operating cost includes all costs associated with the day-to-day operation of the system; the amount depends on the number of shifts, the nature of applications, and the caliber of the operating staff. There are various ways of covering the operating costs. One approach is to treat the operating cost as the overhead. Another approach is to charge each authorized use for the amount the processing they request from the system. The amount charged is based on the computer time, staff time, and the volume of the output produced. In any case, some accounting is necessary to determine how operating costs should be handled.
As our candidate system is not so big we require only one server and some few terminals for data maintaining and processing of data. Their costs can be easily determined at the installation time of the proposed system. As computer is also a machine so it also has depreciation by using any of the depreciation methods we can determine its annual costs after deducting the depreciation cost.
Supply cost is variable costs that increase use of paper, ribbons, disks, and the like. They should be estimated and included in the overall cost of the system.
A system is also expected to provide benefits. The first task is to identify each benefit and then assign a monetary value to it for cost/benefit analysis. Benefits may be tangible and intangible, direct and indirect.
The two major benefits are improving performance and minimizing the cost of processing. The performance category emphasizes improvements in the accuracy of or access to information and easier access to the system by authorized users. Minimizing costs through
- 55 -
an efficient system error-control or reduction of staff-is a benefit that should be measured and included on cost/benefit analysis.
This cost in our proposed system is dependent on the number of customers so sometimes it is more or sometimes it is less. It is not very easy to estimate this cost, what we can do is to make a rough estimate of this cost and when this system is installed at a client side we can compare this rough estimated cost with the actual expenses incurred due to this supply cost.
2. TECHNICAL FEASIBILITY:
Technical feasibility centers on the exciting computer system (hardware, software, etc.) and to what extent it can support the proposed edition for example, if the current computer is operating at 80 percent capacity-an arbitrary ceiling- then running another application could overload the system or require additional hardware. This involves financial consideration to accommodate technical enhancements. If the budget I serious constraint, then the project is judged not feasible.
Presently at our client side all the work is done manually so question of overload the system performance and required an additional hardware is not raised thus our candidate system is technically feasible.
- 56 -
3. OPERATIONAL FEASIBILITY:
People are inherently resistant to change, and computer has been known to facilitate change. An estimate should be made of how strong a reaction the user staff is likely to have towards the development of a computerized system. It is common knowledge that computer installations have something to do with turnover, transfers, retraining, and changes in employee hob status. Therefore, it is understandable that the introduction of a candidate system requires special efforts to educate, sell, and train the staff on new ways of conducting business.
There is no doubt that the people are inherently resistant to change, and computers have been known to facilitate change. As in today's world all the work is computerized because of computerization people only get benefits. As far as our system is concerned it is only going to benefit the staff of the clinic in their daily routine work. There is no danger of someone is loosing job or not get proper attention after the installation of our proposed system. Thus our system is operationally feasible also.
REQUIREMENT ANALYSIS
Analysis is a detailed study of the various operations performed by a system and their relationship within and outside the system. One aspect of analysis is defining the boundaries of the system and determining whether or not a candidate system should consider other related system. During analysis, data are collected on the available files, decision points, and transactions handled by the present system. Dataflow diagrams, interviews, on-site observations, and questionnaires are examples. The interview is commonly used tool in analysis. It requires special skills and
- 57 -
sensitivity to the subjects being interviewed. Bias in data collection and interpretation can be a problem, training, experience and commonsense are required for collection of the information needed to do the analysis.
Once analysis is completed, the next step is to decide how the problem might be solved. Thus in, system design, we move from the logical to the physical aspects of the System Planning.
- 58 -
HARDWARE & SOFTWARE REQUIREMENTS

HARDWARE SPECIFICATIONS: Processor Ram Monitor Keyboard Mouse Pentium- I\II\III\higher 128 MB RAM or higher 15 Inch (Digital) with 800 X 600 support 101 Keys keyboard 2 Button Serial/ PS-2
Tools / Platform Language Used:
Language: Java OS: Any OS such as Windows XP/98/NT/Vista
- 59 -
PROJECT DESCRIPTION
What is Huffman Algorithm: Huffman is a coding algorithm presented by David Huffman in 1952. It's an algorithm which works with integer length codes. In fact if we want an algorithm which does integer length codes, huffman is the best option because it's optimal.
We use huffman for example, for compressing the bytes outputted by lzp. First we have to know the probabilities of them, we use a qsm model for that matter. Based on the probabilities it makes the codes which then can be outputted. Decoding is more or less the reverse process, based on the probabilities and the coded data, it outputs the decoded byte.
To make the probabilities the algorithm uses a binary tree. It stores there the symbols and their probabilities. The position of the symbol depends on its probability. Then it assigns a code based on its position in the tree. The codes have the prefix property and are instantaneously decodable thus they are well suited for compression and decompression.
The Huffman compression algorithm assumes data files consist of some byte values that occur more frequently than other byte values in the same file. This is very true for text files and most raw gif images, as well as EXE and COM file code segments.
By analyzing, the algorithm builds a "Frequency Table" for each byte value within a file. With the frequency table the algorithm can then build the "Huffman Tree" from the frequency table. The purpose of the tree is to associate each byte value with a bit string of variable length. The more frequently used characters get shorter bit strings, while the less frequent characters get longer bit strings. Thusly the data file may be compressed.
- 60 -
To compress the file, the Huffman algorithm reads the file a second time, converting each byte value into the bit string assigned to it by the Huffman Tree and then writing the bit string to a new file. The decompression routine reverses the process by reading in the stored frequency table (presumably stored in the compressed file as a header) that was used in compressing the file. With the frequency table the decompressor can then re-build the Huffman Tree, and from that, extrapolate all the bit strings stored in the compressed file to their original byte value form.
Huffman Encoding : Huffman encoding works by substituting more efficient codes for data and the codes are then stored as a conversion table and passed to the decoder before the decoding process takes place. This approach was first introduced by David Huffman in 1952 for text files and has spawned many variations. Even CCITT (International Telegraph and Telephone Consultative Committee) 1 dimensional encoding used for bilevel, black and white image data telecommunications is based on Huffman encoding.
Algorithm : Basically in Huffman Encoding each unique value is assigned a binary code, with codes varying in length. Shorter codes are then used for more frequently used values. These codes are then stored into a conversion table and passed to the decoder before any decoding is done. So how does the decoder starts assigning codes to the values ?
Let's imagine that there is this data stream that is going to be encoded by Huffman Encoding :
- 61 -
AAAABCDEEEFFGGGH
The frequency for each unique value that appears are as follows :
A : 4, B : 1, C : 1, D : 1, E : 3, F : 2, G : 3, H :1
Based on the frequency count the encoder can generate a statistical model reflecting the probability that each value will appear in the data stream :
A : 0.25, B : 0.0625, C : 0.0625, D : 0.0625, E : 0.1875, F : 0.125, G : 0.1875, H : 0.0625
From the statistical model the encoder can build a minimum code for each and store it in the conversion table. The algorithm pairs up 2 values with the least probability, in this case we take B and C and combine their probability so as to be treated as one unique value. Along the way each value B, C and even BC is being assigned a 0 or 1 on their branch. This means that 0 and 1 will be the least significant bits of the codes B and C respectively. From there the algorithm compares the remaining values for another 2 values with the smallest probability and repeat the whole process again until they extend up to form a structure of a up-side down tree. The whole process is illustrated as on the next page.
- 62 -
- 63 -
- 64 -
- 65 -
- 66 -
The binary code for each of the unique value can then be known following down from the top of the up-side down tree (most significant bit) until we reached the unique value we want (least significant bit). Let's take for example we want to find the code for B : Follow the path shown by the blue arrow on the diagram above, and arrive on B. Notice that beside each of the paths we take, there is a bit value, combining each of these values which we came across, and we will get the code for B : 1000. The same approach is then used to find all of the unique values, and their codes are then stored in the conversion table.
Code Construction:
To assign codes you need only a single pass over the symbols, but before doing that you need to calculate where the codes for each codelength start. To do so consider the following: The longest code is all zeros and each code differs from the previous by 1 (I store them such that the last bit of the code is in the least significant bit of a byte/word).
In the example this means:
Codes with length 4 start at 0000 Codes with length three start at (0000+4*1)>>1 = 010. There are 4 codes with length 4 (that is where the 4 comes from), so the next length 4 code would start at 0100. But since it shall be a length 3 code we remove the last 0 (if we ever remove a 1 there is a bug in the codelengths).
Codes with length 2 start at (010+2*1)>>1 = 10. Codes with length 1 start at (10+2*1)>>1 = 10.
- 67 -
Codes with length 0 start at (10+0*1)>>1 = 1. If anything else than 1 is start for the codelength 0 there is a bug in the codelengths!
Then visit each symbol in alphabetical sequence (to ensure the second condition) and assign the startvalue for the codelength of that symbol as code to that symbol. After that increment the startvalue for that codelength by 1.
Maximum Length of a Huffman Code:

Apart from the ceil(log2(alphabetsize)) boundary for the nonzero bits in this particular canonical huffman code it is useful to know the maximum length a huffman code can reach. In fact there are two limits which must both be fulfilled.
No huffman code can be longer than alphabetsize-1. Proof: it is impossible to construct a binary tree with N nodes and more than N-1 levels.
The maximum length of the code also depends on the number of samples you use to derive your statistics from; the sequence is as follows (the samples include the fake samples to give each symbol a nonzero probability!):
The Compression or Huffing Program:
To compress a file (sequence of characters) you need a table of bit encodings, e.g., an ASCII table, or a table giving a sequence of bits that's used to encode each character. This table is constructed from a coding tree using root-to-leaf paths to generate the bit sequence that encodes each character.
- 68 -
Assuming you can write a specific number of bits at a time to a file, a compressed file is made using the following top-level steps. These steps will be developed further into substeps, and you'll eventually implement a program based on these ideas and sub-steps.
Build a table of per-character encodings. The table may be given to you, e.g., an ASCII table, or you may build the table from a Huffman coding tree.
Read the file to be compressed (the plain file) and process one character at a time. To process each character find the bit sequence that encodes the character using the table built in the previous step and write this bit sequence to the compressed file.
Building the Table for Compression:

To build a table of optimal per-character bit sequences you'll need to build a Huffman coding tree using the greedy Huffman algorithm. The table is generated by following every root-toleaf path and recording the left/right 0/1 edges followed. These paths make the optimal encoding bit sequences for each character.
There are three steps in creating the table:
1 Count the number of times every character occurs. Use these counts to create an initial forest of one-node trees. Each node has a character and a weight equal to the number of times the character occurs.
2 Use the greedy Huffman algorithm to build a single tree. The final tree will be used in the next step. - 69 -
3 Follow every root-to-leaf path creating a table of bit sequence encodings for every character/leaf.
Header Information:
You must store some initial information in the compressed file that will be used by the uncompression/unhuffing program. Basically you must store the tree used to compress the original file. This tree is used by the uncompression program.
There are several alternatives for storing the tree. Some are outlined here, you may explore others as part of the specifications of your assignment.
Store the character counts at the beginning of the file. You can store counts for every character, or counts for the non-zero characters. If you do the latter, you must include some method for indicating the character, e.g., store character/count pairs.
You could use a "standard" character frequency, e.g., for any English language text you could assume weights/frequencies for every character and use these in constructing the tree for both compression and uncompression.
You can store the tree at the beginning of the file. One method for doing this is to do a pre-order traversal, writing each node visited. You must differentiate leaf nodes from internal/non-leaf nodes. One way to do this is write a single bit for each node, say 1 for leaf and 0 for non-leaf. For leaf nodes, you will also need to write the character stored. For non-leaf nodes there's no information that needs to be written, just the bit that indicates there's an internal node.
- 70 -
Decompressing:
Decompression involves re-building the Huffman tree from a stored frequency table (again, presumable in the header of the compressed file), and converting its bit streams into characters. You read the file a bit at a time. Beginning at the root node in the Huffman Tree and depending on the value of the bit, you take the right or left branch of the tree and then return to read another bit. When the node you select is a leaf (it has no right and left child nodes) you write its character value to the decompressed file and go back to the root node for the next bit.
Transmission and storage of Huffman-encoded Data:
If your system is continually dealing with data in which the symbols have similar frequencies of occurence, then both encoders and decoders can use a standard encoding table/decoding tree. However, even text data from various sources will have quite different characteristics. For example, ordinary English text will have generally have 'e' at the root of the tree, with short encodings for 'a' and 't', whereas C programs would generally have ';' at the root, with short encodings for other punctuation marks such as '(' and ')' (depending on the number and length of comments!). If the data has variable frequencies, then, for optimal encoding, we have to generate an encoding tree for each data set and store or transmit the encoding with the data. The extra cost of transmitting the encoding tree means that we will not gain an overall benefit unless the data stream to be encoded is quite long - so that the savings through compression more than compensate for the cost of the transmitting the encoding tree also.
- 71 -
WORKING OF PROJECT:
MODULE & THEIR DESCRIPTION :-
There are following functions in project
Huffman Zip Encoder Decoder Table DLNode Priority Queue Huffman Node
Huffman zip is the main function which uses applet. It is used for user interface. Encoder is the module for compressing the file. It implements Huffman algorithm for compressing the text and image file. It first calculate the frequencies of all the occurring symbols. Then on the basis of these frequencies it generates the priority queue. This priority queue is used for finding the symbols with least frequencies. Now the two symbols with lowest frequencies are deleted from the queue and a new symbol is added to the queue with frequency equal to the sum of these two symbols. In the meanwhile we generate a tree with leaf nodes are the two deleted node and the root node is the new node added to the queue. At last we traverse the tree starting from the root node to the leaf node assigning 0 to the left child and 1 to the right node. In this way we assign code to every symbol in the file. These are binary codes then we
- 72 -
group these binary codes and calculate the equivalent integers and store them in the output file, which is the compressed file.
Decoder works in the reverse order as the encoder. It reads the input from the compressed file and convert it into equivalent binary code. It has one another input the binary tree generated in the encoding process and on the basis of these data it generates the original file. This project is based on lossless compression.
Table is used for storing the codes of each symbol. Priority queue takes input the symbols and there related frequencies and on the basis of these frequencies it assign priorities to each symbol. Huffman node is used for creating the binary tree it takes input two symbol from the priority queue and create two nodes by comparing the frequencies of these two symbol. It
places the symbol with less frequency to the left and the symbol with high frequency to the right, it then deletes these two symbol from the priority queue and places a new symbol with frequency equal to the sum of frequencies of these two deleted symbol. It also generate a parent node to the two node and assign frequency equal to the sum of frequencies of the two leaf node.
- 73 -
DATA FLOW DIAGRAM

When solving a small problem, the entire problem can be tackled at once. For solving larger problems, the basic principles the time-tested principle of divide and conquer. Clearly, dividing in such a manner that all the divisions have to be conquered together is not the intent of this wisdom. This principle, if elaborated, would mean divide into smaller pieces, so that each piece can be conquered separately.
Problem partitioning, which is essential for solving a complex problem, leads to hierarchies in the design. That is, the design produced by using problem partitioning can be represented as a hierarchy of components. The relationship between the elements in this hierarchy can vary depending on the method used. For example, the most common is the whole-part of relationship. In this the system consists of some parts, each past consists of subparts, and so on. This relationship can be naturally represented as a hierarchical structure between various system parts. In general hierarchical structure makes it much easier to comprehend a complex system. Due to this, all design methodologies aim to produce a design that has nice hierarchical structures.
The DFD was first designed by Larry Constantine as a way of expressing system requirements in a graphical form; this led to a modular design. A DFD, also known as bubble chart, has the purpose of clarifying system requirements and identifying major transformations that will become programs in system design. So it is the starting point of the design phase that functionally decomposes the requirement specifications
- 74 -
down to the lowest level of detail. A DFD consists of series of bubbles joined by lines represent data flows in the system.
DFD SYMBOLS
In the DFD, there are four symbols.
1 A square defines a source (originator) or destination of system data.
2 An arrow identifies data flow- data in motion. It is a pipeline through which information flows. 3 A circle or a bubble (some people use an oval bubble) represents a process that transforms incoming data flows(s) into outgoing data flow(s).
4 An open rectangle is a data store-data at rest , or a temporary repository of data .
- 75 -
SYMBOLS
MEANING
Source or destination of data
Data flow
Process that transform data flow
Data Store
CONSTRUCTING DFD
Several rule of thumb are used in drawing D F Ds:
1 Processes should be named and numbered for easy reference. Each name should be representative of the process.
2 The direction of flow is from top to bottom and from left to right. Data traditionally flow from the source (upper left corner) to the destination (lower right corner), although they may flow back to a source. One way to indicate this is to draw a long flow line back to the source. An alternative way is to repeat the source symbol as a destination. Since it is used more than once in the DFD, it is marked with a short diagonal in the lower right corner.
- 76 -
3 When a process is exploded into lower-level details, they are numbered.
4 The names of data sources and destinations are written in capital letters. Process and data flows names have the first letter of each word capitalized.
HOW DETAILED SHOULD A DFD BE?

The DFD is designed to aid communication. If it contains dozens of processes and data stores it gets too unwieldy. The rule thumb is to explode the DFD to a functional level, so that the next sublevel does not exceed 10 processes. Beyond that, it is best to take each function separately and expand it show the explosion of the single process. If a user wants to know what happens within a given process, then the detailed explosion of that process may be shown.
A DFD typically shows the minimum contents of data elements that flow in and out.
A leveled set has a starting DFD, which is a very abstract representation of the system, identifying the major inputs and outputs and the major processes in the system. Then each process is refined and a DFD is drawn for the process. In other words, a bubble DFD is expanded into a DFD during refinement. For the hierarchy to be consistent, it is important that the net inputs and outputs of the DFD for a process are the same as the inputs and outputs of the process are the same as the inputs and the outputs of the process in the higher level DFD. This refinement stops if each bubble can be easily identified or understood. It should be pointed out that during refinement, though the net input and output are preserved, a refinement of the data might also occur. That is , a unit of data may be broken into its components for processing when the detailed DFD for a process is being drawn .So , as the process are decomposed, data decomposition also occurs.
- 77 -
The DFD methodology is quite effective, especially when the required design is unclear the analyst need a notational language for communication. The DFD is easy to understand for communication. The DFD is easy to understand after a brief orientation.
The main problem however is the large number of iterations that often are required to arrives at the most accurate and complete solution.
DATA FLOW DIAGRAM
The DFD helps to understand the functioning & module used in the coding . It describe easily flow and store of the data.What variable are given in input & flow of data in the program & the final output. Here we are referencing some DFDs which helps in understanding the program
- 78 -
Priority queue
Huffman Node
Table
Code generator
Updation of priority queue
- 79 -
Traverse
Code store
Print Layouts
- 80 -
- 81 -
- 82 -
- 83 -
IMPLEMENTATION:
The implementation phase is less creative than system design. It is primarily concerned with user training, site preparation, and file conversion. When the candidate system is linked to terminals to remote sites, the telecommunication network and test of the network along with the system are also included under implementation.
During the implementation phase, the system actually takes physical shape
As in the other two stages, the analyst, his or her associates and the user performs many tasks including:
Writing, testing, debugging and documenting systems. Converting data from the old to the new system. Training the systems users.
Completing system documentation.
Evaluating the final system to make sure that it is fulfilling original need and that it began operation on time and within budget.
The analyst involvement in each of these activities varies from organization to organization . For a small organizations, specialists may work on different phases and tasks, such as training, ordering equipment, converting data from old methods to the new or certifying the correctness of the system.
The implementation phase with an evaluation of the system after
placing it into operation
for a period of time .by then, most program errors will have shown up and most costs will - 84 -
have become clear .To make sure that the system audit is a last check or review of a system to ensure that it meets design criteria. Evaluation forms the feedback part of the cycle that keeps implementation going as long as the system continues operation.
Ordering and installing any new hardware required by the system. Developing operating procedures for the computer center staff. Establishing a maintenance procedure to repair and enhance the system.
During the final testing user acceptance is tested followed by user training. Depending on the nature of the system, extensive user training may be required. Conversion usually takes place at about the same time the user is being trained or later
In the extreme, the programmer is falsely viewed as some who ought to be isolated from other aspects of system development. Programming is itself design work, however. The initial parameters of the candidates system should be modified as a result of programming efforts. Programming provides a reality test for the assumptions maid by the analyst it is therefore a mistake to exclude programmers from the initial system design.
System testing checks the readiness and accuracy of the system to access update and retrieve data from new files. Once the program becomes available test data are read into the computer and processed against the file provide for testing in most conversions a parallel run is conducted where the new system runs simultaneously with the old system this method though costly provides added assurance against errors in the candidate system.
- 85 -
TEST PLAN
A test plan is a service delivery agreement. It is a quality assurances way of communicating to developer, the client, and the rest of the team, this is what can be expected.
The key point of test plan is: Introduction:

Summarizes key features and expectations of software along with testing approach.
Scope:
It includes a description of text types.
Risks and assumptions:

This part should define a risk to the testing phase, such as criteria that could suspend testing.
Testing schedules and cycles:

States when testing will be completed and the number of expected cycles.
Test resources:
Specifies testers and bug fixers.
- 86 -
Some special terms in Testing Fundamental

Error:
The term Error is used in two different ways. It refers to difference between the actual output of the software and the correct output. In this interpretation, error is an essential measure of the difference actual and ideal output. Error is also used to refer to human action that results in software containing a defect or fault.
Fault:
Fault is a condition that causes a system to fail in performing its required function. A fault is a basic reason for software malfunction and is synonymous with the commonly used term 'Bug'.
Failure:
Failure is the inability of a system or component to perform a required function according to its specifications. A software failure occurs if the behavior if the software is different from the specified behavior. performance reasons. Failure may be caused due to functional or
- 87 -
Some of the commonly used Strategies for Testing are as follows:-
Unit Testing
Module testing
Integration testing TEST Testing System testing
Acceptance testing
Unit Testing :
The term 'Unit Testing' comprises the set of tests performed by an individual programmer prior to the integration of the unit into a larger system. The situation is illustrated as follows:
Coding & Debugging
Unit Testing
Integration Testing
A program unit is usually small enough, so the programmer who developed it can test it in great detail, and certainly in greater detail than will be possible when the unit is integrated into an evolving software product. In unit testing, the programs are tested separately, independent of each other. Since the check is done at the program level, it is also called Program Testing.
- 88 -
Module Testing :
A module encapsulates related component. So can be tested without other system modules.
Subsystem testing :
Subsystem testing may be independently designed and implemented. Common
problems such as sub-system interface mistakes can be checked and can concentrate on it in this phase.
There are four categories of tests that a programmer will typically perform on a program unit:
Functional Tests Performance Test Stress Test
Structure Test
Functional Test :
Functional test cases involves exercising the code with nominal input values for which
expected results are known, as well as boundary values (minimum values, maximum values, and values on and just outside the functional boundaries) and special values.
- 89 -
Performance Test :
Performance testing determines the amount of execution time spent in various parts of the unit, program throughput, response time, and device utilization by the program unit. A certain amount of performance tuning may be done during testing, however, caution must be exercised to avoid expending too much effort on fine tuning of a program unit that contributes little to the overall performance of the entire system. Performance testing is most productive at the subsystem and system levels.
Stress Test :
Stress tests are those tests designed to intentionally break the unit. A great deal can be learned about the strengths and limitations of a program by examining the manner in which a program unit breaks.
Structure Test :
Structure tests are concerned with exercising the internal logic of a program and traversing particular execution paths. Some authors refer collectively to functional performance and stress testing as black box testing, while structure testing is referred to as white box or glass box testing. The major activities in structural testing are deciding which path to exercise, deriving test data to exercise those paths, determining the test coverage criterion to be used and executing the test cases on some modules and subsystems. This mix alleviates many of the problems encountered in pure top-down testing and retains the advantages of top-down integration at the subsystem and system level.
- 90 -
Automated tools used in integration testing include module drivers, test data generators, environment simulators, and a management facility to allow easy configuration and
reconfiguration of system elements. Automated modules drivers perm it specification of test cases (both input and expected results) in a descriptive language. The driver tool then calls the routine using specified test cases, compares actual with the expected results, and reports discrepancies.
Some module drivers also provide program stubs for top-down testing. Test cases are written for the stub, and when the stub is invoked by the routine being tested, the drivers examine the input parameters to the stub and return the corresponding outputs to the routine. Automated test drivers include AUT, MTS, TEST MASTER and TPL.
Test data generators are of two varieties; those that generate files of random data values according to some predefined format, and those that generate test data for particular execution paths. In the latter category, symbolic executors such as ATTEST can sometimes be used to driver a set of test data that will force program execution to follow a particular control path.
Environment
simulators
are
sometimes
used
during integration
and
acceptance testing to simulate the operating environment
in which the software will
function. Simulators are used in situation in which operation of the actual environment is impractical. Examples of simulators are PRIM (GAL75) for emulating, machines that do not exist, and the Saturn Flight Program Simulators for simulating live flight tests cases, and measuring the coverage achieved when the test cases are exercised.
- 91 -
System Testing
System testing involves two kinds of activities:
Integration testing Acceptance testing
Strategies for
integrating software
components into a functioning product include the
bottom-up strategy, the top-down strategy, and the sandwich strategy. Careful planning and scheduling are required to ensure that modules will be available for integration into the evolving software product when needed. The integration strategy dictates the order in which modules must be available, and thus exerts a strong influence on the order in which modules are written, debugged, and unit tested.
Acceptance testing involves planning & execution of functional
tests, performance
tests, and stress tests to verify that the implemented system satisfies its requirements.
Acceptance tests are typically performed organizations.
by quality assurance
and/or customer
- 92 -
CONCLUSIONS
Data compression is a topic of much importance and many applications. Methods of data compression have been studied for almost four decades. This paper has provided an overview of data compression methods of general utility. The algorithms have been evaluated in terms of the amount of compression they provide, algorithm efficiency, and susceptibility to error. While algorithm efficiency and susceptibility to error are relatively independent of the characteristics of the source ensemble, the amount of compression achieved depends upon the characteristics of the source to a great extent.
Semantic dependent data compression techniques are special-purpose methods designed to exploit local redundancy or context information. A semantic dependent scheme can usually be viewed as a special case of one or more general-purpose algorithms. It should also be noted that algorithm HUFFMAN CODING & DECODING is a general-purpose technique which exploits locality of reference, a type of local redundancy.
Susceptibility to error is the main drawback of each of the algorithms presented here. Although channel errors are more devastating to adaptive algorithms than to static ones, it is possible for an error to propagate without limit even in the static case. Methods of limiting the effect of an error on the effectiveness of a data compression algorithm should be investigated.
- 93 -
FUTURE ENHANCEMENT & NEW DIRECTIONS
NEW DIRECTIONS:
Data compression is still very much an active research area. This section suggests possibilities for further study.
The discussion of illustrates the susceptibility to error of the codes presented in this survey. Strategies for increasing the reliability of these codes while incurring only a moderate loss of efficiency would be of great value. This area appears to be largely unexplored. Possible approaches include embedding the entire ensemble in an error-correcting code or reserving one or more codewords to act as error flags. For Huffman encoding & decoding it may be necessary for receiver and sender to verify the current code mapping.
Another important research topic is the development of theoretical models for data compression which address the problem of local redundancy. Models based on Huffman coding may be exploited to take advantage of interaction between groups of symbols. Entropy tends to be overestimated when symbol interaction is not considered. Models which exploit relationships between source messages may achieve better compression than predicted by an entropy calculation based only upon symbol probabilities.
- 94 -
SCOPE FOR FUTURE WORK:
Since this system has been generated by using Object Oriented programming, there are every chances of reusability of the codes in other environment even in different platforms. Also its present features can be enhanced by some simple modification in the codes so as to reuse it in the changing scenario.
SCOPE OF FUTHER APPLICATION:
We can implement easily this application. Reusability is possible as and when we require in this application. We can update it next version. We can add new features as and when we require. There is flexibility in all the modules.
- 95 -
SOURCE CODE
HuffmanZip.java
import javax.swing.*; import java.io.*; import java.awt.*; import java.awt.event.*; public class HuffmanZip extends JFrame { private JProgressBar bar; private JButton enc,dec,center; private JLabel title; private JFileChooser choose; private File input1,input2; private Encoder encoder; private Decoder decoder; private ImageIcon icon;
public HuffmanZip() { super("Zip utility V1.1"); // Container con=getContentPane(); Container c=getContentPane(); enc=new JButton("Encode"); dec=new JButton("Decode"); center=new JButton(); title=new JLabel(" Zip Utility V1.1 "); choose=new JFileChooser(); icon=new ImageIcon("huff.jpg"); center.setIcon(icon);
enc.addActionListener( new ActionListener()

- 96 -
{ public void actionPerformed(ActionEvent e) { int f=choose.showOpenDialog(HuffmanZip.this); if (f==JFileChooser.APPROVE_OPTION) { input1=choose.getSelectedFile(); encoder=new Encoder(input1); HuffmanZip.this.setTitle("Compressing....."); encoder.encode(); JOptionPane.showMessageDialog(null,encoder.getSummary(),"Summar y",JOptionPane.INFORMATION_MESSAGE); HuffmanZip.this.setTitle("Zip utility v1.1"); } } } );
dec.addActionListener( new ActionListener() { public void actionPerformed(ActionEvent e) { int f=choose.showOpenDialog(HuffmanZip.this); if (f==JFileChooser.APPROVE_OPTION) { input2=choose.getSelectedFile(); decoder=new Decoder(input2); decoder.decode(); HuffmanZip.this.setTitle("Decompressing....."); JOptionPane.showMessageDialog(null,decoder.getSummary(),"Summar y",JOptionPane.INFORMATION_MESSAGE); HuffmanZip.this.setTitle("Zip utility v1.1"); } } }
- 97 -
);
//c.add(bar,BorderLayout.SOUTH); c.add(dec,BorderLayout.EAST); c.add(enc,BorderLayout.WEST); c.add(center,BorderLayout.CENTER); c.add(title,BorderLayout.NORTH); setSize(250,80); setVisible(true); } public static void main(String args[]) { HuffmanZip g=new HuffmanZip(); g.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); }
- 98 -
Encoder.java
import java.io.*; import javax.swing.*; public class Encoder { private static String code[],summary=""; private int totalBytes=0; private int count=0; private File inputFile; private File outputFile ; private FileOutputStream C; private ObjectOutputStream outF; private BufferedOutputStream outf; private FileInputStream in1; private BufferedInputStream in; private boolean done=false;
public Encoder(File inputFile) { this.inputFile=inputFile; } public void encode() {
int freq[]=new int[256]; for(int i=0;i<256;i++) { freq[i]=0; }
// File inputFile = new File(JOptionPane.showInputDialog("Enter the input file name")); try { in1 = new FileInputStream(inputFile); in=new BufferedInputStream(in1); }
- 99 -
catch(Exception eee) { }
try { System.out.println(" "+in.available()); totalBytes=in.available(); int mycount=0; in.mark(totalBytes); while (mycount<totalBytes) { int a=in.read(); mycount++; freq[a]++; } in.reset(); } catch(IOException eofexc) { System.out.println("error"); }
HuffmanNode tree=new HuffmanNode(),one,two; PriorityQueue q=new PriorityQueue();
try { for(int j=0;j<256;j++) { // System.out.println("\n"+byteval[j]+" "+freq[j]+" prob "+probablity[j]+"int value"+toInt(byteval[j])); if (freq[j]>0) { HuffmanNode t=new HuffmanNode("dipu",freq[j],j,null,null,null);
- 100 -
q.insertM(t); } } //create tree.................................... while (q.sizeQ()>1) { one=q.removeFirst(); two=q.removeFirst(); int f1=one.getFreq(); int f2=two.getFreq(); if (f1>f2) { HuffmanNode t=new HuffmanNode(null,(f1+f2),0,two,one,null); one.up=t; two.up=t; q.insertM(t); } else { HuffmanNode t=new HuffmanNode(null,(f1+f2),0,one,two,null); one.up=t; two.up=t; q.insertM(t); } } tree =q.removeFirst(); } catch(Exception e) { System.out.println("Priority Queue error"); } code=new String[256]; for(int i=0;i<256;i++) code[i]=""; traverse(tree); Table rec=new Table(totalBytes,inputFile.getName());
- 101 -
//
for(int i=0;i<256;i++) { rec.push(freq[i]); if(freq[i]==0) continue; System.out.println(""+i+" "+code[i]+" "); } System.out.println("size of table"+rec.recSize());
//
//create tree ends...........................
//
System.out.println("\n total= "+totalBytes+"\n probablity="+d); int wrote=0,csize=0; int recordLast=0; try { outputFile = new File(inputFile.getName()+".hff"); C=new FileOutputStream(outputFile); outF=new ObjectOutputStream(C); outf=new BufferedOutputStream(C); outF.writeObject(rec); String outbyte="";
while (count<totalBytes) { outbyte+=code[in.read()]; count++; if (outbyte.length()>=8) { int k=toInt(outbyte.substring(0,8)); csize++; outf.write(k); outbyte=outbyte.substring(8); } }
while(outbyte.length()>8)
- 102 -
{ csize++; int k=toInt(outbyte.substring(0,8)); outf.write(k); outbyte=outbyte.substring(8); } if((recordLast=outbyte.length())>0) { while(outbyte.length()<8) outbyte+=0; outf.write(toInt(outbyte)); csize++; } outf.write(recordLast); outf.close(); } catch(Exception re) { System.out.println("Error in writng...."); } float ff=(float)csize/((float)totalBytes); System.out.println("Compression "+recordLast+" ratio"+csize+" "+(ff*100)+" %");
summary+="File name : "+ inputFile.getName(); summary+="\n"; summary+="File size : "+totalBytes+" bytes."; summary+="\n"; summary+="Compressed size : "+ csize+" bytes."; summary+="\n"; summary+="Compression ratio: "+(ff*100)+" %"; summary+="\n"; done=true; }
private void traverse(HuffmanNode n) {
- 103 -
//
if (n.lchild==null&&n.rchild==null) { HuffmanNode m=n; int arr[]=new int[20],p=0; while (true) { if (m.up.lchild==m) { arr[p]=0; } else { arr[p]=1; } p++; m=m.up; if(m.up==null) break; } for(int j=p-1;j>=0;j--) code[n.getValue()]+=arr[j]; } System.out.println("Debug3"); if(n.lchild!=null) traverse(n.lchild); if(n.rchild!=null) traverse(n.rchild); } private String toBinary(int b) { int arr[]=new int[8]; String s=""; for(int i=0;i<8;i++) { arr[i]=b%2; b=b/2; } for(int i=7;i>=0;i--) { s+=arr[i]; } return s; }
- 104 -
private int toInt(String b) { int output=0,wg=128; for(int i=0;i<8;i++) { output+=wg*Integer.parseInt(""+b.charAt(i)); wg/=2; } return output; } public int lengthOftask() { return totalBytes; } public int getCurrent() { return count; } public String getSummary() { String temp=summary; summary=""; return temp; } public boolean isDone() { return done; } }
- 105 -
Decoder.java
import java.io.*; import javax.swing.*; public class Decoder { private int totalBytes=0,mycount=0; private int freq[],arr=0; private String summary=""; private File inputFile; private Table table; private FileInputStream in1; private ObjectInputStream inF; private BufferedInputStream in; private File outputFile ; private FileOutputStream outf; public Decoder(File file) { inputFile=file; } public void decode()//throws Exception {
freq=new int[256]; for(int i=0;i<256;i++) { freq[i]=0; }
// File inputFile = new File(JOptionPane.showInputDialog("Enter the input File name")); try { in1 = new FileInputStream(inputFile); inF=new ObjectInputStream(in1); in=new BufferedInputStream(in1);
- 106 -
//
int arr=0; table=(Table)(inF.readObject());
outputFile = new File(table.fileName()); outf=new FileOutputStream(outputFile); summary+="File name : "+ table.fileName(); summary+="\n"; } catch(Exception exc) { System.out.println("Error creating file"); JOptionPane.showMessageDialog(null,"Error"+"\nNot a valid < hff > format file.","Summary",JOptionPane.INFORMATION_MESSAGE); System.exit(0); }
HuffmanNode tree=new HuffmanNode(),one,two; PriorityQueue q=new PriorityQueue(); try { //creating priority queue................. for(int j=0;j<256;j++) { int r =table.pop(); // System.out.println("Size of table "+r+" "+j); if (r>0) { HuffmanNode t=new HuffmanNode("dipu",r,j,null,null,null); q.insertM(t); } } //create tree.................................... while (q.sizeQ()>1) {
- 107 -
one=q.removeFirst(); two=q.removeFirst(); int f1=one.getFreq(); int f2=two.getFreq(); if (f1>f2) { HuffmanNode t=new HuffmanNode(null,(f1+f2),0,two,one,null); one.up=t; two.up=t; q.insertM(t); } else { HuffmanNode t=new HuffmanNode(null,(f1+f2),0,one,two,null); one.up=t; two.up=t; q.insertM(t); } } tree =q.removeFirst(); } catch(Exception exc) { System.out.println("Priority queue exception"); }
String s=""; try { mycount=in.available(); while (totalBytes<mycount) { arr=in.read(); s+=toBinary(arr); while (s.length()>32) { for(int a=0;a<32;a++) {
- 108 -
int wr=getCode(tree,s.substring(0,a+1)); if(wr==-1)continue; else { outf.write(wr); s=s.substring(a+1); break; }
} } totalBytes++; } s=s.substring(0,(s.length()-8)); s=s.substring(0,(s.length()-8+arr));
int counter; while (s.length()>0) { if(s.length()>16)counter=16; else counter=s.length(); for(int a=0;a<counter;a++) { int wr=getCode(tree,s.substring(0,a+1)); if(wr==-1)continue; else { outf.write(wr); s=s.substring(a+1); break; } } } outf.close(); } catch(IOException eofexc) { System.out.println("IO error");
- 109 -
summary+="Compressed size : "+ mycount+" bytes."; summary+="\n"; summary+="Size after decompressed : "+table.originalSize()+" bytes."; summary+="\n"; } private int getCode(HuffmanNode node,String decode) { while (true) { if (decode.charAt(0)=='0') { node=node.lchild; } else { node=node.rchild; } if (node.lchild==null&&node.rchild==null) { return node.getValue(); } if(decode.length()==1)break; decode=decode.substring(1); } return -1; }
public String toBinary(int b) { int arr[]=new int[8]; String s=""; for(int i=0;i<8;i++) { arr[i]=b%2; b=b/2;
- 110 -
} for(int i=7;i>=0;i--) { s+=arr[i]; } return s; } public int toInt(String b) { int output=0,wg=128; for(int i=0;i<8;i++) { output+=wg*Integer.parseInt(""+b.charAt(i)); wg/=2; } return output; } public int getCurrent() { return totalBytes; } public int lengthOftask() { return mycount; } public String getSummary() { return summary; } }
- 111 -
DLnode.java
public class DLNode { private DLNode next,prev; private HuffmanNode elem; public DLNode() { next=null; prev=null; elem=null; } public DLNode(DLNode next,DLNode prev,HuffmanNode elem) { this.next=next; this.prev=prev; this.elem=elem; } public DLNode getNext() { return next; } public DLNode getPrev() { return prev; } public void setNext(DLNode n) { next=n; } public void setPrev(DLNode n) { prev=n; } public void setElement(HuffmanNode o) { elem=o; } public HuffmanNode getElement() { return elem; } }
- 112 -
HuffmanNode.java
import java.io.*; public class HuffmanNode implements Serializable { public HuffmanNode rchild,lchild,up; private String code; private int freq; private int value; public HuffmanNode(String bstring,int freq,int value,HuffmanNode lchild,HuffmanNode rchild,HuffmanNode up) { code=bstring; this.freq=freq; this.value=value; this.lchild=lchild; this.rchild=rchild; this.up=up; } public HuffmanNode() { code=""; freq=0; value=0; lchild=null; rchild=null; } public int getFreq() { return freq; } public int getValue() { return value; } public String getCode() { return code; } }
]
- 113 -
PriorityQueue.java
public class PriorityQueue { private DLNode head,tail; private int size=0; private int capacity; private HuffmanNode obj[]; public PriorityQueue(int cap) { head=new DLNode(); tail=new DLNode(); head.setNext(tail); tail.setPrev(head); capacity=cap; obj=new HuffmanNode[capacity]; } public PriorityQueue() { head=new DLNode(); tail=new DLNode(); head.setNext(tail); tail.setPrev(head); capacity=1000; obj=new HuffmanNode[capacity]; } public void insertM(HuffmanNode o)throws Exception { if (size==capacity) throw new Exception("Queue is full"); if (head.getNext()==tail) { DLNode d=new DLNode(tail,head,o); head.setNext(d); tail.setPrev(d); } else { DLNode n=head.getNext(); HuffmanNode CurrenMax=null; int key=o.getFreq(); while (true) {
- 114 -
if (n.getElement().getFreq()>key) { DLNode second=n.getPrev(); DLNode huf=new DLNode(n,second,o); second.setNext(huf); n.setPrev(huf); break; } if (n.getNext()==tail) { DLNode huf=new DLNode(tail,n,o); n.setNext(huf); tail.setPrev(huf); break; } n=n.getNext(); } } size++; }
public HuffmanNode removeFirst() throws Exception { if(isEmpty()) throw new Exception("Queue is empty"); HuffmanNode o=head.getNext().getElement(); DLNode sec=head.getNext().getNext(); head.setNext(sec); sec.setPrev(head); size--; return o; } public HuffmanNode removeLast() throws Exception { if(isEmpty()) throw new Exception("Queue is empty"); DLNode d=tail.getPrev(); HuffmanNode o=tail.getPrev().getElement(); tail.setPrev(d.getPrev()); d.getPrev().setNext(tail); size--; return o;
- 115 -
} public boolean isEmpty() { if(size==0)return true; return false; } public int sizeQ() { return size; } public HuffmanNode first()throws Exception { if(isEmpty()) throw new Exception("Stack is empty"); return head.getNext().getElement(); } public HuffmanNode Last()throws Exception { if(isEmpty()) throw new Exception("Stack is empty"); return tail.getPrev().getElement(); } }
- 116 -
Table.java
import java.io.*;
class Table implements Serializable { private String FileName; private int fileSize,arr[],size=0,front=0; public Table(int fileSize,String FileName) { arr=new int[256]; this.FileName=FileName; this.fileSize=fileSize; } public void push(int c) { if(size>256) System.out.println("Error in record"); arr[size]=c; size++; } public int originalSize() { return fileSize; } public int pop() { if(size<1) System.out.println("Error in record"); int rt=arr[front++]; size--; return rt; } public String fileName() { return FileName; } public int recSize() { return size; } }
- 117 -
REFERENCES
TITLE 1. 2. 3. 4. 5. 6. 7. 8. Data compression Data compression Foundations of I.T Complete Reference Java OOPS in java Java programming Software Engineering Software Engineering
.
AUTHOR Khalid Sayood Mark Nelson D.S yadav Herbert Schildt E Balagurusamy Krishnamoorthy Pressman Pankaj Jalote
WEBSITES:-
1. http://www.google.com 2. http://www.wikipedia.org 3.http://www.nist.gov
ENCLOSED: Soft copy of the project in C.D.
- 118 -

Compression &amp; Decompression

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Compression &amp; Decompression

Încărcat de

Drepturi de autor:

Formate disponibile

A Project Report On

COMPRESSION & DECOMPRESSION

RAHUL SINGH SHAKUN GARG 0407713057, 0407713042

Dr. K.N.MODI INSTITUTE OF ENGINEERING & TECHNOLOGY

7.6 Transmission & storage of Huffman encoded data 72

RAHUL SINGH SHAKUN GARG MANISH SRIVASTAVA

Dr. K.N. Modi Institute of Engineering and Technology Modinagar

DEPARTMENT OF INFORMATION TECHNOLOGY

Head of the Department:

Internal Guide: Mr. GAURAV VAJPAI

(Mr. JAIDEEP KUMAR)

FILE TABLE DETAIL TABLE

ARCHITECTURE OF NETPOD PERT CHART GANTT CHART

COMPRESSION & DECOMPRESSION

STATEMENT ABOUT THE PROBLEM

WHY IS THE PARTICULAR TOPIC CHOSEN?

OBJECTIVE AND SCOPE OF THE PROJECT

HARDWARE & SOFTWARE REQUIREMENTS

Tools / Platform Language Used:

Language: Java OS: Any OS such as Windows XP/98/NT/Vista

WHAT CONTRIBUTION WOULD THE PROJECT MAKE?

SYNOPSIS OF THE PROJECT

3. DESIGN PRINCIPLES & EXPLANATION

There are following functions in project

5. HARDWARE & SOFTWARE REQUIREMENTS

OBJECTIVE AND SCOPE

Lossless vs. lossy compression:

A small overview of different compression is presented below:

The three steps of digital image compression.

DEFINITION OF THE PROBLEM

Table is based on Steinmetz and Nahrstedt (1995)

SYSTEM ANALYSIS AND DESIGN

Use of consistent and meaningful names for various design components.

STAGES IN A SYSTEMS LIFE CYCLE

The developer of this s/w studies this feasibility

report and suggests modifications in the

Feasibility study is broadly divided into three parts:

Economic feasibility Technical feasibility Operational feasibility

HARDWARE & SOFTWARE REQUIREMENTS

Tools / Platform Language Used:

Language: Java OS: Any OS such as Windows XP/98/NT/Vista

A : 0.25, B : 0.0625, C : 0.0625, D : 0.0625, E : 0.1875, F : 0.125, G : 0.1875, H : 0.0625

In the example this means:

Maximum Length of a Huffman Code:

The Compression or Huffing Program:

Building the Table for Compression:

There are three steps in creating the table:

Transmission and storage of Huffman-encoded Data:

There are following functions in project

DATA FLOW DIAGRAM

In the DFD, there are four symbols.

1 A square defines a source (originator) or destination of system data.

4 An open rectangle is a data store-data at rest , or a temporary repository of data .

Process that transform data flow

Several rule of thumb are used in drawing D F Ds:

3 When a process is exploded into lower-level details, they are numbered.

HOW DETAILED SHOULD A DFD BE?

DATA FLOW DIAGRAM

Updation of priority queue

Completing system documentation.

The implementation phase with an evaluation of the system after

placing it into operation

The key point of test plan is: Introduction:

Compression & Decompression

Compression & Decompression