Sunteți pe pagina 1din 12

Data Compression

The idea of data compression is maximizing storage efficiency by minimizing the distribution of data,
in this case of computer science, binary distribution. There are two methods to do data compression:
Lossless which avoid data losses in compressed result and Lossy which tolerate data losses within
predicted rate. An important fact in data compression is we can not built an universal data
compression.
We used bitstream (BinaryStdIn) rather than input stream (In) because in the case of compression
algorithm we need to read each bit rater than each byte. BinaryStdIn.java, BinaryStdOut.java,
BinaryDump.java, etc can be found at https://algs4.cs.princeton.edu/55compression/.

Run-length encodings
The idea is we can saving certain amounts of storage by maintenance data frequencies. For example,
consider the following 40-bit string:
0000000000000001111111000000011111111111
The string consists run of 15 0s, then 7s 1s. Then 7 0s, then 11 1s, so we can encode the bit-string with
the number 15, 7, 7, and 11. Since highest frequency is 15, then we can encode it into 4 bits encoding:
1111011101111011
the compressing ration is 16/40 = 40%.

Bitmaps
Bitmaps is an good example of run-length encoding which organize bit-stream to represent images and
documents. An image below is an example an bitmaps of letter q in size 32x48 pixels. We can easily
compress an bitmaps image by making frequency groups. For example, a letter q as shown in the image
has 32 0s, 32 0s, 15 0s, 7 1s, 10 0s, etc.
Image below which is low resolution image can be compressed wit ratio 74% using run-length
encoding. If we increase the resolution to 64 by 96, the ratio drops to 37%. That is a reason why run-
length encoding is very effective scheme to compress bitmaps stream.
Implementing Run-length length encoding can be very easy (we just need a loop!). The Run-length
encoding can also easily decompress using expand method that examine each compressed segments
into original bits.
Below implementation of Run-length encoding in Java:
public class Run-length {
private static final int R = 256;
private static final int LG_R = 8;

private RunLength() {}

// Decompress
public statuc void expand() {
boolean b = false;
while (!BinaryStdIn.isEmpty()) {
int run = BinaryStdIn.readInt(LG_R);
for (int i = 0; i = run; i++)
BinaryStdOut.write(b);
b = !b;
}
BinaryStdOut.close()
}

public static void compress() {


char run = 0;
boolean old = false;
while (!BinaryStdIn.isEmpty()) {
boolean b = BinaryStdIn.readBoolean();
if (b != old) {
BinaryStdOut.write(run, LG_R);
run = 1;
old = !old;
} else {
if (run == R-1) {
BinaryStdOut.write(run, LG_R);
run = 0;
BinaryStdOut.write(run, LG_R);
}
run++;
}
}
BinaryStdOut.write(run, LG_R);
BinaryStdOut.close();
}

public static void main(String[] args) {


if (args[0].equals(“-”)) compress();
else if (args[0].equals(“+”)) expand();
else
throw new IllegalArgumentException(“Illegal command line argument”);
}
}
Huffman compression
This is widely used compressing technique that used frequency information to save space. The idea is
we can used fewer bits (less than 8 bits) to represent high frequency ASCII character. Huffman
compression will produce prefix-free code that can be uniquely decodable, so we can avoid miss-
interpretation. Huffman compression encode all codewords by constructing trie, also known as
huffman encoding. But, rather than construct a trie from root to bottom, we construct it from bottom to
root. We keep the characters to be encoded in leaves and maintain the frequency instance variable in
each node that represent the frequency of all character in the subtree rooted at that node. Below step by
step how to construct huffman code using trie:
1. Create a forest of 1-nodes trees (leaves) ordered by its frequency.
2. Find the two nodes with the smallest frequency, then create a new node which is a parent of
thos two nodes which has frequency equal to sum of child nodes.
3. Repeat step 2 until all node construct a tree.
An image below illustrate how to construct a huffman code of word “it was the best of times it was the
worst of times”.
To represent binary, we can insert each 0 on the left side and 1 on the right side. A complete codeword
table of Huffman encoding illustrate in an image below:

Codeword table produced by Huffman encoding is optimal because it is equal to product of frequency
times word length.
Huffman encoding is two-pass algorithm because read input stream twice while encode and compress.
Now to do compression, we need following steps:
1. Read the input.
2. Tabulate the frequency of occurrence of each value in the input.
3. Build the Huffman encoding trie corresponding to those frequencies.
4. Build the corresponding codeword table, to associate a bitstring with each char value in the
input.
5. Write the trie, encoded as a bitstring.
6. Write a count of character in the input, encoded as a bitstring.
7. Use the codeword table to write the codeword for each input character.
Then to do decompression or expand, we need following steps:
1. Read the trie.
2. Read the count if character to be decoded.
3. Use the trie to decode the bitstream.
Below a full implementation of Huffman encoding, compression and decompression in java:
import edu.princeton.cs.algs4.MinPQ;

public class Huffman


{
private static final int R = 256;

private Huffman() {}

private static class Node implements Comparable<Node> {


private final char ch;
private final int freq;
private final Node left, right;

Node(char ch, int freq, Node left, Node right) {


this.ch = ch;
this.freq = freq;
this.left = left;
this.right = right;
}

private boolean isLeaf() {


assert ((left == null) && (right == null)) || ((left != null) && (right
!= null));
return (left == null) && (right == null);
}
public int compareTo(Node that) {
return this.freq – that.freq;
}

public static void compress() {


String s = BinaryStdIn.readString();
char[] input = s.toCharArray();

int[] freq = new int[R];


for (int i = 0; i < input.length; i++)
freq[input[i]]++;

Node root = buildTrie(freq);

String[] st = new String[R];


buildCode(st, root, “”);

writeTrie(root);

BinaryStdOut.write(input.length);

for (int i = 0; i < input.length; i++) {


String code = st[input[i]];
for (int j = 0; j < code.length(); j++) {
if (code.charAt(j) == ‘0’) {
BinaryStdOut.write(false);
}
else if (code.charAt(j) == ‘1’) {
BinaryStdOut.write(true);
}
else throw new IllegalStateException(“Illegal state”);
}
}

BinaryStdOut.close();
}

private static Node buildTrie(int[] freq) {


MinPQ<Node> pq = new MinPQ<Node>();
for (char i = 0; i < R; i++)
if (freq[i] > 0)
pq.insert(new Node(i, freq[i], null, null));

if (pq.size() == 1) {
if (freq[‘\0’] == 0) pq.insert(new Node(‘\0’, 0, null, null));
else pq.insert(new Node(‘\1’, 0, null, null));
}
while (pq.size() > 1) {
Node left = pq.delMin();
Node right = pq.delMin();
Node parent = new Node(‘\0’, left.freq + right.freq, left,
right);
pq.insert(parent);
}
return pq.delMin();
}

private static void writeTrie(Node x) {


if (x.isLeaf()) {
BinaryStdOut.write(true);
BinaryStdOut.write(x.ch, 8);
return;
}
BinaryStdOut.write(false);
writeTrie(x.left);
writeTrie(x.right);
}

private static void buildCode(String[] st, Node x, String s) {


if (!x.isLeaf()) {
buildCode(st, x.left, s + ‘0’);
buildCode(st, x.right, s + ‘1’);
}
else {
st[x.ch] = s;
}
}

public static void expand() {


Node root = readTrie();
int length = BinaryStdIn.readInt();

for (int i = 0; i < length; i++) {


Node x = root;
while(!x.isLeaf()) {
boolean bit = BinaryStdIn.readBoolean();
if (bit) x = x.right;
else x = x.left;
}
BinaryStdOut.write(x.ch, 8);
}
BinaryStdOut.close();
}

private static Node readTrie() {


boolean isLeaf = BinaryStdIn.readBoolean();
if (isLeaf) {
return new Node(BinaryStdIn.readChar(), -1, null, null);
} else {
return new Node(‘\0’, -1, readTrie(), readTrie());
}
}

public static void main(String[] args) {


if (args[0].equals(“-”)) compress();
else if (args[0].equals(“+”)) expand();
else throw new IllegalArgumentException(“Illegal command line”);
}
}
}

LZW Compression
LZW try to improve encoding performance by creating table lookup for next prefix of character. A
encoding technique is very simple but powerful. Step by step LZW compression explained below:
1. Use standard encoding table to encode each character, for example ASCII.
2. Decide a stop code use to indicate delimiter or space. For example 80.
3. Allocate a symbol table (key - value) to store prefix encoding.
4. Scan each input character until we found longest prefix.
5. If we found new longest prefix, then stored in into symbol table with 8-bit value. For example,
prefix AB with code 81.
6. Else we encode each discovered longest prefix with codeword value in the symbol table.
7. Repeat step 4 until all character have been scanned.
An image below illustrate how to compress text ABRACADABRA using LZW compression
For efficiency, we store table of codewords as a Trie Symbol Table (TST).

If we know the stop code used for compression, it is easy to decompress the LZW compressed bits. The
step-by-step to do decompressing or expand is similar to compressing scheme. But, there is a tricky
technique in the case to avoid stuck while constructing symbol table. For example, in the case to
decompress text ABABABA with compressed bitstream 41 42 81 83 80, we need to ensure that
codeword 83 is a prefix ABA. The decompressing process of text ABABABA illustrated in an image
below.
The implementation of LZW compression over 8-bit extended ASCII alphabet with 12-bit codewords
using java will be looked like this:
public class LZW {
private static final int R = 256;
private static final int L = 4096;
private static final int W = 12;

private LZW() {}

public static void compress() {


String input = BinaryStdIn.readString();
TST<Integer> st = new TST<Integer>();
for (int i = 0; i < R; i++)
st.put(“” + (char) i, I);
int code = R+1;

while (input.length() > 0) {


String s = st.longestPrefixOf(input);
BinaryStdOut.write(st.get(s), W);
int t = s.length();
if (t < input.length() && code < L)
st.put(input.substring(0, t + 1), code++);
input = input.substring(t);
}
BinaryStdOut.write(R, W);
BinaryStdOut.close();
}

public static void expand() {


String[] st = new String[L];
int i;

for (i = 0; i < R; i++)


st[i] = “” + (char) i;
st[i++] = “”;

int codeword = BinaryStdIn.readInt(W);


if (codeword == R) return;
String val = st[codeword];

while (true) {
BinaryStdOut.write(val);
codeword = BinaryStdIn.readint(W);
if (codeword == R) break;
String s = st[codeword];
if (i == codeword) s = val + val.charAt(0); // Special case
if (i < L) st[i++] = val + s.charAt(0);
val = s;
}
BinaryStdOut.close();
}

public static void main(String[] args) {


if (args[0].equals(“-”)) compress();
else if (args[0].equals(“+”)) expand();
else thr ow new IllegalArgumentException(“Illegal command line”);
}
}

S-ar putea să vă placă și