Programming Assignment 1 (15%) CSI2110/CSI2510
Due : November 6, 11:59PM, 2021 (suggestion : complete by Oct 27)
Late assignment policy: 1min-24hs late are accepted with 30% off; no assignments accepted after 24hs late.
Huffman Code for File Compression
Problem Description
Lossless compression is a type of data compression that allows the original data to be recovered from the compressed data. Huffman encoding is a compression technique that reduces the number of bits needed to store a message based on the idea that more frequent letters should have a shorter bit representation and less frequent ones should have a longer bit representation. Text compression is useful when we wish to reduce bandwidth for digital communications, so as to minimize the time required to transmit a text. Text compression is also used to store large documents more efficiently, for instance allowing your hard drive to contain as many documents as possible.
In this assignment, you will implement the Huffman encoding (compression) and decoding (decompression) algorithms.
This assignment will use binary trees to store the Huffman tree and it will use priority queues as part of the greedy algorithm that builds the Huffman tree.
1How the Huffman code works
Suppose we have 4 possible letters that appear on a message, a string or a file: A, C, T, G. We could encode these letters with a fixed length code using 2 bits for each letter:
letter A C G T
fixed length encoding 00 01 10 11
Using a fixed length encoding, the message AACGTAAATAATGAAC which has 16 letters can be encoded with 32 bits as 00000110110000001100001110000001
Huffman encoding would instead check how many times each letter appears in the text and choose a variable length code depending on letter frequencies:
letter A C G T
Frequency in text 9 2 2 3
Huffman code 1 010 011 00
With Huffman encoding the same message can be encoded with 27 bits (saving 15% space)
110100110011100110001111010
Now imagine a more drastic scenario where a text message uses a standard 8-byte encoding for each character (like in UTF-8 character encoding) able to support 256 different characters. Suppose that our specific message is the same as in the example above (AACGTAAATAATGAAC), so that for the 4 characters we have the same frequencies as shown in the table above but the other 252 other characters have frequency 0. Now the message above would be encoded in UTF-8 using 16*8=128 bits or 16 bytes. Now comparing our Huffman encoding above that has 27 bits, which is less than 4 bytes we can now save 75% space.
In another scenario, suppose you want to store the entire human genome composed of a sequence of 3 billion nucleotide base pairs. This can be represented by a sequence of 3 billion characters each being one of A (adenine), T (thymine), G (guanine), C (cytosine), which could be stored using 3 billion bytes using UTF-8 encoding which is about 3GB. Huffman encoding for such a file would depend on the relative frequency between these four letters, but assuming about 25% frequency of each of these 4 letters (and of course absence of any other of the 252 character), Huffman would encode each letter with 2 bits yielding a compressed file of 750MB.
How to build the Huffman code from the letters in a text and their frequencies?
The Huffman code is always a prefix code, that is, no codeword is a prefix of any other codeword. This generates no ambiguity. When decoding our example
110100110011100110001111010
we have no doubt that the first word is an A since no other codeword starts with a 1; similarly, we decode the next A. Then the next bits will be read one by one until we conclude unequivocally that the next codeword can only possibly be 010 which is a C. Try continuing this example to see how the original text can be decoded using the Huffman code given in the table in the previous page.
Huffman’s algorithm produces an optimal variable-length prefix code for a given string X, using the character frequencies in X. It is based on the construction of a binary tree T that represents the code.
Here we quote the textbook by Goodrich et al. 6th edition page 595:
“Each edge in T represents a bit in a codeword, with an edge to the left child representing a 0 and an edge to the right child representing a 1. Each leaf v is associated with a specific character and the codeword for that character is defined by the sequence of bits associated with the edges in the path from the root of T to v. Each leaf v has a frequency, f(v), which is simply the frequency in the string X of the character associated with v. In addition, we give each internal node v in T a frequency, f(v), that is the sum of the