CSE 13SAssignment 7Lempel-Ziv CompressionCSE 13S – Winter 2021Due: March 14th at 11:59 pm PSTEveryday, we create 2.5 quintillion bytes of data – so much that 90% of the data inthe world today has been created in the last two years alone.—IBM report on Big Data (2011)1 IntroductionCompressing data means reducing the number of bits needed to represent it. Compressed data, due to its reducedsize, is a lot faster to transfer and less expensive to store, freeing up storage and increasing network capacity. Algorithmsthat perform data compression are known as data compression algorithms. Data compression algorithmsare divided into two categories: lossy and lossless. Lossy compression algorithms compress more than losslesscompression algorithms, but at the cost of losing some information, which can’t be recovered. Lossy compressionalgorithms are typically used for audio and video files, where a loss in quality is tolerable and often not noticeable.Lossless compression algorithms on the other hand don’t compress as much, but do not lose any data, meaningcompressed information can be exactly reconstructed back into its uncompressed form. Lossless compressionalgorithms are used for data that must maintain integrity, such as binaries, text documents, and source code.2 Lempel-Ziv CompressionAbraham Lempel and Jacob Ziv published papers for two lossless compression algorithms, LZ77 and LZ78, publishedin 1977 and 1978, respectively. The core idea behind both of these compression algorithms is to representrepeated patterns in data with using pairs which are each comprised of a code and a symbol. A code is an unsigned16-bit integer and a symbol is a 8-bit ASCII character.We will explain this algorithm by working through an example. Assume we have some string “abab”. Assume wealso have a dictionary where the key is a prefix, also known as a word, and the value is a code that we can utilizefor fast look-ups. Since codes are 16-bit unsigned integers, the dictionary is limited to 216 −1 possible codes. Whydon’t we have 216 codes? Because we will be reserving a code to be used as a stop code which indicates the end ofour encoded data. We will define the maximal usable code as MAX_CODE, which has the value 216−1. The dictionaryis initialized with the empty word, or a string of zero length, at the index EMPTY_CODE, which is a macro for 1. Wewill specify the first index in which a new word can be added to as the macro START_CODE, which has the value 2.We now consider the first character: ‘a’. Since this is the first character, we know that the string of all previouslyseen characters up to this point must be the empty word; we haven’t considered any characters prior to this point.We then append the current character ‘a’ to the empty word, which yields the word “a”. We check the dictionaryin vain to see if we’ve encountered this word before. Since we haven’t yet seen this word, we will add this first word© 2021 Darrell Long 1Compression ExampleInitialized dictionaryMAX_CODEto the dictionary and assign it the first available code of 2, or START_CODE. We set our previously seen word to bethe empty word and output the pair (EMPTY_CODE, ‘a’).We continue on, now considering the second character: ‘b’. We append this character to our previously seen word,which yields “b” as our current word. We again check the dictionary in vain to see if we’ve encountered this wordbefore. Since we haven’t, we add this word to the dictionary and assign it the next available code of 3. We set thepreviously seen word to the empty word and output the pair (EMPTY_CODE, ‘b’).The next character we see is ‘a’. We append this character to our previously seen word, which yields “a” as ourcurrent word. We check the dictionary to see if we’ve encountered this word before. Indeed we have. We set thepreviously seen word to “a” and proceed to the next character without any further output.We read in the last character, ‘b’. We append this character to our previously seen word, which yields “ab”. Clearly,this word isn’t in the dictionary, so pair comprised of the current symbol, ‘b’, and the code of the previously we addit and assign it the next available code of 4. What should we output? The seen word. Since our previously seenword at this stage was “a”, this code will be 2. Thus, we output (2, ‘b’). To finish off compression, we output thefinal pair (STOP_CODE, 0). As you can imagine, this pair signals the end of compressed data. The symbol in thisfinal pair isn’t used and the value is of no significance. The macro STOP_CODE has the value of 0.Of course, a compression algorithm is useless without a corresponding decompression algorithm. Assume we’rereading in the output from the preceding compression example. The output was comprised of the following pairs,in order: (EMPTY_CODE, ‘a’), (EMPTY_CODE, ‘b’), (2, ‘b’), (STOP_CODE, 0). Similar to the compression algorithm, thedecompression algorithm initializes a dictionary containing only the empty word. The catch is that the key andvalue for the decompressing dictionary is swapped; each key is instead a code and each value a word. As youmight imagine, the decompression algorithm is the inverse of the compression algorithm, and thus from the outputpairs, the decompression algorithm will recreate the same dictionary used during compression to output thedecompressed data.We consider the first pair: (EMPTY_CODE, ‘a’). We append the symbol ‘a’ to the word denoted by EMPTY_CODE,which is the empty word. Thus, the current word is “a”. We add this word to the dictionary and assign it the nextavailable code of 2, then output the current word “a”.We consider the next pair: (EMPTY_CODE, ‘b’). We append the symbol ‘b’ to the word denoted by EMPTY_CODE,which is the empty word. Thus, the current word is “b”. We add this word to the dictionary and assign it the nextavailable code of 3, then output the current word “b”.We now consider the next pair: (2, ‘b’). We append the symbol ‘b’ to the word denoted by the code 2, which wepreviously added to our dictionary. The word denoted by this code is “a”, whence we obtain our current word of“ab”. We add this word to the dictionary and assign it the next available code of 4, then output the current word“ab”. Finally, we read in the last pair: (STOP_CODE, 0). Since the code is STOP_CODE, we know that we have finisheddecompression.If the basic idea behind the compression and decompression algorithms do not immediately make sense to you,MAX_CODEor if you desire a more visual representation of how the algorithms work, make sure to attend section and get helpearly! Things will not get easier as time goes on.3 Your TaskYour task is to implement two programs called encode and decode which perform LZ78 compression and decompression,respectively. The requirements for your programs are as follows:
...