CS 223 Out-lab 6
Week of
3/20

Due: to TA by 4/10 (extension granted)

Huffman Codes for File Compression

As you are probably well aware, data compression is a very common task.  For this out-lab your goal is to write your own file compression software based on Huffman codes.  You will need to combine two data structures for this assignment; priority queues and binary trees.  Your program should have two main functions (user selectable via an initial menu): compress file and decompress file:

Compress File

See the file ByteTest.java for some code to convert between string representations of bytes and char representations.

Update: there is an easier way:
char ch = (char) Integer.parseInt("11111111", 2); // converts a string to a char
String str = Integer.toBinaryString(ch); // converts char back a string

If a user selects 'compress', you should prompt them to enter a file name of a text file (e.g. "test.txt").  Your program should then read this file and create a Huffman code based on the character frequencies in the file (the algorithm was covered in class and is described in both textbooks).  You should then encode the file with the Huffman code that you just created.  This will create a long binary string.  You should then parse this binary string into a byte array (8 bits per byte).  Then write this byte array to a binary file (e.g. "test.huff").  You can write a binary file using the FileOutputStream class. 

Secondly, you should crate another file that stores the Huffman code that you just used to the encoding.  Starting with the code that you wrote for in-lab 7, modify this to write out a simple text file description of the tree.  Assign an index number (starting with 0 for the root) to each node in the tree.  Then do a traversal of the tree.  Create a file that looks like this:
<number of tree nodes>
<index of first internal node> I <index of left child> <index of right child>
<index of second internal node> I <index of left child> <index of right child>
...
<index of first leaf node> L <character encoded in this leaf>
...

Example code tree file:
5
0 I 1 2
1 L a
2 I 3 4
3 L b
4 L c


This file represents the following Huffman code for the string alphabet {a,b,c}
a    0
b    10
c    11


You should save the tree in a text file (e.g. "test.tree").

Decompress File

If the user selects 'decompress' then you should prompt them for a filename (say 'test').  You should then verify that files named 'test.huff' and 'test.tree' exist.  If they do then proceed to decompress the file.  You will first need to read 'test.tree' and reconstruct the Huffman code tree.  Once this is done, read the binary file 'test.huff' and, using the Huffman tree, decode the file.  As you decode each character, write that character to the decoded file 'test.decoded.txt'.  You should verify that the original file 'test.txt' and 'test.decoded.txt' are the same.

One tricky part: Support that your encoded binary string is not divisible by 8.  This means that the final byte will not been complete.  You should pad this some at the end to make up a complete byte.  You could pad with 0s but you need to make sure that the decoder knows that these 0s are just the pad and not part of the encoding string.  One way to handle this is to make the first byte in the file represent the length of the final pad rather than the start of the encoding string.  Then just delete that many 0s from the end before decoding the string.

Extra credit:

  1. Build a simple GUI for this application.
  2. Store both the tree and the encoding in the same file (so that you only need one file for the compressed version, not two).

What to Turn In

  • a printed copy of all source code.  Ensure the source code is documented.
  • a printout of several test runs compressing and decompressing some text files (try compressing this for fun)
  • How much space did you save in each case?