Data Compression

Huffman codes

Section 4.4

The problem of data compression

• The problem is to find an efficient way to encode a data file

– We will look at text files

• To represent a file for the computer, we use a binary code

– In this type of code, each character is represented by a unique binary string, called the codeword

– Remember, a codeword is the bits used to encode a single character

• There are two types of codes used

– Fixed-length binary codes

– Variable-length binary codes

Fixed-length binary codes

• A fixed-length code represents each character using the same number of bits

• How many bits does it take to encode the 26 letters of the alphabet?

• ASCII -- Fixed-length code

– it takes ⌈log 128⌉ or 7 bits to code the 128 symbols

– if all characters were used equally often, then this is the most space efficient way

– but they are not used equally often

– We would like to find a way that uses fewer bits to encode a text file

Variable-length binary codes

• We can find a more efficient code using a variable length code

• We represent different characters using different numbers of bits

• Example: a=0 b=1, c=00, d=01, e=10, f=11, g=000

• Do you see a problem here?

– It is ambiguous because some of the codewords are prefixes of other codewords

• In order for this to work, only codewords are used which are not a prefix of some other codeword.

• Such codes are called prefix codes

– Perhaps “prefix-free codes” would be a better name, but “prefix codes” is standard in the literature

Prefix-free codes

• One way to guarantee prefix-free codewords is to create a binary tree

– The left branch represents a 0 and the right branch is a 1

– Each character is a leaf in the tree

– The encoding for that leaf (character) is the path to the leaf

– Look at overhead

– Name the codeword for each character

• But there is a possibility that the file using this encoding system is not smaller than if we just used three bits for each letter

• If the frequency of d,e,f,g is greater than the frequency of a, b, c this would happen

• We are looking for a way to encode that gives us the smallest possible file. (i.e. the optimal tree)

• If frequently used characters are given a shorter codeword (fewer bits) than infrequently used characters it should be more efficient

Finding an optimal code to compress text

• Suppose we have a file M with 100,000 characters

• If we use fixed length coding, it will take 300,000 bits to encode it.

• To be more efficient, will need to find the frequency of each character used in the file

• Once we have calculated the frequency of each character we put the frequency-character pair in a minimum priority queue

• Then we use an algorithm developed by Huffman to build an optimal binary tree

• The leaves will be the characters, the path to the leaf will the codeword.

• Using Huffman’s algorithm to determine the codewords for each letter, the size of the file can be reduced to 215,000

– The first example will show this

Understanding the pseudocode

• Technically, the priority queue, and tree nodes both consist of nodes with four components:

– character,

– frequency,

– leftChild,

– rightChild

• But we do not need all this information in every part of our data structures

– So, although when coding each node is the same, in our pseudo code, we omit some of the components in our drawings

• In the leaves, we need only the characters

• In the internal nodes we need only the frequency

• In the priority queue, we need only the character:frequency pair

Huffman’s strategy

• Set up the priority queue for each character:frequency pair

• Go through this loop for each different character in the file

– Pop two nodes off the priority queue

• These will be the two nodes with the smallest frequencies

• These will end up being farthest from the root

– Create a new node to hold the sum of their frequencies

– The new node’s left child is the character with the smallest frequency

– It’s right node is the other node

– Enqueue the new node with the sum of the frequencies

Pseudo code for Huffman’s algorithm

• PQ -- a minimum priority queue arranged by frequency

• n – the number of different characters in the file

•      for(i=1; i<=n-1; i++)
{   p = deque(PQ);
    q = deque(PQ);

        r = new nodetype;
    r.left = p;
    r.right = q;

r.frequency = p.frequency + q.frequency;

enque(PQ, r); // put in the correct place in the priority queue
}
p = deque(PQ); // leave the priority queue empty
return(r); // this is the root of the tree built

Using Huffman’s algorithm

• Put these into a priority queue
(a:5) (b:2)(c:10)(d:8)(e:22)(f:49)(g:4)
then use Huffman’s algorithm to determine the codeword for each letter (i.e. build the tree)

• How is this tree used to compress the file?

– Encode the message fad
011001110

• How is the message decoded?

• The number of bits it takes to encode the file is equal to the sum, over all the characters, of the
**(frequency of character) * (#of bits to encode character)**

Another example

• Put these into a priority queue
(a:16) (b:5)(c:12)(d:17)(e:10)(f:25)
then use Huffman’s algorithm to determine the codeword for each letter

Analyzing the Huffman tree

• It will take O(n) time to initialize the heap for the priority queue

• Every time an item is dequeued or enqued, it takes O(log n) time

• Since we go through the loop n-1 times, it takes **n + 3(log n)* (n-1)** time

• The Big O time complexity is then O(n log n) to build the tree to get the code to compress a text file.