Data Compression

Huffman codes

Section 4.4

 

The problem of data compression

•      The problem is to find an efficient way to encode a data file

–    We will look at text files

•      To represent a file for the computer, we use a binary code

–    In this type of code, each character is represented by a unique binary string, called the codeword

–    Remember, a codeword is the bits used to encode a single character

•      There are two types of codes used

–    Fixed-length binary codes

–    Variable-length binary codes

 

Fixed-length binary codes

•      A fixed-length code represents each character using the same number of bits

•      How many bits does it take to encode the 26 letters of the alphabet?

•      ASCII -- Fixed-length code

–    it takes log 128 or 7 bits to code the 128 symbols

–    if all characters were used equally often, then this is the most space efficient way

–    but they are not used equally often

–    We would like to find a way that uses fewer bits to encode a text file

 

Variable-length binary codes

•      We can find a more efficient code using  a variable length code

•      We represent different characters using different numbers of bits

•      Example: a=0 b=1, c=00, d=01, e=10, f=11, g=000

•      Do you see a problem here?

–    It is ambiguous because some of the codewords are prefixes of other codewords

•      In order for this to work, only codewords are used which are not a prefix of some other codeword.

•      Such codes are called prefix codes

–    Perhaps “prefix-free codes” would be a better name, but “prefix codes” is standard in the literature

 

Prefix-free codes

•      One way to guarantee prefix-free codewords is to create a binary tree

–    The left branch represents a 0 and the right branch is a 1

–    Each character is a leaf in the tree

–    The encoding for that leaf (character) is the path to the leaf

–    Look at overhead

–    Name the codeword for each character

•      But there is a possibility that the file using this encoding system is not smaller than if we just used three bits for each letter

•      If the frequency of d,e,f,g is greater than the frequency of a, b, c this would happen

•      We are looking for a way to encode that gives us the smallest possible file. (i.e. the optimal tree)

•      If frequently used characters are given a shorter codeword (fewer bits) than infrequently used characters it should be more efficient

 

Finding an optimal code to compress text

•      Suppose we have a file M with 100,000 characters

•      If we use fixed length coding, it will take 300,000 bits to encode it.

•      To be more efficient,  will need to find the frequency of each character used in the file

•      Once we have calculated the frequency of each character we put the frequency-character pair in a minimum priority queue

•      Then we use an algorithm developed by Huffman to build an optimal binary tree

•      The leaves will be the characters, the path to the leaf will the codeword.

•      Using Huffman’s algorithm to determine the codewords for each letter, the size of the file can be reduced to 215,000

–    The first example will show this

 

Understanding the pseudocode

•           Technically, the priority queue, and tree nodes both consist of nodes with four components:

–        character,

–        frequency,

–        leftChild,

–        rightChild

•           But we do not need all this information in every part of our data structures

–        So, although when coding each node is the same, in our pseudo code, we omit some of the components in our drawings

•       In the leaves, we need only the characters

•       In the internal nodes we need only the frequency

•       In the priority queue, we need only the character:frequency pair

 

Huffman’s strategy

•      Set up the priority queue for each character:frequency pair

•      Go through this loop for each different character in the file

–    Pop two nodes off the priority queue

•    These will be the two nodes with the smallest frequencies

•    These will end up being farthest from the root

–    Create a new node to hold the sum of their frequencies

–    The new node’s left child is the character with the smallest frequency

–    It’s right node is the other node

–    Enqueue the new node with the sum of the frequencies

 

Pseudo code for Huffman’s algorithm

•      PQ -- a minimum priority queue arranged by frequency

•      n – the number of different characters in the file

•      for(i=1; i<=n-1; i++)
{   p = deque(PQ); 
    q = deque(PQ);

        r = new nodetype;
    r.left = p;  
    r.right = q;

        r.frequency = p.frequency + q.frequency;

        enque(PQ, r);   // put in the correct place in the priority queue
}
p = deque(PQ);  // leave the priority queue empty
return(r);  // this is the root of the tree built


Using Huffman’s algorithm

•      Put these into a priority queue
(a:5) (b:2)(c:10)(d:8)(e:22)(f:49)(g:4)
then use Huffman’s algorithm to determine the codeword for each letter (i.e. build the tree)

•      How is this tree used to compress the file?

–    Encode the message fad
011001110

•      How is the message decoded?

•      The number of bits it takes to encode the file is equal to the sum, over all the characters, of the
(frequency of character) * (#of bits to encode character)

 

 

Another example

•      Put these into a priority queue
(a:16) (b:5)(c:12)(d:17)(e:10)(f:25)
then use Huffman’s algorithm to determine the codeword for each letter

 

Analyzing the Huffman tree

•      It will take O(n) time to initialize the heap for the priority queue

•      Every time an item is dequeued or enqued, it takes O(log n) time

•      Since we go through the loop n-1 times, it takes n + 3(log n)* (n-1) time

•      The Big O time complexity is then O(n log n) to build the tree to get the code to compress a text file.