Data Compression
Huffman
codes
Section
4.4
The problem of data compression
The problem is to find an efficient way to encode a
data file
We will look at
text files
To represent a file for the computer, we use a binary
code
In this type of code, each character is represented by
a unique binary string, called the codeword
Remember, a codeword
is the bits used to encode a single character
There are two types of codes used
Fixed-length
binary codes
Variable-length
binary codes
Fixed-length binary codes
A fixed-length code represents each character using
the same number of bits
How many bits does it take to encode the 26 letters of
the alphabet?
ASCII -- Fixed-length code
it takes ⌈log
128⌉ or 7 bits to code the 128 symbols
if all characters were used equally often, then this is
the most space efficient way
but they are not used equally often
We would like to
find a way that uses fewer bits to encode a text file
Variable-length binary codes
We can find a
more efficient code using
a variable length code
We represent different characters using different
numbers of bits
Example: a=0 b=1, c=00, d=01, e=10, f=11, g=000
Do you see a problem here?
It is ambiguous
because some of the codewords are prefixes of other codewords
In order for this to work, only codewords are used
which are not a prefix of some other codeword.
Such codes are
called prefix codes
Perhaps
prefix-free codes would be a better name, but prefix codes is standard in
the literature
Prefix-free codes
One way to
guarantee prefix-free codewords is to create a binary tree
The left branch represents a 0 and the right branch is
a 1
Each character is a leaf in the tree
The encoding for that leaf (character) is the path to
the leaf
Look at overhead
Name the codeword for each character
But there is a
possibility that the file using this encoding system is not smaller than if we
just used three bits for each letter
If the frequency
of d,e,f,g is greater than the frequency of a, b, c
this would happen
We are looking for a way to encode that gives us the
smallest possible file. (i.e. the optimal tree)
If frequently
used characters are given a shorter codeword (fewer bits) than infrequently
used characters it should be more efficient
Finding an optimal code to compress text
Suppose we have a
file M with 100,000 characters
If we use fixed
length coding, it will take 300,000 bits to encode it.
To be more
efficient, will
need to find the frequency of each character used in the file
Once we have calculated the frequency of each
character we put the frequency-character pair in a minimum priority
queue
Then we use an algorithm developed by Huffman to build
an optimal binary tree
The leaves
will be the characters, the path to the leaf will the codeword.
Using Huffmans
algorithm to determine the codewords for each letter, the size of the file can
be reduced to 215,000
The first example will show this
Understanding the pseudocode
Technically, the
priority queue, and tree nodes both consist of nodes with four components:
character,
frequency,
leftChild,
rightChild
But we do not
need all this information in every part of our data structures
So, although when
coding each node is the same, in our pseudo code, we omit some of the
components in our drawings
In
the leaves, we need only the characters
In
the internal nodes we need only the frequency
In the priority queue, we need only the character:frequency pair
Huffmans strategy
Set up the priority queue for each character:frequency pair
Go through this loop for each different character in
the file
Pop two nodes off
the priority queue
These
will be the two nodes with the smallest frequencies
These
will end up being farthest from the root
Create a new node
to hold the sum of their frequencies
The new nodes
left child is the character with the smallest frequency
Its right node
is the other node
Enqueue the new
node with the sum of the frequencies
Pseudo code for Huffmans algorithm
PQ -- a minimum
priority queue arranged by frequency
n the number of different characters in the file
for(i=1; i<=n-1; i++)
{ p = deque(PQ);
q = deque(PQ);
r = new nodetype;
r.left = p;
r.right = q;
r.frequency = p.frequency +
q.frequency;
enque(PQ,
r); // put in the correct place in the
priority queue
}
p = deque(PQ); // leave the priority
queue empty
return(r); // this is the root of the
tree built
Using Huffmans algorithm
Put these into a
priority queue
(a:5)
(b:2)(c:10)(d:8)(e:22)(f:49)(g:4)
then use Huffmans algorithm to determine the codeword for each letter (i.e.
build the tree)
How is this tree
used to compress the file?
Encode the
message fad
011001110
How is the
message decoded?
The number of bits it takes to encode the file is
equal to the sum, over all the characters, of the
(frequency of
character) * (#of bits to encode character)
Another example
Put these into a priority queue
(a:16)
(b:5)(c:12)(d:17)(e:10)(f:25)
then use Huffmans algorithm to determine the codeword for each letter
Analyzing the Huffman tree
It will take O(n) time to
initialize the heap for the priority queue
Every time an item is dequeued or enqued, it takes O(log n) time
Since we go through the loop n-1 times, it takes n + 3(log n)* (n-1) time
The Big O time complexity is then O(n log n) to build the tree to get the code to compress a text
file.