Hashing
Hashing as a Dictionary Implementation
Chapter
19 in Carrano/Savitch
The Dictionary Problem
The problem can be defined as finding, inserting
and deleting records in a database
Usually we want the search to be as efficient as
possible
What data structures do you know that allow for fast
searches?
Ways to solve the dictionary problem
Keep the records sorted
in an array and do a binary search
Keep the records in a binary search tree
Keep the records
in a balanced binary search tree (AVL or red-black tree)
Keep the records in a B-tree
This tree has
multiple children (20+)
Used when the
number of records is so large most of the DB is on secondary storage
Primary data
structure used in database programs
Keep the records
in an array indexed by the keys themselves.
The key is part
of the record, and is used for an index
Indexing
Suppose the index were the same as the key
In a class, everyone is given a class number to be
their student number. It is one member
of the record.
This student number is the index into the array.
How many records
would the array have to have if the student number were four digits?
How long would it
take to retrieve information about a student, given the student number?
Would the records
have to be stored in alphabetical order for a fast search?
What are the problems with this system?
From Indexing to hashing
Suppose we made an array with SS# as the index into
the array
time to find a record?
Problem?
Now suppose we let the index into the array be some
function of the SS# (i.e. key)
Maybe last 4
digits
Maybe add first
3, next two, and last four, truncate to three digits
How big would the array be?
Problems?
General idea of hashing
The information (the key, value pair) is kept in an
array of objects
Each object has a unique key field (name,
number, whatever)
A hash function relates the key to the index
the hash function maps the key to the index of the array
A collision occurs when two keys hash to the same
index
must figure out a way to handle this
The simplest way
is just to put the object in the next available slot in the array.
This
is called Linear Probing
Simple example of hashing
Put these keys
into a hash table of size 11
25, 18, 6, 17, 21, 32, 20, 28
Our hash function
will just be
index = key % tablesize
We will handle
collisions with linear probing
If we get a collision, add one to the index, and try
again
If another collision, just add one to that index again
(put it in the next available slot)
Considerations when using hashing
What hash function to use
How to handle collisions
How big to make the array
The efficiency of hashing is determined by the amount
of storage you are willing to waste
More
space, fewer collisions
Criteria for good hash functions
Quick and easy
to compute
Minimize the number of collisions
Achieve an even distribution of the records
across the range of indices (uniform hashing function)
Components of a Hashing function
Convert the
search key into a hash code
This involves taking the key (which is not necessarily
an integer) and changing it to an integer
This must be done
in a way that avoids collisions as much as possible
Your text talks
about two ways to do this:
Polynomial Hash Codes (for strings)
Folding and/or summing components
Compress the
hash code to the size of the array
Use the mod
function to do this
Summing components
This can be used when the number of bits to store a key
is greater than the number of bits desired by the hash code
Example: We want to store the hash code in an integer
(4 bytes) and the key is stored as a long, or as a float or a character string
One possibility is just to truncate the high order
bits (or the low order ones)
This may result
in many collisions if the difference in the items is often in the bits deleted
An alternative is to sum the high order bits with the
low order bits and use the sum as the hash code
This may not work
well for character strings, since some combinations of letters are much more
common than others
Example stop, tops, pots, spot etc.
Examples of a Summing hash function
A simple hash function
Just add together the ASCII code of all the characters
in the string making up the key
int hashCode( )
{ int hash=
0;
for ( int i = 0; i < key.length(
); i++)
hash =hash + key.charAt( i );
return hash;
} // end hashCode
This assumes the key is a String
It must go inside a class that implement Map
Polynomial Hash Codes
This takes into
account the position of the character or bytes in the key
Consider each
byte (bi) in the key as a coefficient in a polynomial of
degree (# of bytes in the key)
Choose a nonzero
constant g unequal to one and find the hash code value this way:
u0gn-1 + u1gn-2 +
+ un-2g + un-1
This can be
evaluated using Horners method rather quickly.
Analysis has been
done, and it seems that there are some values for a that are much better than
others
33, 37, 39, and 41 are particularly good choices
We will look at an implementation with a = 37
Horner method
Horners method minimizes the number of arithmetic
operations when evaluating a polynomial.
u0gn-1
+ u1gn-2 +
+ un-2g
+ un-1
((u0g+u1)g+
u2)g +
+ un-2 )g + un-1
Let make a concrete example; evaluate
4g3 + 5g2 +
3g1 + 8g0 if g
is 2; here n=4
Evaluating using
Horners method
((4g + 5)g
+ 3)g + 8
Polynomial hash function
This is a good hash function to use for if the key is
a string
int hashCode(
)
{ int hash =
0;
int n = key.length( );
for ( int i = 0; i < n; i++)
hash = 37 * hash +key.charAt(i) ; //Horners
method
return hash;
}
This may overflow and produce a negative number; we
will take care of that when we call hashCode
If the key is very long, the programmer should decide
how to shorten it
Use every other
character
Knowledge of the
form of the key should help here
Mapping the hash code to the range of the
array
N is the size of the array
h(k) = k mod N
If
you divide by N, how many possible remainders are there?
N
should be prime to minimize collisions caused by bunching of the data
private int
getHashIndex(Object key)
{ int hashIndex = key.hashCode( ) % hashTable.size( );
if hashIndex
< 0 // if the key was
negative
hashIndex = hashIndex + HashTable.size( );
return hashIndex;
}
Collision Processing
What to do when two keys hash to the same index
Types of collision resolution
Open Addressing
(only space is in the array)
Linear Probing
Quadratic Probing
Double hashing
Separate Chaining (add extra nodes as needed)
Use an array of linked lists or an array of pointers
Collision
Resolution with Open Addressing
All the records go in a the array; no separate data
structure
Collisions are resolved by using empty cells in the
array
A bigger array is needed to reduce collisions
usually about twice the size of the number of records
expected
Linear probing with Open
Addressing
Simplest way--put the record in the next available
space
The probe increment
is always one
i.e if there is a collision, one
is always added to the index
Retrieval does a sequential search from the hashed
address until the record is found
As long as the table is big enough space can always be
found
But, can lead to quite a sequential search
Using Linear probing for collision
resolution
Example: H(key) =
key mod 23
Add these keys to
the hash table
3, 4, 5, 26, 6, 7, 23, 16, 39, 17, 22, 55
Primary clustering
any key that hashes into a cluster will take several attempt
to find an empty space
the key will add to the cluster
Linear probing
does not come very close to uniform hashing, since primary clustering is likely
even in relatively sparse tables
Quadratic probing using open addressing
The probe
increment is the square of an integer, not the integer itself
If there is a collision, add j2to
the index,
for j = 1, 2 ,
3, 4
With the first
collision, add 1(12) to the index
if that still collides, add 4 (22) to the index
if that collides, add 9 (32) and so on
This avoids
primary clustering, but can cause secondary clustering, where the set of filled
array cells bounces around the array.
It is even
possible that it may fail to find an empty bucket even though one is available.
Double hashing using open addressing
You have two hash functions H1(key)
and H2(key)
The second hash function tells you what the probe
increment must be
One way to use the double hash functions
Used only when in
a collision
New_index = [H1(key)
+ H2(key) ] mod N
Example:
H1(key) = key mod 23; H2(key) = 1 +
key mod 21;
Add these keys to the hash table
3, 4, 5, 26, 6, 7, 23, 16, 39, 17, 22, 55
Removing and searching
With open addressing, removing an item can cause a
real problem
As long as no items are removed, searching takes the same
path as inserting.
The search starts at the index hashed to by the key.
If the key is not
found at that index, it searches sequentially until it either finds the key, or
finds an empty slot
If the item is
not found by the time it finds the empty slot, it is assumed that key is not in
the array.
Removal with open addressing
If an item has been removed, a search may find an
empty slot that the insert did not.
Thus it will stop the search too soon
One way to solve this is to have three states for each
slot in the array
Empty
Occupied with data
Available because data has been removed
There are various ways to implement this, depending on
what the key is
Collision resolution by chaining
The hash table is an index into an array of linked
lists
Each key that hashes to a certain index is kept in a
linked list pointed to by that index
Inserting with chaining
First, find the index with hashing
Next, search the list to see if already in list
If new, insert either at beginning or end of list
(programmer choice)
use linked list insertion
often new objects are
frequently accessed
Searching with separate chaining
First, apply the hash function to get the index into
an array
Then search the chain for the element
Deleting with separate chaining
First, hash to find the index
Next, sequential search to find the record
Then, use linked list deletion
Table size with separate chaining
Table size should be about the same as the number of
records
But
should be a prime number to reduce collisions
Remember
table size is determined by the mode function divisor in the hash function
Disadvantages of separate chaining
Using pointers slows the algorithm
time required to allocate new nodes
Must use two data structures
Deciding on table size
The size of the table is the determining factor in how
efficient hashing is
Collision
resolution takes considerably more time than evaluating the hash function
So the number of
collisions determines how efficient the hashing is.
The number of
collisions depends on how full a hash table is
The load factor is the ratio of the number of
entries in the dictionary and the size of the hash array.
The Load Factor λ
Load factor λ = #of keys / #of slots in array
Load factors for open addressing
λ ranges from
0 (empty table) to 1 (full table)
The number of
collisions increases dramatically when λ >
.5 (when the table is over half full)
Load factors
for separate chaining
λ can be
greater than one
In separate
chaining, λ is thought of as the average length of the chains
(which is the same as the definition above)
To maintain
reasonable efficiency, λ should not go much above 1
Rehashing (Reallocating the hash table)
The hash tables load factor needs to be kept optimal to maintain
O(1) time complexity
If more and more records are added
will need to allocate a new table or array.
This is called rehashing.
The hash function
must reflect the new table size.
All then elements
in the old table must be inserted into the new table using the new hash
function.
A good choice is to rehash into an array about double
the size of the original array,
The size of the
new array should be a prime number.
Brief analysis of types of collision
resolution
Open addressing may save some space of separate
chaining
But it is not necessarily faster.
In experimental and theoretical analyses, the chaining
method is either competitive or faster than open addressing
So, the collision-handling method of choice seems to
be separate chaining.