Hashing
Hashing
as a Dictionary Implementation

Chapter 19 in Carrano/Savitch

 

The Dictionary Problem

•      The problem can be defined as finding, inserting and deleting records in a database

•      Usually we want the search to be as efficient as possible

•      What data structures do you know that allow for fast searches?

 

Ways to solve the dictionary problem

•      Keep the records sorted in an array and do a binary search

•      Keep the records in a binary search tree

•      Keep the records in a balanced binary search tree (AVL or red-black tree)

•      Keep the records in a B-tree

–    This tree has multiple children (20+)

–    Used when the number of records is so large most of the DB is on secondary storage

–    Primary data structure used in database programs

•      Keep the records in an array indexed by the keys themselves.

–    The key is part of the record, and is used for an index

 

Indexing

•      Suppose the index were the same as the key

•      In a class, everyone is given a class number to be their student number.  It is one member of the record.

•      This student number is the index into the array.

–    How many records would the array have to have if the student number were four digits?

–    How long would it take to retrieve information about a student, given the student number?

–    Would the records have to be stored in alphabetical order for a fast search?

•      What are the problems with this system?

 

From Indexing to hashing

•      Suppose we made an array with SS# as the index into the array

–    time to find a record?

–    Problem?

•      Now suppose we let the index into the array be some function of the SS# (i.e. key)

–    Maybe last 4 digits

–    Maybe add first 3, next two, and last four, truncate to three digits

•      How big would the array be?

–    Problems?

 

General idea of hashing

•      The information (the key, value pair) is kept in an array of objects

•      Each object has a unique key field (name, number, whatever)

•      A hash function relates the key to the index

–    the hash function maps the key to the index of the array

•      A collision occurs when two keys hash to the same index

–    must figure out a way to handle this

–    The simplest way is just to put the object in the next available slot in the array.

•    This is called Linear Probing

 

Simple example of hashing 

•      Put these keys into a hash table of size 11
  
25, 18, 6, 17, 21, 32, 20, 28

•      Our hash function will just be
  
index = key % tablesize

•      We will handle collisions with linear probing

–   If we get a collision, add one to the index, and try again

–   If another collision, just add one to that index again (put it in the next available slot)

 

Considerations when using hashing

•     What hash function to use

•     How to handle collisions

•     How big to make the array

–  The efficiency of hashing is determined by the amount of storage you are willing to waste

•   More space, fewer collisions

 

 Criteria for good hash functions

•      Quick and easy to compute

•      Minimize the number of collisions

•      Achieve an even distribution of the records across the range of indices (uniform hashing function)

 

Components of a Hashing function

•      Convert the search key into a hash code

–    This involves taking the key (which is not necessarily an integer) and changing it to an integer

–    This must be done in a way that avoids collisions as much as possible

–    Your text talks about two ways to do this:

•   Polynomial Hash Codes (for strings)

•   Folding and/or summing components

•      Compress the hash code to the size of the array

–    Use the mod function to do this

 

Summing components

•      This can be used when the number of bits to store a key is greater than the number of bits desired by the hash code

–    Example:  We want to store the hash code in an integer (4 bytes) and the key is stored as a long, or as a float or a character string

•      One possibility is just to truncate the high order bits (or the low order ones)

–    This may result in many collisions if the difference in the items is often in the bits deleted

•      An alternative is to sum the high order bits with the low order bits and use the sum as the hash code

•      This may not work well for character strings, since some combinations of letters are much more common than others

–    Example  stop, tops, pots, spot etc.

 

Examples of a Summing hash function

•      A simple hash function

–   Just add together the ASCII code of all the characters in the string making up the key

•      int hashCode( )
{    int hash= 0;
     for ( int i = 0; i < key.length( );  i++)
                hash =hash +  key.charAt( i );
      return hash;
}   // end hashCode

•      This assumes the key is a String

•      It must go inside a class that implement Map

 

Polynomial Hash Codes

•      This takes into account the position of the character or bytes in the key

•      Consider each byte (bi) in the key as a coefficient in a polynomial of degree (# of bytes in the key)

•      Choose a nonzero constant g unequal to one and find the hash code value this way:

–     u0gn-1 + u1gn-2  +  …  + un-2g + un-1

•      This can be evaluated using Horner’s method rather quickly.

•      Analysis has been done, and it seems that there are some values for a that are much better than others

–    33, 37, 39, and 41 are particularly good choices

–    We will look at an implementation with a = 37

 

Horner’ method

•       Horner’s method minimizes the number of arithmetic operations when evaluating a polynomial.
        u0gn-1 + u1gn-2  +  …  + un-2g + un-1
    
((u0g+u1)g+ u2)g + …+ un-2 )g + un-1

•      Let’ make a concrete example; evaluate
     4g3 + 5g2 + 3g1 + 8g0  if g is 2; here n=4

•      Evaluating using Horner’s method
  ((4g + 5)g + 3)g + 8

 

Polynomial hash function

•       This is a good hash function to use for if the key is a string

•      int hashCode( )
{   int hash = 0;
     int n = key.length( );
     for ( int i = 0; i < n;  i++)
            hash = 37 * hash +key.charAt(i)   ; 
//Horners method

•               return hash;
}

•      This may overflow and produce a negative number; we will take care of that when we call hashCode

•      If the key is very long, the programmer should decide how to shorten it

–    Use every other character

–    Knowledge of the form of the key should help here

 

Mapping the hash code to the range of the array

•      N is the size of the array
h(k) = k mod N

•   If you divide by N, how many possible remainders are there?

•   N should be prime to minimize collisions caused by bunching of the data

•      private int getHashIndex(Object key)
{      int hashIndex = key.hashCode( ) % hashTable.size( );
       if hashIndex < 0  
// if the key was negative
                 hashIndex = hashIndex + HashTable.size( );
       return hashIndex;
}

 

Collision Processing

•      What to do when two keys hash to the same index

•      Types of collision resolution

–   Open Addressing (only space is in the array)

•   Linear Probing

•   Quadratic Probing

•   Double hashing

–   Separate Chaining (add extra nodes as needed)

•   Use an array of linked lists  or an array of pointers

 

Collision Resolution with Open Addressing

•      All the records go in a the array; no separate data structure

•      Collisions are resolved by using empty cells in the array

•      A bigger array is needed  to reduce collisions

–   usually about twice the size of the number of records expected

 

Linear probing with Open Addressing

•      Simplest way--put the record in the next available space

–    The probe increment is always one

•    i.e if there is a collision, one is always added to the index

•      Retrieval does a sequential search from the hashed address until the record is found

•      As long as the table is big enough space can always be found

•      But, can lead to quite a sequential search

 

Using Linear probing for collision resolution

•      Example: H(key) = key mod 23

•      Add these keys to the hash table
 3, 4, 5, 26, 6, 7, 23, 16, 39, 17, 22,  55

•      Primary clustering

–    any key that hashes into a cluster will take several attempt to find an empty space

–    the key will add to the cluster

•      Linear probing does not come very close to uniform hashing, since primary clustering is likely even in relatively sparse tables

 

Quadratic probing using open addressing

•      The probe increment is the square of an integer, not the integer itself

•      If there is a collision, add j2to the index,
     for j = 1, 2 , 3, 4

•      With the first collision, add 1(12) to the index
if that still collides, add 4 (22) to the index
if that collides, add 9 (32) and so on

•      This avoids primary clustering, but can cause secondary clustering, where the set of filled array cells “bounces” around the array.

•      It is even possible that it may fail to find an empty bucket even though one is available.

 

Double hashing using open addressing

•      You have two hash functions H1(key) and H2(key)

•      The second hash function tells you what the probe increment must be

•      One way to use the double hash functions

–    Used only when in a collision

New_index = [H1(key) + H2(key) ] mod N

Example:

H1(key) = key mod 23; H2(key) = 1 + key mod 21;

Add these keys to the hash table

3, 4, 5, 26, 6, 7, 23, 16, 39, 17, 22,  55

 

Removing and searching

•      With open addressing, removing an item can cause a real problem

•      As long as no items are removed, searching takes the same path as inserting.

•      The search starts at the index hashed to by the key.

–    If the key is not found at that index, it searches sequentially until it either finds the key, or finds an empty slot

–    If the item is not found by the time it finds the empty slot, it is assumed that key is not in the array.

 

Removal with open addressing

•      If an item has been removed, a search may find an empty slot that the insert did not.

–   Thus it will stop the search too soon

•      One way to solve this is to have three states for each slot in the array

–   Empty

–   Occupied with data

–   Available because data has been removed

•      There are various ways to implement this, depending on what the key is

 

Collision resolution by chaining

–   The hash table is an index into an array of linked lists

–   Each key that hashes to a certain index is kept in a linked list pointed to by that index

Inserting  with chaining

–   First, find the index with hashing

–   Next, search the list to see if already in list

–   If new, insert either at beginning or end of list (programmer choice)

•   use linked list insertion

•   often new objects are frequently accessed

 

Searching with separate chaining

–     First, apply the hash function to get the index into an array

–     Then search the chain for the element

•      Deleting with separate chaining

–     First, hash to find the index

–     Next, sequential search to find the record

–     Then, use linked list deletion

•      Table size with separate chaining

–     Table size should be about the same as the number of records

–     But… should be a prime number to reduce collisions

•    Remember table size is determined by the mode function divisor in the hash function

•      Disadvantages of separate chaining

–     Using pointers slows the algorithm

•    time required to allocate new nodes

–     Must use two data structures

 

Deciding on table size

•      The size of the table is the determining factor in how efficient hashing is

–    Collision resolution takes considerably more time than evaluating the hash function

–    So the number of collisions determines how efficient the hashing is.

–    The number of collisions depends on how full a hash table is

•      The load factor is the ratio of the number of entries in the dictionary and the size of the hash array.

 

The Load Factor   λ

Load factor λ = #of keys / #of slots in array

•      Load factors for open addressing

–     λ ranges from 0 (empty table) to 1 (full table)

–    The number of collisions increases dramatically when λ > .5 (when the table is over half full)

•      Load factors for separate chaining

–     λ can be greater than one

–    In separate chaining, λ is thought of as the average length of the chains (which is the same as the definition above)

–    To maintain reasonable efficiency, λ should not go much above 1

 

Rehashing (Reallocating the hash table)

•      The hash table’s load factor needs to be kept optimal  to maintain O(1) time complexity

•      If more and more records are added will need to allocate a new table or array.

•      This is called rehashing.

–    The hash function must reflect the new table size.

–    All then elements in the old table must be inserted into the new table using the new hash function.

•      A good choice is to rehash into an array about double the size of the original array,

–    The size of the new array should be a prime number.

 

Brief analysis of types of collision resolution

•      Open addressing may save some space of separate chaining

–   But it is not necessarily faster.

•      In experimental and theoretical analyses, the chaining method is either competitive or faster than open addressing

•      So, the collision-handling method of choice seems to be separate chaining.