Hashing
Hashing as a Dictionary Implementation

Chapter 19 in Carrano/Savitch

The Dictionary Problem

• The problem can be defined as finding, inserting and deleting records in a database

• Usually we want the search to be as efficient as possible

• What data structures do you know that allow for fast searches?

Ways to solve the dictionary problem

• Keep the records sorted in an array and do a binary search

• Keep the records in a binary search tree

• Keep the records in a balanced binary search tree (AVL or red-black tree)

• Keep the records in a B-tree

– This tree has multiple children (20+)

– Used when the number of records is so large most of the DB is on secondary storage

– Primary data structure used in database programs

• Keep the records in an array indexed by the keys themselves.

– The key is part of the record, and is used for an index

Indexing

• Suppose the index were the same as the key

• In a class, everyone is given a class number to be their student number. It is one member of the record.

• This student number is the index into the array.

– How many records would the array have to have if the student number were four digits?

– How long would it take to retrieve information about a student, given the student number?

– Would the records have to be stored in alphabetical order for a fast search?

• What are the problems with this system?

From Indexing to hashing

• Suppose we made an array with SS# as the index into the array

– time to find a record?

– Problem?

• Now suppose we let the index into the array be some function of the SS# (i.e. key)

– Maybe last 4 digits

– Maybe add first 3, next two, and last four, truncate to three digits

• How big would the array be?

– Problems?

General idea of hashing

• The information (the key, value pair) is kept in an array of objects

• Each object has a unique key field (name, number, whatever)

• A hash function relates the key to the index

– the hash function maps the key to the index of the array

• A collision occurs when two keys hash to the same index

– must figure out a way to handle this

– The simplest way is just to put the object in the next available slot in the array.

• This is called Linear Probing

Simple example of hashing

• Put these keys into a hash table of size 11
25, 18, 6, 17, 21, 32, 20, 28

• Our hash function will just be
index = key % tablesize

• We will handle collisions with linear probing

– If we get a collision, add one to the index, and try again

– If another collision, just add one to that index again (put it in the next available slot)

Considerations when using hashing

• What hash function to use

• How to handle collisions

• How big to make the array

– The efficiency of hashing is determined by the amount of storage you are willing to waste

• More space, fewer collisions

Criteria for good hash functions

• Quick and easy to compute

• Minimize the number of collisions

• Achieve an even distribution of the records across the range of indices (uniform hashing function)

Components of a Hashing function

• Convert the search key into a hash code

– This involves taking the key (which is not necessarily an integer) and changing it to an integer

– This must be done in a way that avoids collisions as much as possible

– Your text talks about two ways to do this:

• Polynomial Hash Codes (for strings)

• Folding and/or summing components

• Compress the hash code to the size of the array

– Use the mod function to do this

Summing components

• This can be used when the number of bits to store a key is greater than the number of bits desired by the hash code

– Example: We want to store the hash code in an integer (4 bytes) and the key is stored as a long, or as a float or a character string

• One possibility is just to truncate the high order bits (or the low order ones)

– This may result in many collisions if the difference in the items is often in the bits deleted

• An alternative is to sum the high order bits with the low order bits and use the sum as the hash code

• This may not work well for character strings, since some combinations of letters are much more common than others

– Example stop, tops, pots, spot etc.

Examples of a Summing hash function

• A simple hash function

– Just add together the ASCII code of all the characters in the string making up the key

•      int hashCode( )
{    int hash= 0;
     for ( int i = 0; i < key.length( ); i++)
                hash =hash + key.charAt( i );
      return hash;
}   // end hashCode

• This assumes the key is a String

• It must go inside a class that implement Map

Polynomial Hash Codes

• This takes into account the position of the character or bytes in the key

• Consider each byte (b_i) in the key as a coefficient in a polynomial of degree (# of bytes in the key)

• Choose a nonzero constant g unequal to one and find the hash code value this way:

_–u₀g^n-1 + u₁g^n-2 + … + u_n-2g + u_n-1

• This can be evaluated using Horner’s method rather quickly.

• Analysis has been done, and it seems that there are some values for a that are much better than others

– 33, 37, 39, and 41 are particularly good choices

– We will look at an implementation with a = 37

Horner’ method

_•Horner’s method minimizes the number of arithmetic operations when evaluating a polynomial.
u₀g^n-1 + u₁g^n-2 + … + u_n-2g + u_n-1((u₀g+u₁)g+ u₂)g + …+ u_n-2 )g + u_n-1

• Let’ make a concrete example; evaluate
4g³+ 5g²+ 3g¹ + 8g⁰ if g is 2; here n=4

• Evaluating using Horner’s method
((4g + 5)g + 3)g + 8

Polynomial hash function

• This is a good hash function to use for if the key is a string

•      int hashCode( )
{   int hash = 0;
     int n = key.length( );
     for ( int i = 0; i < n; i++)
            hash = 37 * hash +key.charAt(i)   ; //Horners method

• return hash;
}

• This may overflow and produce a negative number; we will take care of that when we call hashCode

• If the key is very long, the programmer should decide how to shorten it

– Use every other character

– Knowledge of the form of the key should help here

Mapping the hash code to the range of the array

• N is the size of the array
h(k) = k mod N

• If you divide by N, how many possible remainders are there?

• N should be prime to minimize collisions caused by bunching of the data

•      private int getHashIndex(Object key)
{      int hashIndex = key.hashCode( ) % hashTable.size( );
       if hashIndex < 0   // if the key was negative
                 hashIndex = hashIndex + HashTable.size( );
       return hashIndex;
}

Collision Processing

• What to do when two keys hash to the same index

• Types of collision resolution

– Open Addressing (only space is in the array)

• Linear Probing

• Quadratic Probing

• Double hashing

– Separate Chaining (add extra nodes as needed)

• Use an array of linked lists or an array of pointers

Collision Resolution with Open Addressing

• All the records go in a the array; no separate data structure

• Collisions are resolved by using empty cells in the array

• A bigger array is needed to reduce collisions

– usually about twice the size of the number of records expected

Linear probing with Open Addressing

• Simplest way--put the record in the next available space

– The probe increment is always one

• i.e if there is a collision, one is always added to the index

• Retrieval does a sequential search from the hashed address until the record is found

• As long as the table is big enough space can always be found

• But, can lead to quite a sequential search

Using Linear probing for collision resolution

• Example: H(key) = key mod 23

• Add these keys to the hash table
3, 4, 5, 26, 6, 7, 23, 16, 39, 17, 22, 55

• Primary clustering

– any key that hashes into a cluster will take several attempt to find an empty space

– the key will add to the cluster

• Linear probing does not come very close to uniform hashing, since primary clustering is likely even in relatively sparse tables

Quadratic probing using open addressing

• The probe increment is the square of an integer, not the integer itself

• If there is a collision, add j²to the index,
for j = 1, 2 , 3, 4

• With the first collision, add 1(1²) to the index
if that still collides, add 4 (2²) to the index
if that collides, add 9 (3²) and so on

• This avoids primary clustering, but can cause secondary clustering, where the set of filled array cells “bounces” around the array.

• It is even possible that it may fail to find an empty bucket even though one is available.

Double hashing using open addressing

• You have two hash functions H₁(key) and H₂(key)

• The second hash function tells you what the probe increment must be

• One way to use the double hash functions

– Used only when in a collision

New_index = [H1(key) + H2(key) ] mod N

Example:

H₁(key) = key mod 23; H₂(key) = 1 + key mod 21;

Add these keys to the hash table

3, 4, 5, 26, 6, 7, 23, 16, 39, 17, 22, 55

Removing and searching

• With open addressing, removing an item can cause a real problem

• As long as no items are removed, searching takes the same path as inserting.

• The search starts at the index hashed to by the key.

– If the key is not found at that index, it searches sequentially until it either finds the key, or finds an empty slot

– If the item is not found by the time it finds the empty slot, it is assumed that key is not in the array.

Removal with open addressing

• If an item has been removed, a search may find an empty slot that the insert did not.

– Thus it will stop the search too soon

• One way to solve this is to have three states for each slot in the array

– Empty

– Occupied with data

– Available because data has been removed

• There are various ways to implement this, depending on what the key is

Collision resolution by chaining

– The hash table is an index into an array of linked lists

– Each key that hashes to a certain index is kept in a linked list pointed to by that index

Inserting with chaining

– First, find the index with hashing

– Next, search the list to see if already in list

– If new, insert either at beginning or end of list (programmer choice)

• use linked list insertion

• often new objects are frequently accessed

Searching with separate chaining

– First, apply the hash function to get the index into an array

– Then search the chain for the element

• Deleting with separate chaining

– First, hash to find the index

– Next, sequential search to find the record

– Then, use linked list deletion

• Table size with separate chaining

– Table size should be about the same as the number of records

– But… should be a prime number to reduce collisions

• Remember table size is determined by the mode function divisor in the hash function

• Disadvantages of separate chaining

– Using pointers slows the algorithm

• time required to allocate new nodes

– Must use two data structures

Deciding on table size

• The size of the table is the determining factor in how efficient hashing is

– Collision resolution takes considerably more time than evaluating the hash function

– So the number of collisions determines how efficient the hashing is.

– The number of collisions depends on how full a hash table is

• The load factor is the ratio of the number of entries in the dictionary and the size of the hash array.

The Load Factor λ

Load factor λ = #of keys / #of slots in array

• Load factors for open addressing

– λ ranges from 0 (empty table) to 1 (full table)

– The number of collisions increases dramatically when λ > .5 (when the table is over half full)

• Load factors for separate chaining

– λ can be greater than one

– In separate chaining, λ is thought of as the average length of the chains (which is the same as the definition above)

– To maintain reasonable efficiency, λ should not go much above 1

Rehashing (Reallocating the hash table)

• The hash table’s load factor needs to be kept optimal to maintain O(1) time complexity

• If more and more records are added will need to allocate a new table or array.

• This is called rehashing.

– The hash function must reflect the new table size.

– All then elements in the old table must be inserted into the new table using the new hash function.

• A good choice is to rehash into an array about double the size of the original array,

– The size of the new array should be a prime number.

Brief analysis of types of collision resolution

• Open addressing may save some space of separate chaining

– But it is not necessarily faster.

• In experimental and theoretical analyses, the chaining method is either competitive or faster than open addressing

• So, the collision-handling method of choice seems to be separate chaining.