Basic Algorithms

================ Start Lecture #12 ================

2.5 Dictionaries and Hash Tables

Dictionaries, as the name implies are used to contain data that may later be retrieved. Associated with each element is the key used for retrieval.

For example consider an element to be one student's NYU transcript and the key would be the student id number. So given the key (id number) the dictionary would return the entire element (the transcript).

2.5.1 the Unordered Dictionary ADT

A dictionary stores items, which are key-element (k,e) pairs.

We will study ordered dictionaries in the next chapter when we consider searching. Here we consider unordered dictionaries. So, for example, we do not support findSmallestKey. the methods we do support are

findElement(k): Return an element having key k or signal an error if no such element exists.
insertItem(k,e): Insert an item with key k and element e.
removeElement(k): Remove an item with key k and return its element. Signal an error if no such item exists.

Trivial Implementation: log files

Just store the items in a sequence.

Trivial (and fast) to insert: Θ(1)
Minimal space: Θ(n)
Slow for finding or removing elements: Θ(n) per operation

2.5.2 Hash Tables

The idea of a hash table is simple: Store the items in an array (as done for log files) but ``somehow'' be able to figure out quickly, i.e., Θ(1), which array element contains the item (k,e).

We first describe the array, which is easy, and then the ``somehow'', which is not so easy. Indeed in some sense it is impossible. What we can do is produce an implementation that, on the average, performs operations in time Θ(1).

Bucket Arrays

Allocate an array A of size N of buckets, each able to hold an item. Assume that the keys are integers in the range [0,N-1] and that no two items have the same key. Note that N may be much bigger than n. Now simply store the item (k,e) in A[k].

Analysis of Bucket Arrays

If everything works as we assumed, we have a very fast implementation: searches, insertions, and removals are Θ(1). But there are problems, which is why section 2.5 is not finished.

The keys might not be unique (although in many applications they are): This is a simple example of a collision. We discuss collisions in 2.5.5.
The keys might not be integers: We can always treat any (computer stored) object as an integer. Just view the object as a bunch of bits and then consider that a base two non-negative integer.
But those integers might not be ``computer integers'', that is, they might have more bits than the largest integer type in our programming language: True, we discuss methods for converting long bit strings into computer integers in 2.5.3.
But on many machines the number of computer integers is huge. We can't possibly have a bucket array of size 2⁶⁴: True, we discuss compressing computer integers into a smaller range in 2.5.4

2.5.3 Hash Functions

We need a hash function h that maps keys to integers in the range [0,N-1]. Then we will store the item (k,e) in bucket A[h(k)] (we are for now ignoring collisions). This problem is divided into two parts. A hash code assigns to each key a computer integer and then a compression map converts any computer integer into one in the range [0,N-1]. Each of these steps can introduce collisions. So even if the keys were unique to begin with, collisions are an important topic.

Hash Codes

A hash code assigns to any key an integer value. The problem we have to solve is that the key may have more bits than are permitted in our integer values. We first view the key as bunch of integer values (to be explained) and then combine these integer values into one.

If our integer values are restricted to 32 bits and our keys are 64 bits, we simply view the high order 32 bits as one value and the low order as another. In general if
⌈numBitsInKey / numBitsInIntegerValue⌉ = k
we view the key as k integer values. How should we combine the k values into one?

Summing Components

Simply add the k values.
But, but, but what about overflows?
Ignore them (or use exclusive or instead of addition).

Polynomial Hash Codes

The summing components method gives very many collisions when used for character strings. If 4 characters fill an integer value, then `temphash' and `hashtemp' will give the same value. If one decided to use integer values just large enough to hold one (unicode) character, then there would be many, many common collisions: `t21' and `t12' for one, mite and time for another.

If we call the k integer values x₀,...,x_k-1, then a better scheme for combining is to choose a positive integer value a and compute ∑x_iaⁱ=x₀+x₁a+..x_n-1a^n-1.

Same comment about overflows applies.

The authors have found that using a = 33, 37, 39, or 41 worked well for character strings that are English words.

2.5.4 Compression Maps

The problem we wish to solve in this section is to map integers in some, possibly large range, into integers in the range [0,N-1].
This is trivial! Why not map all the integers into 0.
We want to minimize collisions.

The Division Method

This is often called the mod method, especially if you use the ``correct'' definition of mod. One simple way to turn any integer x into one in the range [0,N-1] is to compute |x| mod N. That is we define the hash function h by

            h(x) = |x| mod N

(If we used the true mod we would not need the absolute value.)

Choosing N to be prime tends to lower the collision rate, but choosing N to be a power of 2 permits a faster computation since mod with a power of two simply means taking the low order bits.

The MAD Method

MAD stands for multiply-add-divide (mod is essentially division). We still use mod N to get the numbers in the range, but we are a little fancier and try to spread the numbers out first. Specifically we define the hash function h via.

            h(x) = |ax+b| mod N

The values a and b are chosen (often at random) as positive integers not a multiple of N.

2.5.5 Collision-Handling Schemes

The question we wish to answer is what to do when two distinct keys map to the same value, i.e., when h(k)=h(k'). In this case we have two items to store in one bucket. This discussion also covers the case where we permit multiple items to have the same key.

Separate Chaining

The idea is simple, each bucket instead of holding an item holds a reference to a container of items. That is each bucket refers to the trivial log file implementation of a dictionary, but only for the keys that map to this container.

The code is simple, you just error check and pass the work off to the trivial implementation used for the individual bucket.

Algorithm findElement(k):
    B←A[h(k)]
    if B is empty then
        return NO_SUCH_KEY
    // now just do the trivial linear search
    return B.findElement(k)

Algorithm insertItem(k,e):
    if A[h(k)] is empty then
        Create B, an empty sequence-based dictionary
        A[h(k)]←B
    else
        B←A[h(k)]
    B.insertItem(k,e)

Algorithm removeElement(k)
    B←A[h(k)
    if B is empty then
        return NO_SUCH_KEY
    else
        return B.removeElement(k)

Homework: R-2.19

Load Factors and Rehashing

We want the number of keys hashing to a given bucket to be small since the time to find a key at the end of the list is proportional to the size of the list, i.e., to the number of keys that hash to this value.

We can't do much about items that have the same key, so lets consider the (common) case where no two items have the same key.

The average size of a list is n/N, called the load factor, where n is the number of items and N is the number of buckets. Typically, one keeps the load factor below 1.0. The text asserts that 0.75 is common.

What should we do as more items are added to the dictionary? We make an ``extendable dictionary''. That is, as with an extendable array we double N and ``fix everything up'' In the case of an extendable dictionary, the fix up consists of recalculating the hash of every element (since N has doubled). In fact no one calls this an extendable dictionary. Instead one calls this scheme rehashing since one must rehash (i.e., recompute the hash) of each element when N is changed. Also N is normally chosen to be a prime number so instead of doubling, one chooses for the new N the smallest prime number above twice the old N.