Dictionaries, as the name implies are used to contain data that may later be retrieved. Associated with each element is the key used for retrieval.
For example consider an element to be one student's NYU transcript and the key would be the student id number. So given the key (id number) the dictionary would return the entire element (the transcript).
A dictionary stores items, which are key-element (k,e) pairs.
We will study ordered dictionaries in the next chapter when we consider searching. Here we consider unordered dictionaries. So, for example, we do not support findSmallestKey. the methods we do support are
Just store the items in a sequence.
The idea of a hash table is simple: Store the items in an array (as done for log files) but ``somehow'' be able to figure out quickly, i.e., Θ(1), which array element contains the item (k,e).
We first describe the array, which is easy, and then the ``somehow'', which is not so easy. Indeed in some sense it is impossible. What we can do is produce an implementation that, on the average, performs operations in time Θ(1).
Allocate an array A of size N of buckets, each able to hold an item. Assume that the keys are integers in the range [0,N-1] and that no two items have the same key. Note that N may be much bigger than n. Now simply store the item (k,e) in A[k].
If everything works as we assumed, we have a very fast implementation: searches, insertions, and removals are Θ(1). But there are problems, which is why section 2.5 is not finished.
We need a hash function h that maps keys to integers in the range [0,N-1]. Then we will store the item (k,e) in bucket A[h(k)] (we are for now ignoring collisions). This problem is divided into two parts. A hash code assigns to each key a computer integer and then a compression map converts any computer integer into one in the range [0,N-1]. Each of these steps can introduce collisions. So even if the keys were unique to begin with, collisions are an important topic.
A hash code assigns to any key an integer value. The problem we have to solve is that the key may have more bits than are permitted in our integer values. We first view the key as bunch of integer values (to be explained) and then combine these integer values into one.
If our integer values are restricted to 32 bits and our keys are 64
bits, we simply view the high order 32 bits as one value and the low
order as another. In general if
⌈numBitsInKey / numBitsInIntegerValue⌉ = k
we view the key as k integer values. How should we combine the k
values into one?
Simply add the k values.
But, but, but what about overflows?
Ignore them (or use exclusive or instead of addition).
The summing components method gives very many collisions when used for character strings. If 4 characters fill an integer value, then `temphash' and `hashtemp' will give the same value. If one decided to use integer values just large enough to hold one (unicode) character, then there would be many, many common collisions: `t21' and `t12' for one, mite and time for another.
If we call the k integer values x0,...,xk-1, then a better scheme for combining is to choose a positive integer value a and compute ∑xiai=x0+x1a+..xn-1an-1.
Same comment about overflows applies.
The authors have found that using a = 33, 37, 39, or 41 worked well for character strings that are English words.
The problem we wish to solve in this section is to map integers in
some, possibly large range, into integers in the range [0,N-1].
This is trivial! Why not map all the integers into 0.
We want to minimize collisions.
This is often called the mod method, especially if you use the ``correct'' definition of mod. One simple way to turn any integer x into one in the range [0,N-1] is to compute |x| mod N. That is we define the hash function h by
h(x) = |x| mod N(If we used the true mod we would not need the absolute value.)
Choosing N to be prime tends to lower the collision rate, but choosing N to be a power of 2 permits a faster computation since mod with a power of two simply means taking the low order bits.
MAD stands for multiply-add-divide (mod is essentially division). We still use mod N to get the numbers in the range, but we are a little fancier and try to spread the numbers out first. Specifically we define the hash function h via.
h(x) = |ax+b| mod N
The values a and b are chosen (often at random) as positive integers not a multiple of N.
The question we wish to answer is what to do when two distinct keys map to the same value, i.e., when h(k)=h(k'). In this case we have two items to store in one bucket. This discussion also covers the case where we permit multiple items to have the same key.
The idea is simple, each bucket instead of holding an item holds a reference to a container of items. That is each bucket refers to the trivial log file implementation of a dictionary, but only for the keys that map to this container.
The code is simple, you just error check and pass the work off to the trivial implementation used for the individual bucket.
Algorithm findElement(k): B←A[h(k)] if B is empty then return NO_SUCH_KEY // now just do the trivial linear search return B.findElement(k) Algorithm insertItem(k,e): if A[h(k)] is empty then Create B, an empty sequence-based dictionary A[h(k)]←B else B←A[h(k)] B.insertItem(k,e) Algorithm removeElement(k) B←A[h(k) if B is empty then return NO_SUCH_KEY else return B.removeElement(k)
Homework: R-2.19
We want the number of keys hashing to a given bucket to be small since the time to find a key at the end of the list is proportional to the size of the list, i.e., to the number of keys that hash to this value.
We can't do much about items that have the same key, so lets consider the (common) case where no two items have the same key.
The average size of a list is n/N, called the load factor, where n is the number of items and N is the number of buckets. Typically, one keeps the load factor below 1.0. The text asserts that 0.75 is common.
What should we do as more items are added to the dictionary? We make an ``extendable dictionary''. That is, as with an extendable array we double N and ``fix everything up'' In the case of an extendable dictionary, the fix up consists of recalculating the hash of every element (since N has doubled). In fact no one calls this an extendable dictionary. Instead one calls this scheme rehashing since one must rehash (i.e., recompute the hash) of each element when N is changed. Also N is normally chosen to be a prime number so instead of doubling, one chooses for the new N the smallest prime number above twice the old N.