We don't want many keys to hash to the same given bucket since the time to find a key at the end of the list is proportional to the size of the list, i.e., to the number of keys that hash to this value.
We can't do much about items that have the same key, so we consider the (common) case where no two items have the same key.
The average size of a list is n/N, called the load factor, where n is the number of items and N is the number of buckets. Typically, one keeps the load factor significantly below 1.0. The text asserts that 0.75 is common.
What should we do as more items are added to the dictionary? We make an “extendable dictionary”. That is, as with an extendable array we double N and “fix everything up” In the case of an extendable dictionary, the fix up consists of recalculating the hash of every element (since N has doubled).
In fact no one really calls this an extendable dictionary. Instead this scheme is referred to as rehashing since, when N is changed, one must rehash (i.e., recompute the hash) of each element. Also, since both the old and new N are normally chosen to be a prime number, the new N can't be obtained doubling the old one. Instead, the new N is the smallest prime number exceeding twice the old N.
Separate chaining involves two data structures: the buckets and the log files. An alternative is to dispense with the log files and always store items in buckets, one item per bucket. Schemes of this kind are referred to as open addressing. The problem they need to solve is where to put an item when the bucket it should go into is already full? There are several different solutions. We study three: Linear probing, quadratic probing, and double hashing.
This is the simplest of the schemes. To insert a key k (really I should say ``to insert an item (k,e)'') we compute h(k) and initially assign k to A[h(k)]. If we find that A[h(k)] contains another key, we assign k to A[h(k)+1]. It that bucket is also full, we try A[h(k)+2], etc. Naturally, we do the additions mod N so that after trying A[N-1] we try A. So if we insert (16,e) into the dictionary at the right, we place it into bucket 2.
How about finding a key k (again I should say an item (k,e))?
We first look at A[h(k)]. If this bucket contains the key, we have
found it. If not try A[h(k)+1], etc and of course do it mod N (I will
stop mentioning the mod N). So if
we look for 4 we find it in bucket 1 (after encountering two keys
that hashed to 6).
Or perhaps I should say incomplete. What if the item is not on the list? How can we tell?
Ans: If we hit an empty bucket then the item is not present (if it were present we would have stored it in this empty bucket). So 20 is not present.
What if the dictionary is full, i.e., if there are no empty buckets.
Check to see if you have wrapped all the way around. If so, the key is not present
What about removals?
Easy, remove the item creating an empty bucket.
I'm sorry you asked. This is a bit of a mess. Assume we want to remove the (item with) key 19. If we simply remove it, and search for 4 we will incorrectly conclude that it is not there since we will find an empty slot.
OK so we slide all the items down to fill the hole.
WRONG! If we slide 6 into the whole at 5, we will never be able to find 6.
So we only slide the ones that hash to 4??
WRONG! The rule is you slide all keys that are not at their hash location until you hit an empty space.
Normally, instead of this complicated procedure for removals, we simple mark the bucket as removed by storing a special value there. When looking for keys we skip over such slots. When an insert hits such a bucket, the insert uses the bucket. (The book calls this a ``deactivated item'' object).
All the open addressing schemes work roughly the same. The difference is which bucket to try if A[h(k)] is full. One extra disadvantage of linear probing is that it tends to cluster the items into contiguous runs, which slows down the algorithm.
Quadratic probing attempts to spread items out by trying buckets A[h(k)], A[h(k)+1], A[h(k)+4], A[h(k)+9], etc. One problem is that even if N is prime this scheme can fail to find an empty slot even if there are empty slots.
In double hashing we have two hash functions h and h'. We use h as above and, if A[h(k)] is full, we try, A[h(k)+h'(k)], A[h(k)+2h'(k)], A[h(k)+3h'(k)], etc.
The book says h'(k) is often chosen to be (q-k) mod q for some prime q<N (I use the real mod, the book's definition is in terms of the java % and is thus more complicated). We will not consider which secondary hash function h' is good to use.
A hard choice. Separate chaining seems to use more space, but that is deceiving since it all depends on the loading factor. In general for each scheme the lower the loading factor, the faster scheme but the more memory it uses.
The key point about both schemes is that the expected time for a retrieval, insertion, or deletion is Θ(1). We cannot prove this as it depends on some assumptions about the input distribution and use more statistics than we assume known.