Lecture 8: Hash Tables 2/25
DJW section 10.6. However, note that the hash tables in DJW are somewhat
different from ours. In their implementation, the hash table is an array
of key-value pairs rather than an array of headers of linked lists. This
is called Closed hashing. The technique for dealing with collisions
becomes more complicated in closed hashing.
A hash table is a data structure that allows you to associate
a value with a key, and then look up the value associated with the key ---
with very high probability in constant time.
The Java HashTable library class
Sample code (German-English dictionary):
TestLibraryHash.java.
The generic library class
HashMap is a hash table with keys of class K
and values of class V. In this example, G2E
is constructed as an object of class HashMap < String,String >
Important methods:
- G2E.put(KEY,VALUE) -- Associate VALUE with KEY.
- G2E.get(KEY) -- Get the VALUE associated
with KEY.
- G2E.remove(KEY) -- Remove the KEY from the hash table.
- G2E.keySet() --- Return all the keys in the table as a set.
We'll come back to more implementation details later.
Implementation
MyHashTable.java.
This is a hash table with keys of class String and values of
type Person.
The hash table is an array of a specified size, which is supposed to
be larger than the maximum number of keys you plan to store.
Each element of the array
is a linked list of nodes (in this case, we've used a singly
linked list with no header). Each node has three data fields: key,
value, and next.
Method MyHash(S) maps string S to a index in the hash
table.
The method add(Key,PP) :
- Uses MyHash(S) to find the index for Key in the table.
- Looks for a node with Key in the linked list.
- If there is none, adds a node with key Key and
value PP to the linked list.
The method get(Key) :
- Uses MyHash(S) to find the index for Key in the table.
- Looks for a node with Key in the linked list.
- Returns the value in that node. If there is no such node,
return null.
Terminology:
Two keys collide in a hash table if the hash function maps them
to the same index.
The capacity of the hash table is the size of the array.
The load factor is the number of keys stored in the hash table divided
by the capacity. The size should be chosen so that the
load factor is less than 1. For instance, if we want to implement a
German-English dictionary with 50,000 German words, we need a hash table
that is larger than 50,000.
Since the number of keys in the hash table is less than the capacity of the
hash table, assuming that the keys are evenly distributed across indices,
there will be few collisions, and
most of the linked lists will be of length 1. A few will be of length 2;
a very few will be of length 3, and so on. The probability that there
is any linked list that is very much longer than the load factor is very small.
In the library class HashMap, the system automatically doubles the
size of the hash table when the load factor is reached, similar to what
we saw with StringBuffers.
(There are other implementations of hash tables that don't use linked lists,
described in DJW ????. For our purposes, these are unimportant.)
Example
This picture shows the final state of the hash table constructed in
TestMyHashTable.java.
Choosing the hash function
The hash function maps the key to an index in the hash table. It is critical
to avoid collisions as far as possible. Suppose that we
are implementing a German-English dictionary with 50,000 words and we
are using a hash table of capacity 75,000. Then we have a load capacity of 0.67,
which is very reasonable.
- There certainly is no function that maps every different String
to a different index, since there are only 75,000 different indices,
and there are infinitely many strings.
- It would be difficult to find a hash function that maps every different
German word to a different index i.e. no collisions. (However, when the set
of keys is fixed, as here, it is sometimes worth putting in substantial effort
to finding a hash function with few collisions.)
- If the German words of length N were a random selection of
strings of length N then you could use essentially any hash function
that distributes the strings of length N evenly over the indices,
and with high probability that would be a good hash function.
- But German words are not random strings; they have a lot of patterns.
For example: Some letters are common, some are rare. Some sequences of letters
are impossible. German uses a lot of compound words, so many words are
parts of other words.
- What is important is that these patterns don't somehow cause a lot
of German words to hash to the same value. For instance, you don't want a
hash function that only looks at the first four letters of the word
because many words begin with the same first four letters. You want the
hash function to ``make hash'' of any pattern in the set; hence the name.
- However, the hash function has to be computable quickly; otherwise
you lose the advantage of the hash table.
- The hash function for Strings given in MyHashTable.java,
is pretty good.
It views the string
as essentially a numeral in base 37 (or base 43, if the hash table size
is close to a multiple of 37) and then reduces that number mod the table
size. This hash function is a little slower than ideal, particularly for
German words, which are often long.
- The choice of hash function corresponds to the kinds of patterns
that naturally occur in actual collections. A good hash function for
images, for example, may be quite different from a good hash function
for strings.
- Obviously, the hash function does have to be repeatable; you could
get good distribution if you incorporated a random number, but then
you could never find it again.
The Java library hash function
The Java library provides a method hashCode for classes that are
expected to be used as keys in a hash tables; e.g. String,
Integer and so on. This maps the value to a 32 bit integer.
Reducing this modulo the hash table size gives a good hash function.
If the hash table size L = 2^{k} --- which it always is
in the hashMap class --- then reducing mod L is the same
as taking the k lowest-order bits, which is the same as doing
a bitwise AND with L-1. That is the explanation of the code
in the method goodMod.
exoe
equals() and hashCode() for complex data structures:
Java provides an equals(X) method and a hashCode()
method for an arbitrary
object. However, the default is that both these methods are based on
the address in memory of the address.
Example:
TestEqualLists1.java
Sometimes this is what you want,
but often it is not. For example, you would like two linked lists to be
considered equal if they have the identical sequence of elements; and that
this sense of ``equals'' should be used by all the functions, including
library functions, that call the equals method. Likewise, you might
like to use a list like [1, 5, 8] as a key, and then look it up with a
different list [1, 5, 8], without it having to be the same actual object.
The solution is to override the equals(X) and the
hashCode() methods:
Example using lists of ints:
TestEqualLists2.java
Your new equals(X) and hashCode()
methods must satisfy the following
constraints; otherwise the functions that call these (lots of things,
in the case of equals(); hash tables in the case of
hashCode()) will fail
in strange and unpredictable ways.
The method equals(X) must be an equivalence relation. That is:
- X.equals(X) must always return
true for all X in the
class.
- X.equals(Y) and
Y.equals(X) must return the same value (either both
true or both false.
- If X.equals(Y) and Y.equals(Z) return
true, then
X.equals(Z) must also
return true.
The method hashCode() must be compatible with
equals(X). That is,
if X.equals(Y) returns true, then X.hashCode()
and Y.hashCode() must be equal.
If you do want equality in the sense of ``the identical
object'', then you can always use X == Y.
For complex data structures or mathematical entities, the question
of what it means for two things to be the same, and how you compute that
can be a deep and difficult one.