Data Structures Lecture 7: More on Lists; Hash Tables. 2/20
DJW sections 3.1, 3.5 (stacks); 5.1, 5.3, 5.6 (queues); 6.7 (lists)
2/20.
Returning a value
The following class definition does not compile. Why not?
public class FindSqrt {
public static int findSqrt(int n) {
if (n <= 0) return 0;
for (int i = 1; i <= n; i++) {
if (i*i >= n) return i;
}
}
}
Answer: Java cannot be sure that this returns a value; as far as the Java
compiler is concerned, the loop might execute to completion without the
condition ever being satisfied. You have to add an additional return
statement at the end. You know that this will never be executed,
but it reassures the compiler
Generic Ordered Lists
GOrderedList.java
Stacks and Queues
Stacks and queues are lists with restrictions on the forms of access.
Stacks
Stacks obey "Last In, First Out" (LIFO) discipline. In most applications,
there is an additional restriction that you can only examine the top element
of the stack. There are therefore three main methods (besides the constructor):
- push(X) -- put X on top of the stack.
- pop() --- remove the top element, and return its value.
- top() --- return the value of the top element without removing it.
ArrayStack.java
FIFO Queues: List implementation
A FIFO queue observes First In First Out (FIFO) ordering; items come off the
queue in the same order in which they went on. The main methods are:
- add(X) --- add X to the end of the queue
- pop() -- remove the element from the front of the queue and return it.
FIFO queues are implemented in two ways. The first is a linked list, with
pointers to front and back.
FIFOQueue.java
FIFO Queues: Circular array implementation
The second implementation is as a circular array. The queue consists of
an array elements
start and end index.
Conceptually the array ``wraps around'',
so that when you reach the end, you go back to index 0. In our implementation,
end is the index of the first empty slot. (You can never actually fill the
array, because there would be no way to distinguish that from the
empty queue.)
So:
- If end = start, then the queue is empty.
- If end > start, then the queue consists of
the items in elements[start] to elements[end-1].
- If end < start, then the queue consists of
the items in elements[start] to the end of elements, followed by
the items in elements[0] to elements[end-1].
CircularArray.java
Hash Tables
A hash table is a data structure that allows you to associate
a value with a key, and then look up the value associated with the key ---
with very high probability in constant time.
The Java HashTable library class
Sample code (German-English dictionary):
TestLibraryHash.java.
The generic library class
HashMap is a hash table with keys of class K
and values of class V. In this example, G2E
is constructed as an object of class HashMap < String,String >
Important methods:
- G2E.put(KEY,VALUE) -- Associate VALUE with KEY.
- G2E.get(KEY) -- Get the VALUE associated
with KEY.
- G2E.remove(KEY) -- Remove the KEY from the hash table.
- G2E.keySet() --- Return all the keys in the table as a set.
We'll come back to more implementation details later.
Implementation
MyHashTable.java.
This is a hash table with keys of class String and values of
type Person.
The hash table is an array of a specified size, which is supposed to
be larger than the maximum number of keys you plan to store.
Each element of the array
is a linked list of nodes (in this case, we've used a singly
linked list with no header). Each node has three data fields: key,
value, and next.
Method MyHash(S) maps string S to a index in the hash
table.
The method add(Key,PP) :
- Uses MyHash(S) to find the index for Key in the table.
- Looks for a node with Key in the linked list.
- If there is none, adds a node with key Key and
value PP to the linked list.
(You will improve this in the problem set.)
The method get(Key) :
- Uses MyHash(S) to find the index for Key in the table.
- Looks for a node with Key in the linked list.
- Returns the value in that node. If there is no such node,
return null.
Terminology:
Two keys collide in a hash table if the hash function maps them
to the same index.
The capacity of the hash table is the size of the array.
The load factor is the number of keys stored in the hash table divided
by the capacity. The size should be chosen so that the
load factor is less than 1. For instance, if we want to implement a
German-English dictionary with 50,000 German words, we need a hash table
that is larger than 50,000.
Since the number of keys in the hash table is less than the capacity of the
hash table, assuming that the keys are evenly distributed across indices,
there will be few collisions, and
most of the linked lists will be of length 1. A few will be of length 2;
a very few will be of length 3, and so on. The probability that there
is any linked list that is very much longer than the load factor is very small.
In the library class HashMap, the system automatically doubles the
size of the hash table when the load factor is reached, similar to what
we saw with StringBuffers.
(There are other implementations of hash tables that don't use linked lists,
described in Weiss section 5.4. For our purposes, these are unimportant.)
Example
This picture shows the final state of the hash table constructed in
TestMyHashTable.java.
Choosing the hash function
The hash function maps the key to an index in the hash table. It is critical
to avoid collisions as far as possible.
are implementing a German-English dictionary with 50,000 words and we
are using a hash table of capacity 75,000 -- a load capacity of 0.67,
which is very reasonable.
- There certainly is no function that maps every different String
to a different index, since there are only 75,000 different indices,
and there are infinitely many strings.
- It would be difficult to find a hash function that maps every different
German word to a different index i.e. no collisions. (However, when the set
of keys is fixed, as here, it is sometimes worth putting in substantial effort
to finding a hash function with few collisions.)
- If the German words of length N were a random selection of
strings of length N then you could use essentially any hash function
that distributes the strings of length N evenly over the indices,
and with high probability that would be a good hash function.
- But German words are not random strings; they have a lot of patterns.
For example: Some letters are common, some are rare. Some sequences of letters
are impossible. German uses a lot of compound words, so many words are
parts of other words.
- What is important is that these patterns don't somehow cause a lot
of German words to hash to the same value. For instance, you don't want a
hash function that only looks at the first four letters of the word
because many words begin with the same first four letters. You want the
hash function to ``make hash'' of any pattern in the set; hence the name.
- However, the hash function has to be computable quickly; otherwise
you lose the advantage of the hash table.
- The hash function for Strings given in MyHashTable.java, which
is pretty much the same as the one in Weiss, is pretty good. It views the string
as essentially a numeral in base 37 (or base 43, if the hash table size
is close to a multiple of 37) and then reduces that number mod the table
size. This hash function is a little slower than ideal, particularly for
German words, which are often long.
- The choice of hash function corresponds to the kinds of patterns
that naturally occur in actual collections. A good hash function for
images, for example, may be quite different from a good hash function
for strings.
- Obviously, the hash function does have to be repeatable; you could
get good distribution if you incorporated a random number, but then
you could never find it again.
The Java library hash function
The Java library provides a method hashCode for classes that are
expected to be used as keys in a hash tables; e.g. String,
Integer and so on. This maps the value to a 32 bit integer.
Reducing this modulo the hash table size gives a good hash function.
If the hash table size L = 2^{k} --- which it always is
in the hashMap class --- then reducing mod L is the same
as taking the k lowest-order bits, which is the same as doing
a bitwise AND with L-1. That is the explanation of the code
in the method goodMod.
exoe