Note: A practice midterm is on the web.
Remark: It is not as trivial to find the new insertion point for heaps when using a linked implementation.
Homework: Show the steps for inserting an element
with key 2 in the heap of Figure 2.41.
Trivial, right? Just remove the root since that must contain an
element with minimum key. Also decrease n by one.
Wrong!
What remains is TWO trees.
We do want the element stored at the root but we must put some other element in the root. The one we choose is our friend the last node.
But the last node is likely not to be a valid root, i.e. it will destroy the heap property since it will likely be bigger than one of its new children. So we have to bubble this one down. It is shown in pale red on the right and the procedure explained below. We also need to find a new last node, but that really is trivial: It is the node stored at the new value of n.
If the new root is the only internal node then we are done.
If only one child of the root is internal (it must be the left child) compare its key with the key of the root and swap if needed.
If both children of the root are internal, choose the child with the smaller key and swap with the root if needed.
The original last node, became the root, and now has been bubbled
down to level 1. But it might still be bigger than a child so we keep
bubbling. At worst we need Θ(h) bubbling steps, which is again
logarithmic in n as desired.
Homework: R-2.16
Operation | Time |
---|---|
size, isEmpty | O(1) |
minElement, minKey | O(1) |
insertItem | Θ(log n) |
removeMin | Θ(log n) |
The table on the right gives the performance of the heap implementation of a priority queue. As desired, the main operations have logarithmic time complexity. It is for this reason that heap sort is fast.
The goal is to sort a sequence S. We return to the PQ-sort where we insert the elements of S into a priority queue and then use removeMin to obtain the sorted version. When we use a heap to implement the priority queue, each insertion and removal takes Θ(log(n)) so the entire algorithm takes Θ(nlog(n)). The heap implementation of PQ-sort is called heap-sort and we have shown
Theorem: The heap-sort algorithm sorts a sequence of n comparable elements in Θ(nlog(n)) time.
In place means that we use the space occupied by the input. More precisely, it means that the space required is just the input + O(1) additional memory. The algorithm above required Θ(n) addition space to store the heap.
The in place heap-sort of S assumes that S is implemented as an array and proceeds as follows (This presentation, beyond the definition of ``in place'' is unofficial; i.e., it will not appear on problem sets or exams)
If you are given at the beginning all n elements that are to be inserted, the total insertion time for all inserts can be reduced to O(n) from O(nlog(n)). The basic idea assuming n=2n-1 is
Sometimes we wish to extend the priority queue ADT to include a locater that always points to the same element even when the element moves around. So if x is in a priority queue and another item is inserted, x may move during the up-heap bubbling, but the locater of x continues to refer to x.
Method | Unsorted Sequence | Sorted Sequence | Heap |
---|---|---|---|
size, isEmpty | Θ(1) | Θ(1) | Θ(1) |
minElement, minKey | Θ(n) | Θ(1) | Θ(1) |
insertItem | Θ(1) | Θ(n) | Θ(log(n)) |
removeMin | Θ(n) | Θ(1) | Θ(log(n)) |
Dictionaries, as the name implies are used to contain data that may later be retrieved. Associated with each element is the key used for retrieval.
For example consider an element to be one student's NYU transcript and the key would be the student id number. So given the key (id number) the dictionary would return the entire element (the transcript).
A dictionary stores items, which are key-element (k,e) pairs.
We will study ordered dictionaries in the next chapter when we consider searching. Here we consider unordered dictionaries. So, for example, we do not support findSmallestKey. the methods we do support are
Just store the items in a sequence.
The idea of a hash table is simple: Store the items in an array (as done for log files) but ``somehow'' be able to figure out quickly, i.e., Θ(1), which array element contains the item (k,e).
We first describe the array, which is easy, and then the ``somehow'', which is not so easy. Indeed in some sense it is impossible. What we can do is produce an implementation that, on the average, performs operations in time Θ(1).
Allocate an array A of size N of buckets, each able to hold an item. Assume that the keys are integers in the range [0,N-1] and that no two items have the same key. Note that N may be much bigger than n. Now simply store the item (k,e) in A[k].
If everything works as we assumed, we have a very fast implementation: searches, insertions, and removals are Θ(1). But there are problems, which is why section 2.5 is not finished.
We need a hash function h that maps keys to integers in the range [0,N-1]. Then we will store the item (k,e) in bucket A[h(k)] (we are for now ignoring collisions). This problem is divided into two parts. A hash code assigns to each key a computer integer and then a compression map converts any computer integer into one in the range [0,N-1]. Each of these steps can introduce collisions. So even if the keys were unique to begin with, collisions are an important topic.
A hash code assigns to any key an integer value. The problem we have to solve is that the key may have more bits than are permitted in our integer values. We first view the key as bunch of integer values (to be explained) and then combine these integer values into one.
If our integer values are restricted to 32 bits and our keys are 64
bits, we simply view the high order 32 bits as one value and the low
order as another. In general if
⌈numBitsInKey / numBitsInIntegerValue⌉ = k
we view the key as k integer values. How should we combine the k
values into one?
Simply add the k values.
But, but, but what about overflows?
Ignore them (or use exclusive or instead of addition).
The summing components method gives very many collisions when used for character strings. If 4 characters fill an integer value, then `temphash' and `hashtemp' will give the same value. If one decided to use integer values just large enough to hold one (unicode) character, then there would be many, many common collisions: `t21' and `t12' for one, mite and time for another.
If we call the k integer values x0,...,xk-1, then a better scheme for combining is to choose a positive integer value a and compute ∑xiai=x0+x1a+..xn-1an-1.
Same comment about overflows applies.
The authors have found that using a = 33, 37, 39, or 41 worked well for character strings that are English words.
The problem we wish to solve in this section is to map integers in
some, possibly large range, into integers in the range [0,N-1].
This is trivial! Why not map all the integers into 0.
We want to minimize collisions.
This is often called the mod method. One simple way to turn any integer x into one in the range [0,N-1] is to compute x mod N. That is we define the hash function h by
h(x) = x mod N(The book uses the funny mod called % in java so must use a slightly more complicated definition of h. We shall continue to use the “real” (i.e., mathematical) definition of mod and hence our definition of h is a little simple.
Choosing N to be prime tends to lower the collision rate, but choosing N to be a power of 2 permits a faster computation since mod with a power of two simply means taking the low order bits.
MAD stands for multiply-add-divide (mod is essentially division). We still use mod N to get the numbers in the range 0..N-1, but we are a little fancier and try to spread the numbers out first. Specifically we define the hash function h via.
h(x) = (ax+b) mod N
The values a and b are chosen (often at random) as positive integers not a multiple of N.
The question we wish to answer is what to do when two distinct keys map to the same value, i.e., when h(k)=h(k'). In this case we have two items to store in one bucket. This discussion also covers the case where we permit multiple items to have the same key.
The idea is simple, each bucket instead of holding an item holds a reference to a container of items. That is each bucket refers to the trivial log file implementation of a dictionary, but only for the keys that map to this container.
The code is simple, you just error check and pass the work off to the trivial implementation used for the individual bucket.
Algorithm findElement(k): B←A[h(k)] if B is empty then return NO_SUCH_KEY // now just do the trivial linear search return B.findElement(k) Algorithm insertItem(k,e): if A[h(k)] is empty then Create B, an empty sequence-based dictionary A[h(k)]←B else B←A[h(k)] B.insertItem(k,e) Algorithm removeElement(k) B←A[h(k) if B is empty then return NO_SUCH_KEY else return B.removeElement(k)
Homework: R-2.19
Allan Gottlieb