Basic Algorithms: Lecture 12

================ Start Lecture #12 ================

Note: A practice midterm is on the web.

Remark: It is not as trivial to find the new insertion point for heaps when using a linked implementation.

Homework: Show the steps for inserting an element with key 2 in the heap of Figure 2.41.

Removal

Trivial, right? Just remove the root since that must contain an element with minimum key. Also decrease n by one.
Wrong!
What remains is TWO trees.

We do want the element stored at the root but we must put some other element in the root. The one we choose is our friend the last node.

But the last node is likely not to be a valid root, i.e. it will destroy the heap property since it will likely be bigger than one of its new children. So we have to bubble this one down. It is shown in pale red on the right and the procedure explained below. We also need to find a new last node, but that really is trivial: It is the node stored at the new value of n.

Down-Heap Bubbling

If the new root is the only internal node then we are done.

If only one child of the root is internal (it must be the left child) compare its key with the key of the root and swap if needed.

If both children of the root are internal, choose the child with the smaller key and swap with the root if needed.

The original last node, became the root, and now has been bubbled down to level 1. But it might still be bigger than a child so we keep bubbling. At worst we need Θ(h) bubbling steps, which is again logarithmic in n as desired.

Homework: R-2.16

Operation Time
size, isEmpty O(1)
minElement, minKey O(1)
insertItem Θ(log n)
removeMin Θ(log n)

Operation	Time
size, isEmpty	O(1)
minElement, minKey	O(1)
insertItem	Θ(log n)
removeMin	Θ(log n)

Performance

The table on the right gives the performance of the heap implementation of a priority queue. As desired, the main operations have logarithmic time complexity. It is for this reason that heap sort is fast.

Summary of heaps

A heap containing n elements is a complete tree T with n internal nodes each storing a reference to a key and a reference to an element. The tree also contains n+1 leaves, which are not used.
The heap is a very fast implementation of a priority queue. The main operations are logarithmic and the others are constant time.
- The height of the heap is Θ(log(n)) since T is complete.
- The worst case complexity of the up- and down-heap bubbling are Θ(height)=Θ(log(n)).
- Finding the insertion position and updating the last node position take constant time.
- The insertItem and removeMin operations are Θ(log(n)). So are minItem and minKey
Using these insertion and removeMin algorithms makes sorting using a priority queue fast, i.e., Θ(n*log(n)), as we shall state officially in the next section.

2.4.4 Heap-Sort (and some extras)

The goal is to sort a sequence S. We return to the PQ-sort where we insert the elements of S into a priority queue and then use removeMin to obtain the sorted version. When we use a heap to implement the priority queue, each insertion and removal takes Θ(log(n)) so the entire algorithm takes Θ(nlog(n)). The heap implementation of PQ-sort is called heap-sort and we have shown

Theorem: The heap-sort algorithm sorts a sequence of n comparable elements in Θ(nlog(n)) time.

Implementing Heap-Sort In Place

In place means that we use the space occupied by the input. More precisely, it means that the space required is just the input + O(1) additional memory. The algorithm above required Θ(n) addition space to store the heap.

The in place heap-sort of S assumes that S is implemented as an array and proceeds as follows (This presentation, beyond the definition of ``in place'' is unofficial; i.e., it will not appear on problem sets or exams)

Logically divide the array into a portion in the front that contains the growing heap and the rest that contains the elements of the array that have not yet been dealt with.
- Initially the heap part is empty and the not-yet-dealt-with part of the array is the entire array.
- At each insertion we remove the left most entry from the array part and insert it in the heap, growing the heap to include the memory previously used by the newly inserted element. The blue line moves down.
- At the end the heap uses all the space. We are making the optimization discussed before that we only store the internal nodes of the heap and do not waste the first (index 0) component of the array used to store the heap.
Do the insertions a with a normal heap-sort but change the comparison so that a maximum element is in the root (i.e., a parent is no smaller than a child).
Now do the removals from the heap, moving the blue line back up.
- The elements removed are in order big to small.
- This is perfect since we are going to store them starting at the right of the array since that is the portion of the array that is made available by the shrinking heap.

Bottom-Up Heap Constructor (unofficial)

If you are given at the beginning all n elements that are to be inserted, the total insertion time for all inserts can be reduced to O(n) from O(nlog(n)). The basic idea assuming n=2ⁿ-1 is

Take out the first element and call it r.
Divide the remaining 2ⁿ-2 into two parts each of size 2^n-1-1.
Heap-sort each of these two parts.
Make a tree with r as root and the two heaps as children.
Down-heap bubble r.

Locaters (Unofficial)

Sometimes we wish to extend the priority queue ADT to include a locater that always points to the same element even when the element moves around. So if x is in a priority queue and another item is inserted, x may move during the up-heap bubbling, but the locater of x continues to refer to x.

Comparison of the Priority Queue Implementations

Method Unsorted
Sequence Sorted
Sequence
Heap
size, isEmpty Θ(1) Θ(1) Θ(1)
minElement, minKey Θ(n) Θ(1) Θ(1)
insertItem Θ(1) Θ(n) Θ(log(n))
removeMin Θ(n) Θ(1) Θ(log(n))

Method	Unsorted Sequence	Sorted Sequence	Heap
size, isEmpty	Θ(1)	Θ(1)	Θ(1)
minElement, minKey	Θ(n)	Θ(1)	Θ(1)
insertItem	Θ(1)	Θ(n)	Θ(log(n))
removeMin	Θ(n)	Θ(1)	Θ(log(n))

2.5 Dictionaries and Hash Tables

Dictionaries, as the name implies are used to contain data that may later be retrieved. Associated with each element is the key used for retrieval.

For example consider an element to be one student's NYU transcript and the key would be the student id number. So given the key (id number) the dictionary would return the entire element (the transcript).

2.5.1 the Unordered Dictionary ADT

A dictionary stores items, which are key-element (k,e) pairs.

We will study ordered dictionaries in the next chapter when we consider searching. Here we consider unordered dictionaries. So, for example, we do not support findSmallestKey. the methods we do support are

findElement(k): Return an element having key k or signal an error if no such element exists.
insertItem(k,e): Insert an item with key k and element e.
removeElement(k): Remove an item with key k and return its element. Signal an error if no such item exists.

Trivial Implementation: log files

Just store the items in a sequence.

Trivial (and fast) to insert: Θ(1)
Minimal space: Θ(n)
Slow for finding or removing elements: Θ(n) per operation

2.5.2 Hash Tables

The idea of a hash table is simple: Store the items in an array (as done for log files) but ``somehow'' be able to figure out quickly, i.e., Θ(1), which array element contains the item (k,e).

We first describe the array, which is easy, and then the ``somehow'', which is not so easy. Indeed in some sense it is impossible. What we can do is produce an implementation that, on the average, performs operations in time Θ(1).

Bucket Arrays

Allocate an array A of size N of buckets, each able to hold an item. Assume that the keys are integers in the range [0,N-1] and that no two items have the same key. Note that N may be much bigger than n. Now simply store the item (k,e) in A[k].

Analysis of Bucket Arrays

If everything works as we assumed, we have a very fast implementation: searches, insertions, and removals are Θ(1). But there are problems, which is why section 2.5 is not finished.

The keys might not be unique (although in many applications they are): This is a simple example of a collision. We discuss collisions in 2.5.5.
The keys might not be integers: We can always treat any (computer stored) object as an integer. Just view the object as a bunch of bits and then consider that a base two non-negative integer.
But those integers might not be ``computer integers'', that is, they might have more bits than the largest integer type in our programming language: True, we discuss methods for converting long bit strings into computer integers in 2.5.3.
But on many machines the number of computer integers is huge. We can't possibly have a bucket array of size 2⁶⁴: True, we discuss compressing computer integers into a smaller range in 2.5.4

2.5.3 Hash Functions

We need a hash function h that maps keys to integers in the range [0,N-1]. Then we will store the item (k,e) in bucket A[h(k)] (we are for now ignoring collisions). This problem is divided into two parts. A hash code assigns to each key a computer integer and then a compression map converts any computer integer into one in the range [0,N-1]. Each of these steps can introduce collisions. So even if the keys were unique to begin with, collisions are an important topic.

Hash Codes

A hash code assigns to any key an integer value. The problem we have to solve is that the key may have more bits than are permitted in our integer values. We first view the key as bunch of integer values (to be explained) and then combine these integer values into one.

If our integer values are restricted to 32 bits and our keys are 64 bits, we simply view the high order 32 bits as one value and the low order as another. In general if
⌈numBitsInKey / numBitsInIntegerValue⌉ = k
we view the key as k integer values. How should we combine the k values into one?

Summing Components

Simply add the k values.
But, but, but what about overflows?
Ignore them (or use exclusive or instead of addition).

Polynomial Hash Codes

The summing components method gives very many collisions when used for character strings. If 4 characters fill an integer value, then `temphash' and `hashtemp' will give the same value. If one decided to use integer values just large enough to hold one (unicode) character, then there would be many, many common collisions: `t21' and `t12' for one, mite and time for another.

If we call the k integer values x₀,...,x_k-1, then a better scheme for combining is to choose a positive integer value a and compute ∑x_iaⁱ=x₀+x₁a+..x_n-1a^n-1.

Same comment about overflows applies.

The authors have found that using a = 33, 37, 39, or 41 worked well for character strings that are English words.

2.5.4 Compression Maps

The problem we wish to solve in this section is to map integers in some, possibly large range, into integers in the range [0,N-1].
This is trivial! Why not map all the integers into 0.
We want to minimize collisions.

The Division Method

This is often called the mod method. One simple way to turn any integer x into one in the range [0,N-1] is to compute x mod N. That is we define the hash function h by

            h(x) = x mod N

(The book uses the funny mod called % in java so must use a slightly more complicated definition of h. We shall continue to use the “real” (i.e., mathematical) definition of mod and hence our definition of h is a little simple.

Choosing N to be prime tends to lower the collision rate, but choosing N to be a power of 2 permits a faster computation since mod with a power of two simply means taking the low order bits.

The MAD Method

MAD stands for multiply-add-divide (mod is essentially division). We still use mod N to get the numbers in the range 0..N-1, but we are a little fancier and try to spread the numbers out first. Specifically we define the hash function h via.

            h(x) = (ax+b) mod N

The values a and b are chosen (often at random) as positive integers not a multiple of N.

2.5.5 Collision-Handling Schemes

The question we wish to answer is what to do when two distinct keys map to the same value, i.e., when h(k)=h(k'). In this case we have two items to store in one bucket. This discussion also covers the case where we permit multiple items to have the same key.

Separate Chaining

The idea is simple, each bucket instead of holding an item holds a reference to a container of items. That is each bucket refers to the trivial log file implementation of a dictionary, but only for the keys that map to this container.

The code is simple, you just error check and pass the work off to the trivial implementation used for the individual bucket.

Algorithm findElement(k):
    B←A[h(k)]
    if B is empty then
        return NO_SUCH_KEY
    // now just do the trivial linear search
    return B.findElement(k)

Algorithm insertItem(k,e):
    if A[h(k)] is empty then
        Create B, an empty sequence-based dictionary
        A[h(k)]←B
    else
        B←A[h(k)]
    B.insertItem(k,e)

Algorithm removeElement(k)
    B←A[h(k)
    if B is empty then
        return NO_SUCH_KEY
    else
        return B.removeElement(k)

Homework: R-2.19

Allan Gottlieb