**Note**: A practice midterm is on the web.

**Remark**: It is not as trivial to find the new
insertion point for heaps when using a linked implementation.

**Homework:** Show the steps for inserting an element
with key 2 in the heap of Figure 2.41.

Trivial, right? Just remove the root since that must contain an
element with minimum key. Also decrease n by one.

Wrong!

What remains is **TWO** trees.

We do want the element stored at the root but we must put some other element in the root. The one we choose is our friend the last node.

But the last node is likely not to be a valid root, i.e. it will destroy the heap property since it will likely be bigger than one of its new children. So we have to bubble this one down. It is shown in pale red on the right and the procedure explained below. We also need to find a new last node, but that really is trivial: It is the node stored at the new value of n.

If the new root is the only internal node then we are done.

If only one child of the root is internal (it must be the left child) compare its key with the key of the root and swap if needed.

If both children of the root are internal, choose the child with the smaller key and swap with the root if needed.

The original last node, became the root, and now has been bubbled
down to level 1. But it might still be bigger than a child so we keep
bubbling. At worst we need Θ(h) bubbling steps, which is again
logarithmic in n as desired.

**Homework:** R-2.16

Operation | Time |
---|---|

size, isEmpty | O(1) |

minElement, minKey | O(1) |

insertItem | Θ(log n) |

removeMin | Θ(log n) |

The table on the right gives the performance of the heap implementation of a priority queue. As desired, the main operations have logarithmic time complexity. It is for this reason that heap sort is fast.

- A heap containing n elements is a complete tree T with n internal nodes
each storing a reference to a key and a reference to an element.
The tree also contains n+1 leaves, which are not used.

- The heap is a very fast implementation of a priority queue.
The main operations are logarithmic and the others are constant
time.
- The height of the heap is Θ(log(n)) since T is complete.
- The worst case complexity of the up- and down-heap bubbling are Θ(height)=Θ(log(n)).
- Finding the insertion position and updating the last node position take constant time.
- The insertItem and removeMin operations are Θ(log(n)). So are minItem and minKey

- Using these insertion and removeMin algorithms makes sorting using a priority queue fast, i.e., Θ(n*log(n)), as we shall state officially in the next section.

The goal is to sort a sequence S. We return to the PQ-sort where
we insert the elements of S into a priority queue and then use
removeMin to obtain the sorted version. When we use a heap to
implement the priority queue, each insertion and removal takes
Θ(log(n)) so the entire algorithm takes Θ(nlog(n)). The heap
implementation of PQ-sort is called **heap-sort** and we
have shown

**Theorem**:
The heap-sort algorithm sorts a sequence of n comparable elements in
Θ(nlog(n)) time.

In place means that we use the space occupied by the input. More precisely, it means that the space required is just the input + O(1) additional memory. The algorithm above required Θ(n) addition space to store the heap.

The in place heap-sort of S assumes that S is implemented as an array and proceeds as follows (This presentation, beyond the definition of ``in place'' is unofficial; i.e., it will not appear on problem sets or exams)

- Logically divide the array into a portion in the front that
contains the growing heap and the rest that contains the elements
of the array that have not yet been dealt with.
- Initially the heap part is empty and the not-yet-dealt-with part of the array is the entire array.
- At each insertion we remove the left most entry from the array part and insert it in the heap, growing the heap to include the memory previously used by the newly inserted element. The blue line moves down.
- At the end the heap uses all the space. We are making the optimization discussed before that we only store the internal nodes of the heap and do not waste the first (index 0) component of the array used to store the heap.

- Do the insertions a with a normal heap-sort but change the comparison so that a maximum element is in the root (i.e., a parent is no smaller than a child).
- Now do the removals from the heap, moving the blue line back up.
- The elements removed are in order big to small.
- This is perfect since we are going to store them starting at the right of the array since that is the portion of the array that is made available by the shrinking heap.

If you are given at the beginning all n elements that are to be
inserted, the total insertion time for all inserts can be reduced to
O(n) from O(nlog(n)). The basic idea assuming n=2^{n}-1 is

- Take out the first element and call it r.
- Divide the remaining 2
^{n}-2 into two parts each of size 2^{n-1}-1. - Heap-sort each of these two parts.
- Make a tree with r as root and the two heaps as children.
- Down-heap bubble r.

Sometimes we wish to extend the priority queue ADT to include a locater that always points to the same element even when the element moves around. So if x is in a priority queue and another item is inserted, x may move during the up-heap bubbling, but the locater of x continues to refer to x.

Method | Unsorted Sequence | Sorted Sequence | Heap |
---|---|---|---|

size, isEmpty | Θ(1) | Θ(1) | Θ(1) |

minElement, minKey | Θ(n) | Θ(1) | Θ(1) |

insertItem | Θ(1) | Θ(n) | Θ(log(n)) |

removeMin | Θ(n) | Θ(1) | Θ(log(n)) |

**Dictionaries**, as the name implies are used to
contain data that may later be retrieved. Associated with each
element is the **key** used for retrieval.

For example consider an element to be one student's NYU transcript and the key would be the student id number. So given the key (id number) the dictionary would return the entire element (the transcript).

A dictionary stores **items**, which are key-element
(k,e) pairs.

We will study ordered dictionaries in the next chapter when we consider searching. Here we consider unordered dictionaries. So, for example, we do not support findSmallestKey. the methods we do support are

- findElement(k): Return an element having key k or signal an error if no such element exists.
- insertItem(k,e): Insert an item with key k and element e.
- removeElement(k): Remove an item with key k and return its element. Signal an error if no such item exists.

Just store the items in a sequence.

- Trivial (and fast) to insert: Θ(1)
- Minimal space: Θ(n)
- Slow for finding or removing elements: Θ(n) per operation

The idea of a **hash table** is simple: Store the
items in an array (as done for log files) but ``somehow'' be able to
figure out quickly, i.e., Θ(1), which array element contains the item
(k,e).

We first describe the array, which is easy, and then the ``somehow'', which is not so easy. Indeed in some sense it is impossible. What we can do is produce an implementation that, on the average, performs operations in time Θ(1).

Allocate an array A of size N of **buckets**, each able
to hold an item. Assume that the keys are integers in the range
[0,N-1] and that no two items have the same key. Note that N may be
*much* bigger than n. Now simply store the item (k,e) in A[k].

If everything works as we assumed, we have a very fast implementation: searches, insertions, and removals are Θ(1). But there are problems, which is why section 2.5 is not finished.

- The keys might not be unique (although in many applications they are):
This is a simple example of a
**collision**. We discuss collisions in 2.5.5. - The keys might not be integers: We can always treat any (computer stored) object as an integer. Just view the object as a bunch of bits and then consider that a base two non-negative integer.
- But those integers might not be ``computer integers'', that is, they might have more bits than the largest integer type in our programming language: True, we discuss methods for converting long bit strings into computer integers in 2.5.3.
- But on many machines the number of computer integers is huge. We
can't possibly have a bucket array of size 2
^{64}: True, we discuss compressing computer integers into a smaller range in 2.5.4

We need a **hash function** h that maps keys to
integers in the range [0,N-1]. Then we will store the item (k,e) in
bucket A[h(k)] (we are for now ignoring collisions). This problem is
divided into two parts. A hash code assigns to each key a computer
integer and then a compression map converts any computer integer into
one in the range [0,N-1]. Each of these steps can introduce
collisions. So even if the keys were unique to begin with, collisions
are an important topic.

A **hash code** assigns to any key an integer value.
The problem we have to solve is that the key may have more bits than
are permitted in our integer values. We first view the key as bunch
of integer values (to be explained) and then combine these integer
values into one.

If our integer values are restricted to 32 bits and our keys are 64
bits, we simply view the high order 32 bits as one value and the low
order as another. In general if

⌈numBitsInKey / numBitsInIntegerValue⌉ = k

we view the key as k integer values. How should we combine the k
values into one?

Simply add the k values.

But, but, but what about overflows?

Ignore them (or use exclusive or instead of addition).

The summing components method gives very many collisions when used for character strings. If 4 characters fill an integer value, then `temphash' and `hashtemp' will give the same value. If one decided to use integer values just large enough to hold one (unicode) character, then there would be many, many common collisions: `t21' and `t12' for one, mite and time for another.

If we call the k integer values x_{0},...,x_{k-1},
then a better scheme for combining is to choose a positive integer
value a and compute
∑x_{i}a^{i}=x_{0}+x_{1}a+..x_{n-1}a^{n-1}.

Same comment about overflows applies.

The authors have found that using a = 33, 37, 39, or 41 worked well for character strings that are English words.

The problem we wish to solve in this section is to map integers in
some, possibly large range, into integers in the range [0,N-1].

This is trivial! Why not map all the integers into 0.

We want to minimize collisions.

This is often called the mod method. One simple way to turn any integer x into one in the range [0,N-1] is to compute x mod N. That is we define the hash function h by

h(x) = x mod N(The book uses the funny mod called % in java so must use a slightly more complicated definition of h. We shall continue to use the “real” (i.e., mathematical) definition of mod and hence our definition of h is a little simple.

Choosing N to be prime tends to lower the collision rate, but choosing N to be a power of 2 permits a faster computation since mod with a power of two simply means taking the low order bits.

MAD stands for multiply-add-divide (mod is essentially division). We still use mod N to get the numbers in the range 0..N-1, but we are a little fancier and try to spread the numbers out first. Specifically we define the hash function h via.

h(x) = (ax+b) mod N

The values a and b are chosen (often at random) as positive integers not a multiple of N.

The question we wish to answer is what to do when two distinct keys map to the same value, i.e., when h(k)=h(k'). In this case we have two items to store in one bucket. This discussion also covers the case where we permit multiple items to have the same key.

The idea is simple, each bucket instead of holding an item holds a reference to a container of items. That is each bucket refers to the trivial log file implementation of a dictionary, but only for the keys that map to this container.

The code is simple, you just error check and pass the work off to the trivial implementation used for the individual bucket.

Algorithm findElement(k): B←A[h(k)] if B is empty then return NO_SUCH_KEY // now just do the trivial linear search return B.findElement(k) Algorithm insertItem(k,e): if A[h(k)] is empty then Create B, an empty sequence-based dictionary A[h(k)]←B else B←A[h(k)] B.insertItem(k,e) Algorithm removeElement(k) B←A[h(k) if B is empty then return NO_SUCH_KEY else return B.removeElement(k)

**Homework:** R-2.19