Last chapter we consider searching in unordered dictionaries. In general we want to be able to insert, remove, and search fast.
For unordered dictionaries, where we don't consider an ordering between keys, the method we learned was hashing.
When the keys are ordered, we can also ask for the "next" or "previous" items. More precisely we wish to support, in addition to findElement(k), insertItem(k,e), and removeElement(k), the new methods
We naturally signal an exception if no such item exists. For example if the only keys present are 55, 22, 77, and 88, then closestKeyAfter(90) or closestElemBefore(2) each signal an exception.
We begin with the most natural implementation, which however is slow Θ(N) for inserts and deletes (worst case). Later we will learn about AVL trees which are fast Θ(log(N)) for all the operations (again worst case).
We will be able to prove the logarithmic bounds. The idea, which we have seen in heaps, is
We use the sorted vector implementation from chapter 2 (we used it as a simple implementation of a priority queue). Recall that this keeps the items sorted in key order. Hence it is Θ(n) for inserts and removals, which is slow; however, we shall see that it is fast for finding and element and for the four new methods closestKeyBefore(k) and friends. We call this a lookup table.
The space required is Θ(n) since we grow and shrink the array supporting the vector (see extendable arrays).
As indicated the major favorable property of a lookup table is that it is fast for (surprise) lookups using the binary search algorithm that we study next.
In this algorithm we are searching for the rank of the item containing a key equal to k. We are to return a special value if no such key is found.
The algorithm maintains two variables lo and hi, which are respectively lower and upper bounds on the rank where k will be found (assuming it is present).
Initially, the key could be anywhere in the vector so we start with lo=0 and hi=n-1. We write key(r) for the key at rank r and elem(r) for the element at rank r.
We then find mid, the rank (approximately) halfway between lo and hi and see how the key there compares with our desired key.
Some care is need in writing the algorithm precisely as it is easy to have an ``off by one error''. Also we must handle the case in which the desired key is not present in the vector. This occurs when the search range has been reduced to the empty set (i.e., when lo exceeds hi).
Algorithm BinarySearch(S,k,lo,hi): Input: An ordered vector S containing (key(r),elem(r)) at rank r A search key k Integers lo and hi Output: An element of S with key k and rank between lo and hi. NO_SUCH_KEY if no such element exits If lo > hi then return NO_SUCH_KEY // Not present mid ← ⌊(lo+hi)/2⌋ if k = key(mid) then return elem(mid) // Found it if k < key(mid) then return BinarySearch(S,k,lo,mid-1) // Try bottom ``half'' if k > key(mid) then return BinarySearch(S,k,mid+1,hi) // Try top ``half''
Do some examples on the board.
It is easy to see that the algorithm does just a few operations per recursive call. So the complexity of Binary Search is Θ(NumberOfRecursions). So the question is "How many recursions are possible for a lookup table with n items?".
The number of eligible items (i.e., the size of the range we still must consider) is hi-lo+1.
The key insight is that when we recurse, we have reduced the range to at most half of what it was before. There are two possibilities, we either tried the bottom or top ``half''. Let's evaluate hi-lo+1 for the bottom and top half. Note that the only two possibilities for ⌊(lo+hi)/2⌋ are (lo+hi)/2 or (lo+hi)/2-(1/2)=(lo+hi-1)/2
Bottom:
(mid-1)-lo+1 = mid-lo = ⌊(lo+hi)/2⌋-lo
≤ (lo+hi)/2-lo = (hi-lo)/2 < (hi-lo+1)/2
Top:
hi-(mid+1)+1 = hi-mid = hi-⌊(lo+hi)/2⌋
≤ hi-(lo+hi-1)/2 = (hi-lo+1)/2
So the range starts at n and is halved each time and remains an integer (i.e., if a recursive call has a range of size x, the next recursion will be at most ⌊x/2⌋).
Write on the board 10 times
(X-1)/2 ≤ ⌊X/2⌋ &le X/2
If B ≤ A, then Z-A ≤ Z-B
How many recursions are possible? If the range is ever zero, we stop (and declare the key is not present) so the longest we can have is the number of times you can divide by 2 and stay at least 1. That number is Θ(log(n)) showing that binary search is a logarithmic algorithm.
Informal Proof: At each step we lowered the number of possible items from an integer k to another integer <= k/2. If we look at k in binary the new number cannot exceed k with the rightmost bit dropped. Since there are only log(N) bits in N, and we start with N items, we can only perform log(N) "bit removals".
Problem Set 3, Problem 1 Write the algorithm closestKeyBefore. It uses the same idea as BinarySearch.
When you do question 1 you will see that the complexity is Θ(log(n)). Proving this is not hard but is not part of the problem set.
When you do question 1 you will see that closestElemBefore, closestKeyAfter, and closestElemAfter are all very similar to closestKeyBefore. Hence they are all logarithmic algorithms. Proving this is not hard but is not part of the problem set.
Recall that a Log File is an unordered list of items.
Method | Log File | Lookup Table |
---|---|---|
findElement | Θ(n) | Θ(log n) |
insertItem | Θ(1) | Θ(n) |
removeElement | Θ(n) | Θ(n) |
closestKeyBefore | Θ(n) | Θ(log n) |
closestElemBefore | Θ(n) | Θ(log n) |
closestKeyAfter | Θ(n) | Θ(log n) |
closestElemAfter | Θ(n) | Θ(log n) |
Our goal now is to find a better implementation so that all the complexities are logarithmic. This will require us to shift from vectors to trees.
This section gives a simple tree-based implementation, which alas fails to achieve the logarithmic bounds we seek. But it is a good start and motivates the AVL trees we study in 3.2 that do achieve the desired bounds.
Definition: A binary search tree is a tree in which each internal node v stores an item such that the keys stored in every node in the left subtree of v are less than or equal to the key at v which is less than or equal to every key stored in the right subtree.
From the definition we see easily that an inorder traversal of the tree visits the internal nodes in nondecreasing order of the keys they store.
You search by starting at the root and going left or right if the desired key is smaller or larger respectively than the key at the current node. If the key at the current node is the key you seek, you are done. If you reach a leaf the desired key is not present.
Do some examples using the tree on the right. E.g. search for 17, 80, 55, and 65.
Homework: R-3.1 and R-3.2
Here is the formal algorithm described above.
Algorithm TreeSearch(k,v) Input: A search key k and a node v of a binary search tree. Output: A node w in the subtree routed at v such that either w is internal and k is stored at w or w is a leaf where k would be stored if it existed if v is a leaf then return v if k=key(v) then return v if k<key(v) then return TreeSearch(k,T.leftChild(v)) if k>key(v) then return TreeSearch(k,T.rightChild(v))
Draw a tree on the board and illustrate both finding a k and no such key exists.
It is easy to see that only a couple of operations are done per recursive call and that each call goes down a level in the tree. Hence the complexity is O(height).
So the question becomes "How high is a tree with n nodes?". As we saw last chapter the answer is "It depends.".
Next section we will learn a technique for keeping trees low.
Allan Gottlieb