Lecture 22: B-Trees

Suppose you have a database that is too large to be kept in memory, so it is resident on disk. You want to support the functionalities SEARCH, ADD, and DELETE.

Data on disks is organized in blocks of fixed size; typically 512 bytes, 1K Bytes, or 2K Bytes. The disk reader reads and writes one block at a time. Reading a block from disk takes typically about 1 million times as long as doing an in-memory operation. Therefore in designing algorithms for data structures kept on disk, the primary consideration is minimizing the number of disk reads, and you are willing to trade off an awful lot of in-memory computation (though there are limits) in order to save a single disk read.

A B-tree is a version of a 2-3 tree, in which each node of the tree is a block. The 2-3 tree structure is modified to pack the maximum amount of information into each block and thus reduce the total number of block reads.

Since we're talking about fixed size blocks, we need to be systematic about memory size, for the only time in this course. All measurements are in bytes.

Suppose that

In a B-tree there are two kinds of nodes (each node is one block).

As we observed with 2-3 trees, every value in a leaf is a tag in exactly one ancestor node, except the very smallest value in the tree, which is not a tag anywhere. This will be important below.

The rules for the tree are:

Height of the tree

In the smallest tree of height H, Thus the number of records N >= (Q/2)*(B/2)H-1*2. Solving for H, we have H <= 1 + logB/2(N/Q) = 1+[log(N)-log(Q)]/[log(B)-1].
With Q=64, and B=128, if N = 230 (1 billion) then H <= 5. If N = 240 (1 trillion) then H <= 6.

In the largest tree of height H

Then the number of records N <= BH*Q so H >= [log(N) - log(Q)]/log(B). With Q=64, and B=128, if N = 1 billion then H >= 4. If N = 1 trillion, then H >= 5.

The root is held in memory. If the database is heavily used, the first and second level down are also held in memory. For our numbers, keeping two levels requires about 10 MBytes. Three levels would require 2 GByte -- feasible if the computer dedicated to serving this database. Let L be the number of levels held in memory.

You also probably cache some number of recently used blocks below level L in memory, in case you need them again soon. But this can only be a tiny fraction of all the blocks below level L, because the branching factor B is so large.

Also worth noting: the ratio between the space required by the B-tree and the space required by the raw data is somewhere between 1+1/B and 2+4/B, depending almost entirely on how full the leaf nodes are. So the cost in extra space is small, if the leaves are tightly packed and a factor of 2 in the worst case.

Special case: In a 2-3 tree with N leaves, there are between N-1 internal nodes (in the skinniest tree, with 2 children per node) and approximately N/2 internal nodes (in the bushiest tree, with 3 children per node).

Algorithms

The algorithms are essentially the same as for 2-3 trees.

Search

Use the keys at each node to find your way down the tree, and search for the key in the leaf. Note that if you read in the node as an array, you can do binary search among the keys in the node to find the proper subtree.

Number of disk reads: = height of tree minus the number of levels kept in memory.
Machine operations: The number of levels is log(N)/log(B) and at each level you are doing log(B) operations for a binary search, so a total of log(N).

Adding an new record

Find where the new record should go. If the leaf is overfull, split it into two leaves half full. Add these to the parent. If this is overfull, split it into two. Contine upward. If you get to the root, split the root into two, and create a new root with these two as children (this is why the root is allowed to have two children).

Number of disk operations: Number of reads: H-L. Writes: in the worst case, 2*H. (Caching doesn't help with writes; you have to write out a block as soon as it is changed.) Average case: O(1)
Machine operations: At each level of the tree, O(B) for insertion into an array of length B. Hence O(B*height) = O(B*log(N)/log(B)) operation in total.

Note that B/log(B) is an increasing function of B, so the machine operations increase for larger B. That is why, for an in-memory data structure, you use 2-3 trees rather than B trees.

Also, unlike 2-3 trees, it is no longer important to be clever about passing the keys up the tree; you can use the obvious dumb algorithm to adjust the key labels at each node. Why?

Deleting

Find the leaf node N with the record to be deleted. 
Delete it, if it's there. 
If (N is now less than half full) 
    if (either neighboring sibling is more than half full) {
        pick the larger of the two neighboring sibs, if there are two;
        move enough children from that one to this one that the two have
            equal numbers (both more than half)
      } 
    else //  neither neighboring siblings has children to spare
       { P = N.parent;
         move all of N's children to one of its neighboring siblings;
         delete N;
         N = P
         iterate;
     }
if you get up to the root, then delete the root.
The details of the code are messy, as you can imagine. Depending on how the B-tree is to be used there are a lot of optimizations and tuning you can do.

Number of disk operations: Reads: either 3*(H-L)(the path to the item to be deleted, plus both neighboring siblings at each level) or 2*(H-L) (if you keep more information at the parent node about the children).
Writes: Best case 1: Worst case 2*H (each node on the path and one sibling).

Doing N adds

Suppose you do N adds. How many disk operations?

Reads: N*(H-L). The tree stays at height H much longer than it takes to go through heights 0 ... H-1. So nearly all the N adds are done when the tree is already at height H. But see below.

Writes: Here it pays to count a little carefully.

If you add an item to a leaf and the leaf doesn't split, then you do either one or two writes. You certainly have to rewrite the leaf. If the item is now the smallest item in the leaf, you have to rewrite the tag in the internal node where the tag for the smallest item in this leaf appears. So over the whole course of doing the N adds, these kinds of rewrites involve somewhere between N and 2N writes; much closer to N if the data arrives randomly, but 2N if the data arrives in backward sorted order.

If a node N splits and new node N1 is created then there are three` writes involved; N, N1, and the common parent of N and N1 (which itself may split, but we'll do that accounting separately). Therefore each such event, involving 3 writes, has the effect of creating a new node. Therefore the total number of writes of this kind is at most 3 * the number of nodes in the tree = about 6*N/Q. This holds best case, worst case, all cases. So the total number of writes is somewhere between (1+6/Q)*N and (2+6/Q)*N.

However if we consider the fact that some of the nodes below level L are cached then that might change the number of reads. It can't change the number of writes; when the algorithm says you have to write, you have to write, or risk losing your work. But when it says you have to read, you may not have to read; the block may be already cached. If the data arrives in random order, then the probability that a block below level L is cached is tiny, so that doesn't affect the calculation. But if there is a high degree of locality --- that is, items close together tend to arrive together --- then there may be a substantial probability that the blocks you need to read are already in memory, and then the total number of reads may be much less than the above calculation.

The extreme case is where the data arrives in sorted order. In that case, the blocks you need are always in memory, and zero reads are required; only writes.

Fun things to do with 2-3 trees

All kinds of things you can do with 2-3 trees, especially if you're willing to supplement them with some additional data structures. All these (I think) can be adapted to any kind of balanced tree, not just two-three trees.

Splice and Split

Splice sets U and V. That is, U and V are each represented by a two three tree, and you happen to know that the largest element in U is smaller than the smallest element in V.
You want to destructively create the union of U and V.
Assume that nodes are tagged with the smallest value in the smallest subtree, in addition to the smallest values in the other subtrees.

Let H = U.height (assume this is recorded at the root).
Let G = V.height;
if (H == G)  {
   create a new node W
   make U and V 
   tag W with U.smallest, V.smallest
   }
if (H > G)
   { starting at the root of U, go down H-G-1 steps to the right;
         Call this node P;
     make the root of V the last child of P;
         // this puts the leaves of U and the leaves of V at the same
         // level in the right order.
     spilt upward, as necessary
   }
if (H < G)
   { starting at the root of V, go down H-G-1 steps to the left;
         Call this node P;
     make the root of U the first child of P;
     spilt upward, as necessary
   }
Time requirement: O(|H-G|).

Split set U at X. That is, destructively construct the set of all the elements less than or equal to X.

1. Find the path from the root to X, or to where X would be.
2. Prune all the siblings to the right of that path.
3. Working from bottom to top, delete all nodes with 1 child 
    (all of these are nodes on the path).
Comment: You now have a collection of two-three trees of increasing values
     and strictly decreasing heights.
4. Splice them together one at a time..

If the heights of the trees are H1 > H2 > H3 ... > Hk, then the time for all the splices is
(H1-H2) + (H2-H3) + ... + (H{k-1}-Hk) = H1-Hk

So the entire split algorithm runs in time proportional to the height of the original tree O(log(N))).