12.6. B-Trees — CS3 Data Structures & Algorithms

12.6.3. B-Tree Analysis¶

The asymptotic cost of search, insertion, and deletion of records from B-trees, \(\mathrm{B}^+\) trees, and \(\mathrm{B}^*\) trees is \(\Theta(\log n)\) where \(n\) is the total number of records in the tree. However, the base of the log is the (average) branching factor of the tree. Typical database applications use extremely high branching factors, perhaps 100 or more. Thus, in practice the B-tree and its variants are extremely shallow.

As an illustration, consider a \(\mathrm{B}^+\) tree of order 100 and leaf nodes that contain up to 100 records. A B-\(\mathrm{B}^+\) tree with height one (that is, just a single leaf node) can have at most 100 records. A \(\mathrm{B}^+\) tree with height two (a root internal node whose children are leaves) must have at least 100 records (2 leaves with 50 records each). It has at most 10,000 records (100 leaves with 100 records each). A \(\mathrm{B}^+\) tree with height three must have at least 5000 records (two second-level nodes with 50 children containing 50 records each) and at most one million records (100 second-level nodes with 100 full children each). A \(\mathrm{B}^+\) tree with height four must have at least 250,000 records and at most 100 million records. Thus, it would require an extremely large database to generate a \(\mathrm{B}^+\) tree of more than height four.

The \(\mathrm{B}^+\) tree split and insert rules guarantee that every node (except perhaps the root) is at least half full. So they are on average about 3/4 full. But the internal nodes are purely overhead, since the keys stored there are used only by the tree to direct search, rather than store actual data. Does this overhead amount to a significant use of space? No, because once again the high fan-out rate of the tree structure means that the vast majority of nodes are leaf nodes. A K-ary tree has approximately \(1/K\) of its nodes as internal nodes. This means that while half of a full binary tree's nodes are internal nodes, in a \(\mathrm{B}^+\) tree of order 100 probably only about \(1/75\) of its nodes are internal nodes. This means that the overhead associated with internal nodes is very low.

We can reduce the number of disk fetches required for the B-tree even more by using the following methods. First, the upper levels of the tree can be stored in main memory at all times. Because the tree branches so quickly, the top two levels (levels 0 and 1) require relatively little space. If the B-tree is only height four, then at most two disk fetches (internal nodes at level two and leaves at level three) are required to reach the pointer to any given record.

A buffer pool could be used to manage nodes of the B-tree. Several nodes of the tree would typically be in main memory at one time. The most straightforward approach is to use a standard method such as LRU to do node replacement. However, sometimes it might be desirable to "lock" certain nodes such as the root into the buffer pool. In general, if the buffer pool is even of modest size (say at least twice the depth of the tree), no special techniques for node replacement will be required because the upper-level nodes will naturally be accessed frequently.