Sunteți pe pagina 1din 55

B-Trees & its

Variants
Advanced Data Structure
Spring 2007

Zareen Alamgir
Motivation

Yet another Tree!


Why do we need another
Tree-Structure ?
Data Retrieval from
External Storage
 In database programs, the data is too large to fit in
memory, therefore, it is stored on secondary storage
(disks or tapes).

 Disk access is very expensive, the disk I/O operation


takes milliseconds while CPU processes data on the
order of nanoseconds, one million times faster.

 When dealing with external storage the disk accesses


dominate the running time.
Balanced Binary Search
Trees
 Balanced binary search trees (AVL & Red-Black) have
good performance if the entire data can fit in the main
memory.

 These trees are not optimized for external storage and


require many disk accesses, thus give poor
performance for very large data.
Reduce Disk Accesses
 Data is transfer to and from the disk in block (typically block
are of 512, 2048, 4096 or 8192 bytes).

 To reduce disk accesses


 Store multiple records in a block on the disk.
 Reduce tree height by increasing the number of children
of a node.

 To achieve above goals we use Multiway (m-way) search


tree, which is a generalization of BST, binary search tree.
25 62

12 19 32 39 73 84

3 5 15 17 21 23 30 31 34 37 45 51 69 71 75 79 90 94
Multiway(m-way)
Search Trees
Multiway(m-way) Search
Trees
 In an m-way tree all the nodes have degree ≤ m.
 Each node has the following structure:
P1 K1 P2 Ki-1 Pi Ki Kq-1 Pq

keys<K1 Ki-1<keys<Ki Kq-1<keys

 Ki are the keys, where 1 ≤ i ≤ q-1, q<=m


 Pi are pointers to subtrees, where 1 ≤ i ≤ q, q<=m

 The keys in each node are in ascending order K1 ≤ K2 ≤ ... ≤ Ki


 The key Ki is larger than keys in subtree pointed by Pi and
smaller than keys in subtree pointed by Pi+1 .
 The subtrees are the m-way trees.
Multiway(m-way) Search
Trees
 M-way tree is a generalization of BST, its working, benefits
& issues are same.
 Benefits
 Problems
 Fast information
 The tree is not balanced.
retrieval.  Leaf nodes are on different
 Fast update. levels.
 Bad space usage, tree can
become skew.
25 62

12 23 32 73 84

3 5 15 21 30 31 69 71 90 94

17 19

M-way tree
B-Trees
B-Trees
 B-Tree is a balanced m-way tree that is tuned to
minimize disk accesses.
 The node size of B-Tree is usually equal to the disk block
size and the number of keys in a node depends on
 Key size
 Disk block size
 Data organization (either key or entire data record is store in a node)

 Access paths from root to leaf nodes are small.


25 62

12 19 32 39 73 84

3 5 15 17 21 23 30 31 34 37 45 51 69 71 75 79 90 94
B-Tree: Definition
 A B-tree of order m has following properties
 The root has at least two subtrees unless it is a leaf.
 Each non-root and each non-leaf node holds q-1 keys
and q pointers to subtrees, where  m / 2 ≤ q ≤ m .
 Each leaf node holds q-1 keys where  m / 2 ≤ q ≤ m .
 All lea are on the same level.
 It is clear that B-tree is always at least half full, has
fewer levels and is perfectly balanced.
P1
Fr K1 P2 Ki-1 Pi Ki Kq-1 Pq

keys<K1 Ki-1<keys<Ki Kq-1<keys


B-Tree
 B-Tree can have a field KeyTally, KT, to indicate the number
of keys currently stored in the node.
 B-Tree node usually contains key and data pointer pair. The
data pointer points to the data record which is not stored in
the node, with this scheme we can pack more keys &
pointers in a B-Tree node.

KT P1 K1 D1 Ki-1 DKi-1i-1 Pi Ki Di Kq-1 KDq-1q-1 Pq

Data pointer Data pointer


keys<K1 Ki-1<keys<Ki Kq-1<keys
Height of B-Tree
 Height of the B-Tree with n keys is important as it bound
the number of disk accesses.

 The height of the tree is maximum when each node has


minimum number of the subtree pointers, q =  m / 2 .
Height of B-Tree
 The height of B-tree is maximum if all nodes have minimum
number of keys.

1 key in the root + 2(q-1) keys on the second level +……+ 2qh-2(q-1) keys in
the leaves (level h).

1 + 2(q - 1) + 2q(q - 1) + …… + 2q h-2 (q - 1)


 h −2 i 
= 1 + ( q − 1) ∑ 2q 
 i =0 
Applying the formula of geometric progression
 q h −1 − 1 
= 1 + 2( q − 1) 
 q − 1 
= −1 + 2q h −1
Thus, the number of keys in B - Tree of height h is given as :
n ≥ −1 + 2q h −1
n +1
h ≤ log q +1
2
Height of B-Tree
 The height of B-tree is minimum if all nodes are full, thus we have

m-1 keys in the root + m(m-1) keys on the second level +……+ mh-1(m-1) keys in
the leaf nodes
(m - 1) + m(m - 1) + m 2 (m - 1) + …… + m h-1 (m - 1)
h −1 h −1
= ∑ (m − 1)m = ( m − 1)∑ m i
i

i =0 i =0

Applying the formula of geometric progression


 mh − 1 
= (m − 1) 
 m − 1 
= mh − 1
Thus, the number of keys in B - Tree of height h is given as :
n ≤ mh − 1
h ≥ logm (n + 1)
n +1
logm ( n + 1) ≤ h ≤ log q +1
2
Height of B-Tree
 If number of nodes in B-tree equal 2,000,000 (2 million)
and m=200 then maximum height of B-tree is 3, where
as the binary tree would be of height 20.

 Note: Order m is chosen so that B-tree node size is


nearly equal to the disk block size.
Search in a B-Tree
 Search in a B-tree is similar to the search in BST
except that in B-tree we make a multiway
branching decision instead of binary branching in
BST.
25 62

12 19 32 39 73 84

3 5 15 17 21 23 30 31 34 37 45 51 69 71 75 79 90 94

Search key 71
B-Tree Insert Operation
 Insertion in B-tree is more complicated than in BST.

 In BST, the keys are added in top down fashion


resulting in an unbalanced tree.

 B-tree is built bottom up, the keys are added in the


leaf node, if the leaf node is full another node is
created, keys are evenly distributed and middle key
is promoted to the parent. If parent is full, the
process is repeated.

 B-tree can also be built in top down fashion using


pre-splitting technique.
Basic Idea
Find position for the key
in the appropriate leaf node

Insert key in order Is node


and adjust pointer
No full ?

yes

Split node: If parent is full


• Create a new node
• Move half of the keys from the full node to
the new node and adjust pointers
• Promote the median key (before split)
to the parent
≥  m / 2 − 1
Split guarantees that each node has
keys.
Cases in B-Tree Insert
Operation
 In B-tree insertion we have the following cases:
 Case 1: The leaf node has room for the new key.
 Case 2: The leaf in which key is to be placed is full.
 This case can lead to the increase in tree height.

 Now we explain these cases in detail.


B-Tree Insert Operation
 Case 1: The leaf node has room for the new key.

Find appropriate leaf


Insert 3 node for key 3
3
10 25

5 8 14 19 20 23 32 38

Insert 3 in order
B-Tree Insert Operation
 Case 2: The leaf in which key is to be placed is full.

Find appropriate leaf


Insert 16 node for key 16

16
10 19
25

3 5 8 14 19 20 23 32 38

No room for key 16 in leaf node

Insert key 19 in parent node in order Move median key 19 up and


Split node: create a new node and
move keys to the new node.

19
14 16 20 23
B-Tree Insert Operation
 Case 2: The leaf in which key is to be placed is full
and this lead to the increase in tree height.
45 55 67 81

13 27 33 38 48 52 57 61 72 77 86 92

9 12 14 19 20 2 3 3 2 38 47 51 59 75 32 38 4 7 51 59 75 32 38 47 51 59 75 32 3 8 47 51 59 75 32 38 47 51 59 75
B-Tree Insert Operation
 Case 2: The height of the tree increases.
Insert 16
Insert 27 in parent in order 55
No room for 27 in parent, Split node
16
45
5555 67 81

No room for 19 in parent,


Split parent node 48 52 57 61 72 77 86 92

13 19
27 33 38
3 3 4 5 5 7 3 3 4 5 5 7 3 3 4 5 5 7 3 3 4 5 5 7

2 8 7 1 9 5 2 8 7 1 9 5 2 8 7 1 9 5 2 8 7 1 9 5

9 12 14 19 20 23 29 31 35 36 41 42

Insert 19 in parent node in order

No room for key 16,


Move median key 19 up & Split node

19
14 16 20 23
B-Tree Delete Operation
 Deletion is analogous to insertion, but a little
more complicated.
 Two major cases
 Case 1: Deletion from leaf node
 Case 2: Deletion from nonleaf node
 Apply delete by copy technique used in BST, this will
reduce this case to case 1.
 In delete by copy, the key to be deleted is replaced
by the largest key in the right subtree or smallest in
left subtree (which is always a leaf).
B-Tree Delete Operation
 Leaf node deletion cases:
 After deletion node is at least half full.
 After deletion underflow occurs
m
 Redistribute: if number of keys in siblings >  2  − 1 .
 Merge nodes if number of keys in siblings < m
 2  − 1 .
 Merging leads to decrease in tree height.
B-Tree Delete Operation
 After deletion node is at least half full. (inverse of insertion
case 1)

Search key 3

10 25

3 5 8 14 19 32 38 40 45

Key found, delete key 3.


Move others keys in the node to eliminate the gap.
B-Tree Delete Operation
 Underflow occurs, evenly redistribute the keys if left or
right sibling has keys >  m / 2 − 1.

Search key 14
Delete 14
10 25

5 8 14 19 32 38 40 45

Underflow occurs, evenly redistribute keys


in the underflow node, in its sibling and the
separator key.
B-Tree Delete Operation
 Underflow occurs and the keys in the left & right sibling are
=  m / 2 − 1
. Merge the underflow node and a sibling.

Delete 25 Move separator key down.

Move the keys to underflow


10 32
node and discard the sibling.

5 8 19 25 38 40

Underflow occurs, merge nodes.


B-Tree Delete Operation
 Underflow occurs, height decreases after merging.

Delete 21
70
Underflow occurs,
merge nodes

8 32 79 85

3 5 21 27 47 66 73 75 78 81 83 88 90 92

Underflow occurs, merge nodes


by moving separator key and
the keys in sibling node to the
underflow node.
Issues in B-tree
 In B-tree, accessing data in sequential order is not
efficient. In-order traversal of the B-tree requires many
disk accesses.

 In B-tree, data pointers are stored in each node, thus


resulting in less subtree pointers per node and more tree
levels.
KT P1 K1 D1 Ki-1 D
Ki-1 Pi Ki Di Kq-1 free
KDspace
q-1
q-1 Pq

Data pointer Data pointer


keys<K1 Ki-1<keys<Ki Kq-1<keys
B+-Trees
B+ -Tree- Variant of B-
Tree
 Resolves the issues in B-tree.
 In B+ -tree
 Pointers to data is stored only in leaf nodes.
 Internal nodes contain only keys and subtree pointers
 Can accommodate more keys in internal nodes.
 Less disk accesses due to fewer levels in the tree.

 B+ -tree provides faster sequential access of data.


B+-Tree Structure
 B+-tree consist of two parts
 Index Set
 Provides indexes for fast access of data.
 Consist of internal nodes that store only key & subtree pointers.
 Sequence Set
 Consist of leaf nodes that contain data pointers.
 Provide efficient sequential access of data (using doubly linked list).

Index Set

Sequential Sequence Set


Search
B+-Tree: Index Node
Structure
 The basic structure of the B+-tree index node of order
m is same as that of B-tree node of order m.
 The root has at least two subtrees unless it is a leaf.
 Each non-root index node holds q-1 keys and q subtree
pointers, where  m / 2 ≤ q ≤ m .
 Only difference is that index node do not contain
data pointers.
P1
Fr K1 P2 Ki-1 Pi Ki Kq-1 Pq

keys<K1 Ki-1<keys<Ki Kq-1<keys


B+-Tree: Sequence Node
Structure
 The structure of the B+-tree sequence node is as
follows:

K1 D1 Ki-1 Di-1 Ki Di Kq-1 Dq-1


Pointers to previous
and next leaf node in tree

Data pointer Data pointer Data pointer Data pointer


Search in a B+-Tree
 The search in B+-tree works similar to B-tree but it
always ends at the leaf node because
 Pointer to the data is stored in the leaf node.
 Existence of the key in index set does not guarantee that the
particular record is present in the tree.
 A key can occur multiple time in the index set (this does not
create problem because the key in index set node act only as a
separator).
Note 62 is not present
in the sequence set
30 62

15 21 34 45 75 90

3 5 15 17 21 23 30 31 34 37 45 51 69 71 75 79 90 94

B+-tree of order 5
B+-Tree Insert
 Case: The leaf node has room for the key to be inserted.

Find appropriate leaf node for


key 3, and insert in order.
Insert 3
3
14 32

5 8 14 19 20 23 32 38
B+-Tree Insert Operation
 Case 2: The leaf in which key is to be placed is full.

Find appropriate leaf


Insert 16 node for key 16

16
10 19
25

3 5 8 14 19 20 23 32 38

No room for key 16,


Split node: create a new
node and move  m / 2
keys to the new node. Insert a copy of the first key of the
new node in the parent node in order.

14 16
19 20 23 19

Modify Sequence Set next node links


B+ -Tree Insert
Operation
 Case: Only root node exists, and it is full.

Insert 18 Find appropriate


position for key 18

18
No room for key 18, 18
Split node: create a new
sequence set node and
10 17 20 23 18
move  m / 2 keys to the
new node.
Create a new index set node and
make it a root node.

Insert the first key of the new


sequence set node in the new root.
B+-Tree Delete
Operation
 B+-tree deletion follows same rules as that of B-
tree deletion but the separator in index set node is
not removed when a key is deleted.

 Deletion cases:
 After deletion node is at least half full.
 After deletion underflow occurs
m
 Redistribute: if number of keys in siblings >  2  − 1 .
 Merge nodes if number of keys in siblings < m
 2  − 1 .
Merging lead to decrease in tree height.
B+-Tree Delete
 Case: After deletion node is at least half full.

Search key 14
Delete 14
Note: key 14 in the parent node is
14 32 still a valid separator key. It is not
modified.

3 5 8 14 19 21 32 38 40

Key found, delete key 14.


Move others keys in the node to eliminate the gap.
B+-Tree Delete
Operation
 Underflow occurs, evenly redistribute the keys if left or
right sibling has keys >  m / 2 − 1.

Separator
Search key
key 14 in the parent
& delete it. is
Delete 14 no longer valid

14 38
32 Insert a copy of the first key of
the sibling node in parent in order.

5 8 14 19 32 38 40

Underflow occurs, evenly redistribute keys


in the underflow node and in the sibling node.

Note: unlike B-tree, the separator key in the


parent node is not included. Why?
B+-Tree Delete
Operation
 Underflow occurs and the keys in the left & right sibling are
=  m / 2 − 1
. Merge the underflow node and a sibling.
Search key 32 & delete it.
Delete 32
14 38

5 8 19 32 38 40

Underflow occurs, merge nodes.

Move keys in sibling to underflow


node and discard the sibling node.
Efficient Sequential Access in
B+-Tree
For efficient sequential access, start from the beginning of
the sequence set and traverse the sequence set using the
next pointers in sequence set nodes

45 55 67 81

13 27 33 38 48 52 57 61 72 77 86 92

Sequence Set Start


Comparison B-Tree & B+-
Tree
B-Tree B+-Tree
 Data pointers are stored in  Data pointers are stored
all nodes only in leaf nodes (sequence
set)
 No redundant keys
 Search can end at any
 Redundant keys may exist
node  Search always ends at leaf
 Slow sequential access node
 Higher trees
 Efficient sequential access
 Flatter trees (no data pointers in
index set nodes)
B* -Trees
B* -Tree -- Variant of B-
Tree
 Each node of a B-tree represents a block of secondary memory,
therefore, accessing a node is expensive operation. Thus, the
fewer nodes that are created, the better.

 In B*-tree is a variant of B-tree introduced by Donald Knuth and


named by Douglas Comer.

 In a B*-tree, all nodes except the root are required to be at least


two-thirds full, not just half full as in B-tree.

 The number of keys in all non-root nodes in a B*-tree of order m


2m − 1
is k, where  3  ≤ k ≤ m − 1.
 

 Average utilization of B*-tree is 81%.


B* -Tree Insert Operation
 In B*-tree, the frequency of node splitting is decreased by
delaying a split.

 A split in a B*-tree is delayed by attempting to redistribute the


keys between node and its sibling when node overflows.

 In B*-tree split operation two nodes are split into three instead of
one into two as in B-tree.

 All three nodes participating in the split are guaranteed to be two-


thirds full after split.
B*-Tree Insert Operation
 Overflow occurs, evenly redistribute the keys between node
and its sibling.

Insert 22
22 32

10 12 15 16 21 24 25 29 35 42 47 51 53

Overflow occurs, evenly redistribute keys


in the overflow node, in its sibling including
the separator key in parent and new key.
B*-Tree Insert Operation
 Overflow occurs, sibling is full, split node.

Insert 72
72 32

10 12 15 16 21 24 25 29 35 42 47 51 53 55 57 59

Overflow occurs, sibling is full, split node


B* -Tree Delete
Operation
 B*-tree deletion follows same rules as that of B-tree deletion.

 Deletion cases:
 After deletion node is at least two third full.
 After deletion underflow occurs
2m − 1
 Redistribute: if number of keys in siblings > .
3
2m − 1
 Merge nodes if number of keys in siblings < .
3

 Examples of deletion are omitted as they are similar to B-tree


deletion.
Questions ?
References
 Data Structure and Algorithms in C++, Adam Drozek.
 Introduction to Algorithms, T.H.Cormen, C.E.Leiserson,
R.L.Rivest, and C.Stein.
 Fundamentals of Database Systems, Elmasri Navathe.

 Fundamentals of Data Structures in C++, E.Horowitz, S.Sahni and


D.Mehta
 The Ubiquitous B-Tree, DOUGLAS COMER, ACM Computing
Surveys (CSUR), Volume 11 ,Issue 2(June 1979).
The End

S-ar putea să vă placă și