Indexing
Indexed binary search trees
An Indexed Binary Search Tree (BST) is an advanced variant of the traditional binary search tree (BST) that incorporates an indexing mechanism to improve the performance of search operations. While a typical binary search tree uses the tree structure to store and retrieve data, an indexed binary search tree maintains additional auxiliary information (indices) to make searching faster and more efficient.
In simple terms, it combines the properties of a binary search tree with an indexing mechanism, making it particularly useful for searching and sorting data where quick access to elements is required.
Key Concepts of Indexed Binary Search Trees
1. Binary Search Tree (BST):
A binary search tree is a binary tree where each node has at most two children (left and right), and the nodes are arranged such that for every node:
All nodes in the left subtree have values less than the node’s value.
All nodes in the right subtree have values greater than the node’s value.
This structure allows efficient searching, insertion, and deletion in O(log n) time on average.
2. Indexing in Binary Search Tree:
An index in this context refers to the concept of maintaining extra information at each node to speed up certain operations like searching for the k-th smallest element or performing range queries efficiently.
The additional data stored at each node allows traversal of the tree to be more efficient in specific use cases, improving the time complexity for certain operations.
How Indexed Binary Search Trees Work
An indexed binary search tree maintains the following properties:
Tree Structure: Like any BST, each node has left and right child pointers.
Indexing Data: In addition to the key value, each node may store additional information such as:
Size of Subtree: Each node maintains the size of the subtree rooted at that node. This size is the number of nodes in the subtree, including the node itself.
Rank of the Node: The rank or index of the node in the sorted order of the BST. This allows you to efficiently find the k-th smallest element.
Operations in an Indexed Binary Search Tree
1. Insertion:
Insertion in an indexed BST follows the regular BST insertion logic: if the value to be inserted is smaller than the current node’s value, it goes to the left child; otherwise, it goes to the right child.
After insertion, the size of each node's subtree must be updated. This ensures that the index information (e.g., size or rank) is accurate.
2. Searching:
Searching for a particular key in an indexed BST works similarly to a standard binary search tree, with the key comparisons guiding the traversal.
The added indexing can speed up certain types of searches:
Rank Queries: For example, to find the k-th smallest element, the size of each node’s subtree helps determine whether the k-th smallest element lies in the left subtree, the right subtree, or the current node.
Range Queries: If you want to find all elements between two given values, the indexed BST can help perform this efficiently.
3. Finding the k-th Smallest Element:
If each node in the tree stores the size of its subtree, finding the k-th smallest element becomes efficient.
Start at the root.
If the left subtree size is greater than or equal to
k, the k-th smallest element must be in the left subtree.If
kis less than or equal to the size of the left subtree, then the k-th smallest element is in the left subtree.If
kis greater than the size of the left subtree, subtract the size of the left subtree (and the current node) fromkand continue searching in the right subtree.
4. Deletion:
Deletion in an indexed BST involves the standard BST deletion process (if the node to delete has one child or two children).
After deleting a node, the sizes of the nodes in the tree must be updated accordingly to maintain correct index information.
Advantages of Indexed Binary Search Trees
1. Efficient Search for k-th Smallest or k-th Largest Elements:
With the subtree size information, finding the k-th smallest element can be done in O(log n) time, making it much faster than a linear search.
2. Efficient Range Queries:
Range queries (i.e., finding all elements within a certain range) can be handled efficiently using the subtree size and rank information, enabling better performance than standard BSTs.
3. Maintaining Sorted Order:
Since the tree structure follows the binary search property, the elements are always stored in sorted order, allowing efficient in-order traversal and sorted data access.
4. Faster Updates with Indexed Data:
The indexing information at each node (like subtree sizes) can be maintained during insertion and deletion, ensuring that updates to the tree do not disrupt the ability to perform efficient searches.
Disadvantages of Indexed Binary Search Trees
1. Increased Space Complexity:
The added indexing data (such as the size of the subtree) increases the space complexity of the tree. Each node requires extra memory to store the index, making the tree more space-intensive than a standard BST.
2. Complexity of Operations:
While indexed binary search trees offer efficient operations, they are more complex to implement compared to standard binary search trees, requiring careful management of index updates during insertion, deletion, and balancing operations.
3. Balancing Issues:
Like any binary search tree, an indexed BST can become unbalanced if the tree is not carefully managed, leading to potential degradation of search performance (from O(log n) to O(n) in the worst case). Balancing methods (like AVL trees or Red-Black trees) can be incorporated, but this increases the complexity.
Example: Indexed Binary Search Tree with Subtree Size
Let's illustrate a simple indexed BST with the subtree size information.
Tree Structure:
50 (size=5)
/ \
30 (size=2) 70 (size=2)
/ \ / \
20 40 60 80
At the root, the size of the subtree rooted at node 50 is 5 (it has 5 elements).
The left child, node 30, has a subtree size of 2 (it has two elements: 20 and 40).
Similarly, the right child, node 70, has a subtree size of 2 (it has two elements: 60 and 80).
Operation Example - Find the 3rd Smallest Element:
Start at node 50. The left subtree size is 2, so the 3rd smallest element must be in the right subtree (since 3 > 2).
Move to node 70. The left subtree size is 1, so the 3rd smallest element is the 2nd smallest element in the right subtree of node 70.
Move to node 60. Since node 60 is the only element in its subtree, it is the 3rd smallest element in the tree.
Indexed sequential access method (ISAM)
The Indexed Sequential Access Method (ISAM) is a traditional data structure and access method used in database management systems for efficiently storing and retrieving records. ISAM allows for both sequential and random (indexed) access to data, making it a useful structure for scenarios where fast searching, inserting, and sequential reading of records are necessary. While the concept of ISAM is based on indexing and sequential data access, it can also be implemented using a B-tree as an underlying structure.
ISAM Overview
ISAM is based on a two-level structure:
Data File: This contains the actual data records, stored in sequential order according to a primary key.
Index File: This contains the index, which stores pointers to the data file. The index provides fast access to the data, allowing you to search for records quickly without scanning the entire file.
In ISAM, the data file is usually maintained in a sorted order to enable efficient sequential access. The index file allows for random access to the data by providing direct pointers to the location of the records.
How ISAM Works
1. Data File (Sequential Part):
The data file stores records in sorted order based on a key (typically the primary key). These records are arranged such that new records are added in sorted order, and the file can be sequentially scanned to read all the records in the correct order.
Sequential Access is efficient here since the records are sorted and can be read in a linear manner.
2. Index File (Indexed Part):
The index file contains entries that point to specific locations in the data file. The index allows random access to data, which significantly reduces the time to locate a particular record.
The index is typically structured as a B-tree (or other balanced tree structures), where each internal node points to child nodes and each leaf node points to the data file.
The index structure enables fast binary search for keys in the index file, which makes finding the corresponding data record in the data file fast.
Operations in ISAM
1. Search:
To search for a record in ISAM, the system first searches through the index file using binary search (if implemented as a B-tree or similar structure) to quickly locate the correct pointer to the corresponding data file location.
Once the index directs to the correct spot, the system retrieves the record from the data file.
2. Insertion:
Insertion in ISAM can be complex since the data file needs to maintain its sorted order. When inserting a new record, the system may need to:
Find the appropriate location in the data file.
Shift other records to make room for the new record.
Update the index file accordingly.
If the data file has grown too large and the index file no longer adequately represents the data, the system may need to reorganize or rebuild the index.
3. Deletion:
Deleting a record in ISAM involves:
Removing the record from the data file (which may cause holes or gaps).
Updating the index file to reflect the removal of the record.
If necessary, performing reorganization of the data file to maintain order.
4. Sequential Access:
To access the records sequentially, the system can read the data file from start to finish, retrieving records in sorted order without needing to use the index. This is efficient when processing all records in order.
B-Trees and ISAM
In many modern systems, B-trees are used to implement the index file in ISAM. B-trees, with their balanced structure, are ideal for efficiently supporting the sequential and random access requirements of ISAM due to the following reasons:
Balanced Structure: B-trees maintain a balanced structure where each node is at the same level, ensuring logarithmic time complexity for search, insert, and delete operations, which is crucial for indexing large datasets.
Efficient Disk Access: B-trees are designed to reduce disk accesses by storing multiple keys in each node, making them well-suited for storage in disk-based systems like databases.
Range Queries: B-trees naturally support range queries and ordered traversal, which is useful when processing records sequentially.
Example of ISAM with B-Tree Index:
Let's take a simple example where we have a collection of records, each consisting of a key and data field:
Data File (Sorted by Key):
Key
Data
100
Record 1
200
Record 2
300
Record 3
400
Record 4
500
Record 5
Index File (B-Tree Example):
In a B-tree index for the above data file, the keys would be stored in a tree structure where each node points to a location in the data file.
B-Tree example for the index:
css
[200]
/ \
[100] [400]
/ \
[300] [500]
The root of the tree points to two child nodes.
The left child points to records with keys ≤ 200, and the right child points to records with keys > 200.
Searching for key 300 in the B-tree would guide you to the right child (where records with keys > 200 are stored), and then you can directly access the corresponding record in the data file.
Advantages of ISAM with B-Trees:
Efficient Search: The B-tree index provides fast searching through the sorted data file.
Sequential Access: Records are stored in a sequential order in the data file, making it easy to process records in order.
Balanced and Scalable: The B-tree structure ensures that the index remains balanced, providing consistent and efficient access times even as the data grows.
Reduced Disk Access: The use of a B-tree reduces the number of disk accesses for both search and insert operations compared to linear search or flat file indexing.
Disadvantages:
Overhead in Insertion and Deletion: ISAM requires that the data file be kept sorted and the index file updated whenever records are inserted or deleted. This can incur overhead, particularly when the index file needs to be rebuilt or reorganized.
Limited Flexibility: Traditional ISAM systems are not as flexible as more modern methods like B+-trees or hash indexing, which handle dynamic data and large datasets more efficiently.
Conclusion:
Indexed Sequential Access Method (ISAM) is an effective method for handling large datasets that require both sequential and indexed access. By using a B-tree as the underlying index structure, ISAM provides a balanced, efficient way to search, insert, delete, and access data. While it is an older method compared to more modern database indexing techniques, ISAM still remains relevant for certain use cases, particularly in systems where access speed and simple structures are prioritized.
m - Way search trees
An m-way search tree is a type of search tree in which each node can have up to m children (i.e., it is a generalization of the binary search tree, which is an m-way search tree where m = 2). In this type of tree, nodes are allowed to store multiple keys, and the tree structure is designed to facilitate fast searching, insertion, and deletion operations.
Characteristics of an m-Way Search Tree
1. Nodes with Multiple Keys:
Each node in an m-way search tree can store up to m-1 keys (values), arranged in sorted order. These keys divide the node into multiple regions.
For example, if a node has keys
[K1, K2, K3], then:All keys less than
K1are found in the leftmost child.Keys between
K1andK2are found in the second child.Keys between
K2andK3are found in the third child.All keys greater than
K3are found in the rightmost child.
2. Children:
Each node can have up to m children. The children are associated with intervals determined by the keys stored in the parent node.
A node with k keys will have k + 1 children.
3. Balanced Structure:
Similar to binary search trees, m-way search trees are often balanced, meaning that all leaf nodes are at the same depth. This ensures efficient operations like search, insert, and delete.
4. Ordering Property:
The keys in each node are stored in sorted order. Additionally, the keys in the left child are less than the smallest key in the node, and the keys in the right child are greater than the largest key.
Example of an m-Way Search Tree (m = 3)
Consider an m-way search tree with m = 3, which means each node can have up to 2 keys (and thus 3 children). Below is a simple example of such a tree:
Tree Structure:
Root Node: The root node contains the keys
[20, 40]and has three children:Left child: Contains the key
10.Middle child: Contains the key
30.Right child: Contains the keys
[50, 60].
Searching for a key: To search for a key, start at the root and follow the appropriate child pointer based on the key range:
If searching for
30, it’s found in the middle child.If searching for
35, you go to the middle child (which contains30), and then to its child node (which contains35).
Operations on m-Way Search Trees
The main operations on m-way search trees are search, insertion, and deletion, similar to binary search trees, but with adaptations for multiple keys and children.
1. Search Operation:
The search operation works similarly to that in a binary search tree, but instead of traversing down two child pointers, you need to check multiple keys in each node.
Start from the root and at each node, compare the target key with the keys in the node. Based on the comparison, move to the appropriate child node.
This process continues recursively until the key is found or the search reaches a leaf node.
2. Insertion Operation:
Step 1: First, search for the correct position to insert the key by following the appropriate child pointers.
Step 2: If the correct leaf node has space for the new key, simply insert it while maintaining the sorted order.
Step 3: If the node is full (i.e., it has m-1 keys), then:
Split the node into two nodes.
Push the middle key up to the parent node.
If the parent node is also full, this process continues recursively (propagating the split upwards).
In some cases, if the root node is split, a new root is created, and the height of the tree increases.
3. Deletion Operation:
Step 1: Search for the key to be deleted, and if it’s found in a leaf node, remove it and rearrange the remaining keys.
Step 2: If the key is found in an internal node, it can be replaced by either:
The largest key from the left child (predecessor), or
The smallest key from the right child (successor).
Step 3: After deletion, if a node has fewer than the minimum required number of keys (i.e., it is underfull), the node is merged with a sibling or keys are borrowed from a sibling to restore the balance.
Step 4: If this causes a parent node to have fewer than the required keys, the process propagates upwards, similar to insertion.
Advantages of m-Way Search Trees
1. Efficient Search Time:
The search time in an m-way search tree is logarithmic with respect to the number of keys in the tree. This is because the tree has multiple keys in each node, reducing the height of the tree and making search operations faster compared to binary search trees.
In an m-way search tree, the height of the tree is approximately logₘ(n), where n is the number of keys.
2. Reduced Disk Access (in Disk-based Storage):
Since m-way search trees store multiple keys in each node, they are particularly efficient for disk-based storage systems. By reducing the number of nodes and pointers, the tree minimizes disk access, which is a critical factor in systems that store large amounts of data on disks (e.g., databases).
3. Balanced Structure:
Like other balanced trees (e.g., AVL trees, B-trees), m-way search trees maintain a balanced structure, ensuring that all operations are efficient even as the tree grows.
4. Better Space Utilization:
By storing multiple keys in each node, m-way search trees make better use of memory compared to binary search trees, where each node only stores one key.
Disadvantages
1. Complexity:
The algorithm for insertion and deletion can become more complex compared to binary search trees due to the need to handle multiple keys and children at each node, as well as splitting and merging nodes.
2. Overhead in Managing Full Nodes:
When nodes become full, they need to be split, which can introduce additional overhead. Similarly, when nodes become underfilled, rebalancing and merging nodes adds complexity.
Applications of m-Way Search Trees
Databases: m-way search trees, especially in the form of B-trees (where m is a fixed degree), are widely used for indexing in databases because they offer efficient searching and balanced access to large datasets.
File Systems: Many file systems use m-way search trees to index files and directories, allowing for fast file lookup and access.
Memory Management: They can be used in memory management systems where fast searching, allocation, and deallocation of memory blocks are required.
Conclusion
An m-way search tree is a versatile data structure that allows for efficient searching, insertion, and deletion by allowing multiple keys per node and multiple child pointers. It is a generalization of binary search trees and is commonly used in systems that require efficient data access, such as databases and file systems. Although its operations can be more complex than binary search trees, the ability to store more keys per node leads to better performance in large-scale systems.
B - trees of order m
A B-tree of order m is a self-balancing tree data structure that maintains sorted data and allows for efficient insertion, deletion, and searching operations. It is a generalization of a binary search tree, where each node can have more than two children and store more than one key.
Key Characteristics of a B-Tree of Order m
1. Order m:
A B-tree of order m is a tree where each node can contain up to m-1 keys and have m children.
A B-tree is balanced, meaning all leaf nodes are at the same level. This ensures logarithmic time complexity for searching, insertion, and deletion operations.
2. Properties:
Root Node: The root can have as few as 2 children if it's not a leaf, or just one key if it is a leaf. If the root has only one child, it becomes the only child of the parent, making the tree height 1.
Non-root Nodes: Every non-root node must have at least ⌈m/2⌉ children. This property ensures the tree remains balanced and minimizes the height of the tree.
Keys in Sorted Order: Within each node, the keys are stored in sorted order, and the keys within a node partition the node's children into key ranges. For example, in a node containing keys
[K1, K2, K3], the children are arranged as:The left child contains keys smaller than
K1.The middle child contains keys between
K1andK2.The right child contains keys larger than
K3.
3. Balanced Structure:
B-trees are balanced. All leaf nodes are at the same level, which ensures that the tree remains relatively shallow even as it grows, guaranteeing efficient O(log n) time for search, insert, and delete operations.
4. Efficient Disk Usage:
B-trees are particularly well-suited for disk-based storage systems like databases and file systems. Each node of a B-tree can hold multiple keys, which minimizes the number of disk accesses.
Formal Definition of B-Tree of Order m
For a B-tree of order m, each node has the following properties:
Number of Keys: Each node can store between ⌈m/2⌉ - 1 and m-1 keys.
Number of Children: Each node can have between ⌈m/2⌉ and m children, except the root which can have fewer.
Search: The keys are arranged in sorted order within each node. A search for a key starts at the root and follows child pointers to find the key in the correct leaf node.
Example of a B-Tree of Order 3
Let’s take a B-tree of order 3 as an example. In this tree:
Each node can store at most 2 keys.
Each node can have at most 3 children.
Each non-root node must have at least 2 children, and the root can have fewer.
Tree Structure:
[30]
/ | \
[10, 20] [40, 50] [60, 70]
Root Node: The root node contains one key, 30.
Children of the Root: The root has three children:
The left child contains the keys [10, 20].
The middle child contains the keys [40, 50].
The right child contains the keys [60, 70].
This is a B-tree of order 3, where each internal node stores 1 or 2 keys, and each internal node can have up to 3 children.
Operations in B-Trees of Order m
Search:
The search operation in a B-tree follows the sorted keys in each node to traverse the tree. Starting at the root, you:
Compare the key to search for with the keys in the current node.
Based on the comparison, move to the appropriate child node and repeat the process until the key is found or a leaf node is reached.
Time Complexity: The time complexity for searching in a B-tree is O(log m n), where n is the number of keys in the tree and m is the order of the tree.
Insertion:
Step 1: Search for the correct position to insert the key.
Step 2: If the appropriate node has space (i.e., has fewer than m-1 keys), insert the key into the node while maintaining sorted order.
Step 3: If the node is full (i.e., it already has m-1 keys), split the node into two:
The middle key is promoted to the parent node.
The two halves of the node become two separate nodes.
Step 4: If the parent node is also full, the splitting process propagates upwards, and the root may be split as well, increasing the height of the tree.
Time Complexity: The time complexity for insertion is O(log m n) due to the need to traverse the tree to find the insertion point and potentially split nodes.
Deletion:
Step 1: Search for the key to be deleted.
Step 2: If the key is found in a leaf node, simply remove it.
Step 3: If the key is found in an internal node, replace it with either the predecessor or successor key (from a neighboring leaf or child).
Step 4: If a node becomes underfull (i.e., it has fewer than ⌈m/2⌉ keys), perform node merging or key borrowing from a sibling to restore the node's balance. This process propagates up to the root if necessary.
Time Complexity: The time complexity for deletion is also O(log m n), as it requires searching for the key and possibly propagating changes upwards to restore balance.
Advantages of B-Trees of Order m
Efficient Search, Insert, and Delete Operations:
The tree remains balanced, ensuring that all operations (search, insert, delete) have logarithmic time complexity, even as the number of keys grows.
Reduced Disk Access:
Because each node can store multiple keys and pointers, B-trees minimize the number of nodes that must be loaded into memory from disk. This is particularly important in database and file system applications, where data is stored on disk.
Flexibility:
The order of the B-tree can be adjusted according to the storage medium and system architecture. For example, increasing the order reduces the height of the tree, which can lead to fewer disk accesses.
Supports Range Queries:
The sorted nature of the B-tree allows for efficient range queries (e.g., finding all keys within a given range), since the tree’s structure supports ordered traversal.
Disadvantages
Complexity:
B-trees are more complex to implement and manage than simpler tree structures like binary search trees, particularly because of the need to handle node splitting and merging.
Space Utilization:
While B-trees are efficient with disk-based storage, they can be less space-efficient when stored entirely in memory, as nodes may contain fewer keys than the maximum allowed.
Applications of B-Trees of Order m
Databases: B-trees are widely used in databases to index large datasets, allowing for fast search, insert, and delete operations.
File Systems: Many file systems use B-trees or related structures (like B+-trees) to index files and directories.
Storage Systems: B-trees are ideal for systems that require efficient access to large, sorted datasets stored on disk.
Conclusion
A B-tree of order m is a versatile and efficient data structure that supports balanced search, insertion, and deletion operations. By allowing each node to store multiple keys and have multiple children, it ensures fast access even for large datasets, making it ideal for database and file system indexing. The tree remains balanced, providing O(log m n) time complexity for all operations, which is crucial for performance in systems that handle large amounts of data.
Height of B - tree
The height of a B-tree is a critical aspect that determines the efficiency of search, insert, and delete operations. The height of a B-tree of order m depends on the number of keys n stored in the tree, and it is directly related to how well the tree is balanced.
Key Concepts of B-Tree Height
A B-tree of order m has the following properties:
Order (m): Each node can store up to m-1 keys and can have m children.
Balanced Structure: B-trees are balanced, meaning all leaf nodes are at the same level (i.e., they have the same depth from the root).
Root Node: The root can have fewer than the maximum number of children. Specifically, the root can have between 2 and m children (if it's not a leaf node). If it's a leaf, it can have 0 children.
Other Nodes: Every non-root node has between ⌈m/2⌉ and m children.
Height Formula of a B-Tree
The height of a B-tree h is the number of edges (or levels) from the root to the leaf nodes. The height can be estimated by the following formula:
h=⌈logmn⌉h = \lceil \log_m n \rceilh=⌈logmn⌉
Where:
n is the number of keys stored in the tree.
m is the order of the tree (i.e., the maximum number of children a node can have).
h is the height of the tree, which is the number of edges from the root to the leaf nodes.
Derivation of the Formula
In the best-case scenario (when the B-tree is full), the tree is filled as much as possible at every level.
The root node can have at most m children, and each non-root node can have m children as well.
At height h, the number of keys in the tree is maximized.
For a tree of height h:
At level 0 (the root), there is 1 node.
At level 1, there can be at most m nodes.
At level 2, there can be at most m² nodes.
At level h, there can be at most m^h nodes.
Thus, the total number of keys in the tree (when the tree is maximally filled) is given by the sum of the keys stored in all nodes across all levels:
n=1+m+m2+m3+⋯+mhn = 1 + m + m^2 + m^3 + \dots + m^hn=1+m+m2+m3+⋯+mh
This sum approximates the total number of keys as:
n≈mhn \approx m^hn≈mh
From this, the height h of the tree can be determined as:
h≈logmnh \approx \log_m nh≈logmn
Therefore, the height of the B-tree is approximately log_m n, which gives the formula:
h=⌈logmn⌉h = \lceil \log_m n \rceilh=⌈logmn⌉
Example
Consider a B-tree of order 3 (i.e., m = 3) with n = 1000 keys. Let's calculate its height.
Order (m) = 3, and the number of keys (n) = 1000.
Use the height formula: h=⌈log31000⌉h = \lceil \log_3 1000 \rceilh=⌈log31000⌉
First, calculate the logarithm: log31000≈log1000log3≈30.4771≈6.29\log_3 1000 \approx \frac{\log 1000}{\log 3} \approx \frac{3}{0.4771} \approx 6.29log31000≈log3log1000≈0.47713≈6.29
Take the ceiling value: h=⌈6.29⌉=7h = \lceil 6.29 \rceil = 7h=⌈6.29⌉=7
So, the height of the B-tree with 1000 keys and order 3 is 7.
Worst-Case Scenario (Minimum Height)
In the worst case, a B-tree may not be completely full. However, the tree is still balanced, and each non-root node has at least ⌈m/2⌉ children. This ensures that the height is still logarithmic relative to the number of keys.
Even in the worst-case scenario, the height of a B-tree will still grow logarithmically, and it will never grow excessively high. This is what makes B-trees efficient, especially for large datasets.
Practical Implications
Search Operations: The height of the tree impacts the time required to search for a key. Since B-trees are balanced, the time complexity of a search operation is O(log_m n), where n is the number of keys and m is the order of the tree.
Insertion and Deletion: Insertion and deletion operations also depend on the height of the tree. Since the tree remains balanced, these operations will require O(log_m n) time, ensuring efficient updates to the data structure.
Conclusion
The height of a B-tree of order m with n keys is approximately O(log_m n). This logarithmic height ensures that operations such as search, insertion, and deletion are efficient, even as the number of keys in the tree grows large. By maintaining a balanced structure with logarithmic growth in height, B-trees are ideal for applications that require fast data retrieval, such as databases and file systems.
Searching a B - tree
Searching in a B-tree is an efficient process that leverages the properties of a balanced tree structure, making it highly effective for large datasets, particularly in scenarios like databases and file systems. The process of searching in a B-tree is similar to searching in other tree data structures but takes advantage of multiple keys and child pointers per node, as well as the tree's balanced nature.
Key Properties of a B-tree (for reference)
Order (m): A B-tree of order m allows each node to store up to m-1 keys and have m children.
Balanced: All leaf nodes are at the same level (same depth from the root), ensuring efficient access.
Sorted Keys: Keys in each node are stored in sorted order, and the keys in the parent node divide the child nodes into ranges.
General Search Process in a B-tree
The search operation in a B-tree involves traversing down the tree, comparing the key being searched for at each node, and moving to the appropriate child node based on the comparisons. Since the keys in each node are sorted, binary-like search can be used within the node.
Steps to Search in a B-tree
Start at the Root Node: Begin the search at the root node. If the tree is empty (i.e., the root is null), return that the key is not found.
Search within the Node: Each node of the B-tree stores multiple keys, and these keys are arranged in sorted order. To search for a key within a node:
Compare the target key with the keys in the current node.
If the target key is found in the node, return that the key is found.
If the target key is smaller than the first key in the node, move to the leftmost child.
If the target key is between two keys, move to the child corresponding to that key range.
If the target key is larger than the largest key in the node, move to the rightmost child.
Traverse Down the Tree: If the key is not found within the current node, move to the appropriate child node and repeat the process. Each child node corresponds to a range of keys, so you always move to the child whose range contains the key you're looking for.
Continue Until You Reach a Leaf Node:
Continue the process of moving down the tree, visiting child nodes and comparing keys, until you either find the target key or reach a leaf node.
If you reach a leaf node and still haven't found the key, return that the key is not present in the tree.
Time Complexity of Search
The time complexity of searching in a B-tree is O(log_m n), where:
n is the total number of keys in the B-tree.
m is the order of the B-tree, which defines the number of children a node can have (and thus the branching factor of the tree).
Since each node contains m-1 keys and has m children, the height of the tree is logarithmic in terms of m. This ensures that the search operation is efficient even for large datasets.
Example
Consider a B-tree of order 3 (i.e., each node can have 2 keys and 3 children). Let's search for the key 35 in the following tree:
Css:
[30]
/ | \
[10, 20] [40, 50] [60, 70]
|
[35]
Start at the Root: The root node contains the key 30.
Compare 35 with 30. Since 35 > 30, move to the right child node, which contains the keys [40, 50].
Move to the Right Child: The current node contains the keys [40, 50].
Compare 35 with the keys in this node. Since 35 < 40, move to the left child of this node, which contains the key [35].
Found the Key: In the left child of the right node, we find the key 35.
Return that the key is found.
Advantages of Searching in B-trees
Efficient Search Time: The time complexity of O(log_m n) ensures that even large datasets can be searched efficiently.
Minimized Disk Access: B-trees are optimized for disk-based storage, as they store multiple keys per node. This reduces the number of disk accesses required during a search, making B-trees ideal for systems like databases and file systems where accessing disk data is slow.
Balanced Structure: Since B-trees are balanced, the maximum height of the tree is kept low, which minimizes the number of nodes that need to be traversed during the search.
Summary of the Search Process
Start at the root and compare the key with the keys in the node.
Based on the comparison, move to the appropriate child node.
Continue down the tree until the key is found or you reach a leaf node.
If the key is found, return it; if not, indicate that the key is absent.
Conclusion
Searching in a B-tree is a fast and efficient process due to its balanced nature and the ability to store multiple keys in each node. This results in logarithmic time complexity, O(log_m n), for search operations, making B-trees ideal for large-scale applications like databases and file systems where fast retrieval of data is critical.
Inserting into a B - tree
Insertion in a B-tree involves adding a new key to the tree while maintaining the properties of the B-tree: it must remain balanced, sorted, and every node must comply with the B-tree’s order constraints. The insertion process can be broken down into several steps, and in the worst case, may involve splitting nodes and propagating changes upwards to the root.
Key Properties of B-trees (Recap)
Order (m): Each node in a B-tree can have up to m-1 keys and m children.
Balanced Tree: All leaf nodes are at the same level, which ensures that the height of the tree remains logarithmic.
Sorted Keys: The keys within each node are stored in sorted order.
Steps to Insert a Key into a B-Tree
The general approach to insert a new key into a B-tree is as follows:
1. Find the Correct Leaf Node:
Start at the root and search for the appropriate leaf node where the new key should be inserted. This is done by traversing the tree, comparing the key with the keys in each node, and moving to the appropriate child pointer (similar to searching in a B-tree).
2. Insert the Key into the Node:
If the node where the key should be inserted has space (i.e., it contains fewer than m-1 keys), simply insert the new key into the node while maintaining the sorted order of keys.
3. Split the Node if Full:
If the node where the key should be inserted is full (i.e., it already contains m-1 keys), you must split the node into two nodes.
Splitting a node:
Choose the median key (the middle key in the sorted list) to promote to the parent node.
The median key becomes the new key in the parent, and the original node is split into two: one contains the keys smaller than the median, and the other contains the keys larger than the median.
4. Propagate the Split Upwards (if necessary):
After splitting, the parent node may also become full. If this happens, the parent is also split, and the process repeats until the tree is balanced.
If the root is split, a new root is created, which increases the height of the tree by 1.
Detailed Insertion Algorithm
Let’s break down the insertion process with more detail:
Step 1: Find the Correct Leaf Node
Begin at the root node and compare the key to be inserted with the keys in the current node.
Move down the tree to the appropriate child node based on the comparison. Repeat this process until you reach a leaf node.
Step 2: Insert the Key into the Node (If Space is Available)
If the leaf node has fewer than m-1 keys, simply insert the new key in the correct position while maintaining the sorted order within the node.
Step 3: Split the Node if Full
If the node has m-1 keys (i.e., it's full), the following happens:
Find the Median Key: Choose the median key (the middle key) to be promoted to the parent node.
Split the Node:
Create two new nodes: one containing the keys smaller than the median and the other containing the keys larger than the median.
The median key is pushed up into the parent node.
For example, if a node has keys [10, 20, 30] and a new key 25 is to be inserted:
Insert
25into the node to get[10, 20, 25, 30].The median key (
20) is promoted to the parent, and the node is split into two:[10]and[25, 30].
Step 4: Propagate the Split Upwards (If Necessary)
After the split, if the parent node becomes full (i.e., it has m-1 keys), it too will need to be split.
The process of splitting nodes propagates up the tree until there is a node that does not require splitting or a new root is created.
If the root is split, a new root is created, which increases the height of the tree by 1.
Example of Insertion
Let’s go through an example of inserting a key into a B-tree of order 3 (i.e., each node can hold up to 2 keys and 3 children).
Initial Tree
Consider the following B-tree of order 3:
This tree has a height of 1, and we want to insert the key 25.
Step 1: Find the Correct Leaf Node
Start at the root. The root contains the key 20, and since 25 > 20, move to the right child, which contains the keys
[30, 40].
Step 2: Insert the Key into the Node (If Space is Available)
The node
[30, 40]has space for another key, so we insert 25 in the correct position. The node now contains[25, 30, 40].
Step 3: Split the Node if Full
Now, the node
[25, 30, 40]is full, as it contains 3 keys (which is the maximum for a B-tree of order 3).We split this node by promoting the median key 30 to the parent node. The node is split into two nodes:
[25]and[40].
After the split, the tree looks like this:
Step 4: Propagate the Split Upwards (If Necessary)
The parent node
[20]now contains 2 keys and has 3 children, which is within the allowed capacity for a B-tree of order 3, so no further splitting is needed.
Final Tree After Insertion
The key 25 has been successfully inserted into the tree.
Time Complexity of Insertion
The time complexity of inserting a key into a B-tree is O(log_m n), where:
n is the number of keys in the tree.
m is the order of the B-tree (i.e., the maximum number of children a node can have).
The time complexity is logarithmic because the height of the B-tree is O(log_m n), and each operation (search, insertion, splitting) requires traversing the tree and potentially splitting nodes, which takes time proportional to the height of the tree.
Summary of the Insertion Process
Search for the appropriate leaf node where the key should be inserted.
Insert the key into the node if there is space.
If the node is full, split the node and promote the median key to the parent.
If necessary, propagate the split upwards to maintain the balanced structure.
If the root is split, create a new root and increase the height of the tree.
The insertion process ensures that the B-tree remains balanced, with O(log_m n) time complexity for large datasets. This makes B-trees an efficient choice for applications such as database indexing and file systems where fast insertions are required.
Deletion from a B - tree
Deletion from a B-tree is an operation that must ensure the tree remains balanced and adheres to the B-tree properties. Like insertion, deletion in a B-tree involves maintaining the sorted order of keys and the balanced structure of the tree, but the process is more complex due to the need to handle cases where nodes become underfilled or need to be merged.
Key Properties of B-trees (Recap)
Order (m): A B-tree of order m can have at most m-1 keys and m children in each node.
Balanced Tree: All leaf nodes are at the same level, ensuring the tree remains balanced, which allows for efficient access.
Sorted Keys: The keys within each node are stored in sorted order.
Steps to Delete a Key from a B-Tree
When deleting a key from a B-tree, there are several cases to consider. The process can be divided into three main steps:
Find the Key to Delete: Search the tree to find the key that needs to be deleted, similar to how you would search in a B-tree.
Delete the Key: Once the key is found, the deletion process proceeds based on whether the key is in a leaf node or an internal node.
Fix Violations (If Necessary): After the key is deleted, the B-tree might become unbalanced. We need to fix any violations of the B-tree properties, such as underfilled nodes (nodes with fewer than the minimum number of keys).
Step 1: Find the Key to Delete
Start at the root and search for the key to delete, just like a normal search operation in a B-tree.
If the key is found in a leaf node, the deletion is straightforward (removing the key).
If the key is found in an internal node, the deletion is more complicated, as the key needs to be replaced with another key from the tree (either from its predecessor or successor).
Step 2: Delete the Key
Case 1: Deleting a Key from a Leaf Node
If the key to be deleted is in a leaf node, simply remove the key from the node.
If this results in a node having fewer than the minimum number of keys (⌈m/2⌉ - 1), we need to fix the node (this leads to cases 2 or 3 below).
Case 2: Deleting a Key from an Internal Node
If the key to be deleted is in an internal node, it needs to be replaced by either the predecessor (the largest key in the left subtree) or the successor (the smallest key in the right subtree). Then, the predecessor or successor key is deleted.
After replacing the key, the problem is reduced to deleting the predecessor or successor key from a leaf node (since it will be in a leaf or an internal node with at most one child).
Step 3: Fixing the Tree After Deletion
After a deletion, a B-tree node might end up with fewer than the minimum number of keys, which breaks the B-tree property. To restore balance, we must restructure the tree by either:
Borrowing a key from a sibling node.
Merging the node with a sibling node.
Case 1: Borrow from a Sibling
If a node has fewer than the minimum number of keys, we first check if its neighboring sibling has more than the minimum number of keys. If so, we borrow a key from the sibling node:
If the sibling is to the left, move the parent's key down into the current node, and then move the largest key from the left sibling up into the parent.
If the sibling is to the right, move the parent's key down into the current node, and then move the smallest key from the right sibling up into the parent.
Case 2: Merge with a Sibling
If borrowing is not possible (i.e., the sibling also has the minimum number of keys), the current node is merged with a sibling:
The parent key that separates the two nodes is moved down into the newly merged node.
After merging, the parent node loses a key, and the merge might cause the parent to underflow (i.e., have fewer than the minimum number of keys), requiring further restructuring up the tree.
Example of Deletion
Let’s consider a B-tree of order 3 (i.e., each node can hold a maximum of 2 keys and 3 children). Here's an example tree:
[30]
/ | \
[10] [20] [40, 50]
Example 1: Deleting a Key from a Leaf Node
Let’s delete 20 from the tree:
1. Find the Key: The key 20 is in the middle child of the root.
2. Delete the Key: Remove 20 from the node.
The tree now looks like this:
[30]
/ | \
[10] [40, 50]
1. Fix Violations: The node [40, 50] still has two keys, so no violation occurs, and the tree remains balanced.
Example 2: Deleting a Key from a Node with Fewer Keys
Let’s delete 30 from the root:
1. Find the Key: The key 30 is in the root node.
2. Delete the Key: When we delete 30, the tree becomes:
[40, 50]
/ \
[10] [20]
3. Fix Violations: The root has only one key after the deletion, which is underfilled (the minimum for a B-tree of order 3 is 1 key). Since it has only one child, the two nodes merge, and the tree becomes:
[20]
/ \
[10] [40, 50]
Now, the tree is balanced again.
Time Complexity of Deletion
The time complexity of deleting a key from a B-tree is O(log_m n), where:
n is the number of keys in the tree.
m is the order of the B-tree (i.e., the maximum number of children a node can have).
The logarithmic time complexity arises because the tree remains balanced, and in the worst case, the deletion involves traversing the height of the tree, which is O(log_m n).
Summary of Deletion Process
Find the Key: Search for the key to delete, as in a normal B-tree search.
Delete the Key: If the key is in a leaf node, delete it directly. If the key is in an internal node, replace it with the predecessor or successor and delete that key.
Fix Violations: After deleting the key, if any node has fewer than the minimum number of keys, fix the violation by either:
Borrowing a key from a sibling.
Merging with a sibling.
Propagate Fixes: If necessary, propagate the changes upwards, potentially increasing the height of the tree.
Conclusion
Deletion in a B-tree is more complex than insertion due to the need to maintain the tree's balance after deleting a key. The tree remains balanced by borrowing keys or merging nodes when necessary, ensuring efficient operations with O(log_m n) time complexity for large datasets. This makes B-trees highly efficient for dynamic data storage where keys are frequently added and removed, such as in databases and file systems.
Node structure
In the context of indexing in data structures, a node structure refers to the way data (typically keys, pointers, or values) is organized within a node of a tree-based index, such as a B-tree, B+-tree, or other indexing structures. Nodes are essential for efficient data retrieval, as they store the keys and help in navigating the tree during search, insertion, and deletion operations.
Node Structure in Indexing
In tree-based indexing structures, each node typically contains a combination of keys and pointers (also called child references or references), which direct the search to the appropriate subtrees or data records. Depending on the type of index structure, the exact contents and organization of a node may vary, but the general idea remains the same: nodes store keys that help guide searches, and they contain pointers to other nodes or records.
Here is an overview of the node structure for common indexing structures:
1. B-tree Node Structure
A B-tree is a balanced tree structure that stores keys and pointers in an efficient way to allow fast searching, insertion, and deletion. The node structure of a B-tree typically contains the following components:
Keys: A sorted list of keys in the node. The number of keys is at most m-1 where m is the order of the B-tree.
Pointers (Children): A list of child pointers. Each pointer refers to a subtree that contains keys smaller or larger than the corresponding keys in the node. The number of pointers is m, corresponding to the number of possible subtrees.
Parent Pointer (optional): Some B-trees maintain a parent pointer for easier traversal during certain operations, but this is not a mandatory feature in many implementations.
For a B-tree of order m, each node has the following properties:
It can hold between ⌈m/2⌉ - 1 and m - 1 keys.
It has between ⌈m/2⌉ and m child pointers (for internal nodes).
The root node may have fewer than ⌈m/2⌉ keys, but it must still be at least partially filled (i.e., at least one key).
Example of B-tree Node Structure:
Node:
[ Key1 | Key2 | Key3 | ... | KeyN ]
[ P1 | P2 | P3 | ... | PN+1 ]
Key1, Key2, ..., KeyN: Sorted keys in the node.
P1, P2, ..., PN+1: Pointers to child nodes or data records.
Properties:
If the node is an internal node, the keys guide the search, where:
Keys on the left of a pointer are less than the pointer's corresponding key.
Keys on the right of a pointer are greater than the pointer’s corresponding key.
If the node is a leaf node, the pointers refer to data records rather than child nodes.
2. B+-tree Node Structure
A B+-tree is an extension of the B-tree that stores only keys in the internal nodes and stores both keys and values (or pointers to records) in the leaf nodes. The internal node structure is almost identical to a B-tree, but the leaf nodes have a different structure to hold actual data pointers.
Internal Node:
An internal node in a B+-tree typically contains:
Keys: Sorted keys (similar to a B-tree).
Child Pointers: Pointers to child nodes, which are subtrees that help with the search.
Leaf Node:
A leaf node in a B+-tree contains:
Keys: Sorted keys, like in the internal nodes.
Data Pointers: Pointers to the actual data records or data blocks.
A key difference between a B-tree and a B+-tree is that in a B+-tree, all actual data records (or pointers to data) are found only in the leaf nodes, not in the internal nodes. This structure allows for faster range queries and traversal.
Example of B+-tree Node Structure:
· Internal Node:
[ Key1 | Key2 | Key3 | ... | KeyN ]
[ P1 | P2 | P3 | ... | PN+1 ]
· Leaf Node:
css
[ Key1 | Key2 | Key3 | ... | KeyN ]
[ D1 | D2 | D3 | ... | DN ] <-- Pointers to data records
Where:
Key1, Key2, ..., KeyN: Sorted keys.
P1, P2, ..., PN+1: Pointers to child nodes (in internal nodes).
D1, D2, ..., DN: Pointers to actual data records (in leaf nodes).
3. Trie Node Structure (for Prefix Indexing)
A Trie (or Prefix Tree) is another type of tree-based data structure commonly used for indexing strings and efficiently performing prefix searches. Each Trie node typically contains:
Children Pointers: A set of pointers (or references) to child nodes corresponding to possible characters (for string indexing). Each pointer refers to a node representing a character in a specific position of the string.
IsEndOfWord Flag: A flag that indicates whether the current node marks the end of a valid word.
Keys/Value: Optionally, a node can store a key (or value) associated with the string it represents. Typically, a node will store information only when it marks the end of a word.
Example of Trie Node Structure:
Node:
[ children: { a -> Node1, b -> Node2, c -> Node3 } ]
[ isEndOfWord: true ] <-- Marks the end of a valid word
[ value: 123 ] <-- Optional, holds associated value
children: A set of pointers to child nodes representing possible next characters.
isEndOfWord: A flag indicating whether the node represents the end of a valid word.
value: Optionally, stores a value associated with the string.
4. Hash Index Node Structure
In hash indexing, a node typically contains:
Hash Bucket: A bucket that stores the list of keys (and associated data) whose hash values fall into the same bucket.
Key-Value Pairs: In each bucket, there are key-value pairs where the key is the hash and the value is the data or a pointer to the data.
Hash indexes are not trees but are still an essential indexing technique used for fast lookup based on hash values.
Example of Hash Node Structure:
Bucket:
[ (Hash1, Data1), (Hash2, Data2), ... ]
Where:
Hash1, Hash2: The hashed keys.
Data1, Data2: Data associated with each key.
Hash function
In data structures, a hash function plays a crucial role in indexing, particularly in hash tables or hash maps, by efficiently mapping keys to specific positions in an array or a table. Here's how a hash function works in the context of indexing:
1. Purpose of a Hash Function:
The primary goal of a hash function is to take a given key (which can be any data type such as strings, integers, etc.) and convert it into an integer index that corresponds to a specific location in an array (or hash table). This allows for fast retrieval of values associated with that key.
2. Working of a Hash Function:
A hash function processes the input key and generates a hash value, which is then mapped to a valid index within the array or hash table.
For example, if you have a hash table of size m, the hash function will produce an index between 0 and m-1. The hash function can be expressed as: index=hash(key)%m\text{index} = \text{hash}(key) \% mindex=hash(key)%m
This ensures that the key is mapped to a specific position in the array, which can then be used to store or retrieve the associated value efficiently.
3. Characteristics of a Good Hash Function:
Deterministic: A hash function should always produce the same output for the same input.
Uniform Distribution: The hash function should distribute keys evenly across the available indices to minimize the chance of collisions (when two keys hash to the same index).
Efficiency: The computation of the hash value should be fast, as it directly affects the performance of the hash table operations.
4. Collisions:
Collisions occur when two different keys hash to the same index. Hash functions cannot avoid this issue entirely, but they should minimize it.
Common methods to handle collisions include:
Chaining: Store multiple elements at the same index by using a linked list or another data structure.
Open Addressing: Find another open spot within the table when a collision occurs, using methods like linear probing, quadratic probing, or double hashing.
5. Examples of Hash Functions:
Modulo-based Hashing: One simple hash function is to take the key and apply modulo operation with the table size. index=key%m\text{index} = key \% mindex=key%m
Multiplicative Hashing: Multiply the key by a constant and take the fractional part of the product.
Polynomial Rolling Hashing: A more sophisticated method often used in string hashing where each character’s position is weighted differently.
6. Applications:
Hash Tables: Used in databases, caches, and associative arrays to store key-value pairs with fast average-time access.
Data Deduplication: Hash functions can quickly identify duplicate entries by comparing hash values.
Cryptography: In cryptographic systems, hash functions are used to generate secure representations of data (though cryptographic hash functions differ from regular hash functions used in data structures).
Example:
python
class HashTable:
def __init__(self, size):
self.size = size
self.table = [None] * size
def hash_function(self, key):
return hash(key) % self.size
def insert(self, key, value):
index = self.hash_function(key)
self.table[index] = value
def get(self, key):
index = self.hash_function(key)
return self.table[index]
# Example usage
ht = HashTable(10)
ht.insert("apple", 100)
ht.insert("banana", 200)
print(ht.get("apple")) # Output: 100
In this example, the hash function maps the key (like "apple") to a table index, where the associated value is stored.
Collision resolution
Collision Resolution in Indexing in Data Structures
In indexing, collisions occur when two different keys hash to the same index in a hash table. This is problematic because it causes ambiguity in where to store or retrieve the data. To manage collisions, various resolution strategies are used. These strategies ensure that data can still be accessed efficiently even when hash values collide.
Here are the common collision resolution techniques:
1. Chaining
Concept: Each slot in the hash table holds a linked list (or another dynamic data structure) that stores all elements that hash to the same index.
How it works: When a collision occurs, the new element is simply added to the list of elements stored at the corresponding index.
Advantages:
Simple to implement.
Can handle a large number of collisions without significantly degrading performance.
Disadvantages:
Extra memory is required for the linked lists.
Performance may degrade if the table becomes too full, leading to long chains.
Example:
less
Copy code
Hash Table Index:
0 -> [key1, key2]
1 -> [key3]
2 -> []
2. Open Addressing
Concept: In open addressing, when a collision occurs, the algorithm looks for another open slot in the table using a probing technique. The key-value pair is placed in the first available slot.
Types of Probing:
Linear Probing: If a collision occurs at index i, the algorithm checks i+1, i+2, and so on, until an open slot is found.
Quadratic Probing: The interval between slots increases quadratically, i.e., it checks i+1^2, i+2^2, i+3^2, and so on.
Double Hashing: A second hash function is used to determine the next slot if a collision occurs.
Advantages:
Avoids extra memory for linked lists (compared to chaining).
Works well when the table is not too full.
Disadvantages:
Performance can degrade as the table fills up.
May require rehashing when the table reaches a certain load factor (i.e., when the number of elements becomes too large relative to the table size).
Example (Linear Probing):
rust
Copy code
Hash Table Index:
0 -> key1
1 -> key2 (collision, move to next available)
2 -> key3
3 -> [empty]
3. Rehashing
Concept: When the hash table becomes too full (i.e., when the load factor exceeds a certain threshold), the table is resized, and all elements are rehashed to new indices in the larger table.
Advantages:
Helps maintain performance as the table grows.
Disadvantages:
Rehashing is computationally expensive because every element in the table must be rehashed and moved to a new location.
Can cause temporary performance degradation during resizing.
When to use: Typically combined with open addressing or chaining to keep the hash table from becoming too full.
4. Cuckoo Hashing
Concept: This method uses two hash functions and two hash tables. When a collision occurs in one table, the existing element is "kicked out" and inserted into the other table using the second hash function. This process continues recursively until an element can be inserted into an empty slot.
Advantages:
Guarantees constant-time lookups.
Disadvantages:
More complex to implement.
May require frequent rehashing, especially if the table is too full.
Summary of Key Points:
Chaining: Uses linked lists to store colliding elements. Simple but requires extra memory.
Open Addressing: Finds another open slot in the table. Memory efficient but may degrade performance as the table fills up.
Rehashing: Dynamically resizes the hash table to reduce collision frequency.
Cuckoo Hashing: Uses multiple hash functions and tables to resolve collisions efficiently with guarantees for lookup time.
The choice of collision resolution method depends on the specific use case and performance requirements, such as memory availability, speed of lookups, and the expected load factor of the hash table.
Rehashing
Rehashing in Indexing in Data Structures
Rehashing is a process used in hash tables to handle the situation when the table becomes too full or when its performance degrades due to a high number of collisions. It involves creating a new, larger hash table and re-inserting all the existing elements into it, using the same or new hash functions. This process helps maintain efficient operations like insertion, deletion, and search by reducing the likelihood of collisions.
When Does Rehashing Occur?
Rehashing typically occurs when the load factor of the hash table exceeds a certain threshold. The load factor is defined as the ratio of the number of elements (N) to the size of the table (M):
Load Factor=NM\text{Load Factor} = \frac{N}{M}Load Factor=MN
When the load factor becomes too high, it indicates that the hash table is too full, and collisions will become more frequent, leading to degraded performance. Common thresholds for triggering rehashing are 0.7 or 0.8 (i.e., when 70% or 80% of the table is full).
Steps in the Rehashing Process
Create a New Larger Table:
A new hash table is created with a larger size. Often, the size of the new table is doubled or increased by a factor of 2.
This increase in size helps to maintain an optimal load factor.
Rehash All Existing Elements:
Each existing element in the old table must be rehashed to the new table.
The new table may have a different size, so the hash function may produce different indices for the elements.
Insert Elements into New Table:
After rehashing, all elements from the old table are re-inserted into the new table. This re-insertion typically involves using the same hash function (or a new one) to determine where to place each element.
Replace Old Table with New Table:
Once all elements have been rehashed and placed into the new table, the new table becomes the active table, and the old table is discarded.
Why is Rehashing Important?
Improves Efficiency:
As the hash table grows, rehashing helps maintain efficient lookup, insertion, and deletion operations by reducing collisions.
A larger table has more available slots, which decreases the probability of collisions.
Maintains Constant Time Operations:
Rehashing ensures that operations like search, insert, and delete remain efficient and approach constant time, O(1), on average. Without rehashing, as the table fills up, these operations may degrade to O(n) due to the increased number of collisions.
Prevents Clustering:
If the table becomes too full, the likelihood of clustering (where multiple keys hash to the same or adjacent locations) increases. Rehashing helps by distributing elements more evenly in a larger table, reducing clustering.
Example of Rehashing
Original Table (Size = 5, Load Factor = 3/5 = 0.6):
Table[0] = key1
Table[1] = key2
Table[2] = key3
Table[3] = empty
Table[4] = empty
2. After Inserting Key4 (New Load Factor = 4/5 = 0.8, exceeds threshold):
3. Rehashing is triggered, and the table is resized to a new size (e.g., size = 10).
· New Table (Size = 10):
Rehash all keys (key1, key2, key3, and key4) and insert them into the new table using the hash function.
sql
New Table[0] = key1
New Table[3] = key2
New Table[5] = key3
New Table[7] = key4
New Table[1-2, 4, 6, 8-9] = empty
Pros and Cons of Rehashing
Advantages:
Improves Performance: Rehashing keeps the hash table from becoming too full, ensuring that insertion, deletion, and lookup operations remain fast and efficient.
Scalable: Allows the hash table to grow dynamically as needed, supporting a growing number of elements without a dramatic decrease in performance.
Disadvantages:
Computationally Expensive: Rehashing involves recalculating hash values and re-inserting all elements into a new table, which can be an expensive operation, especially when the table is large.
Temporary Performance Degradation: Rehashing temporarily slows down operations because it requires rehashing every element, and during this process, no new elements can be inserted.
Optimizations
Incremental Rehashing: Rather than rehashing the entire table at once, incremental rehashing can be used, where elements are rehashed in small batches, spreading the cost over time and reducing the impact on performance during inserts.
Resizing Factors: Instead of doubling the table size, other resizing strategies, such as increasing by 1.5 times, can be employed to balance memory consumption and rehashing cost.
Extensible hashing
Extensible Hashing in Indexing in Data Structures
Extensible Hashing is a dynamic hashing technique used in database indexing and hash tables to efficiently handle growing data sets while minimizing the cost of collisions and rehashing. It is particularly useful in situations where the data size is unknown or changes frequently. This technique is designed to handle a dynamic database or hash table where entries are continually inserted or removed, and the table needs to grow without excessive rehashing or performance degradation.
The main advantage of extensible hashing is that it allows for dynamic resizing without having to move all entries at once, unlike traditional static hashing methods (such as linear or quadratic probing in open addressing).
Key Concepts of Extensible Hashing
Buckets:
Instead of a single array to store all the hash table entries, extensible hashing uses a set of "buckets" to store records. Each bucket can hold multiple records, and each bucket is indexed by a hash value.
Global Directory:
Extensible hashing employs a global directory that points to the actual buckets. This directory can grow or shrink dynamically depending on the number of entries in the table.
The directory stores pointers to the buckets, and the number of bits used to index into the directory can change as the system grows.
Bucket Splitting:
When a bucket becomes full, it is split into two new buckets. This split may also require the global directory to expand (i.e., the number of bits used to index into the directory increases).
When a bucket overflows, it does not rehash all existing entries; instead, it only splits the bucket and redistributes the entries between the two newly created buckets.
Hashing with Prefix:
The hash function used in extensible hashing is applied to a key, but only a certain number of bits from the hash value (prefix) are used to decide which bucket the key will go into.
Initially, the directory might use a small number of bits, and as the directory grows, more bits are used to create more buckets.
Dynamic Growth:
The directory is dynamic and grows as needed. If a bucket becomes full and needs splitting, the directory may double in size, thereby requiring additional bits to be used for hashing.
The number of bits used to index the global directory can increase (and decrease) over time.
How Extensible Hashing Works
Initial Setup:
Suppose we start with a hash table that uses only a small number of bits from the hash value to index into the global directory (e.g., 1 or 2 bits).
Initially, each entry's hash value is processed using the hash function, and the lower k bits are used to select a bucket.
Insertion:
When a new key is inserted, the hash value of the key is computed, and the lower k bits of the hash are used to select the appropriate bucket.
If the selected bucket has space, the key is inserted into the bucket.
If the selected bucket is full, the bucket is split. If this results in the need to expand the global directory, more bits are used for hashing, and the directory is expanded by doubling its size.
Bucket Splitting:
When a bucket overflows, it is split into two new buckets. The directory will be modified accordingly to ensure the correct entries are redistributed into the two new buckets.
After a split, some of the hash table entries may need to be redistributed based on the new directory size and the new bits used for indexing.
Directory Expansion:
If splitting a bucket causes a situation where the directory cannot address all the buckets (e.g., the number of bits in the directory is insufficient), the directory size is doubled.
New bits are added to the directory, and entries in the buckets are re-evaluated and redistributed based on the new number of bits used for addressing.
Deletion:
When a record is deleted, the bucket may become underutilized. If the number of entries in a bucket decreases significantly, the bucket may be merged with a neighboring bucket, reducing the size of the directory and the number of buckets.
Advantages of Extensible Hashing
Efficient Space Usage:
Unlike static hashing, which requires resizing the entire table when it fills up, extensible hashing splits only the relevant buckets and expands the directory dynamically. This minimizes memory wastage.
Flexible and Dynamic:
The system can grow and shrink dynamically as needed. It adapts to the size of the data set without significant overhead or performance degradation during insertions or deletions.
Avoids Rehashing All Entries:
In traditional hashing, when a table becomes full, a rehash of the entire table is necessary. In extensible hashing, only the overflowing bucket is split, and existing entries are moved minimally.
Lower Risk of Collisions:
Since the directory size and the number of bits used for indexing grow dynamically, extensible hashing maintains a balanced distribution of entries and avoids a high number of collisions.
Good Performance in Growing Data Sets:
Extensible hashing is particularly well-suited for dynamic environments where the data set size can grow or shrink over time. The system ensures that the table remains efficient even as data grows.
Disadvantages of Extensible Hashing
Directory Overhead:
Although extensible hashing is more space-efficient than rehashing, it requires maintaining a dynamic directory. As the directory grows, it may require additional memory for storing the pointers to the buckets.
Complex Implementation:
The implementation of extensible hashing is more complex than simpler static hashing techniques due to the dynamic nature of bucket splits, directory resizing, and handling collisions.
Directory Growth:
As the data set grows, the global directory may need to expand significantly, which can cause a temporary increase in memory usage and lookup time during expansion.
Example of Extensible Hashing
Let’s assume we are using a hash function that generates a 4-bit hash value. The global directory initially uses only 1 bit for indexing, which results in 2 buckets.
Initial State:
Global directory: 1 bit → {0, 1}
Bucket 0: [key1, key2]
Bucket 1: [key3]
After Inserting key4:
If Bucket 0 overflows, it splits into two buckets.
Global directory: 2 bits → {00, 01, 10, 11}
Bucket 0: [key1]
Bucket 1: [key2]
Bucket 2: [key3]
Bucket 3: [key4]
Further Insertion or Expansion:
As more keys are inserted and buckets split, the directory expands by adding more bits to handle larger numbers of buckets.