data item by computing a function on the search key value. A hash function(randomizing function) h is a function from the set of all search key values K to the set of all bucket addresses B. Bucket is used to denote a unit of storage that can store one or more records. A bucket is typically a disk block, but could be chosen to be smaller or larger than a disk block.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 1
To insert a record with search key Ki, we compute h(Ki ), which gives the address of the bucket for that record. Assume for now that there is space in the bucket to store the record. Then, the record is stored in that bucket. To perform a lookup on a search-key value Ki , we simply compute h(Ki ), then search the bucket with that address.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 2
Hash index 1. A hash index organizes the search keys with their associated pointers into a hash file structure.
2. We apply a hash function on a search key to identify a
bucket, and store the key and its associated pointers
in the bucket (or in overflow buckets).
3. hash indices are always secondary indices
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 3
Hash functions 1. The worst hash function maps all keys to the same bucket. 2. The best hash function maps all keys to distinct addresses. 3. Ideally, distribution of keys to addresses is uniform(the hash function assigns each bucket the same number of search-key values from the set of all possible search-key values) and random(in the average case, each bucket will have nearly the same number of values assigned to it, regardless of the actual distribution of search-key values).
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 4
Example for hash functions Assume that we decide to have 26 buckets, and we define a hash function that maps names beginning with the ith letter of the alphabet to the ith bucket.This hash function does not provide a uniform distribution, since we expect more names to begin with such letters as B and R than Q and X, for example.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 5
Collision and collision resolution Collisions occur when hash field value of a record that is being inserted hashes to an address that already contain a different record. Collision resolution is the process of finding a new
position since its hash address is occupied.
Methods for collision resolution are :
1. Open addressing: Proceeding from the occupied position
specified by the hash address, the program checks the subsequent positions in order until an unused (empty) position is found.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 6
2. Chaining: For this method, various overflow locations are kept, usually by extending the array with a number of overflow positions. In addition, a pointer field is added to each record location. A collision is resolved by placing the new record in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location. 3. Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, the program uses open addressing or applies a third hash function and then uses open addressing if necessary.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 7
Overflow chaining in hash structure
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 8
Hashed Files
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 9
Hashed Files To reduce overflow records, a hash file is typically kept 70-80% full. The hash function h should distribute the records uniformly among the buckets Otherwise, search time will be increased because many overflow records will exist. Main disadvantages of static external hashing: Fixed number of buckets M is a problem if the number of records in the file grows or shrinks. Ordered access on the hash key is quite inefficient (requires sorting the records).
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 10
Hashed Files - Overflow handling
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 11
Dynamic hashing Dynamic Hashing techniques are adapted to allow the dynamic growth and shrinking of the number of file records. Includes extendible hashing, and linear hashing. use the binary representation of the hash value h(K) in order to access a directory. 1) Extendilble hashing ) Extendable hashing splits and coalesces buckets as database size changes. ) imposes some performance overhead, but space efficiency is maintained. ) As reorganization is on one bucket at a time, overhead is acceptably low.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 12
Dynamic And Extendible Hashing (contd.) The directories can be stored on disk, and they expand or shrink dynamically. Directory entries point to the disk blocks that contain the stored records. An insertion in a disk block that is full causes the block to split into two blocks and the records are redistributed among the two blocks. The directory is updated appropriately. Dynamic and extendible hashing do not require an overflow area. Linear hashing does require an overflow area but does not use a directory. Blocks are split in linear order as the file expands.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 13
Extendible Hashing
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 14
Extendible hashing In extendible hashing, a type of directoryan array of 2 d bucket addressesis maintained, where d is called the global depth of the directory. The integer value corresponding to the first (high-order) d bits of a hash value is used as an index to the array to determine a directory entry, and the address in that entry determines the bucket in which the corresponding records are stored. A local depth dstored with each bucketspecifies the number
of bits on which the bucket contents are based.
The value of d can be increased or decreased by one at a time, thus doubling or halving the number of entries in the directory array. Doubling is needed if a bucket, whose local depth d is equal to the global depth d, overflows. Halving occurs if d >d for all the buckets after some deletions occur.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 15
Eg. for bucket splitting To illustrate bucket splitting, suppose that a new inserted record causes overflow in the bucket whose hash values start with 01the third bucket. The records will be distributed between two buckets: the first contains all records whose hash values start with 010, and the second all those whose hash values start with 011. Now the two directory locations for 010 and 011 point to the two new distinct buckets. Before the split, they pointed to the same bucket. The local depth d of the two new buckets is 3, which is one more than the local depth of the old bucket
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 16
Eg. For bucket overflow If a bucket that overflows and is split used to have a local depth d equal to the global depth d of the directory, then the size of the directory must now be doubled so that we can use an extra bit to distinguish the two new buckets. For example, if the bucket for records whose hash values start with 111 overflows, the two new buckets need a directory with global depth d = 4, because the two buckets are now labeled 1110 and 1111, and hence their local depths are both 4. The directory size is hence doubled, and each of the other original locations in the directory is also split into two locations, both of which have the same pointer value as did the original location.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 17
Advantages & Disadvantages of extendible hashing Advantages 1. Extendable hashing provides performance that does not degrade as the file grows. 2. Minimal space overhead - no buckets need be reserved for future use. Bucket address table only contains one pointer for each hash value of current prefix length. Disadvantages
1. Extra level of indirection in the bucket address table
2. Added complexity
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 18
Comparison of Ordered Indexing and Hashing Cost of periodic re-organization Relative frequency of insertions and deletions Is it desirable to optimize average access time at the expense of worst-case access time? Expected type of queries: Hashing is generally better at retrieving records having a specified value of the key. If range queries are common, ordered indices are to be preferred In practice: PostgreSQL supports hash indices, but discourages use due to poor performance Oracle supports static hash organization, but not hash indices SQLServer supports only B+-trees Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 19 Creating index in SQL Create an index create index <index-name> on <relation-name> (<attribute-list>) E.g.: create index b-index on branch(branch_name) Use create unique index to indirectly specify and enforce the condition that the search key is a candidate key is a candidate key. Not really required if SQL unique integrity constraint is supported To drop an index drop index <index-name> Most database systems allow specification of type of index, and clustering.
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 20
Assertions An assertion is any condition that the database must always satisfy. Domain constraints and referential-integrity constraints are special forms of assertions Each assertion is given a constraint name and is specified via a condition similar to the WHERE clause of an SQL query. create assertion <assertion-name> check <condition>;
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 21
For example, to specify the constraint that the salary of an employee must not be greater than the salary of the manager of the department that the employee works for in SQL, we can write the following assertion:
CREATE ASSERTION SALARY_CONSTRAINT CHECK
( NOT EXISTS ( SELECT * FROM EMPLOYEE E, EMPLOYEE M,DEPARTMENT D WHERE E.Salary>M.Salary AN E.Dno=D.Dnumber AND D.Mgr_ssn=M.Ssn ) ); The constraint name SALARY_CONSTRAINT is followed
by the keyword CHECK, which is followed by a
condition in parentheses that must hold true on every database state for the assertion to be satisfied. The constraint name can be used later to refer to Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 22