Sunteți pe pagina 1din 22

Hashing

Hashing involves computing the address of a


data item by computing a function on the search
key value.
A hash function(randomizing function) h is a
function from the set of all search key values K to
the set of all bucket addresses B.
Bucket is used to denote a unit of storage that
can store one or more records. A bucket is
typically a disk block, but could be chosen to be
smaller or larger than a disk block.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 1


To insert a record with search key Ki, we compute h(Ki ), which
gives the address of the bucket for that record. Assume for now
that there is space in the bucket to store the record. Then, the
record is stored in that bucket.
To perform a lookup on a search-key value Ki , we simply
compute h(Ki ), then search the bucket with that address.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 2


Hash index
1. A hash index organizes the search keys with their
associated pointers into a hash file structure.

2. We apply a hash function on a search key to identify a


bucket, and store the key and its associated pointers

in the bucket (or in overflow buckets).

3. hash indices are always secondary indices

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 3


Hash functions
1. The worst hash function maps all keys to the
same bucket.
2. The best hash function maps all keys to distinct
addresses.
3. Ideally, distribution of keys to addresses is
uniform(the hash function assigns each bucket the
same number of search-key values from the set of
all possible search-key values) and random(in the
average case, each bucket will have nearly the
same number of values assigned to it, regardless of
the actual distribution of search-key values).

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 4


Example for hash functions
Assume that we decide to have 26 buckets, and
we define a hash function that maps names
beginning with the ith letter of the alphabet to the
ith bucket.This hash function does not provide a
uniform distribution, since we expect more names
to begin with such letters as B and R than Q and
X, for example.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 5


Collision and collision resolution
Collisions occur when hash field value of a record that is
being inserted hashes to an address that already contain
a different record.
Collision resolution is the process of finding a new

position since its hash address is occupied.


Methods for collision resolution are :

1. Open addressing: Proceeding from the occupied position


specified by the hash address, the program checks the
subsequent positions in order until an unused (empty)
position is found.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 6


2. Chaining: For this method, various overflow locations are
kept, usually by extending the array with a number of
overflow positions. In addition, a pointer field is added to
each record location. A collision is resolved by placing the
new record in an unused overflow location and setting the
pointer of the occupied hash address location to the address
of that overflow location.
3. Multiple hashing: The program applies a second hash
function if the first results in a collision. If another collision
results, the program uses open addressing or applies a third
hash function and then uses open addressing if necessary.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 7


Overflow chaining in hash structure

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 8


Hashed Files

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 9


Hashed Files
To reduce overflow records, a hash file is typically
kept 70-80% full.
The hash function h should distribute the records
uniformly among the buckets
Otherwise, search time will be increased because
many overflow records will exist.
Main disadvantages of static external hashing:
Fixed number of buckets M is a problem if the
number of records in the file grows or shrinks.
Ordered access on the hash key is quite inefficient
(requires sorting the records).

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 10


Hashed Files - Overflow handling

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 11


Dynamic hashing
Dynamic Hashing techniques are adapted to allow the
dynamic growth and shrinking of the number of file records.
Includes extendible hashing, and linear hashing.
use the binary representation of the hash value h(K) in
order to access a directory.
1) Extendilble hashing
) Extendable hashing splits and coalesces buckets as database
size changes.
) imposes some performance overhead, but space efficiency is
maintained.
) As reorganization is on one bucket at a time, overhead is
acceptably low.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 12


Dynamic And Extendible Hashing
(contd.)
The directories can be stored on disk, and they expand or
shrink dynamically.
Directory entries point to the disk blocks that contain the
stored records.
An insertion in a disk block that is full causes the block to
split into two blocks and the records are redistributed
among the two blocks.
The directory is updated appropriately.
Dynamic and extendible hashing do not require an
overflow area.
Linear hashing does require an overflow area but does
not use a directory.
Blocks are split in linear order as the file expands.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 13


Extendible Hashing

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 14


Extendible hashing
In extendible hashing, a type of directoryan array of 2 d bucket
addressesis maintained, where d is called the global depth of the
directory.
The integer value corresponding to the first (high-order) d bits of a
hash value is used as an index to the array to determine a directory
entry, and the address in that entry determines the bucket in which
the corresponding records are stored.
A local depth dstored with each bucketspecifies the number

of bits on which the bucket contents are based.


The value of d can be increased or decreased by one at a time, thus
doubling or halving the number of entries in the directory array.
Doubling is needed if a bucket, whose local depth d is equal to the
global depth d, overflows.
Halving occurs if d >d for all the buckets after some deletions occur.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 15


Eg. for bucket splitting
To illustrate bucket splitting, suppose that a new inserted
record causes overflow in the bucket whose hash values
start with 01the third bucket. The records will be
distributed between two buckets: the first contains all
records whose hash values start with 010, and the second
all those whose hash values start with 011. Now the two
directory locations for 010 and 011 point to the two new
distinct buckets. Before the split, they pointed to the same
bucket. The local depth d of the two new buckets is 3,
which is one more than the local depth of the old bucket

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 16


Eg. For bucket overflow
If a bucket that overflows and is split used to have a local
depth d equal to the global depth d of the directory, then
the size of the directory must now be doubled so that we
can use an extra bit to distinguish the two new buckets.
For example, if the bucket for records whose hash values
start with 111 overflows, the two new buckets need a
directory with global depth d = 4, because the two buckets
are now labeled 1110 and 1111, and hence their local depths
are both 4. The directory size is hence doubled, and each of
the other original locations in the directory is also split into
two locations, both of which have the same pointer value as
did the original location.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 17


Advantages & Disadvantages of
extendible hashing
Advantages
1. Extendable hashing provides performance that does not
degrade as the file grows.
2. Minimal space overhead - no buckets need be reserved
for future use. Bucket address table only contains one
pointer for each hash value of current prefix length.
Disadvantages

1. Extra level of indirection in the bucket address table


2. Added complexity

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 18


Comparison of Ordered Indexing and
Hashing
Cost of periodic re-organization
Relative frequency of insertions and deletions
Is it desirable to optimize average access time at the expense of
worst-case access time?
Expected type of queries:
Hashing is generally better at retrieving records having a specified
value of the key.
If range queries are common, ordered indices are to be preferred
In practice:
PostgreSQL supports hash indices, but discourages use due to
poor performance
Oracle supports static hash organization, but not hash indices
SQLServer supports only B+-trees
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 19
Creating index in SQL
Create an index
create index <index-name> on <relation-name>
(<attribute-list>)
E.g.: create index b-index on branch(branch_name)
Use create unique index to indirectly specify and enforce the
condition that the search key is a candidate key is a candidate key.
Not really required if SQL unique integrity constraint is supported
To drop an index
drop index <index-name>
Most database systems allow specification of type of index, and
clustering.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 20


Assertions
An assertion is any condition that the
database must always satisfy.
Domain constraints and referential-integrity
constraints are special forms of assertions
Each assertion is given a constraint name
and is specified via a condition similar to
the WHERE clause of an SQL query.
create assertion <assertion-name>
check <condition>;

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 21


For example, to specify the constraint that the salary of
an employee must not be greater than the salary of the
manager of the department that the employee works
for in
SQL, we can write the following assertion:

CREATE ASSERTION SALARY_CONSTRAINT CHECK


( NOT EXISTS ( SELECT * FROM EMPLOYEE E,
EMPLOYEE M,DEPARTMENT D WHERE E.Salary>M.Salary
AN E.Dno=D.Dnumber AND D.Mgr_ssn=M.Ssn ) );
The constraint name SALARY_CONSTRAINT is followed

by the keyword CHECK, which is followed by a


condition in parentheses that must hold true on every
database state for the assertion to be satisfied.
The constraint name can be used later to refer to
Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13- 22

S-ar putea să vă placă și