Estruturas de Dados II LISCH EISCH

BM307File Organization
Gazi University
Computer Engineering Department
9/24/2014
Index
Sequential File Organization
Binary Search
Interpolation Search
Self-Organizing Sequential Search
Direct File Organization
9/24/2014
Locating Information
Hashing Functions
Collision Resolution
Coalesced Hashing
2
File Organization
Goal Organizing files efficiently in terms of both

space and performance
File Organization
sequential
indexed sequential
direct
9/24/2014
File Access
sequential
sequential & direct
direct (random)
File Access Types
Sequential accessing multiple records (often an entire

file) and usually according to a predefined order
Direct (random) locating a single record
Question How can we have an effective organization?

Answer matching the type of organization with the
type of intended access
9/24/2014

Background
Fields (eg.: Employee name, number)

Records contain data about individual entities
Files (eg.: employee list)
Primary Key field(s) which uniquely distinguishes a record
from all others
Secondary Key all the remaining fields
9/24/2014
File consists of records of the same format

Fixed-length records
Variable-length records
Sequential File Organization (i+1)st element of a file

is stored immediately after the ith element.
9/24/2014
Sequential access moving from one record in the

file to the next by incrementing the address of the
current record by the record size
Direct access processing a single record directly if

we know subscript
9/24/2014
Probe access to a distinct location

Sequential Search
In an entire file of N records
N/2 probes are needed in average

Need to probe entire file for an unseccessful retrieval
Computational complexity O(N)
Appropriate when N is small

Performance improvement?
9/24/2014
Sorting
Eg. - Sequential Search
100000 records, each record size is 400 bytes,

block size is 2400 bytes.
Sequential search time for retrieving 10000 records?
Each probe one block of data
(100000*400)/2400 = 16667 blocks
Reading time for one block 0.84ms (IBM 3380)
Time requirement for each record
(16667/2)*0.84 = 7 sec.
For 10000 records 7sec * 10000 = 19 hours
Better organization is needed!!
9/24/2014
Binary Search
Requires sorting
Compares the key of the sought record with the middle record of the file
Half of the file is eliminated in each turn
Computational complexity O(log2n)
Eg. the key of the sought record 17
9/24/2014
10

Binary Search (Algoritma)
9/24/2014
11

Interpolation Search
Approximate relative position

Eg.: Searching a name in a telephone book
Choses the next position for a comparison based upon the
estimated position of the sought keyrelative to the remainder
of the file to be searched
key[sought] key [LOWER]

NEXT := LOWER + (UPPER-LOWER)
key[UPPER] key [LOWER]
Worst case computational complexity O(n)

Average case computational complexity O(log2 log2n)
Its performance improves as the distribution of keys becomes
more uniform
9/24/2014
12
binary search should be preferred when the data is stored in

primary memory
Why?
interpolation search should be preferred when the data is stored in

auxilary memory
9/24/2014
Why?
13
binary search should be preferred when the data is stored in

primary memory
The additional calculations needed for the interpolation search cancel

any savings gained from fewer probes
interpolation search should be preferred when the data is stored in

auxilary memory
9/24/2014
An access of auxiliary storage is an order of magnitude greater than

the time required for the additional calculations
14

Self-Organizing Sequential Search
Modifies the order of records

Moves the most frequently retrieved records to the beginning
of the file
Most popular algorithms:

Move_to_front
Transpose
Count
9/24/2014
15
Move_to_front
The sought record is moved to the front position of the file
Potential of making big mistakes if a record accessed , moved to the

front of the file, and then rarely if ever accessed again!
A linked implementation is preferable even though it takes more storage
Appropriate when space is not limited and locality of access is important
Essentially the same as the LRU (least recently used) paging algorithm
used by operating systems
9/24/2014
16
Eg. - Move_to_front
The records are accessed in the order of fileediting
abcdefghijklmnoprqstvwyz
fabcdeghijklmnoprqstvwyz
ifabcdeghjklmnoprqstvwyz
lifabcdeghjkmnoprqstvwyz
elifabcdghjkmnoprqstvwyz
elifabcdghjkmnoprqstvwyz
delifabcghjkmnoprqstvwyz
idelfabcghjkmnoprqstvwyz
tidelfabcghjkmnoprqsvwyz
itdelfabcghjkmnoprqsvwyz
nitdelfabcghjkmoprqsvwyz
gnitdelfabchjkmoprqsvwyz
9/24/2014
17
Transpose
Interchanges the sought record with its immediate predecessor
More stable than the Move_to_front algorithm
A record needs to be accessed many times before it is moved to the

front of the list
Easily implemented
Does not need additional space
Should be used when space is premium
9/24/2014
18
Eg. - Transpose
The records are accessed in the order of fileediting
abcdefghijklmnoprqstvwyz
abcdfeghijklmnoprqstvwyz
abcdfegihjklmnoprqstvwyz
abcdefgihjklmnoprqstvwyz
abcedfgihjklmnoprqstvwyz
abcdefgihjklmnoprqstvwyz
abcdefighjklmnoprqstvwyz
abcdefighjklmnoprqtsvwyz
abcdeifghjklmnoprqtsvwyz
abcdeifghjklnmoprqtsvwyz
abcdeigfhjklnmoprqtsvwyz
9/24/2014
19
Count
Keeps count of the number of accesses of each record
The file is always ordered in a decreasing order of frequency of

access
Requires extra sorage to keep the count
Use it only when the counts are needed for another purpose
9/24/2014
20
Ideally, we want to go directly to the address where the record is stored

A key can be unique address one probe
0
0
11
Key
space
correspondence
999-99-9999
Address
space
999-99-9999
More address space than needed

Eg.1 billion addresses for 300 million people
9/24/2014
21
Converting information into a unique address
Eg. : Airline reservation system
Flight numbers from 1 to 999

Days are numbered from 1 to 366
Flight number and day of the year could be concatenated to determine the
location
Location = flight number || day of the year, address range 001001-999366

(???367 - ???999 would not exist)
Location = day of the year || flight number , address range 001001-366999
9/24/2014
22
The key converts to a probable address
If we remove most of the empty spaces in the address space, we have

lost the 1-1 correspondence btw keys & addresses
Hashing functions are used to map the wider range of key values into
the narrower range of address values
Hash (key)
probable address
Initial probable address home address

Hashing function should
Evenly distribute the keys among the addresses

Executes efficiently
9/24/2014
23
A collision occurs when two distinct keys map to the same

address
0
0
Key
space
Address
space
1200
999-99-9999
Hashing is then composed of two aspects;
The function
The collision resolution method
9/24/2014
24

Hashing Functions
9/24/2014
25

Hashing Functions
Squaring
Taking square of a key and then substringing or truncating a portion

of the result
Radix conversion
9/24/2014
The key is considered to be in a base other than 10 ans is then

converted into a number in base 10
Eg.: Base 11
1234 = 1 * 113 + 2 * 112 + 3 * 111 + 4 * 110= 1331 + 242 + 33 + 4
= 1610
substringing or truncation could then be used
26

Hashing Functions
Polynomial hashing
The key is divided by a polynomial

f(information area)
cyclic check bytes
Alphabetic keys
9/24/2014
Alphabetic or alphanumeric key values can be input to a hashing

function if the values are interpreted as integers
27

Collisions
For a given set of data, one hashing function may distribute the keys
more evenly over the address space than another
A hashing function that has a large number of collisions is said to

exhibit primary clustering
It is better to have a slightly more expensive hashing function for data

that need to be stored on auxiliary storage
Another method for reducing collisions is reducing the packing factor
Packing factor =
total number of storage locations
9/24/2014
collisions
number of records stored
storage
28

Collision Resolution
Collision resolution with links
Collision resolution without links

Static positioning of records
Dynamic positioning of records
Collision resolution with pseudolinks
9/24/2014
29

Collision resolution with links
If multiple synonyms occur for a

particular home address, we form
a chain of synonym records
Disadvantage extra storage is

needed
Collision resolution without links
We can use implied links by

applying a convention , or set of
rules for deciding where to go
next
A simple convention is to look at

the next location in memory
Advantage NO extra storage is

needed
9/24/2014
30

Coalesced Hashing
Occurs when we attempt to insert a record with a home

address that is already occupied by a record from a chain with
a different home address
The two chains with records having different home addresses

coalesce or grow together
X ,D, Y were inserted
9/24/2014
31

Coalesced Hashing
(Eg.)
Hash (key) = key mod 11

27, 18, 29, 28, 39, 13, 16
Average # of probes 1.8
42 & 17 added
9/24/2014
32

Coalesced Hashing
Discussion
Packing factor of the final table = 9/11 (82%)
One method of reducing coalescing is to reduce the packing factor
It would be advisable to place the most frequently accessed records early in

the insertion process
Deleting a record is complicated
If coalescing has occurred,

a simple deletion procedure is
to move a record later in the
probe chain into the position of
the deleted record
Final table after deleting 39 ---------->
9/24/2014
33

Coalesced Hashing
Variants
Table organization (whether or not a seperate

overflow area is used)
The manner of linking a colliding item into a chain
The manner of choosing unoccupied locations
Table Organization
Table primary area + overflow area

Adres factor = (primary area ) / (total table size)
Best performance when the adres factor is 0.86
9/24/2014
34

Coalesced Hashing
Variants
Late Insertion Standart Colesced Hashing (LISCH)
New records are inserted at the end ofa probe chain
Lack of a cellar
Late Insertion Coalesced Hashing (LICH)
Uses a cellar
Eg. Keys: 27, 18, 29, 28, 39, 13, 16, 42, 17
hashing function: key mod 7
Average # of probes 1.3

(It was 1.8 for LISCH)
In general, for a 90 percent packing factor,

using a cellar will reduce the number of
probes by about 6 percent compared
with LISCH
9/24/2014
35

Coalesced Hashing
Variants
Early Insertion Standart Colesced Hashing (EISCH)
nserts a new record into a position on the probe chain immediately after the record
srored at its home address
nsertion of the record with key 17 according to EISCH algorithm:

Hash (key) = key mod 11
9/24/2014
36

Coalesced Hashing
Variants
Random Early Insertion Standart Colesced Hashing (REISCH)
Choosing a random unoccupied location for the new insertion

Gives only a 1% improvement over EISCH
Random Late Insertion Standart Colesced Hashing (RLISCH)
Bidirectional Late Insertion Standart Colesced Hashing (BLISCH)

Choosing the overflow location for a collision insertion by alternating the
selection between the top and bottom of the table
Bidirectional Early Insertion Standart Colesced Hashing (BEISCH)
9/24/2014
37

Coalesced Hashing
Comparison
9/24/2014
38

Estruturas de Dados II LISCH EISCH

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Estruturas de Dados II LISCH EISCH

Încărcat de

Drepturi de autor:

Formate disponibile

BM307File Organization

Sequential File Organization

Direct File Organization

Goal Organizing files efficiently in terms of both

File Access Types

Sequential accessing multiple records (often an entire

Direct (random) locating a single record

Question How can we have an effective organization?

Sequential File Organization

Fields (eg.: Employee name, number)

Sequential File Organization

File consists of records of the same format

Sequential File Organization (i+1)st element of a file

Sequential File Organization

Sequential access moving from one record in the

Direct access processing a single record directly if

Sequential File Organization

Probe access to a distinct location

N/2 probes are needed in average

Computational complexity O(N)

Appropriate when N is small

Eg. - Sequential Search

100000 records, each record size is 400 bytes,

Sequential File Organization

Sequential File Organization

Sequential File Organization

Approximate relative position

key[sought] key [LOWER]

Worst case computational complexity O(n)

binary search should be preferred when the data is stored in

interpolation search should be preferred when the data is stored in

binary search should be preferred when the data is stored in

The additional calculations needed for the interpolation search cancel

interpolation search should be preferred when the data is stored in

An access of auxiliary storage is an order of magnitude greater than

Sequential File Organization

Modifies the order of records

Most popular algorithms:

Sequential File Organization

The sought record is moved to the front position of the file

Potential of making big mistakes if a record accessed , moved to the

A linked implementation is preferable even though it takes more storage

Appropriate when space is not limited and locality of access is important

The records are accessed in the order of fileediting

Sequential File Organization

Interchanges the sought record with its immediate predecessor

More stable than the Move_to_front algorithm

A record needs to be accessed many times before it is moved to the

Does not need additional space

Should be used when space is premium

The records are accessed in the order of fileediting

Sequential File Organization

Keeps count of the number of accesses of each record

The file is always ordered in a decreasing order of frequency of

Requires extra sorage to keep the count

Direct File Organization

Ideally, we want to go directly to the address where the record is stored

More address space than needed

Direct File Organization

Converting information into a unique address

Eg. : Airline reservation system

Flight numbers from 1 to 999

Location = flight number || day of the year, address range 001001-999366