Sunteți pe pagina 1din 36

Organizing Files for

Performance
Chapter 6
Jim Skon

File Processing - Organizing file for Performance

MVNC

Organizing Files for


Performance

Data Compression
Reclaiming space in files
Fast Searching
Keysorting

File Processing - Organizing file for Performance

MVNC

Data Compression

Making files smaller


Use less storage, save space
Faster Transmission
Processed faster

Data Compression
encoding information more efficiently
Many techniques exist

File Processing - Organizing file for Performance

MVNC

Data Compression

Consider fields with fixed length or fixed set of


values
A binary representation can save space
States - 50 states - 6 bits (one byte)
Zip - 0 to 99999. 17 bits (three bytes)

Called Compact Notation


Redundancy reduction

File Processing - Organizing file for Performance

MVNC

Data Compression

Cost of binary representations


file not readable as test
Processing time for conversion
All software must including appropriate/compatable
encoding and decoding routines.
Potential lost of flexibility

File Processing - Organizing file for Performance

MVNC

Data Compression

Suppressing repreating sequences


Consider a picture

Series of pixels - each a color


Colors represented by 8 bit value
usually come in bunches, e.g.
24 23 22 22 22 22 22 25 25 25 25 25 25 65 65 66 66 66 66

Run length encoding


Represent long runs with a prefix (FF) follwed by count, followed by color
24 23 FF 05 22 FF 06 25 65 65 FF 04 66

Simple images would be small, busy images would be no


bigger.

File Processing - Organizing file for Performance

MVNC

Data Compression

Assigning variable length codes


Some codes are more likely then others
Use shorter codes for often used values, longer
ones for less used values.
Each code must have the property of a unique
prefix
No code is the prefix of any other code
Thus we always know if we are at the end of a given code

File Processing - Organizing file for Performance

MVNC

Variable length codes

Example:
Letter:
Prob:
Code:

a
0.4
1

b
0.1
010

c
0.1
011

d
e
f
g
0.1
0.1
0.1
0.1
0000 0001 0010 0011

Can be decoded with a binary tree!


Called Huffman code

Algorithm exists to easily create optimal code


Requires that a table of codes be mainted with file
Most often used for fixed codes
Example - Type 3 FAX

File Processing - Organizing file for Performance

MVNC

Data Compression

Irreversible Compression
Compression which losses some information
Example - compress a 400x400 image into a
100x100 image by averaging groups of 16
adjacent pixels
Saves space, but resolution of picture reduced
Used most often for visual or audio information
(which has inherient redundancy)

File Processing - Organizing file for Performance

MVNC

Data Compression

Compression in UNIX
pack and unpack programs

Uses Huffman coding


25% to 40% savings on text files
much less on binary files
Uses .z file prefix

compress and uncompress programs


Uses Lempel-Ziv compression
No coding table needed - self coding
Uses .Z file prefix

File Processing - Organizing file for Performance

MVNC

10

Reclaiming space in files

Suppose a variable length record in the


middle of a file is modified so it is:
Longer?
Shorter?

Suppose a record is
Added to to the middle?
Deleted from middle?

File Processing - Organizing file for Performance

MVNC

11

Reclaiming space in files

Record deletion and storage compaction


storage compaction
recovering unused space in a file
from deletion or from record size changing

Consider deleted records


Must be able to recognize deleted records
Have a special mark for record
e,g, asterisk in first charater in key field
May be undeleted if not overwritten!

File Processing - Organizing file for Performance

MVNC

12

Dealing with Deleted


records

Occasional compaction
Dynamic maintanance

File Processing - Organizing file for Performance

MVNC

13

Occasional compaction

A process periodically run which reads file,


and rewrites with no empty space.
Could happen every night automactically
every night/week/month
File unavailable while operation underway.

File Processing - Organizing file for Performance

MVNC

14

Dynamic maintanance

Delete records by marking


Reuse deleted records a new records added,
updated
Need:
Way of knowing if deleted records exist
Where deleted records are so we can jump right to
them

File Processing - Organizing file for Performance

MVNC

15

Dynamic maintanance

Solution: linked list of deleted records


Each deleted record contains a mark, and a pointer
to the next deleted record
The file header contains a pointer to the first
deleted record.

File Processing - Organizing file for Performance

MVNC

16

Linked list of deleted


records

Fixed-length records
Variable-length records

File Processing - Organizing file for Performance

MVNC

17

Linked list of deleted


records

Fixed-length records
Simply maintain a stack of deleted records rooted
in header record
Deletion - add to front of list
Addition - use record at front of list
Minimal list maintanance cost

File Processing - Organizing file for Performance

MVNC

18

Linked list of deleted


records

Variable-length records
Store for each deleted record
Deletion Marker
link to nect deleted record
record size indicator

File Processing - Organizing file for Performance

MVNC

19

Variable-length records

Insertion
Which deleted record?

Deletion
Add records to list (stack?)
Where

File Processing - Organizing file for Performance

MVNC

20

Variable-length records Insertion

Select and use a deleted record


Break up records
pick a record
If size of deleted record bigger, break into two - a record
to use and a new, smaller, deleted record.
Put smaller deleted record back in list

Leave empty space at end


pick a record
If size of deleted record bigger, just leave empty space
at end.

File Processing - Organizing file for Performance

MVNC

21

Variable-length records Fragmentation

Recall fragmentation in Fixed-length records


At the end of fields if fixed length fields
At the end of records in variable length fields
Called internal fragmentation

Leaving space and the end of a variable length


records also leads to internal fragmentation.
Breaking up variable length records get rid of
fragmentation, right? Wrong!

File Processing - Organizing file for Performance

MVNC

22

Variable-length records Fragmentation

As records get broken up, smaller and smaller


pieces get left over.
These pieces are external fragmentation

File Processing - Organizing file for Performance

MVNC

23

Variable-length records Insertion strategy

How to pick record to use?


First Fit
Use first deleted record found in list

Best Fit
Use deleted record closest in size

Worst Fit
Use deleted record that is largest
No good when not breaking up records!

File Processing - Organizing file for Performance

MVNC

24

Variable-length records Insertion

How do we find the record with the desired


size?
Search them ALL!
Keep the records in sorted order by record size
Increasing size facilitates Best fit
Decreasing size facilitates worst fit (just pick first in list)
This increases deletion time!

File Processing - Organizing file for Performance

MVNC

25

Variable-length records Reducing fragmentation

Merge adjacent free records


How do we know if a newly deleted record is
adjacent to a free record?
Search the deleted list
Keep deleted records sorted by position in file
This makes finding of adjacent free space trivial
Costs more at deletion time

File Processing - Organizing file for Performance

MVNC

26

Fast Searching

Binary Searching
O(log n), where n is number of records
requires file be sorted

Question - how do we sort file?

File Processing - Organizing file for Performance

MVNC

27

File Sorting

Sort in Ram
read in entire file - sort
Called internal sorting
Limited by size of memory

File Processing - Organizing file for Performance

MVNC

28

Binary Search - Problems

Binary searching requires more then one or


two accesses

Accesses are VERY expensive


Access are very random (much seek time)
100,000 requires average of 16.5 accesses
We would like to approach the speed of a direct
lookup!

File Processing - Organizing file for Performance

MVNC

29

Binary Search - Problems

Keeping a file sorted is expensive


Every record added must be entered in sorted
order
Reordering is costly

Internal sorted is limited to small files


We will see there are sort methods to sort a file
that will not fit in memory. But it is still expensive!

File Processing - Organizing file for Performance

MVNC

30

Keysorting

Rather then sorting file, we could sort an array


of primary keys, where each key is
accompanied by the address of the
associated record.
Pointer could be a byte offset from start, or (if
records fixed length) a RRN.
After sort keys, the file can be rewritten in
order.

File Processing - Organizing file for Performance

MVNC

31

Keysorting

Advantages
Keys can be sorted in smaller space then whole
file
Faster to sort (swap!) keys then entire records

File Processing - Organizing file for Performance

MVNC

32

Keysorting

Disadvantages
Still limited in size to key lists which fit in memory
Sequential processing cannot not take advantage
of buffering!

File Processing - Organizing file for Performance

MVNC

33

Keysorting

Alternative - keeping sorted keylist,pointer


structure around.
Is a type of index file!
Can be read in and searched in memory!

File Processing - Organizing file for Performance

MVNC

34

Key Sorted Index

Advantages
Keys and pointers can be searched in memery.
Only one I/O per lookup!
File can be maintained in ANY order. Searching
and key order sequential processing still possible.

File Processing - Organizing file for Performance

MVNC

35

Key Sorted Index

Disadvantages
Sequential processing cannot not take advantage
of buffering!
Pinned records
Records in main file cannot change location without
invalidating index file!
Must either maintain index in parallel, or rebuild!

File Processing - Organizing file for Performance

MVNC

36

S-ar putea să vă placă și