DBMS INTRO CORE CONCEPTS

UNIT I
INTRODUCTION TO DBMS
File Systems Organization -Sequential, Pointer, Indexed, Direct -Purpose of Database SystemDatabase System Terminologies-Database characteristics-Data models Types of data models
Components of DBMS-Relational Algebra. LOGICAL DATABASE DESIGN: Relational DBMS
-Codd's Rule-Entity-Relationship model -Extended ER Normalization Functional Dependencies,
Anomaly-1NF to 5NF-Domain Key Normal Form Denormalization.
1. File Systems Organization -Sequential, Pointer, Indexed, Direct
A file is a collection or set (ordered or unordered) of data elements stored on storage

media. A system software module is responsible for managing (reading, processing, deleting, etc.)
a file.
Logical Structure of a file

Field: Smallest (fixed) indivisible logical unit of a file. A field holds a part of some data value.
Record: A set of logically related fields. Cardinality of this set may be fixed or variable, i.e., a
record size may be fixed or variable. A file, therefore, is a collection of logically related records.
A record can be of fixed or variable size. Most commonly fixed size records are used. The file
structure has been illustrated here in a tabular form but note that a file is not just a table its records
can be organized in many different ways each way defines a unique organization.
Operations on file
Create: Create a new file.
1
http://www.francisxavier.ac.in
Write: Write to a file

Read: Read a part of a file
Rewind: Go to the beginning of the file
Delete: Remove a file from the system.
Close: Close a file (at the end of manipulation).
Open: Open a file for processing.
Modify: Modify a field value.
One of the main objectives of file organization is to speed up the file retrieval time, that is,
to reduce the I/O time. This can be done by improving disk access speed but that has a limit
because of the hardware constraints.
The other way is to store file records in such a way that records can be accessed
concurrently or in parallel. A hardware solution to this is using RAID structure. A RAID is an
array of several disks which can be accessed in parallel.
No single file organization is the most efficient for all types of data access. For this reason
a database management system uses a number of different file organizations for storing the same
set of data.
There are basically three categories of file organizations (a) Sequential organization, (b)
organization using Index, and (c) Random organization. We begin with sequential category. Files
of these categories are called sequential files.
(a) understand the file organization, (b) understand how a set of operations are performed, and (c)
analyze the time required to performed each operation. This will help us to identify which file is
suitable for what kind of data processing.
Sequential file
In this organization records are written consecutively when the file is created. Records in a
sequential file can be stored in two ways. Each way identifies a file organization.
Pile file: Records are placed one after another as they arrive (no sorting of any kind).
Sorted file: Records are placed in ascending or descending values of the primary key.
Pile file: The total time to fetch (read) a record from a pile file requires seek time (s), rotational
latency time (r), Block transfer time (btt), and number of blocks in the file (B). We also need a key
field of the record (K). In pile file we do not know the address of a record, so we search
2
sequentially. The possibilities are (a) the desired record may be in the first block or (b) it may be in
some middle block or (c) it may be in the last block.
The fetching (searching) process is shown in the flow diagram (Figure 1) and the example
illustrates the time involved.
Fi
gure 1. Search in a sequential file organization.
File Reorganization: In file reorganization all records, which are marked to be deleted are deleted
and all inserted records are moved to their correct place (sorting). File reorganization steps are:
read the entire file (all blocks) in RAM.
remove all the deleted records.
write all the modified blocks at a different place on the disk.
Inserting a record: To insert a record, it is placed at the end of the file. No need to sort
(ascending or descending order) the file. However, the file may have duplicate records.
3
Deleting or modifying a record: This will require to fetch the block containing the record, find
the record in the block and just mark it deleted, then write the modified block to the disk.
Sorted Sequential File: In a sorted file, first the record is inserted at the end of the file and then
moved to its correct location (ascending or descending). Records are stored in order of the values
of the key field.
A sequential file organization
A sequential file usually has an overflow area. This area is to avoid sorting the file after every
deletion, insertion and/or modification. All records, which were added after the file was first
populated, go to this overflow area. The overflow area is not itself sorted it is a pile file with fixed
size record. At some convenient time the entire file is sorted where records from the overflow area
go to their correct position in the main file.
Retrieval: Records are retrieved in a consecutive order. The order of record storage determines
order of retrieval. During retrieval several required operations (partial result output etc.) can be
performed simultaneously.
Insert, delete and modify (Update): Sequential files are usually not updated in place. Each
operation regenerates a new version of the file. The new version contains the up-to-date
information and the old version is usually saved for recovery. The new version becomes the old
version and the next update to the file uses this old version to generate next new version. The
intensity or frequency of use of a sequential file is defined by a parameter called Hit ratio, which
defines is defined as follows:
4
Desirable: high hit ratio value. This means a larger number of records are accessed to respond to a
query. Interactive transactions have very low hit ratio.
Advantages of sequential file

Good for batch transactions.
Simple to implement
Good for report generation, statistical computation and inventory control.
Disadvantages
Not good for interactive transactions
High overheads in file processing for simple queries.
Index File Organization
Index organization tries to reduce access time. It may not reduce the storage requirement of
a file. Some important terms of this organization are.
Index: An index is similar to a pointer with an additional property. This additional property allows
an index to identify the record it is pointing to.
For example, an index to a record of employees points and identifies an employee record.
This means that an index has some semantics associated with it. Formally, an index maps the key
space to the record space. Index for a file may be created on the primary or secondary keys.
Index types: Primary, Non dense, and Dense.
Primary Index: An ordered file of index record. This file contains index records which are of
fixed length. Each index record has two fields:
one field holds the primary key of the data file record.
5
the other holds pointer to disk block where the data file is stored.
Nondense index: No. of entries in the index file << no. of records in the data file.
Dense index: No. of entries in the index file = no. of records in the data file.
Example: How record search time is reduced by indexing.
Index Sequential File (ISAM - Index Sequential Access Method)

Introduced by IBM in early 1960s. This file organization closely related to the physical
characteristics of the storage media. It has two parts: (a) Index Part and (b) Data Part.
Index Part: This part stores pointers to the actual record location on the disk. It may have several
levels and each level represents some physical layout such as sector, cylinder, of the storage
media.
Data Part: It holds actual data records and is made up of two distinct areas: Prime area and
Overflow area. The prime area holds records of the file. The overflow area holds records when the
prime area overflows.
Direct File
Indexing provides the most powerful and flexible tools for organizing files. However, (a)
indexes take up space and time to keep them organized and (b) the amount of I/O increases with
the size of index.
This problem can be minimized by direct file organization where the address of the desired
record can be found directly (no need of indexing or sequential search). Such file are created using
some hashing function so they are called hashing organization or hashed files.
Hashing
Two types of hashing: (a) Static and (b) Dynamic.
Static: The address space size is predefined and does not grow or shrink with file.
Dynamic: The address space size can grow and shrink with file.
Key Space: Set of Primary keys.
Address Space: Set of home addresses.
Hash function (Primary Key) Home address of the record.
Home address file manager disk address.
Requirements: 1. A predefined set of home addresses (address space). 2. A hashing function. 3.
Primary key (Hashed key).
6
There is a unique relationship between the key and the record address. Any add, delete or change
in a record must not break this relationship. If this bond is broken then the record is no longer
accessible by hashing.
Address distribution
1. With hashing, the address generated is random.
2. No obvious connection between key and the home address. (For this reason hashing is
sometimes referred as randomizing.)
Record distribution in address space: a. Uniform. b. Random.

Collision: When a hashing function generates the same home address for more than one record.
Solutions for collision: Progressive overflow (also called open addressing)
2. Purpose of Database System:
Database management systems were developed to handle the following difficulties of
typical le-processing systems supported by conventional operating systems.
(i)
Data redundancy and Consistency
(ii)
Self Describing Nature (of Database System)
(iii)
Data isolation or Abstraction
(iv)
Integrity
(v)
Atomicity
(vi)
Concurrent Access or sharing of Data
(vii)
Security
(viii) Support for multiple views of Data
(i)
Data redundancy and Consistency:

In file System, each user maintains separate files and programs to manipulate these files
because each requires some data not available from other users files. This redundancy in defining
and storage of data results in
wasted storage space,
redundant efforts to maintain common update,
higher storage and access cost and

7
Leads to inconsistency of data (ie.,) various copies of same data may not agree.
In Database approach, a single repository of data is maintained that is maintained that is

defined once and then accessed by various users. Thus redundancy is controlled and it is
consistent.
(ii) Self Describing Nature of Database System
In File System, the structure of the data file is embedded in the access programs.
A database system contains not only the database itself but also a complete definition or
description of database structure and constraints. This definition is stored in System catalog which
contains information such as structure of each file, type and storage format of each data item and
various constraints on the data. Information stored in the catalog is called Meta-Data.
DBMS is not written for specific applications, hence it must refer to catalog to
know structure of file etc., and hence it can work equally well with any number of database
applications.
(iii) Data Isolation or Abstraction:
Conventional File processing systems do not allow data to be retrieved in
convenient and efficient manner. More responsive data retrieval systems are required for general
use.
The structure of the data file is embedded in the access programs. So any changes
to the structure of a file may require changing all programs that access this file. Because data are
scattered in various files and files may be in different formats, writing new application programs to
retrieve appropriate data is difficult.
But the DBMS access programs do not require such changes in most cases. The
structure of data files is stored in DBMS catalog separately from the access programs. This
property is known as program data independence.
Operations are separately specified and can be changed without affecting the
interface. User application programs can operate on data by invoking these operations regardless
of how they are implemented. This property is known as program operation independence.
Program data independence and program operation independence are together known as data
independence.
(iv) Enforcing Integrity Constraints:
The data values must satisfy certain types of consistency constraints.
In File System, Developers enforce constraints by adding appropriate code in application
program. When new constraints are added, it is difficult to change the programs to enforce them.
8
In data base system, DBMS provide capabilities for defining and enforcing constraints. The
constraints are maintained in system catalog. Therefore application programs work independently
with addition or modification of constraints. Hence integrity problems are avoided.
(v) Atomicity:
A Computer system is subjected to failure. If failure occurs, the data has to be
restored to the consistent state that existed prior to failure. The transactions must be atomic it
must happen in entirety or not at all.
It is difficult to ensure atomicity in File processing System.
In DB approach, the DBMS ensures atomicity using the Transaction manager
inbuilt in it. DBMS supports online transaction processing and recovery techniques to maintain
atomicity.
(vi) Concurrent Access or sharing of Data:
When multiple users update the data simultaneously, it may result in inconsistent
data. The system must maintain supervision which is difficult because data may be accessed by
many different application programs that may have not been coordinated previously.
The database (DBMS) include concurrency control software to ensure that several
programs /users trying to update the same data do so in controlled manner, so that the result of
update is correct.
(vii) Security:
Every user of the database system should not be able to access the data.
But since the application programs are added to the system in an adhoc manner, enforcing
such security constraints is difficult in file system.
DBMS provide security and authorization subsystem, which the DBA uses to create
accounts and to specify account restrictions.
(viii) Support for multiple views of Data:
Database approach support multiple views of data. A database has many users each of
whom may require a different view of the database. View may be a subset of database or virtual
data retrieved from database which is not explicitly stored. DBMS provide multiple views of the
data or DB.Different application programs are to be written for different views of data.
3. Database System Terminologies:
a) File system:
A file system is a method for storing and organizing computer files and the data they contain to
make it easy to find and access them. It is controlled by Operating system.
b) Data:
9
Data are Known facts that can be recorded and that have implicit meaning
c) Database:
Database is a collection of related data with some inherent meaning. It is designed built and
populated with data for specific purpose. It represents some aspects of real world.
d) Database Management System:
It is a collection of programs that enables to create and maintain database. It is a general
purpose Software system that facilitates the process of defining, constructing and manipulating
data bases for various applications.
Defining involves specifying data types, structures and constraints for the data to be stored in the
database
Constructing is the process of storing the database itself on some storage medium that is
controlled by DBMS.
Manipulating includes functions such as querying the database to retrieve specific data, updating
database to reflect changes and generating reports from the data.
Eg. Oracle, Ms-access, Sybase, Informix, Foxpro
Example for a database - A University Database

e) Database System: Database and DBMS together known as Database system.
10
User / programmers
DataBaseSystem Application programs / Queries
Software to process Queries / Programs
Software to access Stored Data

DBMS
Fig 1.1 Database System Environment

f) Applications of Database System:
Banking Customer information, accounts, loans and transactions
Airlines Reservation and schedule information
Universities Student, course, registration and grades
Credit and Transactions purchases on cards and report
Telecommunications Records of calls, report, balances etc.,
Finance Holdings, sales, stocks and bond purchase.
Sales customer, product, purchase.
Manufacturing Inventory items, orders, purchase.
Human Resources-Employee details, salary, tax etc.,
Stored Database
Definition
Stored
(Metadata)
Database
4. Database Characteristics:
Self-Describing Nature of a Database System
11
Insulation between Programs and Data, and Data Abstraction

Support of Multiple Views of the Data
Sharing of Data and Multiuser Transaction Processing
In the database approach, a single repository of data is maintained that is defined once and then is
accessed by various users. The main characteristics of the database approach versus the fileprocessing approach are the following.
Self-Describing Nature of a Database System
A fundamental characteristic of the database approach is that the database system contains not
only the database itself but also a complete definition or description of the database structure and
constraints.
This definition is stored in the system catalog, which contains information such as the structure
of each file, the type and storage format of each data item, and various constraints on the data. The
information stored in the catalog is called meta-data, and it describes the structure of the primary
database.
The catalog is used by the DBMS software and also by database users who need information
about the database structure. A general purpose DBMS software package is not written for a
specific database application, and hence it must refer to the catalog to know the structure of the
files in a specific database, such as the type and format of data it will access.
The DBMS software must work equally well with any number of database applicationsfor
example, a university database, a banking database, or a company databaseas long as the
database definition is stored in the catalog.
DBMS software can access diverse databases by extracting the database definitions from the
catalog and then using these definitions.
Whenever a request is made to access, say, the Name of a STUDENT record, the DBMS
software refers to the catalog to determine the structure of the STUDENT file and the position and
size of the Name data item within a STUDENT record.
12
Insulation between Programs and Data, and Data Abstraction

The structure of data files is stored in the DBMS catalog separately from the access programs.
We call this property program-data independence.
In a DBMS environment, we just need to change the description of STUDENT records in the
catalog to reflect the inclusion of the new data item Birthdate; no programs are changed. The next
time a DBMS program refers to the catalog, the new structure of STUDENT records will be
accessed and used.
An operation (also called a function) is specified in two parts. The interface (or signature) of
an operation includes the operation name and the data types of its arguments (or parameters). The
implementation (or method) of the operation is specified separately and can be changed without
affecting the interface.
User application programs can operate on the data by invoking these operations through their
names and arguments, regardless of how the operations are implemented. This may be termed
program-operation independence.
The
characteristic
that
allows
program-data
independence
and
program-operation
independence is called data abstraction. A DBMS provides users with a conceptual representation
of data that does not include many of the details of how the data is stored or how the operations are
implemented.
The data model uses logical concepts, such as objects, their properties, and their
interrelationships, that may be easier for most users to understand than computer storage concepts.
Hence, the data model hides storage and implementation details that are not of interest to most
database users.
Support of Multiple Views of the Data
A database typically has many users, each of whom may require a different perspective or view of
the database. A view may be a subset of the database or it may contain virtual data that is derived
from the database files but is not explicitly stored. Some users may not need to be aware of
13
whether the data they refer to is stored or derived. A multiuser DBMS whose users have a variety
of applications must provide facilities for defining multiple views.
For example, one user of the database may be interested only in the transcript of each student;
the view for this user is displayed. A second user, who is interested only in checking that students
have taken all the prerequisites of each course they register for, may require the different view.
Sharing of Data and Multiuser Transaction Processing
A multiuser DBMS must allow multiple users to access the database at the same time. This is
essential if data for multiple applications is to be integrated and maintained in a single database.
The DBMS must include concurrency control software to ensure that several users trying to
update the same data do so in a controlled manner so that the result of the updates is correct.
For example, when several reservation clerks try to assign a seat on an airline flight, the
DBMS should ensure that each seat can be accessed by only one clerk at a time for assignment to a
passenger. These types of applications are generally called on-line transaction processing (OLTP)
applications. A fundamental role of multiuser DBMS software is to ensure that concurrent
transactions operate correctly.
5. Data Models
5.1 Definitions:
a) Data Model:
Data model is a collection of concepts that can be used to describe the structure of a
database (ie., Data, Relationships, Datatypes and constraints)
b) Schema:
Complete definition and description of the database is known as database schema. Each object in
the schema is known as schema construct.It is known as Intension
c) Data Base State:
The data in the database at a particular moment in time is called a database state or snapshot.It is
known as the extension of database schema.
DBMS restores the description of the schema constructs and constraints (meta data ) in DBMS
Catalog.
14
5.2 Types of Data model:
High level or Conceptual: Close to users
Low level or Physical: how data is stored in computer
Representational or Implementational : Concepts understood by users and not too far from
the way data is organized Eg. Network, Hierarchical Model.
5.3 DBMS Architecture:
Three schema architecture
Achieve the database characteristics
Three Schema Architecture:

Separates the user applications and physical database. Schemas can be defined in three
levels:
(i)
Internal Level:
It has an internal schema which describes physical storage structure of the

database.
How the data are actually stored
Uses physical model
Describes the complete details of data storage and access paths for the
database.
(ii)
Conceptual Level:
It has an conceptual schema
which describes the structure of whole
database
What data are stored and what relationships exist among data.
Uses high level or implementational data model.
Hides the details of physical storage structures and describes datatypes,

relationships, operations and constraints.
(iii)
External or View Level:
Includes a number of external schemas or views.
Each external schema describes the part of the database and hides the rest.
Uses high level or implementational data model.
15
Fig 1.2 DBMS architecture

Each user refers only to its own schema. DBMS must transform a request specified on external
schema into a request against conceptual schema and then into a request on internal schema for
processing over stored database.
The process of transforming requests and results between levels are called mappings.
Data Independence:
The capacity to change the schema at one level of database system without having to change the
schema at the next higher level.
Logical Data independence:
The capacity to change the conceptual schema without having to change the External schema.
Physical Data Independence:
The capacity to change the physical schema without having to change the Conceptual schema.
5.4 Data base Application Architecture:
Client : Machine in which remote database users work
Server: machine in which database system runs.
Two-tier Architecture:
Application is partitioned into component that resides at the client machine, which invokes
database system functionality at server machine through query language statements.
Eg. API ODBC & JDBC used for Interaction.
Three tier Architecture:
The client merely acts as a front end and does not contain any direct database calls.
16
Client end communicates with application server usually through form interface.
Application server (has Business logic) communicates with database system.
User
User
Client
Application
Client
Application
Client
Client
Network
Server
Server
Network
Application Server
Data base
System
Data base System
Fig 1.3 Two Tier and Three Tier Architecture

6. Components of DBMS
DBMSs are the technology tools that directly support managing organizational data. With a
DBMS you can create a database including its logical structure and constraints, you can
manipulate the data and information it contains, or you can directly create a simple database
application or reporting tool.
Business applications, which perform the higher level tasks of managing business
processes, interact with end users and other applications and, to store and manage data, rely on and
directly operate their own underlying database through a standard programming interface like
ODBC.
The following diagram illustrates the five components of a DBMS.
17
Database Engine:
The Database Engine is the core service for storing, processing, and securing data. The
Database Engine provides controlled access and rapid transaction processing to meet the
requirements of the most demanding data consuming applications within your enterprise.
Use the Database Engine to create relational databases for online transaction processing or
online analytical processing data. This includes creating tables for storing data, and database
objects such as indexes, views, and stored procedures for viewing, managing, and securing data.
You can use SQL Server Management Studio to manage the database objects, and SQL Server
Profiler for capturing server events.
Data dictionary:
A data dictionary is a reserved space within a database which is used to store information
about the database itself. A data dictionary is a set of table and views which can only be read and
never altered.
Most data dictionaries contain different information about the data used in the enterprise. In
terms of the database representation of the data, the data table defines all schema objects including
views, tables, clusters, indexes, sequences, synonyms, procedures, packages, functions, triggers
and many more. This will ensure that all these things follow one standard defined in the dictionary.
The data dictionary also defines how much space has been allocated for and / or currently in
used by all the schema objects. A data dictionary is used when finding information about users,
18
objects, and schema and storage structures. Every time a data definition language (DDL) statement
is issued, the data dictionary becomes modified.
A data dictionary may contain information such as:
Database design information
Stored SQL procedures
User permissions
User statistics
Database process information
Database growth statistics
Database performance statistics
Query Processor :
A relational database consists of many parts, but at its heart are two major components: the
storage engine and the query processor. The storage engine writes data to and reads data from the
disk. It manages records, controls concurrency, and maintains log files.The query processor
accepts SQL syntax, selects a plan for executing the syntax, and then executes the chosen plan.
The user or program interacts with the query processor, and the query processor in turn interacts
with the storage engine. The query processor isolates the user from the details of execution: The
user specifies the result, and the query processor determines how this result is obtained.
The query processor components include
DDL interpreter
DML compiler
Query evaluation engine
Report writer/ generator:

Report writer is a program, usually part of a database management system that extracts
information from one or more files and presents the information in a specified format. Most report
19
writers allow you to select records that meet certain conditions and to display selected fields in
rows and columns.
You can also format data into pie charts, bar charts, and other diagrams. Once you have
created a format for a report, you can save the format specifications in a file and continue reusing
it for new data.
7. Relational Algebra:
Relational algebra is a procedural query language, which consists of basic set of relational
model operations. These operations enable the user to specify basic retrieval requests. It takes one
or more relations as input and result of retrieval is a relation.
Relational Algebra operations are divided into two groups
(i)
Set operations
(ii)
Fundamental operations (Relation based operations)
Relational Algebra Notations:
20
Operation
Keyword
Symbol
Operation
Keyword
Set Operators (Binary)
Relation Based Operators
Union
UNION
Selection
SELECT
Intersection
INTERSECTION
Projection
PROJECT
Renaming
RENAME
Assignment
<-
Cartesian
product
Set Difference MINUS
Symbol
Binary Relation Based Operators

Operation Keyword
Symbol
Join
Other Operators
Group or
AGGREGATE
Left outer join
JOIN
LEFT OUTER
JOIN
Aggregate
Right outer
RIGHT OUTER
join
JOIN
Full outer join
Division
FULL OUTER
JOIN
DIVIDE
1. UNARY RELATIONAL ALGEBRA OPERATIONS:

(i) SELECT Operation: ()
The SELECT operation is used to select a set of tuples from a relation that satisfy
the selection condition.
21
The sigma represents the select operator. It is a Unary operator because it

operates on one relation.
<selection condition> (R)

selection condition is the Boolean expression on the attributes of relation R.
The selection condition is represented as
<attribute name> <comparison operator> <constant value> or
<attribute name> <comparison operator> <attributename>
where comparison operator is one of {=,,,,,} and Boolean conditions AND,OR,NOT can
also be used.
The selection condition is applied to each tuple in R. This is done by substituting each
occurrence of an Attribute Ai in the selection condition with its value in the tuple t[Ai]. If
the condition evaluates to true, then tuple t is selected. All the selected tuples appear in the
result of SELECT operation.
The selection conditions cannot involve more than one tuple.
The degree of the relation resulting from SELECT is same as that of R.
The number of tuples in the resultant relation is always less than or equal to the number of
tuples in R. ie.,| c(R) | |R|. The number of tuples selected by a selection condition is
referred to as the selectivity of the condition.
Cascade of select operations can be combined into a single select using conjunctive AND
condition.
<cond1>( <cond2> (<cond3> R)) =<cond1> AND <cond2> AND<cond3>(R)
SELECT operation is commutative.
<cond1> (<cond2> (R)) =( <cond2> (<cond1> (R))

Eg.,
DNO=4(EMPLOYEE)
22
SALARY>30000(EMPLOYEE)
(DNO=4 AND SALARY>25000) OR (DNO=5 AND SALARY>30000)(EMPLOYEE)
(ii) PROJECT Operation: ()
The PROJECT operation is used to select certain columns from the table and
discards other columns. The pi represents the project operator. It is a Unary
operator because it operates on one relation.
<attribute list> (R)

attribute list is a list of attributes from attributes of relation R.
The result of PROJECT operation has only the attribute specified in <attribute list>
in the same order as they appear in the list.
The degree of the relation resulting from SELECT is same as that of attribute list.
If the attribute list has a nonkey attribute, PROJECT operation removes any
duplicates , so the result is a set of tuples. This is known as duplicate elimination.
The number of tuples in the resultant relation is always less than or equal to the
number of tuples in R
< list1> ( < list2> (R) ) = < list1> (R)
PROJECT operation is not commutative.
Eg. LNAME, FNAME, SALARY(EMPLOYEE)
23
(iii) RENAME Operation: ()

The rename operator is used to rename either the relation name or the attribute names or
both. The (rho ) operator denotes the rename operation. It is a Unary operator.
(i) S(B1,B2,Bn) (R) or (ii) S(R) or (iii) (B1,B2,..Bn)(R)
(i)
Renames both relation and its attributes
(ii)
Renames the relation only
(iii)
Renames the attributes only
The attributes are renamed as the same order in R.

FN,LN,SAL (FNAME, LNAME, SALARY(DNO= 5(EMPLOYEE)))
2. RELATIONAL ALGEBRA SET OPERATORS:
Set theoretic operations are binary operations that are applied in two sets of tuples.
Union Compatibility:
The two relations on which these operations are applied must have the same type of tuples. This
condition is called as union compatibility.
Two relations R(A1,,..An) and S(B1, ..Bn) are said to be union compatible if they have same
degree n (same number of attributes )and if dom(Ai)=dom(Bi) ie., each corresponding pair of
attributes have the same domain.
Set operators:
24
Union (U)
Intersection (n)
Set Difference (--)
Cartesian Product (x)
Defined on union compatible relations
(i) Union: The result of the operation denoted by R U S is a relation that includes all tuples that are
either in R or in S or in both R and S. Duplicate tuples are eliminated.
(ii) Intersection: The result of the operation denoted by R n S is a relation that includes all tuples
that are in both R and S.
Both union and intersection are commutative operations and can be treated as n-ary operations
applicable to any number of relations and both are associative operations;
R U S = S U R and R n S = S n R
R U (S U T) = (R U S) U T ; R n (S n T) = (R n S) n T
(iii) Set Difference: The result of this operation, denoted by R - S is a relation that includes all
tuples that are in R but not in S.
The minus operation is not commutative;
R S S R
STUDENT
FN
INSTRUCTOR
LN
Suresh
Rao
Bala
Ganesh
Ravi
Sankar
Mohan
Varma
25
FNAME
LNAME
Kishore
Das
Sushil
Sharma
Suresh
Rao
Rakesh
Agarwal
Bala
Ganesh
Sushil
Sharma
Ashwin
Kumar
FN
LN
STUDENT U INSTRUCTOR
STUDENT n INSTRUCTOR
FN
LN
Suresh
Rao
Suresh
Rao
Bala
Ganesh
Bala
Ganesh
Ravi
Sankar
INSTRUCTOR STUDENT
Sushil
Sharma
STUDENT - INSTRUCTOR
Mohan
Varma
Sushil
Sharma
Rakesh
Agarwal
FN
FN
KishoreRavi
LN
LN
Das
Sankar
Ashwin Mohan Kumar Varma

Kishore
Das
Rakesh
Ashwin
Agarwal
Kumar
(iv) Cartesian Product (or Cross Product) Operation:

It is denoted by X. It is a binary set operation, but the relations on which it is applied need
not be union compatible. This operation is used to combine tuples from two relations in a
combinatorial fashion.
The Cartesian product creates tuples with combined attributes of two relations.
R(A1, A2,.. An) X S(B1,B2.. Bm) is a relation Q with degree n+m attributes Q(A1,..
An,B1,.Bm).
The resultant relation Q has one tuple for each combination of tuples one from R and one
from S.
If R has nR tuples and S has nS tuples then R X S will have nR * nS Tuples.
26
The operation applied leads to meaningless tuples. It is useful when followed by a selection
that matches the values of attributes coming from relations.
FEMALE_EMPS SEX=F(EMPLOYEE)
EMPNAMES FNAME, LNAME, SSN (FEMALE_EMPS)
EMP_DEPENDENTS EMPNAMES x DEPENDENT
3. BINARY RELATIONAL ALGEBRA OPERATIONS

(i)
JOIN Operation:
Join operation denoted by
is used to combine related tuples from two relations into single
tuples. It allows to process relationships among relations.

R
<join condition> S
The join condition is of the form

<condition> AND <condition> AND .. <condition> and each condition is of the form
Ai Bj, where Ai is the attribute of R and Bj is the attribute of S, Ai and Bj must have same
27
domain. And is one of the relational operator { =,<,>,,,} A join with such a general join
condition is called as theta join.
The result of join is a relation Q with n + m attributes Q(A1,.. An,B1,.Bm).. Q has one
tuple from R and one tuple from S whenever the combination satisfies the join condition.
The join condition specified on attributes from two relations R and S is evaluated for each
combination. Each tuple combination for which the condition is evaluated to true is included in the
result.
Equi Join: The join which involves join conditions with equality comparisons only ie., it involves
only = operator in join condition is known as equi join.
DEPT_MGR DEPARTMENT
MGRSSN=SSN EMPLOYEE
Natural Join : denoted by * Equi join with two join attributes have same name in both
relationships. The attribute involved in join condition is known as Join attribute.
DEPT_LOCS DEPARTMENT * DEPT_LOCATIONS
Outer Join : The tuples without a matching tuple and the tuples with null values for the join
attribute are eliminated from JOIN result. Outer joins can be used to keep all the tuples in R or in S
or those in both relation in the result of join even if they donot have matching tuples in join.
It preserves all the tuples whether or not they match in the join condition.
Left outer Join: Keeps every tuple in the left relation R in R
S even if no matching tuple is
found in S., The attributes of S in the join result are filled with null values. It is denoted by
28
To find the names of all employees, along with the names of the departments they manage, we
could use:
T1 EMPLOYEE
SSN=MGRSSN DEPARTMENT
RESULT FNAME,MINIT,LNAME,DNAME (T1)
Right Outer Join: Keeps every tuple in the right relation S in R
S even if no matching tuple is
found in S., The attributes of R in the join result are filled with null values. It is denoted by
Full Outer Join: Keeps all tuples in both the right relation R and S in R S , even if no matching
tuples are found and filled with null values when needed. It is denoted as
The main difference between Cartesian product and join:

In join only the combinations of tuples satisfying the join condition appears in the result. Whereas
in Cartesian productrall combinations of tuples are included in the result.
(ii) Divide operator (%)
The division operator is applied on two relations R(Z) % S(X) where X C Z. Let Y be the
set of attributes Y=Z-X the set of attributes that are not in S. The result of division is a relation
T(Y) ., in which a tuple to appear in result T , the values in t must appear in R in combination with
every tuple in S.
29
Relations r, s:
A
r s:
B
a
Null Values
It is possible for tuples to have a null value, denoted by null, for some of their attributes
null signifies an unknown value or that a value does not exist.
The result of any arithmetic expression involving null is null.
Aggregate functions simply ignore null values (as in SQL)
For duplicate elimination and grouping, null is treated like any other value, and two nulls are
assumed to be the same (as in SQL)
4. Aggregate Functions and Grouping
30
The relational algebra allows to specify mathematical aggregate functions on collections of values
from the database. Examples of such functions include retrieving the average or total salary of all
employees or the number of employee tuples. Common functions applied to collections of numeric
values include SUM, AVERAGE, MAXIMUM, and MINIMUM. The COUNT function is used
for counting tuples or values.
Another type may be grouping the tuples in a relation by the value of some of their attributes and
then applying an aggregate function independently to each group. An example would be to group
employee tuples by DNO, so that each group includes the tuples for employees working in the
same department. We can then list each DNO value along with, say, the average salary of
employees within the department.
<grouping attributes> <function list>
(R)
where <grouping attributes> is a list of attributes of the relation specified in R, and <function list>
is a list of (<function> <attribute>) pairs. In each such pair, <function> is one of the allowed
functions such as SUM, AVERAGE, MAXIMUM, MINIMUM, COUNTand <attribute> is
an attribute of the relation specified by R.
The resulting relation has the grouping attributes plus one attribute for each element in the function
list.
1. FMAXIMUM Salary (Employee) retrieves the maximum salary value from the Employee
relation
MAXIMUM_Salary
55000
2. Retrieve each department number, the number of employees in the department, and their
average salary.
(DNO COUNT SSN, AVERAGE SALARY (EMPLOYEE))
31
3. Without the grouping attribute, you would get the result for all tuples together
COUNT SSN, AVERAGE Salary (Employee)
Note :The results of aggregate functions are still relations But not numbers.
7. Logical Database Design: Relational DBMS - Codd's Rule
A relational database management system (RDBMS) is a database management system
(DBMS) that is based on the relational model as introduced by E. F. Codd. Most popular
commercial and open source databases currently in use are based on the relational model.
A short definition of an RDBMS may be a DBMS in which data is stored in the form of tables
and the relationship among the data is also stored in the form of tables.
E.F. Codd, the famous mathematician has introduced 12 rules for the relational model for
databases commonly known as Codd's rules. The rules mainly define what is required for a
DBMS for it to be considered relational, i.e., an RDBMS. There is also one more rule i.e. Rule00
which specifies the relational model should use the relational way to manage the database. The
rules and their description are as follows:Rule 0: Foundation Rule
A relational database management system should be capable of using its relational facilities
(exclusively) to manage the database.
Rule 1: Information Rule
32
All information in the database is to be represented in one and only one way. This is achieved by
values in column positions within rows of tables.
Rule 2: Guaranteed Access Rule
All data must be accessible with no ambiguity, that is, Each and every datum (atomic value) is
guaranteed to be logically accessible by resorting to a combination of table name, primary key
value and column name.
Rule 3: Systematic treatment of null values
Null values (distinct from empty character string or a string of blank characters and distinct from
zero or any other number) are supported in the fully relational DBMS for representing missing
information in a systematic way, independent of data type.
Rule 4: Dynamic On-line Catalog Based on the Relational Model
The database description is represented at the logical level in the same way as ordinary data, so
authorized users can apply the same relational language to its interrogation as they apply to regular
data. The authorized users can access the database structure by using common language i.e. SQL.
Rule 5: Comprehensive Data Sublanguage Rule
A relational system may support several languages and various modes of terminal use. However,
there must be at least one language whose statements are expressible, per some well-defined
syntax, as character strings and whose ability to support all of the following is comprehensible:
a. data definition
b. view definition
c. data manipulation (interactive and by program)
d. integrity constraints
e. authorization
f. Transaction boundaries (begin, commit, and rollback).
Rule 6: View Updating Rule

33
All views that are theoretically updateable are also updateable by the system.
Rule 7: High-level Insert, Update, and Delete
The system is able to insert, update and delete operations fully. It can also perform the operations
on multiple rows simultaneously.
Rule 8: Physical Data Independence
Application programs and terminal activities remain logically unimpaired whenever any changes
are made in either storage representation or access methods.
Rule 9: Logical Data Independence
Application programs and terminal activities remain logically unimpaired when information
preserving changes of any kind that theoretically permit unimpairment are made to the base tables.
Rule 10: Integrity Independence
Integrity constraints specific to a particular relational database must be definable in the relational
data sublanguage and storable in the catalog, not in the application programs.
Rule 11: Distribution Independence
The data manipulation sublanguage of a relational DBMS must enable application programs and
terminal activities to remain logically unimpaired whether and whenever data are physically
centralized or distributed.
Rule 12: Nonsubversion Rule
If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level
language cannot be used to subvert or bypass the integrity rules or constraints expressed in the
higher-level (multiple-records-at-a-time) relational language.
ENTITY-RELATIONSHIP MODEL
The basic object that the ER model represents is an entity, which is a "thing" in the real
world with an independent existence.
An entity may be an object with a physical existencea particular person, car, house, or
employeeor it may be an object with a conceptual existencea company, a job, or a
university course.
34
Each entity has attributesthe particular properties that describe it.
For example, an employee entity may be described by the employees name, age, address,
salary, and job.
A particular entity will have a value for each of its attributes.
The attribute values that describe each entity become a major part of the data stored in the
database.
Types of attributes:
Simple versus Composite
Single-Valued versus Multivalued
Stored versus Derived
Composite Versus Simple (Atomic) Attributes

Composite attributes can be divided into smaller subparts, which represent more basic
attributes with independent meanings. For example, the Address attribute of the employee entity
can be sub-divided into StreetAddress, City, State, and Zip (Note 2), with the values "2311 Kirby,"
"Houston," "Texas," and "77001." Attributes that are not divisible are called simple or atomic
attributes.
Composite attributes can form a hierarchy; for example, StreetAddress can be subdivided
into three simple attributes, Number, Street, and ApartmentNumber,
The value of a composite attribute is the concatenation of the values of its constituent simple
attributes.
35
Composite attributes are useful to model situations in which a user sometimes refers to the
composite attribute as a unit but at other times refers specifically to its components.
If the composite attribute is referenced only as a whole, there is no need to subdivide it into
component attributes. For example, if there is no need to refer to the individual components of an
address (Zip, Street, and so on), then the whole address is designated as a simple attribute. Singlevalued Versus Multivalued Attributes
Most attributes have a single value for a particular entity; such attributes are called singlevalued. For example, Age is a single-valued attribute of person. In some cases an attribute can
have a set of values for the same entityfor example, a Colors attribute for a car, or a
CollegeDegrees attribute for a person. Cars with one color have a single value, whereas two-tone
cars have two values for Colors. Similarly, one person may not have a college degree, another
person may have one, and a third person may have two or more degrees; so different persons can
have different numbers of values for theCollegeDegrees attribute. Such attributes are called
multivalued.
A multivalued attribute may have lower and upper bounds on the number of values allowed
for each individual entity. For example, the Colors attribute of a car may have between one and
three values, if we assume that a car can have at most three colors.
Stored Versus Derived Attributes
In some cases two (or more) attribute values are relatedfor example, the Age and
BirthDate attributes of a person. For a particular person entity, the value of Age can be determined
from the current (todays) date and the value of that persons BirthDate. The Age attribute is hence
called a derived attribute and is said to be derivable from the BirthDate attribute, which is called a
stored attribute. Some attribute values can be derived from related entities; for example, an
attribute NumberOfEmployees of a department entity can be derived by counting the number of
employees related to (working for) that department.
Entity Types, Entity Sets, Keys, and Value Sets
Entity Types and Entity Sets

o A database usually contains groups of entities that are similar. For example, a
company employing hundreds of employees may want to store similar information
concerning each of the employees. These employee entities share the same
attributes, but each entity has its own value(s) for each attribute.
36
o An entity type defines a collection (or set) of entities that have the same attributes.
Each entity type in the database is described by its name and attributes.
o Figure shows two entity types, named EMPLOYEE and COMPANY, and a list of
attributes for each. A few individual entities of each type are also illustrated, along
with the values of their attributes.
o The collection of all entities of a particular entity type in the database at any point
in time is called an entity set; the entity set is usually referred to using the same
name as the entity type. For example, EMPLOYEE refers to both a type of entity as
well as the current set of all employee entities in the database.
o An entity type is represented in ER diagrams as a rectangular box enclosing the
entity type name.
o Attribute names are enclosed in ovals and are attached to their entity type by
straight lines.
o Composite attributes are attached to their component attributes by straight lines.
o Multivalued attributes are displayed in double ovals.
o An entity type describes the schema or intension for a set of entities that share the
same structure.
o The collection of entities of a particular entity type are grouped into an entity set,
which is also called the extension of the entity type.
Key Attributes of an Entity Type
o An important constraint on the entities of an entity type is the key or uniqueness
constraint on attributes.
o An entity type usually has an attribute whose values are distinct for each individual
entity in the collection. Such an attribute is called a key attribute, and its values can
be used to identify each entity uniquely.
o For example, the Name attribute is a key of the COMPANY entity type in Figure,
because no two companies are allowed to have the same name. For the PERSON
entity type, a typical key attribute is SocialSecurityNumber.
o Sometimes, several attributes together form a key, meaning that the combination of
the attribute values must be distinct for each entity. If a set of attributes possesses
this property, we can define a composite attribute that becomes a key attribute of
the entity type.
o In ER diagrammatic notation, each key attribute has its name underlined inside the
oval,
37
Value Sets (Domains) of Attributes

o Each simple attribute of an entity type is associated with a value set (or domain of
values), which specifies the set of values that may be assigned to that attribute for
each individual entity. If the range of ages allowed for employees is between 16 and
70, we can specify the value set of the Age attribute of EMPLOYEE to be the set of
integer numbers between 16 and 70.
Extended Entity-Relationship (EER) Model

The extended entity-relationship (EER) model is a language for definition of structuring
(and functionality) of database or information systems. It uses inductive development of
structuring.
Basic attributes are assigned to base data types. Complex attributes can be constructed by
applying constructors such as tuple, list or set constructors to attributes that have already been
constructed.
Entity types conceptualize structuring of things of reality through attributes.
Relationship types associate types that have already been constructed into an association
type. The types may be restricted by integrity constraints and by specification of identification of
objects defined on the corresponding type. Typical integrity constraint of the extended entityrelationship model is participation, lookup and general cardinality constraints.
Entity, cluster and relationship classes contain a finite set of objects defined on these types.
The types of an EER schema are typically depicted by an EER diagram.
The extended entity-relationship model is mainly used as a language for conceptualization

of the structure of information systems applications. Conceptualization of database or information
systems aims in a representation of the logical and physical structure of an information system in a
given database management system (or for a database paradigm), so that it contains all the
information required by the user and required for the efficient behavior of the whole information
system for all users. Furthermore, conceptualization may target to specify the database application
processes and the user interaction. Description of structuring is currently the main use of the
extended ER model. The diagram representation of EER schemata uses rectangles and diamonds
for the entity and relationship types.
38
Functional Dependencies, Anomaly-1NF to 5NF:

12.1 Functional Dependencies
A functional dependency is a constraint between two sets of attributes from the database.
Let R be the relational schema R={A1,A2,An}. A functional dependency denoted by X Y
Between two sets of attributes X and Y that are subset of R specifies a constraint on the possible
tuples that can forma a relation state r of R
X Y --- X functionally determines Y (or) there is a functional dependency from X to Y (or) Y is
functionally dependent on X.
X L.H.S of F.D.
Y R.H.S of F.D.
Definition 1:
X functionally determines Y in a relational schema R iff whenever two tuples of r(R) agree on
their X value , they must necessarily agree on their Y value.
F.D. is a property of semantics or meaning odf attributes.
Definition 2 :
For any two tuples t1 and t2 in r if t1[x]=t2[x], we must have t1[y]=t2[y], ie., values of Y
component of a tuple in r depend on and determined by the values of X component.
39
Eg. SSN ENAME ie., SSN uniquely determines ENAME.

Pnumber {pname,location}
{SSN, Pnumber} hours
Relation state r(R) that satisfies functional dependency constraints are called legal extensions or
legal relation states.
Diagrammatic Notation:
An arrow mark is drawn from L.H.S to R.H.S of a f.d
SSN
ENAME
SSN
ENAME
DNO{DNAME,MGRSSN}
DNO
DNAME
MGRSSN
Inference rules:
Let F be the set of f.ds specified on R. Other numerous functional dependencies also hold
in R that satisfy F.
The set of all such dependencies inferred from F is called the closure of F denoted by F+.
SSN {Ename,Bdate,Address,Dno}
DNO {Dname,MGRSSN}
Inferred F.d.s F+
SSN {Dname,MGRSSN}
SSN Ename
40
A F.D X Y is inferred from a set of dependencies F specified on R if X Y holds in every

relation state r(R).
F
X Y
ie.,X Y is inferred from F
IR1: (Reflexive)
: If x y, then x y
IR2: (Augmentation)
: {xy}
IR3: (Transitive)
: {xy,yz}
IR4: (Decomposition)
: {xyz}
IR5: (Union)
: xy, xz
IR6: (Pseudo transitive)
: {xy,wyz}
xzyz
xz
xy
xyz
wxz
Closure of X under F:
For each set of attributes X in L.H.S. of a f.d. in F ,determine all set of attributes dependent on X.
Thus for each X we determine X+ functionally determined by X based on F.
X+ is called closure of X under F.
Equivalence sets of f.d.s
A set of F.Ds E is covered by a set of f.ds F or F covers E if every F.D in E can be inferred
from F (also in F+)
Two sets of F.Ds E and F are said to be equivalent if E+=F+ ie., every F.D in E can be
inferred from F and vice-versa.ie., E covers F and F covers E.
Minimal set of functional dependency:
A set of functional dependencies F is said to be minimal if
1) Every dependency in F has a single attribute in its R.H.S.
2) We cannot replace any f.d X A in F with Y A where Y X have a set of functional
dependencies equivalent to F.
3) We cannot remove any dependency from F and still have a set of dependencies equivalent
to F.
41
It is a standard or canonical form with no redundancies.

Minimal Cover:
A minimal cover of set of f.ds F is a minimal set of dependencies Fmin that is equivalent to F.
Trivial and Nontrivial Functional Dependency
A functional dependency XY is trivial if YX, otherwise it is non-trivial.
12.2 Normal Forms:

Normalization was proposed by Codd (1972)
It takes the relational schema through a series tests whether it satisfies a certain normal
form.
It proceeds in Top down fashion Design by analysis.
Codd has defined I, II., III NF and a stronger definition of 3NF known as
BCNF.(Boyce Codd Normal Form)
All these normal forms are based on f.ds among attributes of a relation.
Normalization:
It is a process of analyzing the given relational schema based on their FDs and primary keys to
achieve desirable properties of minimizing redundancy and minimizing the insertion and Updation
anamolies.
The relational schema that donot satisfy normal form tests are decomposed into smaller relational
schemas that meet the tests and possess desirable properties.
Normalization procedure provides:
1) Framework for analyzing relational schema based on keys and f.ds
2) A series of normal from tests carried out on each relational schema so that the database
is normalized to desired degree.
42
A normal form of a relation is the highest normal form condition that it meets ad hence indicates
the degree to which it is normalized.
Normalisation through decomposition must meet
1) Lossless join or non-additive join property:
Spurious tuples problem donot occur w.r.t schemas after decomposition.
2) Dependency preservation property
Each f.d is represented by some individual relations after decomposition.
The process of storing the join of higher normal form relations as a base relation in lower normal
form is known as denormalization.
Superkey: A superkey s of R is a set of attributes with the property that no two tuples t1 and t2 in
r(R) will have t1[s] = t2[s]. Different set of attributes, which are able to identify any row in the
database, is known as super key.
Candidate Key: A candidate key is a minimal superkey. The removal of any attribute from K will
not cause K to be a super key any more. K={A1,Am} K-Ai is not a key
Eg{SSN,Ename} is a superkey and {ssn} is a candidate key.
Primary Key: Primary key could be any key, which is able to identify a specific row in database
in a unique manner.
An attribute of a relation schema R, which is a member of some candidate key of R is called Prime
Attribute otherwise it is a non-prime attribute.
And minimal super key is termed as candidate key i.e. among set of super keys one with minimum
number of attributes. Primary key could be any key, which is able to identify a specific row in
database in a unique manner.
First Normal Form:
Statement of First normal form:
The domain of an attribute must include only atomic (simple atomic) values and that value of an
attribute in a tuple must be a single value from domain of that attribute.
43
1NF disallows set of values, a tuples of values ie., it disallows a relation within a relation or
relations as attributes (ie., it disallows composite and Multivalued attribute)
Eg. DEPARTMENT relational schema whose primary key is DNUMBER. Considering
DLOCATIONS attribute. Each department can have a number of locations.
DEPARTMENT
DNAME
DNUMBER
DMGRSSN
DLOCATIONS
NOT IN 1NF
1) The domain of DLOCATIONS contains atomic values, but some tuples can have a set of these
values. Thereore DLOCATIONS is not functionally dependent on DNUMBER.
2)
DNAME
DNUMBER
DMGRSSN
DLOCATIONS
Research
334
Bellaire
Research
334
Alaska
Domain of
DLOCATIONS has set
Research
of
334
Newyork
values and hence non-
atomic.
But DNUMBER
DLOCATIONS exist.
DNAME
DNUMBER
DMGRSSN
DLOCATIONS
Research
334
{Bellaire,Alaska,Newyork}
Normalisation to achieve 1NF:

Three ways to achieve first normal form
1. Remove the attribute DLOCATIONS and place it in a separate relation along with the
primary key DNUMBER of the department. Ie., primary key is the combination of
{DNUMBER,DLOCATION}. So a distinct tuple exist for each location
DNAME
DNUMBER
DMGRSSN
44
DNUMBER
DLOCATIONS
2. Expand the key so that there will be a separate tuple for each location of department. But it
introduces redundancy in relation.
3. If maximum number of values is known for DLOCATIONS eg 3., Replace DLOCATIONS
by 3 atomic attributes LOC1, LOC2, LOC3. It introduces more null values in the relation.
The 1 NF also disallows multivalued attributes that are themselves composite. These are called
nested relations.
Test for First Normal Form:
Relation should not have non-atomic attributes or nested relation.
Remedy
Form new relations for each non-atomic attribute or nested relation.
Second Normal Form:
2NF is based on Full Functional Dependency.
Full Functional Dependency:
A FD X Y is a full functional dependency, if removal of any attribute A from X, {X{A}} does not functionally determine Y. Eg. {SSN, PNUMBER} HOURS
Partial Functional dependency:
A FD XY is a partial dependency if some attribute A X can be removed from X and
the dependency still holds ie., A X, X-{A} Y
Eg.,{SSN, Pnumber} Ename
Statement of Second normal form:

A relational schema R is in 2NF if every nonprime attribute A is fully functionally dependent on
the primary key of R.
45
If a relational schema R is not in 2NF, it can be 2NF normalized into a number of 2NF
relations in which non prime attributes are associated only with the part of the primary key on
which they are fully functionally dependent. Eg.,
SSN
PNumber Hours
Ename Pname PLocation
NOT IN 2 NF
FD1
FD2
FD3
Ename, Pname & PLocation violates 2NF. FD2 and FD3 are partially dependent on primary key.
Normalisation (Decomposition) to achieve 2NF:
SSN
Ename
SSN
FD2
PNumber Hours
FD1
PNumber Pname PLocation
FD3
Test for Second Normal Form
For relations where primary key contains multiple attributes, non-key or non-prime
attribute should not be functionally dependent on a part of primary key.
Remedy:
Decompose and set up a new relation for each partial key with its dependent attribute(s).
make sure to keep a relation with the original primary key and any attributes that are fully
functionally dependent on it.
Third Normal Form:
46
It is based on Transitive dependency.

Transitive Dependency:
A functional dependency XY in R is a transitive dependency if there is a set of attributes Z (that
is neither a candidate key or subset of any key of R) and both XZ and ZY hold.
Statement of Third Normal Form:
According to Codds Definition, a relational schema R is in 3NF if and only if it satisfies 2NF
and every nonprime attribute is non-transitively dependent on that primary key.
General Definition of 3NF:
A relational schema R is in 3NF if whenever a nontrivial functional dependency XA
hold in R either a) X is a superkey of R or b) A is a Prime Attribute of R
This definition states that a table is in 3NF if and only if, for each of its functional dependencies X
A, at least one of the following conditions holds:
X contains A (that is, X A is trivial functional dependency), or
X is a superkey, or
A is a prime attribute (i.e., A is contained within a candidate key)
Eg., EMPLOYEE
Ename SSN
Bdate Address Dno
Dname Dmgrssn
SSN MGRSSN is transitive through DNO.

Ename SSN
Bdate Address Dno
Dno
Dname Dmgrssn
Test for Third Normal Form:

47
A relation should not have non key attribute functionally determined by another non key
attribute ie., there should be no transitive dependency of a non key attribute on the primary key.
Remedy:
Decompose and set up a relation that includes non-key attributes that functionally
determine other non-key attribute.
Boyce-Codd normal form (BCNF):
It is stricter than 3 NF because every relation in BCNF also in 3NF. A relation R is in BCNF if
whenever non-trivial dependency XA holds in R then X is a superkey of R.
The condition of 3NF, which allows A to be prime, is absent from BCNF.
A relation is in BCNF, if and only if, every determinant is a candidate key.
ClientNo
interviewDate
interviewTime
staffNo
roomNo
CR76
13-May-02
10.30
SG5
G101
CR76
13-May-02
12.00
SG5
G101
CR74
13-May-02
12.00
SG37
G102
CR56
1-Jul-02
10.30
SG5
G102
ClientInterview
FD1 : clientNo, interviewDate interviewTime, staffNo, roomNo
(Primary Key)
FD2 : staffNo, interviewDate, interviewTime clientNo
(Candidate key)
FD3 : roomNo, interviewDate, interviewTime clientNo, staffNo
(Candidate key)
FD4 : staffNo, interviewDate roomNo (not a candidate key)

As a consequence, the ClientInterview relation may suffer from update anomalies.
For example, two tuples have to be updated if the roomNo need be changed for staffNo SG5 on the
13-May-02.
creating two new relations called Interview and StaffRoom as shown below,
Interview (clientNo, interviewDate, interviewTime, staffNo)
48
StaffRoom(staffNo, interviewDate, roomNo)

Interview
ClientNo
interviewDate
interviewTime
staffNo
CR76
CR76
CR74
CR56
13-May-02
13-May-02
13-May-02
1-Jul-02
10.30
12.00
12.00
10.30
SG5
SG5
SG37
SG5
Staff Room
staffNo
interviewDate
SG5
SG37
SG5
13-May-02
13-May-02
1-Jul-02
roomNo
G101
G102
G102
Domain-Key Normal Form (DKNF)

There is no hard and fast rule about defining normal forms only up to 5NF. Historically,
the process of normalization and the process of discovering undesirable dependencies was carried
through 5NF as a meaningful design activity, but it has been possible to define stricter normal
forms that take into account additional types of dependencies and constraints.
The idea behind domain-key normal form (DKNF) is to specify (theoretically, at least) the
"ultimate normal form" that takes into account all possible types of dependencies and constraints.
A relation is said to be in DKNF if all constraints and dependencies that should hold on the
relation can be enforced simply by enforcing the domain constraints and key constraints on the
relation. For a relation in DKNF, it becomes very straightforward to enforce all database
constraints by simply checking that each attribute value in a tuple is of the appropriate domain and
that every key constraint is enforced.
Denormalization
The ultimate goal during normalization is to separate the logically related attributes into
tables to minimize redundancy, and thereby avoid the update anomalies that lead to an extra
processing overhead to maintain consistency in the database.
49
The above ideals are sometimes sacrificed in favor of faster execution of frequently
occurring queries and transactions. This process of storing the logical database design (which may
be in BCNF or 4NF) in a weaker normal form, say 2NF or 1NF, is called denormalization.
Typically, the designer adds to a table attributes that are needed for answering queries or
producing reports so that a join with another table, which contains the newly added attribute, is
avoided. This reintroduces a partial functional dependency or a transitive dependency into the
table, thereby creating the associated redundancy problems.
Other forms of denormalization consist of storing extra tables to maintain original
functional dependencies that are lost during a BCNF decomposition.
Differentiate 3 NF and BCNF:
The difference between 3NF and BCNF is that for a functional dependency A B, 3NF allows
this dependency in a relation if B is a primary-key attribute and A is not a candidate key.whereas
BCNF insists that for this dependency to remain in a relation, A must be a candidate key.
Note:
A non-prime attribute of R is an attribute that does not belong to any candidate key of R.
Denormalization
Denormalization doesnt mean not normalizing. Its a step in the process.
First we normalize, then we realize that we now have hundreds or thousands of small tables and
that the performance cost of all those joins will be prohibitive, and then carefully apply some
denormalization techniques so that the application will return results before the werewolves have
overrun the city.
Since normalization is about reducing redundancy, denormalizing is about deliberately adding
redundancy to improve performance. Before beginning, consider whether or not its
necessary.Data Normalization, Denormalization, and the Forces of Darkness / Hollingsworth / p18
Is the systems performance unacceptable with fully normalized data? Mock up a
client and do some testing.
If the performance is unacceptable, will denormalizing make it acceptable? Figure out
50
where your bottlenecks are.

If you denormalize to clear those bottlenecks, will the system and its data still be
reliable?
Unless the answers to all three are yes, denormalization should be avoided.
Unfortunately for me, the Council of Light has some pretty old hardware and cant or wont
upgrade to newer and more efficient SQL software. Id better get to work.
Denormalization Strategies
Materialized Views
If were lucky, we wont need to denormalize our logical data design at all. We can let the
database management system store additional information on disk, and its the responsibility of the
DBMS software to keep the redundant information consistent.
Oracle does this with materialized views, which are pretty much what they sound likeSQL
views, but made material on the disk. Like regular views, they change when the underlying data
does. Microsoft SQL has something called indexed views, which I gather are similar although
Ive never used them. Other SQL databases have standard hacks to simulate materialized views;
a little Googling will tell you whether or not yours does.
Database Constraints
The more common approach is to denormalize some carefully chosen parts of the database
design, adding redundancies to speed performance. Danger lies ahead. It is now the database
designers responsibility to create database constraints specifying how redundant pieces of data
will be kept consistent.
These constraints introduce a tradeoff. They do speed up reads, just as we wish, but they slow
down writes. If the databases regular usage is write-heavy, such as the Council of Lights
51
message forum, then this may actually be worse than the non-denormalized version.
Mirror Tables
If a particular table is too active, with constant querying and updating leading to slowdowns
and timeouts and deadlocks, consider a miniature and internal version of Double your database
fun. You can have a background and foreground copy of the same table. The background table
takes only writes, the foreground table is for reading, and synchronizing the two is a low-priority
process that happens when less is going on.
Split Tables
The original table could also be maintained as a master, in which case the split tables are
special cases of mirror tables (above) which happen to contain a subset instead of a complete copy.
If few people want to query the original table, I could maintain it as a view which joins the split
tables and treat the split tables as masters.
Thats a horizontal split, pulling rows out into different tables. For other needs we might do
a vertical split, keeping all the same rows/keys but with each table having separate sets of columns
of information.
This is only worthwhile when there are distinct kinds of queries which are, in effect,
already treating the table as more than one table.
Combined Tables
Instead of splitting one table into several, we might combine several into one. If tables have
a one-to-one relationship, it might speed performance to combine them into one table even if that
isnt part of the logical database design. We might even combine tables with a one-to-many
relationship, although that would significantly complicate our update process and would likely
work out to be a net loss.
Combined tables differ from joined tables (above) in that they already share some data in
some relationship, while any two tables which are frequently joined are potential candidates for
joined tables. They also differ in that a joined table usually populates itself from the co-existing
52
normalized tables, while a combined table will usually be the master copy.
Index Tables
These are nice for searching hierarchies. Remember how Normal 5 taught us that weapon
composition is actually a hierarchy of materials rather than a simple list? Well, now were stuck
with searching that hierarchy by searching all the leaves under particular parent nodes and
combining with a bunch of union statements.
UNIT II
SQL & QUERY OPTIMIZATION
SQL Standards -Data types -Database Objects-DDL-DML-DCL-TCL-Embedded SQL-Static Vs
Dynamic SQL -QUERY OPTIMIZATION: Query Processing and Optimization -Heuristics and
cost Estimates in Query Optimization.
SQL - Structured Query Language Standards

SQL has evolved from IBMs Sequel (Structures English QUEry Language) Language.
Advantages of SQL:
SQL is a standard relational-database Language.
SQL is a comprehensive database language; it has statements for data definition, Query
and update. Hence it is both DDL and DML
It has facilities for defining views on the database, specifying security and authorization,
for defining integrity constraints and for specifying transaction controls.
It also has rules for embedding SQL statements into general purpose programming
language such as C or Pascal.
Parts of SQL:
The SQL Language has several parts:
Data Definition Language Defines relational schema, deleting relations and modifying
relational schemas.
Interactive Data Manipulation Language Based on both relational algebra and relational
calculus. It includes commands to insert, update, select and delete tuples in the database.
View Definition define views
Transaction Control specify beginning and end of transactions.
53
Embedded SQL and Dynamic SQL - SQL statements can be embedded in general purpose
programming language.
Integrity SQL DDL commands specify integrity constraints that the data stored in the
database must satisfy. The updates that violate these constraints are disallowed.
Authorization specifying access rights to relations and views.
Basic Structure of SQL:
The basic structure of SQL expression consists of three clauses: select, from and where
The SELECT clause corresponds to the project operator of the relational algebra to list the
attributes desired in the result of a query.
The FROM clause corresponds to the Cartesian product operation of the relational algebra
to list the relations in the evaluation of the expression.
The WHERE clause corresponds the selection predicate of the relational algebra. It consist
of the predicate involving attributes of the relations that appear in the from clause.
A SQL query is of the form:
SELECT A1,A2,A3An
FROM r1,r2,r3rm
WHERE P
or
SELECT <attribute list>
FROM <table list>
WHERE <condition>;
where:
<attribute list> is a list of attribute names whose values are to be retrieved by the query.
<table list> is a list of the relation names required to process the query.
<condition> is a conditional (Boolean) expression that identifies the tuples to be
retrieved by
the query.
The SQL Query is equivalent to the relational algebra expression
A1,A2,A3..An( P (r1 xr2 x r3 xrm)
SQL forms the Cartesian product of the relations named in the from clause, performs a
relational algebra selection using the where clause predicate and then projects the result onto the
attributes of the select clause.
2.2 DDL:
54
SQL uses the terms table, row, and column for relation, tuple, and attribute, respectively.
The SQL2 commands for data definition are CREATE, ALTER, TRUNCATE and DROP;
The SQL DDL allows specification of not only a set of relations, but also information
about each relation, including:
The schema for each relation
The domain values associated with each attribute.
The integrity constraints
The set of indices to be maintained for each relation.
The security and authorization information for each relation.
The physical storage structure of each relation on disk.
Schema and Catalog Concepts in SQL2

An SQL schema is identified by a schema name, and includes an authorization identifier
to indicate the user or account who owns the schema, as well as descriptors for each element in
the schema. Schema elements include the tables, constraints, views, domains, and other constructs
(such as authorization grants) that describe the schema. A schema is created via the CREATE
SCHEMA statement, which can include all the schema elements definitions. Alternatively, the
schema can be assigned a name and authorization identifier, and the elements can be defined later.
For example, the following statement creates a schema, owned by the user with
authorization identifier JSMITH:
CREATE SCHEMA AUTHORIZATION JSMITH;
A catalog always contains a special schema called INFORMATION_SCHEMA, which
provides information on all the element descriptors of all the schemas in the catalog to authorized
users. Integrity constraints such as referential integrity can be defined between relations only if
they exist in schemas within the same catalog. Schemas within the same catalog can also share
certain elements, such as domain definitions.
The CREATE TABLE Command and SQL2 Data Types and Constraints
The CREATE TABLE command is used to specify a new relation by giving it a name and
specifying its attributes and constraints. The attributes are specified first, and each attribute is
given a name, a data type to specify its domain of values, and any attribute constraints such as
NOT NULL.
Create table r (A1 D1, A2 D2,.An Dn, (integrity constraint1), (integrity
constraint2) (integrity constraint k))
55
Where r is the name of the relation, each Ai is the name of an attribute in the schema of a
relation r and Di is the domain type of values in the domain
DML(Data Manipulation Language)
It is a language that provides a set of operation to support the basic data manipulation, operation
on the data held in the database.
Commands used are:
Insert
Delete
Select
Update
DCL (Data Control Language)
Grant
Revoke
Data/Domain types in SQL
The SQL standard supports a variety of built-in domain types including:
char(n)
fixed length character string with length n.
varchar(n)
variable length character string with maximum length n
int
- integer
smallint
- small integer , subset of integer
number(n)
- a number with n digits
number(p,d)
- a fixed point number with p digits and d of the p digits are to the right of the
decimal point.
real
- floating point numbers
float(n)
- a floating point number with precision of atleast n digits.
date
- a calendar date containing day-month four digit year
time
- the time of the day in hours, minutes and seconds.
timestamp
- A combination of date and time.
Database Objects
View
A View is subset of part of a database. It is a personalized model of a database. A view can hide
data that a user does not need to see. Simplifies the usage of the system and enhance security. A
user who is not allowed to directly access a relation may be allowed to access a part of a relation
through view. Views may also be called as a virtual table.
56
Provide a mechanism to hide certain data from the view of certain users. To create a view we
use the command:
create view v as <query expression>
where:<query expression> is any legal expression
The view name is represented by v
create view <viewname> as select <fields> from <table name>;
Update of a View
Create a view of all loan data in loan relation, hiding the amount attribute
create view branch-loan as

select branch-name, loan-number
from loan
Add a new tuple to branch-loan
insert into branch-loan

values (Perryridge, L-307)
This insertion must be represented by the insertion of the tuple
(L-307, Perryridge, null)
into the loan relation
Updates on more complex views are difficult or impossible to translate, and hence are
disallowed.
Most SQL implementations allow updates only on simple views (without aggregates)
defined on a single relation
Advantages of Views:
1. Provide automatic security for hidden data.
2. Different views of same data for different users.
3. Provide logical data independence.
4. Provides the principle of interchangeability and principle of database relativity.
Sequence
A sequence is a database object created by a user and can be shared by multiple users.
Use:
To create a primary key value, this must be unique for each row.
Automatically generates unique numbers
Speeds up the efficiency of accessing sequence values when cached in memory.
57
Syntax
To Create Sequence
CREATE SEQUENCE sequence_name
[INCREMENT BY n]
[START WITH n]
[{MAXVALUE n | NOMAXVALUE}]
[{MINVALUE n | NOMINVALUE}]
[CYCLE | NOCYCLE}]
[{CACHE n | NOCACHE}];
To use a Sequence
insert into table_name values (sequence_name.nextval,value1,value2, value3,);

Modifying a Sequence
If a MAXVALUE limit is reached in the sequence, no additional values from the sequence
will be allocated and an error will be received indicating that the sequence exceeds the
MAXVALUE. To continue to use the sequence, it can be modified using the ALTER SEQUENCE
statement.
ALTER SEQUENCE sequence_name
[INCREMENT BY n]
[{MAXVALUE n | NOMAXVALUE}]
[{MINVALUE n | NOMINVALUE}]
[{CYCLE | NOCYCLE}]
[{CACHE n | NOCACHE}];
Removing a Sequence
To remove a sequence from the data dictionary, use the DROP SEQUENCE statement.
Once removed, the sequence can no longer be referenced.
DROP SEQUENCE
sequence_name;
Synonyms:
Direct references to objects. They are used to provide public access to an object, mask the real
name or owner of an object, etc. A user may create a private synonym that is available to only that
user.
58
Embedded SQL-Static Vs Dynamic SQL

Embedded SQL is a method of combining the power of a programming language and the
database. Embedded SQL statements are processed by a special SQL precompiler. The embedded
SQL statements are parsed by an embedded SQL.
The output from the preprocessor is then compiled by the host compiler. This allows
programmers to embed SQL statements in programs written in any number of languages such as:
C Programming language family, COBOL, FORTRAN and Java.
The SQL standard defines embeddings of SQL in a variety of programming languages such
as C, Java, and Cobol.
A language to which SQL queries are embedded is referred to as a host language, and the
SQL structures permitted in the host language comprise embedded SQL.
The basic form of these languages follows that of the System R embedding of SQL into PL/I.
EXEC SQL statement is used to identify embedded SQL request to the preprocessor EXEC SQL
<embedded SQL statement > END_EXEC
Note: this varies by language (for example, the Java embedding uses # SQL { . }; )
From within a host language, find the names and cities of customers with more than the variable
amount dollars in some account. Specify the query in SQL and declare a cursor for it
EXEC SQL <query or PL/SQL block>
END_EXEC
The statement causes the query to be evaluated EXEC SQL .. END_EXEC The fetch statement
causes the values of one tuple in the query result to be placed on host language variables.
EXEC SQL fetch c into :cn, :cc END_EXEC
Repeated calls to fetch get successive tuples in the query result A variable called SQLSTATE in
the SQL communication area (SQLCA) gets set to 02000 to indicate no more data is available
The close statement causes the database system to delete the temporary relation that holds the
result of the query. EXEC SQL close c END_EXEC
Dynamic SQL
The SQL statements in the program are static; that is, they do not change each time the program is
run
Allows programs to construct and submit SQL queries at run time.
The program constructs an SQL statement in a buffer, just as it does for the EXECUTE
IMMEDIATE statement. Instead of host variables, a question mark (?) can be substituted for a
constant.
59
The program passes the SQL statement to the DBMS with a PREPARE statement, which requests
that the DBMS parse, validate, and optimize the statement and generate an execution plan for it.
The program can use the EXECUTE statement repeatedly, supplying different parameter values
each time the dynamic statement is executed.
Example of the use of dynamic SQL from within a C program.
char * sqlprog = update account set balance = balance * 1.05 where account_number = ?
EXEC SQL prepare dynprog from :sqlprog;
char account [10] = A-101;
EXEC SQL execute dynprog using :account;
The dynamic SQL program contains a ?, which is a place holder for a value that is provided when
the SQL program is executed.
Difference between Static and Dynamic SQL:
S.No. Static (embedded) SQL
1.
Dynamic (interactive) SQL
In static SQL how database will be accessed In dynamic SQL, how database will be
is predetermined in the embedded SQL
accessed is determined at run time.
statement.Hard coded
2.
It is more fast and efficient.
It is less fast and efficient.
3.
SQL statements are compiled at compile
SQL statements are compiled at run time.
time.
4.
5.
6.
7.
Parsing, validation, optimization, and
Parsing, validation, optimization, and
generation of application plan are done at
generation of application plan are done at run
compile time.
time.
It is generally used for situations where data It is generally used for situations where data is
is distributed uniformly.
distributed non-uniformly.
EXECUTE IMMEDIATE, EXECUTE and
EXECUTE IMMEDIATE, EXECUTE and
PREPARE statements are not used.
PREPARE statements are used.
It is less flexible.
It is more flexible.
Query Processing and Optimization

A query expressed in a high-level query language such as SQL must first be scanned,
parsed, and validated.
60
The scanner identifies the language tokenssuch as SQL keywords, attributes names and
relation namesin the text of the query, whereas the parser checks the query syntax to determine
whether it is formulated according to the syntax rules (rules of grammar) of the query language.
The query must also be validated, by checking that all attribute and relation names are
valid and semantically meaningful names in the schema of the particular database being queried.
An internal representation of the query is then created, usually as a tree data structure called a
query tree. It is also possible to represent the query using a graph data structure called a query
graph.
The DBMS must then devise an execution strategy for retrieving the result of the query
from the database files. A query typically has many possible execution strategies, and the process
of choosing a suitable one for processing a query is known as query optimization.
Finding the optimal strategy is usually too time-consuming except for the simplest of
queries and may require information on how the files are implemented and even on the contents of
the filesinformation that may not be fully available in the DBMS catalog. Hence, planning of an
execution strategy may be a more accurate description than query optimization.
Steps in Query Processing
The steps involved in processing a query appear in Figure. The basic steps are
1. Parsing and translation
2. Optimization
3. Evaluation
Given a query, there are generally a variety of methods for computing the answer.
For example, we have seen that, in SQL, a query could be expressed in several different ways.
Each SQL query can itself be translated into a relational-algebra expression in one of several ways.
Furthermore, the relational-algebra representation of a query specifies only partially how to
61
evaluate a query; there are usually several ways to evaluate relational-algebra expressions. As an
illustration, consider the query
select balance
from account
where balance < 2500
Heuristics and cost Estimates in Query Optimization
A drawback of cost-based optimization is the cost of optimization itself. Although the cost
of query processing can be reduced by clever optimizations, cost-based optimization is still
expensive. Hence, many systems use heuristics to reduce the number of choices that must be
made in a cost-based fashion. Some systems even choose to use only heuristics, and do not use
cost-based optimization at all.
An example of a heuristic rule is the following rule for transforming relational algebra
queries:
Perform selection operations as early as possible.
A heuristic optimizer would use this rule without finding out whether the cost is reduced by
this transformation. In the first transformation the selection operation was pushed into a join.
We say that the preceding rule is a heuristic because it usually, but not always, helps to
reduce the cost. For an example of where it can result in an increase in cost, consider an expression
(r s), where the condition refers to only attributes in s. The selection can certainly be
performed before the join. However, if r is extremely small compared to s, and if there is an index
on the join attributes of s, but no index on the attributes used by , then it is probably a bad idea to
perform the selection early.
Performing the selection earlythat is, directly on swould require doing a scan of all
tuples in s. It is probably cheaper, in this case, to compute the join by using the index, and then to
reject tuples that fail the selection.
The projection operation, like the selection operation, reduces the size of relations. Thus,
whenever we need to generate a temporary relation, it is advantageous to apply immediately any
projections that are possible. This advantage suggests a companion to the perform selections
early heuristic:
Perform projections early.
62
It is usually better to perform selections earlier than projections, since selections have the
potential to reduce the sizes of relations greatly, and selections enable the use of indices to access
tuples.
An example similar to the one used for the selection heuristic should convince you that this
heuristic does not always reduce the cost.
A heuristic optimization algorithm will reorder the components of an initial query tree to
achieve improved query execution. Heuristics can be understood by visualizing a query expression
as a tree, as illustrated in Figure
1. Deconstruct conjunctive selections into a sequence of single selection operations. This step,
based on equivalence rule 1, facilitates moving selection operations down the query tree.
2. Move selection operations down the query tree for the earliest possible execution. This step uses
the commutativity and distributivity properties of the selection operation noted in equivalence
rules 2, 7.a, 7.b, and 11.
For instance, this step transforms (r s) into either (r) s or r (s) whenever possible.
Performing value-based selections as early as possible reduces the cost of sorting and merging
intermediate results. The degree of reordering permitted for a particular selection is determined by
the attributes involved in that selection condition.
3. Determine which selection operations and join operations will produce the smallest relations
that is, will produce the relations with the least number of tuples. Using associativity of the
operation, rearrange the tree so that the leaf-node relations with these restrictive selections are
executed first.
This step considers the selectivity of a selection or join condition. Recall that the most restrictive
selectionthat is, the condition with the smallest selectivityretrieves the fewest records. This
step relies on the associativity of binary operations given in equivalence rule 6.
63
4. Replace with join operations those Cartesian product operations that are followed by a selection
condition (rule 4.a). The Cartesian product operation is often expensive to implement since r1 r2
includes a record for each combination of records from r1 and r2. The selection may significantly
reduce the number of records, making the join much less expensive than the Cartesian product.
5. Deconstruct and move as far down the tree as possible lists of projection attributes, creating new
projections where needed. This step draws on the properties of the projection operation given in
equivalence rules 3, 8.a, 8.b, and 12.
6. Identify those subtrees whose operations can be pipelined, and execute them using pipelining.
In summary, the heuristics listed here reorder an initial query-tree representation in such a
way that the operations that reduce the size of intermediate results are applied first; early selection
reduces the number of tuples, and early projection reduces the number of attributes. The heuristic
transformations also restructure the tree so that the system performs the most restrictive selection
and join operations before other similar operations.
Heuristic optimization further maps the heuristically transformed query expression into
alternative sequences of operations to produce a set of candidate evaluation plans. An evaluation
plan includes not only the relational operations to be performed, but also the indices to be used, the
order in which tuples are to be accessed, and the order in which the operations are to be performed.
The access-plan selection phase of a heuristic optimizer chooses the most efficient strategy for
each operation.
UNIT III
TRANSACTION PROCESSING AND CONCURRENCY CONTROL
Introduction-Properties of Transaction- Serializability- Concurrency Control Locking MechanismsTwo Phase Commit Protocol-Dead lock.
3.1 INTRODUCTION
A transaction is a unit of program execution that accesses and possibly updates various data
items. A transaction consists of collection of operations used to perform a particular task. Each
transaction begins with BEGIN TRANSACTION statement and ends with END TRANSACTION
statement.
Transaction -States
Active
This is the initial state, the transaction stays in this state while it is executing.
Partially committed
A transaction is in this state when it has executed the final statement.
64
Failed
A transaction is in this state once the normal execution of the transaction cannot
proceed.
Aborted
A transaction is said to be aborted when the transaction has rolled back and the database is
being restored to the consistent state prior to the start of the transaction.
Committed
A transaction is in the committed state once it has been successfully executed and the database is
transformed into a new consistent state.
Fig States of a transaction
A transaction starts in the active state; A transaction contains a group of statements that form
a logical unit of work. When the transaction has finished executing the last statement, it enters the
partially committed state. At this point the transaction has completed execution, but it is still
possible that it may have to be aborted. This is because the actual output may still be in the main
memory and a hardware failure can still prevent the successful completion. The database system
then writes enough information to the disk. When the last of this information is written, the
transaction enters the committed states.
A transaction enters the failed state once the system determines that the transaction can no
longer proceed with its normal execution. This could be due to hardware failures or logical errors.
Such a transaction should be rolled back. When the roll back is complete, the transaction enters the
aborted state when a transaction aborts, the system has two options as follows:
65
Restart the transaction

Kill the transaction.
3.2 PROPERTIES OF TRANSACTION
The four basic properties that all transactions should possess are called ACID properties.
Atomicity: The 'all or nothing' property. A transaction is an indivisible unit that is either
performed in its entirety or is not performed at all. It is the responsibility of the recovery subsystem
of the DBMS to ensure atomicity.
Consistency: A transaction must transform the database from one consistent state to another
consistent state. It is the responsibility of both the DBMS and the application developers to ensure
consistency. The DBMS can ensure consistency by enforcing all the constraints that have been
specified on the database schema, such as integrity and enterprise constraints. However in itself this
is insufficient to ensure consistency.
Example:
A transaction that is intended to transfer money from one bank account to another and the
programmer makes an error in the transaction logic and debits one account but credits the wrong
account, then the database is in an inconsistent state.
Isolation: Transactions execute independently of one another, i.e. the partial effects of incomplete
transactions should not be visible to other transactions. It is the responsibility of the concurrency
control subsystem to ensure isolation.
Durability: The effects of a successfully completed transaction are permanently recorded in the
database and must not be lost because of a subsequent failure. It is the responsibility of the recovery
subsystem to ensure durability.
3.3 SERIALIZABILITY
66
Basic
assumption:
Each transaction preserves database consistency.
Thus serial execution of a set of transactions preserves database consistency
A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule.
Different forms of schedule equivalence give rise to the notions of:
1. Conflict Serializability
2. View Serializability
3.3.1 Conflict Serializability

Instructions Ii and Ij, of transactions Ti and Tj respectively, conflict if and only if there exists some item Q
accessed by both Ii and Ij, and at least one of these instructions wrote Q.
1.Ii = read( Q), Ij = read( Q). Ii and Ij dont conflict.

2.Ii = read( Q), Ij = write( Q). They conflict.
3.Ii = write( Q), Ij = read( Q). They conflict.
4.Ii = write( Q), Ij = write( Q). They conflict.
Intuitively, a conflict between Ii and Ij forces a temporal order between them. If Ii and Ij are
consecutive in a schedule and they do not conflict, their results would remain the same even if they
had been interchanged in the schedule.
If a schedule S can be transformed into a schedule S0 by a series of swaps of non-conflicting
instructions, we say that S and S0 are conflict equivalent.
If a schedule Sis conflict serializable if it is conflict equivalent to a serial schedule.
Example of a schedule that is not conflict serializable:
It is not possible to swap instructions in the above schedule to obtain either the serial
schedule < T3, T4 >, or the serial schedule < T4, T3 >.Schedule 3 below can be transformed
67
into Schedule1, a serial schedule where T2 follows T1, by a series of swaps of non conflicting instructions. Therefore Schedule 3 is conflict serializable.
3.3.2 View Serializability

Let Sand S0 be two schedules with the same set of transactions. S and S0 are
view equivalent if the following three conditions are met:
1.
For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction
Ti must, in schedule S0, also read the initial value of Q.

2.
For each data item Q, if transaction Ti executes read( Q) in schedule S, and that value was
produced by transaction Tj (if any), then transaction Ti must in schedule S0 also read the value
of Q that was produced by transaction Tj.
3.
For each data item Q, the transaction (if any) that performs the final write (Q) operation in
schedule S must perform the final write (Q) operation in schedule S0.
A schedule S is view serializable if it is view equivalent to a serial schedule. Every conflict

serializable schedule is also view serializable. The following schedule is view-serializable but not
conflict serializable
Every view serializable schedule, which is not conflict serializable, has blind
68
writes.
Precedence graph: It is a directed graph where the vertices are the transactions (names). We
draw an arc from Ti to Tj if the two transactions conflict, and Ti accessed the data item on which the
conflict arose earlier. We may label the arc by the item that was accessed.
Example Schedule
(Schedule
A) Figure
Precedence
Graph
for
Schedule A
52
3.4 CONCURRENCY CONTROL

Multiple transactions are allowed to run concurrently in the system.
Advantages of concurrent executions

69
o Increased processor and disk utilization, leading to better transaction throughput: one transaction can be
using the CPU while another is reading from or writing to the disk.
o Reduced average response time for transactions: short transactions need not wait behind long ones.
. Concurrency control schemes: These are mechanisms to control the interaction among the concurrent
transactions in order to prevent them from destroying the consistency of the database.
. Schedules- Schedules are sequences that indicate the chronological order in which instructions of concurrent
transactions are executed.
A schedule for a set of transactions must consist of all instructions of those transactions. Must preserve the
order in which the instructions appear in each individual transaction.
Recoverable schedule if a transaction Tj reads a data items previously written by transaction
Ti, the commit operation of Ti appears before the commit opera-tion of Tj.
3.5 LOCKING MECHANISMS
Concurrency control protocols, which ensure serializability of transactions, are most desirable.
Concurrency control protocols can be broadly divided into two categories:
Lock based protocols
Time stamp based protocols
Lock based protocols

Database systems, which are equipped with lock-based protocols, use mechanism by which any
transaction cannot read or write data until it acquires appropriate lock on it first. Locks are of two
kinds:
Binary Locks: a lock on data item can be in two states; it is either locked or unlocked.
Shared/exclusive: this type of locking mechanism differentiates lock based on their uses.
If a lock is acquired on a data item to perform a write operation, it is exclusive lock.
Because allowing more than one transactions to write on same data item would lead the
database into an inconsistent state. Read locks are shared because no data value is being
changed.
Two Phase Locking - 2PL

70
This locking protocol is divides transaction execution phase into three parts. In the first
part, when transaction starts executing, transaction seeks grant for locks it needs as it
executes. Second part is where the transaction acquires all locks and no other lock is
required. Transaction keeps executing its operation. As soon as the transaction releases its
first lock, the third phase starts. In this phase a transaction cannot demand for any lock but
only releases the acquired locks.
Fig Two Phase Locking

Two phase locking has two phases, one is growing; where all locks are being acquired by
transaction and second one is shrinking, where locks held by the transaction are being released.
To get an exclusive (write) lock, a transaction must first acquire a shared (read) lock and then
upgrade it to exclusive lock.
Strict Two
Phase Locking
The first phase of Strict-2PL is same as 2PL. After acquiring all locks in the first phase, transaction
continues to execute normally. But in contrast to 2PL, Strict-2PL does not release lock as soon as it
is no more required, but it holds all locks until commit state arrives. Strict-2PL releases all locks at
once at commit point.
Fig Strict Two Phase Locking

71
Strict-2PL does not have cascading abort as 2PL does.

Time stamp based protocols
The most commonly used concurrency protocol is time-stamp based protocol. This
protocol uses either system time or logical counter to be used as a time-stamp.
Lock based protocols manage the order between conflicting pairs among transaction at the time of
execution whereas time-stamp based protocols start working as soon as transaction is created.
Every transaction has a time-stamp associated with it and the ordering is determined by the age of
the transaction. A transaction created at 0002 clock time would be older than all other transaction,
which come after it. For example, any transaction 'y' entering the system at 0004 is two seconds
younger and priority may be given to the older one. In addition, every data item is given the latest
read and write-timestamp.
Time-stamp ordering protocol
The timestamp-ordering protocol ensures serializability among transaction in their conflicting read
and writes operations. This is the responsibility of the protocol system that the conflicting pair of
tasks should be executed according to the timestamp values of the transactions.
Time-stamp of Transaction Ti is denoted as TS(Ti).
Read time-stamp of data-item X is denoted by R-timestamp(X).
Write time-stamp of data-item X is denoted by W-timestamp(X).
Timestamp ordering protocol works as follows:
If a transaction Ti issues read(X) operation:

o
If TS(Ti) < W-timestamp(X)
Operation rejected.
72
If TS(Ti) >= W-timestamp(X)
Operation executed.
All data-item Timestamps updated.

If a transaction Ti issues write(X) operation:
o
If TS(Ti) < R-timestamp(X)
If TS(Ti) < W-timestamp(X)
Operation rejected.
Operation rejected and Ti rolled back.
Otherwise, operation executed.
Thomas' Write rule:

This rule states that in case of:
If TS(Ti) < W-timestamp(X) ,Operation rejected and Ti rolled back. Timestamp ordering
rules can be modified to make the schedule view serializable. Instead of making Ti rolled
back, the 'write' operation itself is ignored.
TWO PHASE COMMIT PROTOCOL

The two phase commit protocol is a distributed algorithm which lets all sites in a
distributed system agree to commit a transaction. The protocol results in either all nodes
committing the transaction or aborting, even in the case of site failures and message losses. The
two phases of the algorithm are broken into the COMMIT-REQUEST phase, where the
COORDINATOR attempts to prepare all the COHORTS, and the COMMIT phase, where the
COORDINATOR completes the transactions at all COHORTS.
The protocol works in the following manner: One node is designated the coordinator,
which is the master site, and the rest of the nodes in the network are called cohorts. Other
assumptions of the protocol include stable storage at each site and use of a write ahead log by each
node. Also, the protocol assumes that no node crashes forever, and eventually any two nodes can
communicate with each other. The latter is not a big deal since network communication can
typically be rerouted.
73
Basic Algorithm
During phase 1, initially the coordinator sends a query to commit message to all cohorts.
Then it waits for all cohorts to report back with the agreement message. The cohorts, if the
transaction was successful, write an entry to the undo log and an entry to the redo log. Then the
cohorts reply with an agree message, or an abort if the transaction failed at a cohort node.
During phase 2, if the coordinator receives an agree message from all cohorts, then it writes
a commit record into its log and sends a commit message to all the cohorts. If all agreement
messages do not come back the coordinator sends an abort message. Next the coordinator waits for
the acknowledgement from the cohorts. When acks are received from all cohorts the coordinator
writes a complete record to its log. Note the coordinator will wait forever for all the
acknowledgements to come back. If the cohort receives a commit message, it releases all the locks
and resources held during the transaction and send an acknowledgement to the coordinator. If the
message is abort, then the cohort undoes the transaction with the undo log and releases the
resources and locks held during the transaction. Then it sends an acknowledgement.
Disadvantages
The greatest disadvantage of the two phase commit protocol is the fact that it is a blocking
protocol. A node will block while it is waiting for a message. This means that other processes
competing for resource locks held by the blocked processes will have to wait for the locks to be
released. A single node will continue to wait even if all other sites have failed. If the coordinator
fails permanently, some cohorts will never resolve their transactions. This has the effect that
resources are tied up forever.
3.6 DEADLOCK:
When dealing with locks two problems can arise, the first of which being deadlock.
Deadlock refers to a particular situation where two or more processes are each waiting for another
to release a resource, or more than two processes are waiting for resources in a circular chain.
Deadlock is a common problem in multiprocessing where many processes share a specific type of
mutually exclusive resource. Some computers, usually those intended for the time-sharing and/or
real-time markets, are often equipped with a hardware lock, or hard lock, which guarantees
exclusive access to processes, forcing serialization. Deadlocks are particularly disconcerting
because there is no general solution to avoid them.
74
Livelock: Livelock is a special case of resource starvation. A livelock is similar to a

deadlock, except that the states of the processes involved constantly change with regard to one
another wile never progressing. The general definition only states that a specific process is not
progressing. For example, the system keeps selecting the same transaction for rollback causing the
transaction to never finish executing. Another livelock situation can come about when the system
is deciding which transaction gets a lock and which waits in a conflict situation.
3.6.1 Deadlock Prevention
To prevent any deadlock situation in the system, the DBMS aggressively inspects all the
operations which transactions are about to execute. DBMS inspects operations and analyze if they
can create a deadlock situation. If it finds that a deadlock situation might occur then that
transaction is never allowed to be executed. There are deadlock prevention schemes, which uses
time-stamp ordering mechanism of transactions in order to pre-decide a deadlock situation.
Wait-Die Scheme:
In this scheme, if a transaction request to lock a resource (data item), which is already held with
conflicting lock by some other transaction, one of the two possibilities may occur:
If TS(Ti) < TS(Tj), that is Ti, which is requesting a conflicting lock, is older than Tj, Ti is
allowed to wait until the data-item is available.
If TS(Ti) > TS(tj), that is Ti is younger than Tj, Ti dies. Ti is restarted later with random
delay but with same timestamp.
This scheme allows the older transaction to wait but kills the younger one.
Wound-Wait Scheme:
In this scheme, if a transaction request to lock a resource (data item), which is already held with
conflicting lock by some other transaction, one of the two possibilities may occur:
If TS(Ti) < TS(Tj), that is Ti, which is requesting a conflicting lock, is older than Tj, Ti
forces Tj to be rolled back, that is Ti wounds Tj. Tj is restarted later with random delay but
with same timestamp.
75
If TS(Ti) > TS(Tj), that is Ti is younger than Tj, Ti is forced to wait until the resource is
available.
This scheme, allows the younger transaction to wait but when an older transaction request an item
held by younger one, the older transaction forces the younger one to abort and release the item.
In both cases, transaction, which enters late in the system, is aborted.
3.6.2Deadlock Avoidance
Aborting a transaction is not always a practical approach. Instead deadlock avoidance
mechanisms can be used to detect any deadlock situation in advance. Methods like "wait-for
graph" are available but for the system where transactions are light in weight and have hold on
fewer instances of resource. In a bulky system deadlock prevention techniques may work well.
Wait-for Graph
This is a simple method available to track if any deadlock situation may arise. For each transaction
entering in the system, a node is created. When transaction Ti requests for a lock on item, say X,
which is held by some other transaction Tj, a directed edge is created from Ti to Tj. If Tj releases
item X, the edge between them is dropped and Ti locks the data item.
The system maintains this wait-for graph for every transaction waiting for some data items held by
others. System keeps checking if there's any cycle in the graph.
Fig Wait-for Graph

Two approaches can be used, first not to allow any request for an item, which is already
locked by some other transaction. This is not always feasible and may cause starvation, where a
transaction indefinitely waits for data item and can never acquire it. Second option is to roll back
one of the transactions.
76
It is not feasible to always roll back the younger transaction, as it may be important than
the older one. With help of some relative algorithm a transaction is chosen, which is to be aborted,
this transaction is called victim and the process is known as victim selection.
UNIT IV
TRENDS IN DATABASE TECHNOLOGY
Overview of Physical Storage Media Magnetic Disks RAID Tertiary storage File
Organization Organization of Records in Files Indexing and Hashing Ordered Indices B+
tree Index Files B tree Index Files Static Hashing Dynamic Hashing - Introduction to
Distributed Databases- Client server technology- Multidimensional and Parallel databases- Spatial
and multimedia databases-Mobile and web databases- Data Warehouse-Mining- Data marts
4.1 OVERVIEW OF PHYSICAL STORAGE MEDIA
Classification of Physical Storage Media
Based on Speed with which data can be accessed
Based on Cost per unit of data
Based on Reliability
Based on life of storage
Volatile storage: loses contents when power is switched off

Non-volatile storage: Contents persist even when power is switched off.
Includes secondary and tertiary storage, as well as batter backed up main-memory.
Storage Hierarchy
Primary storage: Fastest media but volatile (cache, main memory).
Secondary storage: next level in hierarchy, non-volatile, moderately fast access time. Also called
on-line storage. E.g. flash memory, magnetic disks
77
Tertiary storage: lowest level in hierarchy, non-volatile, slow access time. Also called off-line
storage. E.g. magnetic tape, optical storage.
Fig Storage-device hierarchy
Cache
The fastest and most costly form of storage
Volatile
Managed by the computer system hardware.
Main memory
Fast access (10s to 100s of nanoseconds; 1 nanosecond = 109 seconds)
Generally too small (or too expensive) to store the entire database
Capacities of up to a few Gigabytes widely used currently
Capacities have gone up and per-byte costs have decreased steadily and rapidly
Volatile contents of main memory are usually lost if a power failure or system crash
occurs.
Flash memory
Data survives power failure.
Data can be written at a location only once, but location can be erased and written to again.
Can support only a limited number of write/erase cycles.
Erasing of memory has to be done to an entire bank of memory.
Reads are roughly as fast as main memory.
But writes are slow (few microseconds), erase is slower.
Cost per unit of storage roughly similar to main memory.

78
Widely used in embedded devices such as digital cameras. Also known as EEPROM.
Magnetic-disk
Data is stored on spinning disk, and read/written magnetically.
Primary medium for the long-term storage of data; typically stores entire database.
Data must be moved from disk to main memory for access, and written back for storage.
Much slower access than main memory
Direct-access possible to read data on disk in any order, unlike magnetic tape Hard disks
vs. floppy disks
Capacities range up to roughly 100 GB currently. Much larger capacity and cost/byte than
main memory/ flash memory
Growing constantly and rapidly with technology improvements
Survives power failures and system crashes
Disk failure can destroy data, but is very rare.
Optical storage
Non-volatile, data is read optically from a spinning disk using a laser CD-ROM (640 MB)
and DVD (4.7 to 17 GB) most popular forms
Write-one, read-many (WORM) optical disks used for archival storage (CD-R and DVDR)
Multiple write versions also available (CD-RW, DVD-RW, and DVD-RAM)
Reads and writes are slower than with magnetic disk
Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism
for automatic loading/unloading of disks available for storing large volumes of data.
Optical Disks
Compact disk-read only memory (CD-ROM)
Disks can be loaded into or removed from a drive
High storage capacity (640 MB per disk)
High seek times or about 100 msec (optical read head is heavier and slower)
Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to
magnetic disks
Digital Video Disk (DVD)

79
DVD-5 holds 4.7 GB, and DVD-9 holds 8.5 GB
DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB & 17 GB
Other characteristics similar to CD-ROM
Record once versions (CD-R and DVD-R)
Data can only be written once, and cannot be erased.
High capacity and long lifetime; used for archival storage
Multi-write versions (CD-RW, DVD-RW and DVD-RAM) also available
Tape storage
Non-volatile, used primarily for backup (to recover from disk failure), and for archival data
Sequential-access much slower than disk
Very high capacity (40 to 300 GB tapes available)
Tape can be removed from drive storage costs much cheaper than disk, but drives are
expensive
Tape jukeboxes available for storing massive amounts of data
Hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)
Magnetic Tapes
Hold large volumes of data and provide high transfer rates
Few GB for DAT (Digital Audio Tape) format, 10-40 GB with DLT (Digital Linear Tape)
format, 100 GB+ with Ultrium format, and 330 GB with Ampex helical scan format
Transfer rates from few to 10s of MB/s
Currently the cheapest storage medium
Tapes are cheap, but cost of drives is very high
Very slow access time in comparison to magnetic disks and optical disks
Limited to sequential access.
Used mainly for backup, for storage of infrequently used information, and as an offline
medium for transferring information from one system to another.
Tape jukeboxes used for very large capacity storage (Terabyte (1012 bytes) to Petabye
(1015 bytes)
80
4.2 MAGNETIC DISKS
Fig Magnetic disk
Physical Characteristics of Magnetic Hard
Surface of platter divided into circular tracks. Over 17,000 tracks per platter on typical hard
disks.
Each track is divided into sectors. A sector is the smallest unit of data that can be read or
written.
Sector size typically 512 bytes. Typical sectors per track: 200 (on inner tracks) to 400 (on
outer tracks).
Read-write head is positioned very close to the platter surface (almost touching it). Reads
or writes magnetically encoded information.
To read/write a sector-Disk arm swings to position head on right track. Platter spins
continually; data is read/written as sector passes under head.
81
Head-disk assemblies- Multiple disk platters on a single spindle (typically 2 to 4).
One head per platter, mounted on a common arm.
Cylinder i consists of i th track of all the platters
Performance Measures of Disks
Access Time the time it takes from when a read or write request is issued to when data
transfer begins.
Seek Time time it takes to reposition the arm over the correct track. Average Seek time
is 1/2 the worst case seek time. 4 to 10 milliseconds on typical disks.
Rotational latency time it takes for the sector to be accessed to appear under the head.
Average latency is 1/2 of the worst-case latency. 4 to 11 milliseconds on typical disks
(5400 to 15000 r.p.m.)
Data-Transfer Rate the rate at which data can be retrieved from or stored to the disk. 4
to 8 MB per second is typical. Multiple disks may share a controller, so
rate that
controller can handle is also important
Mean Time To Failure (MTTF) the average time the disk is expected to run
continuously without any failure.
Typically 3 to 5 years.
Probability of failure of new disks is quite low, corresponding to a theoretical
MTTF of 30,000 to 1,200,000 hours for a new disk.

4.3 RAID
RAID: Redundant Arrays of Independent Disks
Disk organization techniques that manage a large numbers of disks, providing a view of a
single disk of
high capacity and high speed by using multiple disks in parallel,
high reliability by storing data redundantly, so that data can be recovered even if a
disk
fails.
The chance that some disk out of a set of N disks will fail is much higher than the chance that a
specific single disk will fail.
E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will
have a system MTTF of 1000 hours (approx. 41 days)
o Techniques for using redundancy to avoid data loss are critical with large
numbers of disks
Originally a cost-effective alternative to large, expensive disks.
82
o I in RAID originally stood for inexpensive

o Today RAIDs are used for their higher reliability and bandwidth.
The I is interpreted as independent
Improvement of Reliability via Redundancy
Redundancy store extra information that can be used to rebuild information lost in a disk
failure.
E.g., Mirroring (or shadowing)
o Duplicate every disk. Logical disk consists of two physical disks.
o Every write is carried out on both disks
Reads can take place from either disk
o If one disk in a pair fails, data still available in the other
Data loss would occur only if a disk fails, and its mirror disk also fails before the system is
repaired.
Probability of combined event is very small
o Except for dependent failure modes such as fire or building collapse or electrical power surges.
Mean time to data loss depends on mean time to failure, and mean time to repair.
E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives mean time to data loss of
500*106 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes)
Improvement in Performance via Parallelism
o Two main goals of parallelism in a disk system:
1. Load balance multiple small accesses to increase throughput
2. Parallelize large accesses to reduce response time.
o Improve transfer rate by striping data across multiple disks.
o Bit-level striping split the bits of each byte across multiple disks
In an array of eight disks, write bit i of each byte to disk i.

Each access can read data at eight times the rate of a single disk.
But seek/access time worse than for a single disk
Bit level striping is not used much any more
o Block-level striping with n disks, block i of a file goes to disk (i mod n) + 1
Requests for different blocks can run in parallel if the blocks reside on different disks.
83
A request for a long sequence of blocks can utilize all disks in parallel.
RAID Levels
o Schemes to provide redundancy at lower cost by using disk striping combined with parity bits.
Different RAID organizations, or RAID levels, have differing cost, performance and reliability
characteristics
RAID Level 0: Block striping; non-redundant.
o Used in high-performance applications where data lost is not critical.
RAID Level 1: Mirrored disks with block striping.
o Offers best write performance.
o Popular for applications such as storing log files in a database system.
RAID Level 2: Memory-Style Error-Correcting-Codes (ECC)
RAID Level 3: Bit-Interleaved Parity
o a single parity bit is enough for error correction, not just detection, since we know which disk
has failed
When writing data, corresponding parity bits must also be computed and written to a parity bit
disk.
To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit
disk).
o Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to
participate in every I/O.
o Subsumes Level 2 (provides all its benefits, at lower cost).
RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a
separate disk for corresponding blocks from N other disks.
o When writing data block, corresponding block of parity bits must also be computed and written
to parity disk.
o To find value of a damaged block, compute XOR of bits from corresponding blocks (including
parity block) from other disks.
o Provides higher I/O rates for independent block reads than Level 3
block read goes to a single disk, so blocks stored on different disks can be read in parallel.
o Provides high transfer rates for reads of multiple blocks than no-striping.
o Before writing a block, parity data must be computed
84
Can be done by using old parity block, old value of current block and new value of current
block (2 block reads + 2 block writes).
Or by recomputing the parity value using the new values of blocks corresponding to the parity
block. More efficient for writing large amounts of data sequentially.
o Parity block becomes a bottleneck for independent block writes since every block write also
writes to parity disk.
RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1
disks, rather than storing data in N disks and parity in 1 disk.
o E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data
blocks stored on the other 4 disks.
o Higher I/O rates than Level 4.
Block writes occur in parallel if the blocks and their parity blocks are on different disks.
o Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.
RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant
information to guard against multiple disk failures.
o Better reliability than Level 5 at a higher cost; not used
85
Choice of RAID Level

o Factors in choosing RAID level
Monetary cost
Performance: Number of I/O operations per second, and bandwidth during normal operation
Performance during failure
Performance during rebuild of failed disk
Including time taken to rebuild failed disk
o RAID 0 is used only when data safety is not important
E.g. data can be recovered quickly from other sources
o Level 2 and 4 never used since they are subsumed by 3 and 5
o Level 3 is not used anymore since bit-striping forces single block reads to access all disks,
wasting disk arm movement, which block striping (level 5) avoids
o Level 6 is rarely used since levels 1 and 5 offer adequate safety for almost all applications
o So competition is between 1 and 5 only
Level 1 provides much better write performance than level 5
o Level 5 requires at least 2 block reads and 2 block writes to write a single block, whereas Level 1
only requires 2 block writes
o Level 1 preferred for high update environments such as log disks
86
Level 1 had higher storage cost than level 5

o disk drive capacities increasing rapidly (50%/year) whereas disk access times have decreased
much less (x 3 in 10 years)
o I/O requirements have increased greatly, E.g. for Web servers
o When enough disks have been bought to satisfy required rate of I/O, they often have spare
storage capacity
so there is often no extra monetary cost for Level 1!
Level 5 is preferred for applications with low update rate, and large amounts of data
Level 1 is preferred for all other applications
Hardware Issues
Software RAID: RAID implementations done entirely in software, with no special hardware
support
Hardware RAID: RAID implementations with special hardware
o Use non-volatile RAM to record writes that are being executed
o Beware: power failure during write can result in corrupted disk
E.g. failure after writing one block but before writing the second in a mirrored system.
Such corrupted data must be detected when power is restored
Recovery from corruption is similar to recovery from failed disk
NV-RAM helps to efficiently detected potentially corrupted blocks
o Otherwise all blocks of disk must be read and compared with mirror/parity block
Hot swapping: replacement of disk while system is running, without power down
o Supported by some hardware RAID systems,
o reduces time to recovery, and improves availability greatly.
Many systems maintain spare disks which are kept online, and used as replacements for failed
disks immediately on detection of failure
o Reduces time to recovery greatly.
Many hardware RAID systems ensure that a single point of failure will not stop the functioning
of the system by using
o Redundant power supplies with battery backup.
o Multiple controllers and multiple interconnections to guard against controller/ interconnection
failures.
87
4.4 TERTIARY STORAGE

Tertiary storage: lowest level in hierarchy, non-volatile, slow access time. Also called off-line
storage. E.g. magnetic tape, optical storage.
Optical storage
Non-volatile, data is read optically from a spinning disk using a laser CD-ROM (640 MB) and
DVD (4.7 to 17 GB) most popular forms Write-one, read-many (WORM) optical disks used for
archival storage (CD-R and DVD-R) .
Multiple write versions also available (CD-RW, DVDRW, and DVD-RAM) 4 Reads and writes
are slower than with magnetic disk.
Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for
automatic loading/unloading of disks available for storing large volumes of data.
Optical Disks
Compact disk-read only memory (CD-ROM)
Disks can be loaded into or removed from a drive
High storage capacity (640 MB per disk)
High seek times or about 100 msec (optical read head is heavier and slower)
Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to
magnetic disks
Digital Video Disk (DVD)
DVD-5 holds 4.7 GB, and DVD-9 holds 8.5 GB
DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB & 17 GB.
Other characteristics similar to CD-ROM 4 Record once versions (CD-R and DVD-R)
Data can only be written once, and cannot be erased.
High capacity and long lifetime; used for archival storage
Multi-write versions (CD-RW, DVD-RW and DVDRAM) also available
Tape storage
Non-volatile, used primarily for backup (to recover from disk failure), and for archival data
88
Sequential-access much slower than disk
Very high capacity (40 to 300 GB tapes available)
Tape can be removed from drive storage costs much cheaper than disk, but drives are
expensive
Tape jukeboxes available for storing massive
Hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)
Optical Disks
CD-ROM has become a popular medium for distributing software, multimedia data, and
other electronic published information.
Capacity of CD-ROM: 500 MB. Disks are cheap to mass produce and also drives.
CD-ROM: much longer seek time (250m-sec), lower rotation speed (400 rpm), leading to
high latency and lower data-transfer rate (about 150 KB/sec). Drives spins at audio CD
spin speed (standard) is available.
Recently, a new optical format, digit video disk (DVD) has become standard. These disks
hold between 4.7 and 17 GB data.
WORM (write-once, read many) disks are popular for archival storage of data since they
have a high capacity (about 500 MB), longer life time than HD, and can be removed from
drive -- good for audit trail (hard to tamper).
4.5 FILE ORGANIZATION

The database is stored as a collection of files. Each file is a sequence of records. A record
is a sequence of fields.
One approach:
o assume record size is fixed.
o each file has records of one particular type only.
o different files are used for different relations.
This case is easiest to implement; will consider variable length records later.
4.5.1 Fixed-Length Records

Simple approach:
89
o Store record i starting from byte n (i 1), where n is the size of each
record.
o Record access is simple but records may cross blocks
Modification: do not allow records to cross block boundaries.
Deletion of record I:
alternatives:
o move records i + 1, . . ., n to i, . . . , n 1
o move record n to i
o do not move records, but link all free records on a free list
Free Lists
o Store the address of the first deleted record in the file header.
o Use this first record to store the address of the second deleted record, and so on.
o Can think of these stored addresses as pointers since they point to the location of a record.
o More space efficient representation: reuse space for normal attributes of free records to store
pointers. (No pointers stored in in-use records.)
4.5.2 Variable-Length Records

o Variable-length records arise in database systems in several ways:
90
Storage of multiple record types in a file.Record types that allow variable lengths for one or
more fields.
Record types that allow repeating fields (used in some older data models).
o Byte string representation
Attach an end-of-record () control character to the end of each record.
Difficulty with deletion.
Difficulty with growth.
Variable-Length Records: Slotted Page Structure
Slotted page header contains:

o number of record entries.
o end of free space in the block.
o location and size of each record.
Records can be moved around within a page to keep them contiguous with no empty space
between them; entry in the header must be updated.
Pointers should not point directly to record instead they should point to the entry for the
record in header.
Fixed-length representation:
o reserved space
o pointers
Reserved space can use fixed-length records of a known maximum length; unused space in
shorter records filled with a null or end-of record symbol.
91
4.5.3 Pointer Method

Pointer method
A variable-length record is represented by a list of fixed-length records,

chained together via pointers.
Can be used even if the maximum record length is not known
Disadvantage to pointer structure; space is wasted in all records except the
first in a a chain.
Solution is to allow two kinds of block in file:
Anchor block contains the first records of chain
Overflow block contains records other than those that are the first records
4.6 ORGANIZATION OF RECORDS IN FILES

Heap a record can be placed anywhere in the file where there is space
Sequential store records in sequential order, based on the value of the search key of each
record
92
Hashing a hash function computed on some attribute of each record; the result specifies in
which block of the file the record should be placed.
Records of each relation may be stored in a separate file. In a clustering file organization records
of several different relations can be stored in the same file
Motivation: store related records on the same block to minimize I/O
4.6.1 Sequential File Organization
Suitable for applications that require sequential processing of the entire file
The records in the file are ordered by a search-key
Deletion use pointer chains
Insertion locate the position where the record is to be inserted
if there is free space insert there
if no free space, insert the record in an overflow block.In either case, pointer chain must be
updated
Need to reorganize the file from time to time to restore sequential order
4.6.2 Clustering File Organization

o Simple file structure stores each relation in a separate file.
o Can instead store several relations in one file using a clustering file organization.
o E.g., clustering organization of customer and depositor:
93
o bad for queries involving only customer

o results in variable size records
4.6.3 Mapping of Objects to Files
Mapping objects to files is similar to mapping tuples to files in a relational system; object data
can be stored using file structures.
Objects in O-O databases may lack uniformity and may be very large; such objects have to
managed differently from records in a relational system.
o Set fields with a small number of elements may be implemented using data structures such as
linked lists.
o Set fields with a larger number of elements may be implemented as separate relations in the
database.
o Set fields can also be eliminated at the storage level by normalization.
Similar to conversion of multivalued attributes of E-R diagrams to relations
Objects are identified by an object identifier (OID); the storage system needs a mechanism to
locate an object given its OID (this action is called dereferencing).
o logical identifiers do not directly specify an objects physical location; must maintain an index
that maps an OID to the objects actual location.
o physical identifiers encode the location of the object so the object can be found directly. Physical
OIDs typically have the following parts:
1. a volume or file identifier
2. a page identifier within the volume or file
3. an offset within the page
4.7 INDEXING AND HASHING:

94
INDEXING
Indexing mechanisms used to speed up access to desired data.
o E.g., author catalog in library
Search Key - attribute to set of attributes used to look up records in a file.
An index file consists of records (called index entries) of the form.
Index files are typically much smaller than the original file.
Two basic kinds of indices:

o Ordered indices: search keys are stored in sorted order.
o Hash indices: search keys are distributed uniformly across buckets using a hash
function.
Index Evaluation Metrics
o Access types supported efficiently. E.g.,
o records with a specified value in the attribute
o or records with an attribute value falling in a specified range of values.
o Access time
o Insertion time
o Deletion time
o Space overhead
ORDERED INDICES
In an ordered index, index entries are stored, sorted on the search key value. E.g., author
catalog in library.
Primary index: in a sequentially ordered file, the index whose search key specifies the
sequential order of the file. Also called clustering index
The search key of a primary index is usually but not necessarily the primary key.
Secondary index: an index whose search key specifies an order different from the
sequential order of the file. Also called non-clustering index.
Search-key Pointer
Index-sequential file: ordered sequential file with a primary index.
Dense Index Files
o Dense index Index record appears for every search key value in the file.
95
Sparse Index Files

Sparse Index: contains index records for only some search-key values.
Applicable when records are sequentially ordered on search-key
To locate a record with search-key value K we:
Find index record with largest search-key value < K
Search file sequentially starting at the record to which the index record points.
Less space and less maintenance overhead for insertions and deletions.
Generally slower than dense index for locating records.
Good tradeoff: sparse index with an index entry for every block in file, corresponding to
least search-key value in the block.
Example of Sparse Index Files
Multilevel Index
If primary index does not fit in memory, access becomes expensive.
96
To reduce number of disk accesses to index records, treat primary index kept on disk as
a sequential file and construct a sparse index on it.
o outer index a sparse index of primary index
o inner index the primary index file
If even the outer index is too large to fit in main memory, yet another level of index can
be created, and so on.
Indices at all levels must be updated on insertion or deletion from the file.
Index Update: Deletion

If deleted record was the only record in the file with its particular search-key value, the
search-key is deleted from the index also.
Single-level index deletion:
o Dense indices deletion of search-key is similar to file record deletion.
o Sparse indices if an entry for the search key exists in the index, it is deleted by
replacing the entry in the index with the next search-key value in the file (in search-key
97
order). If the next search-key value already has an index entry, the entry is deleted instead
of being replaced.
Index Update: Insertion
o Single-level index insertion:
o Perform a lookup using the search-key value appearing in the record to be inserted.
o Dense indices if the search-key value does not appear in the index, insert it.
o Sparse indices if index stores an entry for each block of the file, no change needs to be
made to the index unless a new block is created. In this case, the first search-key value appearing
in the new block is inserted into the index.
o Multilevel insertion (as well as deletion) algorithms are simple extensions of the singlelevel algorithms.
Secondary Indices
o Frequently, one wants to find all the records whose values in a certain field (which is not
the search-key of the primary index satisfy some condition.
o Example 1: In the account database stored sequentially by account number, we may want
to find all accounts in a particular branch.
o Example 2: as above, but where we want to find all accounts with a specified balance or
range of balances
o We can have a secondary index with an index record for each search-key value; index
record points to a bucket that contains pointers to all the actual records with that particular searchkey value.
Secondary Index on balance field of account
Primary and Secondary Indices

Secondary indices have to be dense.
98
Indices offer substantial benefits when searching for records.

When a file is modified, every index on the file must be updated, Updating indices imposes
overhead on database modification.
Sequential scan using primary index is efficient, but a sequential scan using a secondary index is
expensive.
each record access may fetch a new block from disk
Bitmap Indices
Bitmap indices are a special type of index designed for efficient querying on multiple keys.
Records in a relation are assumed to be numbered sequentially from, say,0
Given a number n it must be easy to retrieve record n
Particularly easy if records are of fixed size
Applicable on attributes that take on a relatively small number of distinct values
E.g. gender, country, state,
E.g. income-level (income broken up into a small number of levels such as 0-9999, 1000019999, 20000-50000, 50000- infinity)
A bitmap is simply an array of bits.
In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute.
Bitmap has as many bits as records.
In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and
is 0 otherwise.
Bitmap indices are useful for queries on multiple attributes
Not particularly useful for single attribute queries Queries are answered using bitmap operations
Intersection (and)
Union (or)
Complementation (not)
Each operation takes two bitmaps of the same size and applies the operation on corresponding bits
to get the result bitmap
Males with income level L1: 10010 AND 10100 = 10000
Can then retrieve required tuples.
Counting number of matching tuples is even faster
Bitmap indices generally very small compared with relation size.
If number of distinct attribute values is 8, bitmap is only 1% of relation
size
Deletion needs to be handled properly
99
Existence bitmap to note if there is a valid record at a record location

Needed for complementation
n Should keep bitmaps for all values, even null value.
B+-Tree Index Files
o B+-tree indices are an alternative to indexed-sequential files.
o Disadvantage of indexed-sequential files: performance degrades as file grows, since many
overflow blocks get created. Periodic reorganization of entire file is required.
o Advantage of B+-tree index files: automatically reorganizes itself with small, local, changes, in
the face of insertions and deletions. Reorganization of entire file is not required to maintain
performance.
o Disadvantage of B+-trees: extra insertion and deletion overhead, space overhead.
o Advantages of B+-trees outweigh disadvantages, and they are used extensively.
A B+-tree is a rooted tree satisfying the following properties:
All paths from root to leaf are of the same length
Each node that is not a root or a leaf has between [n/2] and n children.
A leaf node has between [(n1)/2] and n1 values
Special cases:
If the root is not a leaf, it has at least 2 children.
If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n1)
values.
Typical node
o Ki are the search-key values

o Pi are pointers to children (for non-leaf nodes) or pointers to
records or buckets of records (for leaf nodes).
The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn1.
Leaf Nodes in B+-Trees
o Properties of a leaf node:
For i = 1, 2, . . ., n1, pointer Pi either points to a file
100
record with search-key value Ki, or to a bucket of pointers to file records, each record having
search-key
value Ki. Only need bucket structure if search-key does
not form a primary key.
If Li, Lj are leaf nodes and i < j, Lis search-key values
are less than Ljs search-key values.
Pn points to next leaf node in search-key order.
Non-Leaf Nodes in B+-Trees

Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m
pointers:
o All the search-keys in the subtree to which P1 points are less than K1.
o For 2 i n 1, all the search-keys in the subtree to which Pi points have values greater than
or equal to Ki1 and less than Km1.
101
Example of a B+-tree
B+-tree for account file (n = 3)
B+-tree for account file (n - 5)

o Leaf nodes must have between 2 and 4 values ((n1)/2and n 1, with n = 5).
o Non-leaf nodes other than root must have between 3 and 5 children ((n/2 and n with n =5).
o Root must have at least 2 children.
Observations about B+-trees
o Since the inter-node connections are done by pointers, logically close blocks need not be
physically close.
o The non-leaf levels of the B+-tree form a hierarchy of sparse indices.
o The B+-tree contains a relatively small number of levels (logarithmic in the size of the main
file), thus searches can be conducted efficiently.
o Insertions and deletions to the main file can be handled efficiently, as the index can be
restructured in logarithmic time.
Queries on B+-Trees
Find all records with a search-key value of k.
o Start with the root node
Examine the node for the smallest search-key value > k.
If such a value exists, assume it is Kj. Then follow Pi to the child node.
Otherwise k Km1, where there are m pointers in the node. Then follow Pm to the child node.
102
o If the node reached by following the pointer above is not a leaf node, repeat the above procedure
on the node, and follow the corresponding pointer.
o Eventually reach a leaf node. If for some i, key Ki = k follow pointer Pi to the desired record or
bucket. Else no record with search-key value k exists.
In processing a query, a path is traversed in the tree from the root to some leaf node.
If there are K search-key values in the file, the path is no longer than logn/2(K).
A node is generally the same size as a disk block, typically 4 kilobytes, and n is typically around
100 (40 bytes per index entry).
With 1 million search key values and n = 100, at most log50(1,000,000) = 4 nodes are accessed
in a lookup.
Contrast this with a balanced binary free with 1 million search key values around 20 nodes
are accessed in a lookup.
o The above difference is significant since every node access may need a disk I/O, costing around
20 milliseconds!
Updates on B+-Trees: Insertion
o Find the leaf node in which the search-key value would appear
o If the search-key value is already there in the leaf node, record is added to file and if necessary a
pointer is inserted into the bucket.
o If the search-key value is not there, then add the record to the main file and create a bucket if
necessary. Then:
If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node
Otherwise, split the node (along with the new (key-value, pointer) entry) as discussed in the
next slide.
o Splitting a node:
take the n(search-key value, pointer) pairs (including the one being inserted) in sorted order.
Place the first n/2 in the original node, and the rest in a new node.
let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node
being split. If the parent is full, split it and propagate the split further up.
The splitting of nodes proceeds upwards till a node that is not full is found. In the worst case the
root node may be split increasing the height of the tree by 1.
103
Result of splitting node containing Brighton and Downtown on inserting Clearview
B+-Tree before and after insertion of Clearview

Updates on B+-Trees: Deletion
Find the record to be deleted, and remove it from the main file and from the bucket (if present).
Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has
become empty.
If the node has too few entries due to the removal, and the entries in the node and a sibling fit
into a single node, then
Insert all the search-key values in the two nodes into a single node (the one on the left), and
delete the other node.
104
Delete the pair (Ki1, Pi), where Pi is the pointer to the deleted node, from its parent,
recursively using the above procedure.
Otherwise, if the node has too few entries due to the removal, and the entries in the node and a
sibling fit into a single node,
then
Redistribute the pointers between the node and a sibling such that both have more than the
minimum number of entries.
Update the corresponding search-key value in the parent of the node.
The node deletions may cascade upwards till a node which has n/2 or more pointers is found. If
the root node has only one pointer after deletion, it is deleted and the sole child becomes the root.
Examples of B+-Tree Deletion
Before and after deleting Downtown

o The removal of the leaf node containing Downtown did not result in its parent having too little
pointers. So the cascaded deletions stopped with the deleted leaf nodes parent.
105
Deletion of Perryridge from result of previous example

o Node with Perryridge becomes underfull (actually empty, in this special case) and is merged
with its sibling.
o As a result Perryridge nodes parent became underfull, and was merged with its sibling (and an
entry was deleted from their parent).
o Root node then had only one child, and was deleted and its child became the new root node.
B+-Tree File Organization
o Index file degradation problem is solved by using B+-Tree indices.
Data file degradation problem is solved by using B+-Tree File Organization.
o The leaf nodes in a B+-tree file organization store records, instead of pointers.
o Since records are larger than pointers, the maximum number of records that can be stored in a
leaf node is less than the number of pointers in a non leaf node.
o Leaf nodes are still required to be half full.
Insertion and deletion are handled in the same way as insertion and deletion of entries in a B+-tree
index
106
Example of B+-tree File Organization

o Good space utilization is important since records use more space than pointers.
o To improve space utilization, involve more sibling nodes in redistribution during splits and
merges.
Involving 2 siblings in redistribution (to avoid split / merge where possible) results in each node
having at least entries
B-Tree Index Files

Similar to B+-tree, but B-tree allows search-key values to appear
only once; eliminates redundant storage of search keys.
Search keys in nonleaf nodes appear nowhere else in the B-tree; an
additional pointer field for each search key in a nonleaf node must be included.
Generalized B-tree leaf node Nonleaf node pointers Bi are the bucket or file record pointers.
B-Tree Index File Example
107
B-tree (above) and B+-tree (below) on same data
o Advantages of B-Tree indices:

May use less tree nodes than a corresponding B+-Tree.
Sometimes possible to find search-key value before reaching leaf node.
o Disadvantages of B-Tree indices:
Only small fraction of all search-key values are found early
Non-leaf nodes are larger, so fan-out is reduced. Thus BTrees
typically have greater depth than corresponding B+-Tree
Insertion and deletion more complicated than in B+Trees
Implementation is harder than B+-Trees.
o Typically, advantages of B-Trees do not out weigh disadvantages.
HASHING
Hashing is a hash function computed on some attribute of each record; the result specifies
in which block of the file the record should be placed.
108
Static Hashing
A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block). In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function. Hash function h is a function from the set of all search-key values K
to the set of all bucket addresses B. Hash function is used to locate records for access, insertion as
well as deletion. Records with different search-key values may be mapped to the same bucket; thus
entire bucket has to be searched sequentially to locate a record.
Example of Hash File Organization
Hash file organization of account file, using branch-name as key
o There are 10 buckets,
o The binary representation of the ith character is assumed to be the integer i.
o The hash function returns the sum of the binary representation of the characters
modulo10.
o E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
o Hash file organization of account file, using branch-name as key
109
Hash Functions
o Worst had function maps all search-key values to the same bucket; this makes access
time proportional to the number of search-key values in the file.
o An ideal hash function is uniform, i.e., each bucket is assigned the same number of
search-key values from the set of all possible values.
o Ideal hash function is random, so each bucket will have the same number of records
assigned to it irrespective of the actual distribution of search-key values in the file.
o Typical hash functions perform computation on the internal binary representation
of the search-key.
o For example, for a string search-key, the binary representations of all the characters in the
string could be added and the sum modulo the number of buckets could be returned.
Handling of Bucket Overflows
o Bucket overflow can occur because of
Insufficient buckets
Skew in distribution of records.
This can occur due to two reasons:
multiple records have same search-key value
chosen hash function produces non-uniform distribution of key values
o Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is
handled by using overflow buckets.
o Overflow chaining the overflow buckets of a given bucket are chained together in a
linked list.
o The Above scheme is called closed hashing.
o An alternative, called open hashing, which does not use overflow buckets, is not suitable
for database applications.
110
Hash Indices
o Hashing can be used not only for file organization, but also for index-structure creation.
o A hash index organizes the search keys, with their associated record pointers,into a hash
file structure.
o Strictly speaking, hash indices are always secondary indices
o If the file itself is organized using hashing, a separate primary hash index on it using the
same search-key is unnecessary.
o However, we use the term hash index to refer to both secondary index structures and hash
organized files.
Example of Hash Index
Demerits of Static Hashing

o In static hashing, function h maps search-key values to a fixed set of B of bucket
addresses.
o Databases grow with time. If initial number of buckets is too small, performance will
degrade due to too much overflows.
o If file size at some point in the future is anticipated and number of buckets allocated
accordingly, significant amount of space will be wasted initially.
111
o If database shrinks, again space will be wasted.

o One option is periodic re-organization of the file with a new hash function, but it is very
expensive.
o These problems can be avoided by using techniques that allow the number of buckets to
be modified dynamically.
Dynamic Hashing
Good for database that grows and shrinks in size.
Allows the hash function to be modified dynamically.
Extendable hashing one form of dynamic hashing
Hash function generates values over a large range typically b-bit integers, with b =
32.
At any time use only a prefix of the hash function to index into a table of bucket
addresses.
Let the length of the prefix be i bits, 0 i 32.
Bucket address table size = 2i. Initially i = 0.
Value of i grows and shrinks as the size of the database grows and shrinks.
Multiple entries in the bucket address table may point to a bucket.
Thus, actual number of buckets is < 2i.
The number of buckets also changes dynamically due to coalescing and splitting of
buckets.
112
General Extendable Hash Structure
In this structure, i2 = i3 = i, whereas i1 = i 1
Use of Extendable Hash Structure

o Each bucket j stores a value ij; all the entries that point to the same bucket have the same
values on the first ij bits.
o To locate the bucket containing search-key Kj:
o 1. Compute h(Kj) = X
o 2. Use the first i high order bits of X as a displacement into bucket address table, and
follow the pointer to appropriate bucket
o To insert a record with search-key value Kj
o follow same procedure as look-up and locate the bucket, say j.
o If there is room in the bucket j insert record in the bucket.
o Else the bucket must be split and insertion re-attempted.
o Overflow buckets used instead in some cases.
Updates in Extendable Hash Structure
o To split a bucket j when inserting record with search-key value Kj:
o If i > ij (more than one pointer to bucket j)
o allocate a new bucket z, and set ij and iz to the old ij -+ 1.
113
o make the second half of the bucket address table entries pointing to j to point to z
o remove and reinsert each record in bucket j.
o recompute new bucket for Kj and insert record in the bucket (further splitting is required
if the bucket is still full)
o If i = ij (only one pointer to bucket j)
o increment i and double the size of the bucket address table.
o replace each entry in the table by two entries that point to the same bucket.
o recompute new bucket address table entry for Kj
Now i > ij so use the first case above.
o When inserting a value, if the bucket is full after several splits (that is, i reaches some
limit create an overflow bucket instead of splitting bucket entry table further.
o To delete a key value,

o locate it in its bucket and remove it.
o The bucket itself can be removed if it becomes empty (with appropriate updates to the
bucket address table).
o Coalescing of buckets can be done (can coalesce only with a buddy bucket having
same value of ij and same ij 1 prefix, if it is present).
o Decreasing bucket address table size is also possible.
o Note: decreasing bucket address table size is an expensive operation and should be done
only if number of buckets becomes much smaller than the size of the table.
Use of Extendable Hash Structure:
Example
114
Initial Hash structure, bucket size = 2

Hash structure after insertion of one Brighton and two Downtown
Records
Hash structure after insertion of Mianus record
115
Hash structure after insertion of three Perryridge records

Hash structure after insertion of Redwood and Round Hill records
Extendable Hashing vs. Other Schemes

Benefits of extendable hashing:
o Hash performance does not degrade with growth of file
o Minimal space overhead
Disadvantages of extendable hashing
o Extra level of indirection to find desired record
o Bucket address table may itself become very big (larger than memory)
Need a tree structure to locate desired record in the structure
o Changing size of bucket address table is an expensive operation
116
Linear hashing is an alternative mechanism which avoids these disadvantages at the possible
cost of more bucket overflows.
4.8
INTRODUCTION
TO
DISTRIBUTED
DATABASES
AND
CLIENT/SERVER TECHNOLOGY
DISTRIBUTED DATABASE SYSTEM
A distributed database system consist of loosely coupled sites that share no physical
component
Database systems that run on each site are independent of each other
Transactions may access data at one or more sites
In a homogeneous distributed database
All sites have identical software
Are aware of each other and agree to cooperate in processing user requests.
Each site surrenders part of its autonomy in terms of right to change schemas or software
Appears to user as a single system
In a heterogeneous distributed database
Different sites may use different schemas and software
Difference in schema is a major problem for query processing
Difference in software is a major problem for transaction processing
Sites may not be aware of each other and may provide only limited facilities for
cooperation in transaction processing
DISTRIBUTED DATA STORAGE

Data Replication
Data Fragmentation
Data Replication
A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites
Full replication of a relation is the case where the relation is stored at all sites.Fully redundant
databases are those in which every site contains a copy of the entire database
Advantages of Replication
Availability: failure of site containing relation r does not result in unavailability of r is replicas
exist.
Parallelism: queries on r may be processed by several nodes in parallel.
Reduced data transfer: relation r is available locally at each site containing a replica of r.
117
Is advantages of Replication
Increased cost of updates: each replica of relation r must be updated.
Increased complexity of concurrency control: concurrent updates to distinct replicas may
lead to inconsistent data unless special concurrency control mechanisms are implemented.
One solution: choose one copy as primary copy and apply concurrency control operations on
primary copy.
Data Fragmentation
Division of relation r into fragments r1 , r2 , , rn which contain sufficient
information to reconstruct relation r.
Horizontal fragmentation : each tuple of r is assigned to one or more fragments
Vertical fragmentation : the schema for relation r is split into several smaller schemas
All schemas must contain a common candidate key (or superkey) to ensure lossless join property.
A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key.
Example : relation account with following schema
Account = (account_number, branch_name , balance )
Data transparency : Degree to which system user may remain unaware of the details of
how and where the data items are stored in a distributed system.
Consider transparency issues in relation to:
Fragmentation transparency
Replication transparency
Location transparency
Naming of data items: criteria
Every data item must have a system-wide unique name.
It should be possible to find the location of data items efficiently.
It should be possible to change the location of data items transparently.
Each site should be able to create new data items autonomously.
CLIENT/ SERVER TECHNOLOGY

Client/ Server technology is a means for separating the functions of an application into two
or more distinct parts. Client/ server describes the relationship between two computer programs in
which one program, the client, makes a service request from another program, the server, which
fulfills the request. The client presents and manipulates data on the desktop computer. The server
acts like a mainframe to store and retrieve protected data. It is network architecture in which each
118
computer or process on the network is either a client or a server.

Servers are powerful computers or processes dedicated to managing disk drives (file
servers), printers (print servers), or network traffic (network servers). Clients are PCs or
workstations on which users run applications. Clients rely on servers for resources, such as files,
devices, and even processing power.
Client - Server Technology
A client/ server model has following three distinct components, each focusing on a specific
job:
Database server
Client application
Network.
Database Server
A server (or "back end") manages the resources such as database, efficiently and optimally among
various clients that simultaneously request the server for the same resources. Database server
mainly concentrates on the following tasks.
Managing a single database of information among many concurrent users.
Controlling database access and other security requirements.
Protecting database of information with backup and recovery features.
Centrally enforcing global data integrity rules across all client applications.
Client Application
A client application (the "front end") is the part of the system that users apply to interact with
data. The client application in a client/ server model focus on the following job:
Presenting an interface between the user and the resource to complete the job Managing
presentation logic, Performing application logic and Validating data entry Managing the
request traffic of receiving and sending information from database server
Network
The third component of a client/ server system is network. The communication software is the
vehicles that transmit data between the clients and the server in client server system. Both the
client and the server run communication software that allows them to talk across the network.
Three-Tier Technology
Client Server technology is also called 3-tier technology as illustrated in Figure below.
Client/server is an important idea in a network, however, it can be used by programs within a
119
single computer. In a network, the client/ server model provides a convenient way to interconnect
programs that are distributed efficiently across different locations. Computer transactions using the
client/ server model are very common. For example, to check your bank account from your
computer, a client program in your computer forwards your request to a server program at the
bank. That program may in turn forward the request to its own client program that sends a request
to a database server at another bank computer to retrieve your account balance.
4.9 MULTIDIMENSIONAL AND PARALLEL DATABASES

MULTIDIMENSIONAL DATABASES
A multidimensional database is a specific type of database that has been optimized for data
warehousing and OLAP (online analytical processing). A multi-dimensional database is structured
by a combination of data from various sources that work amongst databases simultaneously and
that offer networks, hierarchies, arrays, and other data formatting methods. In a multidimensional
database, the data is presented to its users through multidimensional arrays, and each individual
value of data is contained within a cell which can be accessed by multiple indexes.
A multidimensional database uses the concept of a data cube (also referred to as a
hypercube) to represent the dimensions of data currently available to its user(s). The
multidimensional database concept is designed to assist with decision support systems.
This detailed organization of the data allows for advanced and complex query generation while
providing outstanding performance in certain cases when compared to traditional relational
structures and databases. This type of database is usually structured in an order that optimizes
OLAP and data warehouse applications.
120
Fig. Multidimensional Database

Dimensions and Members
This section introduces the concepts of outlines, dimensions, and members within a
multidimensional database. If you understand dimensions and members, you are well on your way
to understanding the power of a multidimensional database.
A dimension represents the highest consolidation level in the database outline. The database
outline presents dimensions and members in a tree structure to indicate a consolidation
relationship. .
Standard dimensions represent the core components of a business plan and often relate to
departmental functions. Typical standard dimensions: Time, Accounts, Product Line, Market, and
Division. Dimensions change less frequently than members.
Attribute dimensions are associated with standard dimensions. Members are the individual
components of a dimension. For example, Product A, Product B, and Product C might be members
of the Product dimension. Each member has a unique name. Essbase can store the data associated
with a member (referred to as a stored member in this chapter), or it can dynamically calculate the
data when a user retrieves it.
PARALLEL DATABASES
Data can be partitioned across multiple disks for parallel I/O.

Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel data can be
partitioned and each processor can work independently on its own partition.
Queries are expressed in high level language (SQL, translated to relational algebra) makes
121
parallelization easier.
Different queries can be run in parallel with each other. Concurrency control takes care of
conflicts. Thus, databases naturally lend themselves to parallelism.
Reduce the time required to retrieve relations from disk by partitioning the relations on multiple
disks.
Horizontal partitioning tuples of a relation are divided among many disks such that each
tuple resides on one disk.
Partitioning techniques (number of disks = n):
Round-robin:
Send the ith tuple inserted in the relation to disk i mod n.
Hash partitioning:
Choose one or more attributes as the partitioning attributes.
Choose hash function h with range 0n - 1
Let i denote result of hash function h applied to the partitioning attribute value of a tuple.
Send tuple to disk i.
Range partitioning:
Choose an attribute as the partitioning attribute.
A partitioning vector [vo, v1, ..., vn-2] is chosen.
Let v be the partitioning attribute value of a tuple. Tuples such that vi vi+1 go to disk I +
1. Tuples with v < v0 go to disk 0 and tuples with v vn-2 go to disk n-1.
E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will
go to disk 0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to
disk2.
INTERQUERY PARALLELISM
Queries/transactions execute in parallel with one another.
Increases transaction throughput; used primarily to scale up a transaction processing system
to support a larger number of transactions per second.
Easiest form of parallelism to support, particularly in a shared-memory parallel database,
because even sequential database systems support concurrent processing.
More complicated to implement on shared-disk or shared-nothing architectures
Locking and logging must be coordinated by passing messages between processors.
Data in a local buffer may have been updated at another processor.
122
Cache-coherency has to be maintained reads and writes of data in buffer must find
latest version of data.
INTRAQUERY PARALLELISM
Execution of a single query in parallel on multiple processors/disks; important for
speeding up long-running queries.
Two complementary forms of intraquery parallelism :
Intraoperation Parallelism parallelize the execution of each individual operation in the query.
Interoperation Parallelism execute the different operations in a query expression in parallel
the first form scales better with increasing parallelism because the number of tuples processed
by each operation is typically more than the number of operations in a query
4.9 MOBILE DATABASES

4.9.1 Mobile Computing Architecture
The general architecture of a mobile platform
123
It is distributed architecture where a number of computers, generally referred to as Fixed

Hosts and Base Stations are interconnected through a high-speed wired network.
Fixed hosts are general purpose computers configured to manage mobile units.
Base stations function as gateways to the fixed network for the Mobile Units.
Wireless Communications
The wireless medium have bandwidth significantly lower than those of a
wired network.
The current generation of wireless technology has data rates range from the
tens to hundreds of kilobits per second (2G cellular telephony) to tens of
megabits per second (wireless Ethernet, popularly known as WiFi).
Modern (wired) Ethernet, by comparison, provides data rates on the order
of hundreds of megabits per second.
The other characteristics distinguish wireless connectivity options:
interference,
locality of access,
range,
support for packet switching,
seamless roaming throughout a geographical region.
Some wireless networks, such as WiFi and Bluetooth, use unlicensed areas of
the frequency spectrum, which may cause interference with other appliances,
such as cordless telephones.
Modern wireless networks can transfer data in units called packets, that are
used in wired networks in order to conserve bandwidth.
Client/Network Relationships
Mobile units can move freely in a geographic mobility domain, an area that
is circumscribed by wireless network coverage.
To manage entire mobility domain is divided into one or more
smallerdomains, called cells, each of which is supported by at least one base
station.
Mobile units be unrestricted throughout the cells of domain, while
124
maintaining information access contiguity.

The communication architecture described earlier is designed to give the
mobile unit the impression that it is attached to a fixed network, emulating a
traditional client-server architecture.
Wireless communications, however, make other architectures possible.
In a MANET, co-located mobile units do not need to communicate via a
fixed network, but instead, form their own using cost-effective
technologies such as Bluetooth.
In a MANET, mobile units are responsible for routing their own data,
effectively acting as base stations as well as clients.
Moreover, they must be robust enough to handle changes in the

network topology, such as the arrival or departure of other mobile
units.
MANET applications can be considered as peer-to-peer, meaning that a

mobile unit is simultaneously a client and a server.
Transaction processing and data consistency control become
more difficult since there is no central control in this
architecture.
Resource discovery and data routing by mobile units make computing in
a MANET even more complicated.
Sample MANET applications are multi-user games, shared

whiteboard, distributed calendars, and battle information sharing.
4.9.2 Characteristics of Mobile Environments

The characteristics of mobile computing include:
Communication latency
Intermittent connectivity
Limited battery life
Changing client location
The server may not be able to reach a client.
A client may be unreachable because it is dozing in an energy-conserving
state in which many subsystems are shut down or because it is out of range
of a base station.
125
In either case, neither client nor server can reach the other, and
modifications must be made to the architecture in order to compensate
for this case.
Proxies for unreachable components are added to the architecture.

For a client (and symmetrically for a server), the proxy can cache
updates intended for the server.
Mobile computing poses challenges for servers as well as clients.
The latency involved in wireless communication makes scalability a problem.
Since latency due to wireless communications increases the

time to service each client request, the server can handle fewer
clients.
One way servers relieve this problem is by broadcasting data whenever possible.
A server can simply broadcast data periodically.
Broadcast also reduces the load on the server, as clients do not have to
maintain active connections to it.
Client mobility also poses many data management challenges.
Servers must keep track of client locations in order to efficiently route

messages to them.
Client data should be stored in the network location that minimizes the traffic
necessary toaccess
The act of moving between cells must be transparent to the client.
The server must be able to gracefully divert the shipment of data from one
base to another, without the client noticing.
Client mobility also allows new applications that are location-based.
WEB DATABASES
A web database is a system for storing information that can then be accessed via a website.
For example, an online community may have a database that stores the username, password, and
other details of all its members. The most commonly used database system for the internet is
MySQL due to its integration with PHP one of the most widely used server side programming
languages.
At its most simple level, a web database is a set of one or more tables that contain data.
Each table has different fields for storing information of various types. These tables can then be
linked together in order to manipulate data in useful or interesting ways. In many cases, a table
will use a primary key, which must be unique for each entry and allows for unambiguous selection
of data.
A web database can be used for a range of different purposes. Each field in a table has to
have a defined data type. For example, numbers, strings, and dates can all be inserted into a web
database. Proper database design involves choosing the correct data type for each field in order to
reduce memory consumption and increase the speed of access. Although for small databases this
often isn't so important, big web databases can grow to millions of entries and need to be well
designed to work effectively.
SPATIAL AND MULTIMEDIA DATABASES
SPATIAL DATABASE
Types of Spatial Data
Point Data
Points in a multidimensional space
E.g., Raster data such as satellite imagery, where
each pixel stores a measured value
E.g., Feature vectors extracted from text
Region Data
Objects have spatial extent with location and boundary.
DB typically uses geometric approximations constructed using line
segments, polygons, etc., called vector data.
Types of Spatial Queries
Spatial Range Queries
Find all cities within 50 miles of Madison
Query has associated region (location, boundary)
Answer includes ovelapping or contained data regions
Nearest-Neighbor Queries
Find the 10 cities nearest to Madison
Results must be ordered by proximity
Spatial Join Queries

Find all cities near a lake
Expensive, join condition involves regions and proximity
Applications of Spatial Data
Geographic Information Systems (GIS)
E.g., ESRIs ArcInfo; OpenGIS Consortium
Geospatial information
All classes of spatial queries and data are common
Computer-Aided Design/Manufacturing
Store spatial objects such as surface of airplane fuselage
Range queries and spatial join queries are common
Multimedia Databases
Images, video, text, etc. stored and retrieved by content
First converted to feature vector form; high dimensionality
Nearest-neighbor queries are the most common
Single-Dimensional Indexes
B+ trees are fundamentally single-dimensional indexes.
When we create a composite search key B+ tree, e.g., an index on <age, sal>, we
effectively linearize the 2-dimensional space since we sort entries first by age and then by
sal.
Multi-dimensional Indexes
A multidimensional index clusters entries so as to exploit nearness in multidimensional
space.
Keeping track of entries and maintaining a balanced index structure presents a
challenge! Consider entries:
Motivation for Multidimensional Indexes
Spatial queries (GIS, CAD).
Find all hotels within a radius of 5 miles from the conference venue.
Find the city with population 500,000 or more that is nearest to Kalamazoo, MI.
Find all cities that lie on the Nile in Egypt.
Find all parts that touch the fuselage (in a plane design).
Similarity queries (content-based retrieval).

Given a face, find the five most similar faces.
Multidimensional range queries.
50 < age < 55 AND 80K < sal < 90K
Drawbacks
An index based on spatial location needed.
One-dimensional indexes dont support multidimensional searching efficiently.
Hash indexes only support point queries; want to support range queries as well.
Must support inserts and deletes gracefully.
Ideally, want to support non-point data as well (e.g., lines, shapes).
The R-tree meets these requirements, and variants are widely used today.
MULTIMEDIA DATABASES
To provide such database functions as indexing and consistency, it is desirable to store
multimedia data in a database
Rather than storing them outside the database, in a file system
The database must handle large object representation.
Similarity-based retrieval must be provided by special index structures.
Must provide guaranteed steady retrieval rates for continuous-media data.
Multimedia Data Formats
Store and transmit multimedia data in compressed form
JPEG and GIF the most widely used formats for image data.
MPEG standard for video data use commonalties among a sequence of
frames to achieve a greater degree of compression.
MPEG-1 quality comparable to VHS video tape.
Stores a minute of 30-frame-per-second video and audio in approximately 12.5 MB
MPEG-2 designed for digital broadcast systems and digital video disks; negligible
loss of video quality.
Compresses 1 minute of audio-video to approximately 17 MB.
Several alternatives of audio encoding
MPEG-1 Layer 3 (MP3), RealAudio, WindowsMedia format, etc.
Continuous-Media Data
Most important types are video and audio data.
Characterized by high data volumes and real-time information-delivery requirements.
Data must be delivered sufficiently fast that there are no gaps in the audio or
video.
Data must be delivered at a rate that does not cause overflow of system buffers.
Synchronization among distinct data streams must be maintained
video of a person speaking must show lips moving synchronously
with the audio
105
Video Servers
Video-on-demand systems deliver video from central video servers, across a

network, to terminals
must guarantee end-to-end delivery rates
Current video-on-demand servers are based on file systems; existing database

systems do not meet real-time response requirements.
Multimedia data are stored on several disks (RAID configuration), or on tertiary
storage for less frequently accessed data.
Head-end terminals - used to view multimedia data
PCs or TVs attached to a small, inexpensive computer called a set-top box.
Similarity-Based Retrieval
Examples of similarity based retrieval
Pictorial data:
Two pictures or images that are slightly different as
represented in the database may be considered the same by a user.

e.g., identify similar designs for registering a new trademark.
Audio data: Speech-based user interfaces allow the user to give a command or
identify a data item by speaking.
e.g., test user input against stored commands.
Handwritten data: Identify a handwritten data item or command stored in the
database
DATA WAREHOUSING AND MINING

Data Warehouse:
Large organizations have complex internal organizations, and have data
stored at different locations, on different operational (transaction processing)
systems, under different schemas
Data sources often store only current data, not historical data
Corporate decision making requires a unified view of all organizational data,
including historical data
A data warehouse is a repository (archive) of information gathered from
106
multiple sources, stored under a unified schema, at a single site

Greatly simplifies querying, permits study of historical trends
Shifts decision support query load away from transaction processing systems
Components of a Datawarehouse
When and how to gather data
Source driven architecture: data sources transmit new information to warehouse,
either continuously or periodically (e.g. at night)
Destination driven architecture: warehouse periodically requests new information
from data sources
Keeping warehouse exactly synchronized with data sources (e.g. using twophase commit) is too expensive
Usually OK to have slightly out-of-date data at warehouse

Data/updates are periodically downloaded form online transaction processing (OLTP)
system.
What schema to use
Schema integration
Data cleansing
E.g. correct mistakes in addresses
107
E.g. misspellings, zip code errors

Merge address lists from different sources and purge duplicates
Keep only one address record per household (householding)
How to propagate updates
Warehouse schema may be a (materialized) view of schema from data sources
Efficient techniques for update of materialized views
What data to summarize
Raw data may be too large to store on-line
Aggregate values (totals/subtotals) often suffice
Queries on raw data can often be transformed by query optimizer to use
aggregate values.
WAREHOUSE SCHEMA
Typically warehouse data is multidimensional, with very large fact tables
Examples of dimensions: item-id, date/time of sale, store where sale was made,
customer identifier
Examples of measures: number of items sold, price of items
Dimension values are usually encoded using small integers and mapped to full
values via dimension tables
Resultant schema is called a star schema
More complicated schema structures
Snowflake schema: multiple levels of dimension tables
Constellation: multiple fact tables
108
Data Mining
Broadly speaking, data mining is the process of semi-automatically analyzing
large databases to find useful patterns.
Like knowledge discovery in artificial intelligence data mining discovers statistical
rules and patterns
Differs from machine learning in that it deals with large volumes of data stored
primarily on disk.
Some types of knowledge discovered from a database can be represented by a set
of rules. e.g.,: Young women with annual incomes greater than $50,000 are most
likely to buy sports cars.
Other types of knowledge represented by equations, or by prediction functions.
Some manual intervention is usually required
Pre-processing of data, choice of which type of pattern to find,
postprocessing to find novel patterns
Applications of Data Mining
Prediction based on past history
Predict if a credit card applicant poses a good credit risk, based on
some attributes (income, job type, age, ..) and past history
Predict if a customer is likely to switch brand loyalty
Predict if a customer is likely to respond to junk mail
109
Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms:
Classification
Given a training set consisting of items belonging to different classes, and a
new item whose class is unknown, predict which class it belongs to.
Regression formulae
Given a set of parameter-value to function-result mappings for an
unknown function, predict the function-result for a new parametervalue
Descriptive Patterns
Associations
Find books that are often bought by the same customers. If a new customer
buys one such book, suggest that he buys the others too.
Other similar applications: camera accessories,
clothes, etc. Associations may also be used as a first step in
detecting causation
E.g. association between exposure to chemical X and cancer, or new medicine
and cardiac problems
Clusters
E.g. typhoid cases were clustered in an area surrounding a contaminated well
Detection of clusters remains important in detecting epidemics
DATA MART
A data mart is the access layer of the data warehouse environment that is used to get data out to
the users. The data mart is a subset of the data warehouse that is usually oriented to a specific
business line or team. Data marts are small slices of the data warehouse. Whereas data
warehouses have an enterprise-wide depth, the information in data marts pertains to a single
department. In some deployments, each department or business unit is considered the owner of
110
its data mart including all the hardware, software and data. This enables each department to use,
manipulate and develop their data any way they see fit; without altering information inside other
data marts or the data warehouse. In other deployments where conformed dimensions are used,
this business unit ownership will not hold true for shared dimensions like customer, product, etc.
The reasons why organizations are building data warehouses and data marts are because the
information in the database is not organized in a way that makes it easy for organizations to find
what they need. Also complicated queries might take a long time to answer what people want to
know since the database systems are designed to process millions of transactions per day.
Transactional database are designed to be updated, however, data warehouses or marts are read
only. Data warehouses are designed to access large groups of related records.
Data marts improve end-user response time by allowing users to have access to the specific type
of data they need to view most often by providing the data in a way that supports the collective
view of a group of users.
A data mart is basically a condensed and more focused version of a data warehouse that reflects
the regulations and process specifications of each business unit within an organization. Each data
mart is dedicated to a specific business function or region. This subset of data may span across
many or all of an enterprises functional subject areas. It is common for multiple data marts to be
used in order to serve the needs of each individual business unit (different data marts can be used
to obtain specific information for various enterprise departments, such as accounting, marketing,
sales, etc.)
111
UNIT V
ADVANCED TOPICS
DATABASE SECURITY: Data Classification-Threats and risks Database access Control
Types of Privileges Cryptography- Statistical Databases.- Distributed Databases-ArchitectureTransaction Processing-Data Warehousing and Mining-Classification-Association rulesClustering-Information Retrieval- Relevance ranking-Crawling and Indexing the Web- Object
Oriented Databases-XML Databases.
5.1 DATABASE SECURITY: Data Classification
Database security concerns the use of a broad range of information security controls to
protect databases (potentially including the data, the database applications or stored functions,
the database systems, the database servers and the associated network links) against
compromises of their confidentiality, integrity and availability. It involves various types or
categories of controls, such as technical, procedural/administrative and physical. Database
security is a specialist topic within the broader realms of computer security, information security
and risk management.
Security risks to database systems include, for example:
Unauthorized or unintended activity or misuse by authorized database users, database

administrators, or network/systems managers, or by unauthorized users or hackers (e.g.
inappropriate access to sensitive data, metadata or functions within databases, or
inappropriate changes to the database programs, structures or security configurations);
Malware infections causing incidents such as unauthorized access, leakage or disclosure

of personal or proprietary data, deletion of or damage to the data or programs,
interruption or denial of authorized access to the database, attacks on other systems and
the unanticipated failure of database services;
Overloads, performance constraints and capacity issues resulting in the inability of

authorized users to use databases as intended;
112
Physical damage to database servers caused by computer room fires or floods,

overheating,
lightning,
accidental
liquid
spills,
static
discharge,
electronic
breakdowns/equipment failures and obsolescence;
Design flaws and programming bugs in databases and the associated programs and
systems, creating various security vulnerabilities (e.g. unauthorized privilege escalation),
data loss/corruption, performance degradation etc.;
Data corruption and/or loss caused by the entry of invalid data or commands, mistakes in
database or system administration processes, sabotage/criminal damage etc.
Types of Security
Legal and ethical issues
Policy issues
System-related issues
The need to identify multiple security levels

Threats to databases
- Loss of integrity
- Loss of availability
- Loss of confidentiality
5.2 Threats and risk
Types of threats to database security
1. Privilege abuse: When database users are provided with privileges that exceeds their
day-to-day job requirement, these privileges may be abused intentionally or unintentionally.
Take, for instance, a database administrator in a financial institution. What will happen if
he turns off audit trails or create bogus accounts? He will be able to transfer money from one
account to another thereby abusing the excessive privilege intentionally.
Having seen how privilege can be abused intentionally, let us see how privilege can be
abused unintentionally. A company is providing a work from home option to its employees
113
and the employee takes a backup of sensitive data to work on from his home. This not only
violates the security policies of the organization, but also may result in data security breach if the
system at home is compromised.
2. Operating System vulnerabilities: Vulnerabilities in underlying operating systems
like Windows, UNIX, Linux, etc., and the services that are related to the databases could lead to
unauthorized access. This may lead to a Denial of Service (DoS) attack. This could be prevented
by updating the operating system related security patches as and when they become available.
3. Database rootkits: A database rootkit is a program or a procedure that is hidden inside
the database and that provides administrator-level privileges to gain access to the data in the
database. These rootkits may even turn off alerts triggered by Intrusion Prevention Systems
(IPS). It is possible to install a rootkit only after compromising the underlying operating system.
This can be avoided by periodical audit trails, else the presence of the database rootkit may go
undetected.
4. Weak authentication: Weak authentication models allow attackers to employ
strategies such as social engineering and brute force to obtain database login credentials and
assume the identity of legitimate database users.
5. Weak audit trails: A weak audit logging mechanism in a database server represents a
critical risk to an organization especially in retail, financial, healthcare, and other industries with
stringent regulatory compliance. Regulations such as PCI, SOX, and HIPAA demand extensive
logging of actions to reproduce an event at a later point of time in case of an incident. Logging of
sensitive or unusual transactions happening in a database must be done in an automated manner
for resolving incidents. Audit trails act as the last line of database defense. Audit trails can detect
the existence of a violation that could help trace back the violation to a particular point of time
and a particular user.
5.3 Database Access Control
To protect databases against these types of threats four kinds of countermeasures can be
implemented : access control, inference control, flow control, and encryption.
114
A DBMS typically includes a database security and authorization subsystem that is

responsible for ensuring the security portions of a database against unauthorized access.
Two types of database security mechanisms:
Discretionary security mechanisms
Mandatory security mechanisms
The security mechanism of a DBMS must include provisions for restricting access to the
database as a whole; this function is called access control and is handled by creating user
accounts and passwords to control login process by the DBMS.
Discretionary Access Control Based on Granting and Revoking Privileges
The typical method of enforcing discretionary access control in a database system is
based on the granting and revoking privileges.
5.3.1Types of Discretionary Privileges
The account level: At this level, the DBA specifies the particular privileges that each
account holds independently of the relations in the database.
The relation (or table level): At this level, the DBA can control the privilege to access
each individual relation or view in the database.
The privileges at the account level apply to the capabilities provided to the account itself
and can include the CREATE SCHEMA or CREATE TABLE privilege, to create a
schema or base relation; the CREATE VIEW privilege; the ALTER privilege, to apply
schema changes such adding or removing attributes from relations; the DROP privilege,
to delete relations or views; the MODIFY privilege, to insert, delete, or update tuples;
and the SELECT privilege, to retrieve information from the database by using a SELECT
query.
The second level of privileges applies to the relation level, whether they are base
relations or virtual (view) relations.The granting and revoking of privileges generally follow an
authorization model for discretionary privileges known as the access matrix model, where the
rows of a matrix M represents subjects (users, accounts, programs) and the columns represent
115
objects (relations, records, columns, views, operations). Each position M(i,j) in the matrix
represents the types of privileges (read, write, update) that subject i holds on object j.
To control the granting and revoking of relation privileges, each relation R in a
database is assigned and owner account, which is typically the account that was used when the
relation was created in the first place. The owner of a relation is given all privileges on that
relation. In SQL2, the DBA can assign and owner to a whole schema by creating the schema and
associating the appropriate authorization identifier with that schema, using the CREATE
SCHEMA command. The owner account holder can pass privileges on any of the owned relation
to other users by granting privileges to their accounts.
In SQL the following types of privileges can be granted on each individual relation R:
SELECT (retrieval or read) privilege on R: Gives the account retrieval privilege. In SQL
this gives the account the privilege to use the SELECT statement to retrieve tuples from
R.
MODIFY privileges on R: This gives the account the capability to modify tuples of R. In
SQL this privilege is further divided into UPDATE, DELETE, and INSERT privileges to
apply the corresponding SQL command to R. In addition, both the INSERT and
UPDATE privileges can specify that only certain attributes can be updated by the
account.
REFERENCES privilege on R: This gives the account the capability to reference relation
R when specifying integrity constraints. The privilege can also be restricted to specific
attributes of R.
Notice that to create a view, the account must have SELECT privilege on all relations involved
in the view definition
5.3.2 Specifying Privileges Using Views
The mechanism of views is an important discretionary authorization mechanism in its
own right.For example, if the owner A of a relation R wants another account B to be able to
retrieve only some fields of R, then A can create a view V of R that includes only those
116
attributes and then grant SELECT on V to B. The same applies to limiting B to retrieving only
certain tuples of R; a view V can be created by defining the view by means of a query that
selects only those tuples from R that A wants to allow B to access.
5.3.3 Revoking Privileges
In some cases it is desirable to grant a privilege to a user temporarily.For example, the
owner of a relation may want to grant the SELECT privilege to a user for a specific task and then
revoke that privilege once the task is completed. Hence, a mechanism for revoking privileges is
needed. In SQL, a REVOKE command is included for purpose of canceling privileges.
5.3.4 Mandatory Access Control and Role-Based Access Control for Multilevel Security
The discretionary access control techniques of granting and revoking privileges on
relations has traditionally been the main security mechanism for relational database systems.
This is an all-or-nothing method: A user either has or does not have a certain privilege.In many
applications, and additional security policy is needed that classifies data and users based on
security classes. This approach as mandatory access control, would typically be combined with
the discretionary access control mechanisms.
Typical security classes are top secret (TS), secret (S), confidential (C), and
unclassified (U), where TS is the highest level and U the lowest: TS S C U.The commonly
used model for multilevel security, known as the Bell-LaPadula model, classifies each subject
(user, account, program) and object (relation, tuple, column, view, operation) into one of the
security classifications, T, S, C, or U: clearance (classification) of a subject S as class(S) and to
the classification of an object O as class(O).
Two restrictions are enforced on data access based on the subject/object classifications:
1. A subject S is not allowed read access to an object O unless class(S) class(O). This is
known as the simple security property.
2. A subject S is not allowed to write an object O unless class(S) class(O). This known as
the star property (or * property).
117
The value of the TC attribute in each tuple t which is the highest of all attribute
classification values within t provides a general classification for the tuple itself, whereas each
Ci provides a finer security classification for each attribute value within the tuple.
The apparent key of a multilevel relation is the set of attributes that would have
formed the primary key in a regular(single-level) relation.
A multilevel relation will appear to contain different data to subjects (users) with
different clearance levels. In some cases, it is possible to store a single tuple in the relation at a
higher classification level and produce the corresponding tuples at a lower-level classification
through a process known as filtering.
In other cases, it is necessary to store two or more tuples at different classification
levels with the same value for the apparent key. This leads to the concept of polyinstantiation
where several tuples can have the same apparent key value but have different attribute values for
users at different classification levels.
In general, the entity integrity rule for multilevel relations states that all attributes that
are members of the apparent key must not be null and must have the same security classification
within each individual tuple.
In addition, all other attribute values in the tuple must have a security classification
greater than or equal to that of the apparent key. This constraint ensures that a user can see the
key if the user is permitted to see any part of the tuple at all.
Other integrity rules, called null integrity and interinstance integrity, informally
ensure that if a tuple value at some security level can be filtered (derived) from a higherclassified tuple, then it is sufficient to store the higher-classified tuple in the multilevel relation.
5.3.5 Role-Based Access Control
Role-based access control (RBAC) emerged rapidly in the 1990s as a proven
technology for managing and enforcing security in large-scale enterprisewide systems. Its basic
notion is that permissions are associated with roles, and users are assigned to appropriate roles.
Roles can be created using the CREATE ROLE and DESTROY ROLE commands. The GRANT
118
and REVOKE commands discussed under DAC can then be used to assign and revoke privileges
from roles.
RBAC appears to be a viable alternative to traditional discretionary and mandatory access
controls; it ensures that only authorized users are given access to certain data or
resources.
Many DBMSs have allowed the concept of roles, where privileges can be assigned to
roles.
Role hierarchy in RBAC is a natural way of organizing roles to reflect the organizations
lines of authority and responsibility.
Another important consideration in RBAC systems is the possible temporal constraints
that may exist on roles, such as time and duration of role activations, and timed triggering
of a role by an activation of another role.
Using an RBAC model is highly desirable goal for addressing the key security
requirements of Web-based applications.
In contrast, discretionary access control (DAC) and mandatory access control (MAC) models
lack capabilities needed to support the security requirements emerging enterprises and Webbased applications.
5.4 Types of Privileges
Privileges: Privileges defines the access rights provided to a user on a database object.
There are two types of privileges.
1) System privileges - This allows the user to CREATE, ALTER, or DROP
database objects.
2) Object privileges - This allows the user to EXECUTE, SELECT, INSERT,
UPDATE, or DELETE data from database objects to which the privileges apply.
Few CREATE system privileges are listed below:
119
System
Privileges
Description
allows users to create the specified

object in their own schema.
CREATE ANY allows users to create the specified
object
object in any schema.
CREATE object
The above rules also apply for ALTER and DROP system privileges.
Few of the object privileges are listed below:
Object
Description
Privileges
INSERT allows users to insert rows into a table.
allows users to select data from a
SELECT
database object.
UPDATE allows user to update data in a table.
allows user to execute a stored
EXECUTE
procedure or a function.
Roles: Roles are a collection of privileges or access rights. When there are many users in a
database it becomes difficult to grant or revoke privileges to users. Therefore, if you define roles,
you can grant or revoke privileges to users, thereby automatically granting or revoking
privileges. You can either create Roles or use the system roles pre-defined by oracle.Some of the
privileges granted to the system roles are as given below:
System Role
Privileges Granted to the Role

CREATE TABLE, CREATE VIEW,
CONNECT CREATE
SYNONYM,
CREATE
SEQUENCE, CREATE SESSION etc.
CREATE PROCEDURE, CREATE
SEQUENCE,
CREATE
TABLE,
RESOURCE CREATE TRIGGER etc. The primary
usage of the RESOURCE role is to
restrict access to database objects.
DBA
ALL SYSTEM PRIVILEGES
120
5.5 Cryptography
A DBMS can use encryption to protect information in certain situations where the normal
security mechanisms of the DBMS are not adequate. For example, an intruder may steal tapes
containing some data or tap a communication line. By storing and transmitting data in an
encrypted form, the DBMS ensures that such stolen data is not intelligible to the intruder. Thus,
encryption is a technique to provide privacy of data.
In encryption, the message to be encrypted is known as plaintext. The plaintext is

transformed by a function that is parameterized by a key. The output of the encryption process is
known as the cipher text. Ciphertext is then transmitted over the network. The process of
converting the plaintext to ciphertext is called as Encryption and process of converting the
ciphertext to plaintext is called as Decryption. Encryption is performed at the transmitting end
and decryption is performed at the receiving end. For encryption process we need the encryption
key and for decryption process we need decryption key as shown in figure. Without the
knowledge of decryption key intruder cannot break the ciphertext to plaintext. This process is
also called as Cryptography.
The basic idea behind encryption is to apply an encryption algorithm, which may' be
accessible to the intruder, to the original data and a user-specified or DBA-specified encryption
key, 'which is kept secret. The output of the algorithm is the encrypted version of the data. There
is also a decryption algorithm, which takes the encrypted data and the decryption key as input
and then returns the original data. Without the correct decryption key, the decryption algorithm
produces gibberish. Encryption and decryption keys may be same or different but there must be
relation between the both which must me secret.
121
5.5.1Techniques used for Encryption

There are following techniques used for encryption process:
Substitution Ciphers
Transposition Ciphers
Substitution Ciphers: In a substitution cipher each letter or group of letters is replaced by
another letter or group of letters to mask them For example: a is replaced with D, b with E, c
with F and z with C. In this way attack becomes DWWDFN. The substitution ciphers are not
much secure because intruder can easily guess the substitution characters.
Transposition Ciphers: Substitution ciphers preserve the order of the plaintext symbols but
mask them-;-The transposition cipher in contrast reorders the letters but do not mask them. For
this process a key is used. For example: iliveinqadian may be coded as divienaniqnli. The
transposition ciphers are more secure as compared to substitution ciphers.
5.5.2 Algorithms for Encryption Process
There are commonly used algorithms for encryption process. These are:
Data Encryption Standard (DES)
Public Key Encryption
Data Encryption Standard (DES)
It uses both a substitution of characters and a rearrangement of their order on the basis of
an encryption key. The main weakness of this approach is that authorized users must be told the
encryption key, and the mechanism for communicating this information is vulnerable to clever
intruders.
122
Public Key Encryption

Another approach to encryption, called public-key encryption, has become increasingly
popular in recent years. The encryption scheme proposed by Rivest, Shamir, and Adheman,
called RSA, is a well-knm.vnexample of public-key encryption. Each authorized user has a
public encryption key, known to everyone and a private decryption key (used by the decryption
algorithm), chosen by the user and known only to him or her. The encryption and decryption
algorithms themselves are assumed to be publicly known.
Consider user called Suneet. Anyone can send Suneet a secret message by encrypting the
message using Sunset's publicly known encryption key. Only Suneet can decrypt this secret
message because the decryption algorithm required Suneet's decryption key, known only to
Suneet. Since users choose their own decryption keys, the weakness 0f DES is avoided.
The main issue for public-key encryption is how encryption and decryption keys are
chosen. Technically, public-key encryption algorithms rely on the existence of one-way
functions, which are functions whose inverse is computationally very hard to determine.
The RSA algorithm, for example is based on the observation that although checking
whether a given number of prime is easy, determining the prime factors of a nonprime number is
extremely hard. (Determining the prime factors of a number with over 100 digits can take years
of CPU-time on the fastest available computers today.)
We now sketch the intuition behind the RSA algorithm, assuming that the data to be
encrypted is an integer 1. To choose an encryption key and a decryption key, our friend Suneet-create a public key by computing the product of two large prime numbers: PI and P2. The private
key consists of the pair (PI, P2) and decryption algorithms cannot be used if the product of PI
and P2 is known. So we publish the product PI *P2, but an unauthorized user would need to be
able to factor PIP2 to steal data. By choosing PI and P2 to be sufficiently large (over 100 digits),
we can make it very difficult (or nearly impossible) for an intruder to factorize it.
123
Although this technique is secure, but it is also computationally expensive. A hybrid scheme
used for secure communication is to use DES keys exchanged via a public-key encryption
scheme and DES encryption is used on the data transmitted subsequently.
5.5.3 Disadvantages of encryption
There are following problems of Encryption:
Key management (i.e. keeping keys secret) is a problem. Even in public-key encryption
the decryption key must be kept secret.
Even in a system that supports encryption, data must often be processed in plaintext form.
Thus sensitive data may still be accessible to transaction programs.
Encrypting data gives rise to serious technical problems at the level of physical storage
organization. For example indexing over data, which is stored in encrypted form, can be
very difficult.
5.6 STATISTICAL DATABASE

A statistical database is a database used for statistical analysis purposes. It is an OLAP
(online analytical processing), instead of OLTP (online transaction processing) system. Modern
decision, and classical statistical databases are often closer to the relational model than the
multidimensional model commonly used in OLAP systems today.
124
Statistical databases typically contain parameter data and the measured data for these
parameters. For example, parameter data consists of the different values for varying conditions in
an experiment (e.g., temperature, time). The measured data (or variables) are the measurements
taken in the experiment under these varying conditions.
Many statistical databases are sparse with many null or zero values. It is not uncommon
for a statistical database to be 40% to 50% sparse. There are two options for dealing with the
sparseness: (1) leave the null values in there and use compression techniques to squeeze them out
or (2) remove the entries that only have null values.
5.7 DISTRIBUTED DATABASE
A distributed database is a database in which a storage devices are not all attached to a
common processing unit such as the CPU, controlled by a distributed database management
system (together sometimes called a distributed database system).
5.7.1 Architecture of DBMS
125
5.7.2 Component Architecture of Distributed Database
126
5.8 TRANSACTION PROCESSING

In computer science, transaction processing is information processing that is divided
into individual, indivisible operations, called transactions. Each transaction must succeed or fail
as a complete unit; it can never be only partially complete.
Transaction processing systems consist of computer hardware and software hosting a
transaction-oriented application that performs the routine transactions necessary to conduct
business. Examples include systems that manage sales order entry, airline reservations, payroll,
employee records, manufacturing, and shipping.
Transaction processing is designed to maintain a system's Integrity (typically a database
or some modern file systems) in a known, consistent state, by ensuring that interdependent
operations on the system are either all completed successfully or all canceled successfully.
5.8.1 ACID Properties
A transaction may contain several low level tasks and further a transaction is very small
unit of any program. A transaction in a database system must maintain some properties in order
to ensure the accuracy of its completeness and data integrity. These properties are refer to as
ACID properties and are mentioned below:
Atomicity: Though a transaction involves several low level operations but this property
states that a transaction must be treated as an atomic unit, that is, either all of its
operations are executed or none. There must be no state in database where the transaction
is left partially completed. States should be defined either before the execution of the
transaction or after the execution/abortion/failure of the transaction.
Consistency: This property states that after the transaction is finished, its database must
remain in a consistent state. There must not be any possibility that some data is
incorrectly affected by the execution of transaction. If the database was in a consistent
state before the execution of the transaction, it must remain in consistent state after the
execution of the transaction.
Durability: This property states that in any case all updates made on the database will
persist even if the system fails and restarts. If a transaction writes or updates some data in
database and commits that data will always be there in the database. If the transaction
127
commits but data is not written on the disk and the system fails, that data will be updated
once the system comes up.
Isolation: In a database system where more than one transaction are being executed
simultaneously and in parallel, the property of isolation states that all the transactions will
be carried out and executed as if it is the only transaction in the system. No transaction
will affect the existence of any other transaction.
5.9 DATAWAREHOUSING
Data sources often store only current data, not historical data. Corporate decision making
requires a unified view of all organizational data, including historical data.A data warehouse is a
repository (archive) of information gathered from multiple sources, stored under a unified
schema, at a single site
Greatly simplifies querying, permits study of historical trends Shifts decision support query load
away from transaction processing systems
DESIGN ISSUES
When and how to gather data

Source driven architecture: data sources transmit new information to Warehouse, either
continuously or periodically (e.g. at night).Destination driven architecture: warehouse
periodically requests new information from data sources. Keeping warehouse exactly
synchronized with data sources (e.g. using two-phase commit) is too expensive.
128
What schema to use

Schema integration
Data cleansing
E.g. correct mistakes in addresses (misspellings, zip code errors)
Merge address lists from different sources and purge duplicates
How to propagate updates
Warehouse schema may be a (materialized) view of schema from data sources

What data to summarize
Raw data may be too large to store on-line
Aggregate values (totals/subtotals) often suffice
Queries on raw data can often be transformed by query optimizer to use aggregate
values
Dimension values are usually encoded using small integers and mapped to full values via
dimension tables. Resultant schema is called a star schema.
More complicated schema structures
Snowflake schema: multiple levels of dimension tables
Constellation: multiple fact tables
129
5.10 DATAMINING
Data mining is the process of semi-automatically analyzing large databases to find useful
patterns. Prediction based on past history. Predict if a credit card applicant poses a good credit
risk, based on some attributes (income, job type, age, ..) and past history.Predict if a pattern of
phone calling card usage is likely to be fraudulent. Some examples of prediction mechanisms:
Classification
Given a new item whose class is unknown, predict to which
class it belongs
Regression formulae
Given a set of mappings for an unknown function, predict the
function result for a new parameter value
Descriptive Patterns
Associations
Find books that are often bought by similar customers. If a
new such customer buys one such book, suggest the others too.
Associations may be used as a first step in detecting causation
E.g. association between exposure to chemical X and cancer, Clusters
E.g. typhoid cases were clustered in an area surrounding a
contaminated well. Detection of clusters remains important in
detecting epidemics.
5.11 Classification:
Classifications rules help assign new objects to classes.

E.g., given a new automobile insurance applicant, should he or she be classified as low risk,
medium risk or high risk?
Classification rules for above example could use a variety of data, such as educational level,
salary, age, etc.
person P, P.degree = masters and P.income > 75,000
P.credit = excellent
130
person P, P.degree = bachelors and

(P.income 25,000 and P.income 75,000)
P.credit = good
Rules are not necessarily exact: there may be some misclassifications.Classification rules can be
shown compactly as a decision tree.
CONSTRUCTION OF DECISION TREES
Training set: a data sample in which the classification is already known.Greedy top down
generation of decision trees.
Each internal node of the tree partitions the data into groups based on a partitioning attribute, and
a partitioning condition for the node
Leaf node:
all (or most) of the items at the node belong to the same class, or
all attributes have been considered, and no further partitioning is possible.
131
The purity of a set S of training instances can be measured quantitatively in several ways.
Notation: number of classes = k, number of instances = |S|, fraction
of instances in class i = pi.
The Gini measure of purity is defined as
Gini (S) = 1 -
When all instances are in a single class, the Gini value is 0.
It reaches its maximum (of 1 1 /k) if each class the same number of instances.
DECISION TREE CONSTRUCTION ALGORITHM
Procedure
GrowTree (S )
Partition (S );
Procedure Partition (S)

if ( purity (S ) > p or |S| <
s ) then return;
for each attribute A
evaluate splits on
attribute A;
Use best split found (across all attributes) to
partition S into S1, S2, ., Sr,
132
for i = 1, 2,
.., r
Partition (Si
);
NAVE BASIYAN CLASSIFIERS

Bayesian classifiers require
computation of p (d | cj )
precomputation of p (cj )
p (d ) can be ignored since it is the same for all classes.To simplify the task, nave Bayesian
classifiers assume attributes have independent distributions, and thereby estimate
p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * .* (p (dn | cj )
Each of the p (di | cj ) can be estimated from a histogram on di values for

each class cj
the histogram is computed from the training instances
Histograms on multiple attributes are more expensive to compute and store
REGRESSION
Regression deals with the prediction of a value, rather than a class.Given values for a set of
variables, X1, X2, , Xn, we wish to predict the value of a variable Y.
One way is to infer coefficients a0, a1, a1, , an
such that Y = a0 + a1 * X1 + a2 * X2 +
133
+ an * Xn
Finding such a linear polynomial is called linear regression. In general, the process of finding a
curve that fits the data is also called curve fitting.
The fit may only be approximate
-because of noise in the data, or
-because the relationship is not exactly a polynomial
Regression aims to find coefficients that give the best possible fit.
5.12 ASSOCIATION RULES

Retail shops are often interested in associations between different items that people buy.
Someone who buys bread is quite likely also to buy milk. A person who bought the book
Database System Concepts is quite likely also to buy the book Operating System Concepts.
Associations information can be used in several ways.
E.g. when a customer buys a particular book, an online shop may
suggest associated books.
Association rules:
bread milk
DB-Concepts, OS-Concepts Networks
Left hand side: antecedent, right hand side: consequent
An association rule must have an associated population; the population consists of a set of
instances
E.g. each transaction (sale) at a shop is an instance, and the set of all transactions
is the population
Rules have an associated support, as well as an associated confidence.Support is a measure of
what fraction of the population satisfies both the antecedent and the consequent of the rule.
E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for
the rule is milk screwdrivers is low.
Confidence is a measure of how often the consequent is true when the antecedent is true.
E.g. the rule bread milk has a confidence of 80 percent if 80 percent of the purchases that
include bread also include milk.
FINDING ASSOCIATION RULES
We are generally only interested in association rules with reasonably high support (e.g.
support of 2% or greater)
Nave algorithm
134
1. Consider all possible sets of relevant items.

2. For each set find its support (i.e. count how many transactions purchase all
items in the set).
Large itemsets: sets with sufficiently high support
3. Use large itemsets to generate association rules.
From itemset A generate the rule A - {b } b for each b A.
Support of rule = support (A).
Confidence of rule = support (A ) / support (A - {b })
Determine support of itemsets via a single pass on set of transactions
Large itemsets: sets with a high count at the end of the pass
If memory not enough to hold all counts for all itemsets use multiple passes,
considering only some itemsets in each pass.
Optimization: Once an itemset is eliminated because its count (support) is too small none
of its supersets needs to be considered.
The a priori technique to find large itemsets:
Pass 1: count support of all sets with just 1 item. Eliminate those items with low support
Pass i: candidates: every set of i items such that all its i-1 item subsets are large
Count support of all candidates
Stop if there are no candidates
Basic association rules have several limitations.Deviations from the expected probability are
more interesting
E.g. if many people purchase bread, and many people purchase cereal, quite a few
would be expected to purchase both .We are interested in positive as well as
negative correlations between sets of items
Positive correlation: co-occurrence is higher than predicted
Negative correlation: co-occurrence is lower than predicted
Sequence associations / correlations
E.g. whenever bonds go up, stock prices go down in 2 days
Deviations from temporal patterns
E.g. deviation from a steady growth
135
E.g. sales of winter wear go down in summer

Not surprising, part of a known pattern.
Look for deviation from value predicted using past pattern
5.13 CLUSTERING
Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in
the same cluster can be formalized using distance metrics in several ways.
Group points into k sets (for a given k) such that the average distance of points from the
centroid of their assigned group is minimized
Centroid: point defined by taking average of coordinates
in each dimension.
Hierarchical Clustering
Example from biological classification
Other examples: Internet directory systems (e.g. Yahoo, more on this later)
Agglomerative Clustering Algorithm
Build small clusters, then cluster small clusters into bigger clusters, and
so on
Divisive clustering algorithms
Start with all items in a single cluster, repeatedly refine (break) clusters into smaller ones.
Another metric: minimize average distance between every pair of points in a cluster has been
studied extensively in statistics, but on small data sets.Data mining systems aim at clustering
techniques that can handle very large data sets.
E.g. the Birch clustering algorithm
5.14 INFORMATION RETRIEVAL
Information retrieval is the activity of obtaining information resource relevant to an

information need from a collection of information resources. Searches can be based on metadata
or on full-text indexing.
Typically it refers to the automatic (rather than manual) retrieval of documents

136
Information Retrieval System (IRS)
Document is the generic term for an information holder (book, chapter, article,
webpage, etc)
Systemic approach
Goal (for a known information need):
Return as many relevant documents as possible and as few non-relevant documents as

possible
Cognitive approach
Goal (in an interactive information-seeking environment, with a given
IRS):Support the users exploration of the problem domain and the task completion.
5.15 RELEVANCE RANKING
Relevancy ranking is the process of sorting the document results so that those
documents which are most likely to be relevant to your query are shown at the top.
Relevance ranking is usually best for searches that are not either/or types of searches.
For example, in most traditional title searches, the result is either the library has the book, or it
does not. The relevancy program would either show the entry for the book, or an alphabetical list
that has a statement in the appropriate place that says, Your search would be here. This is a
very good place for this concrete, well-known sorting method.
5.16 CRAWLING AND INDEXING THE WEB
A Web crawler is an Internet both that systematically browses the World Wide Web,
typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an
ant, an automatic indexer, or (in the FOAF software context) a Web scutter. Web search engines
and some other sites use Web crawling or spidering software to update their web content or
indexes of others sites' web content. Web crawlers can copy all the pages they visit for later
processing by a search engine that indexes the downloaded pages so that users can search them
137
much more quickly.

Web indexing (or Internet indexing) refers to various methods for indexing the contents
of a website or of the Internet as a whole. Individual websites or intranets may use a back-of-thebook index, while search engines usually use keywords and metadata to provide a more useful
vocabulary for Internet or onsite searching.
Search engine indexing collects, parses, and stores data to facilitate fast and accurate
information retrieval. Index design incorporates interdisciplinary concepts from linguistics,
cognitive psychology, mathematics, informatics, and computer science. An alternate name for
the process in the context of search engines designed to find web pages on the Internet is web
indexing.
5.17 OBJECT ORIENTED DATABASE
Basically, an OODBMS is an object database that provides DBMS capabilities to objects

that have been created using an object-oriented programming language (OOPL). The basic
principle is to add persistence to objects and to make objects persistent. Consequently application
programmers who use OODBMSs typically write programs in a native OOPL such as Java, C++
or Smalltalk, and the language has some kind of Persistent class, Database class, Database
Interface, or Database API that provides DBMS functionality as, effectively, an extension of the
OOPL.
Object-oriented DBMSs, however, go much beyond simply adding persistence to any one
object-oriented programming language. This is because, historically, many object-oriented
DBMSs were built to serve the market for computer-aided design/computer-aided manufacturing
(CAD/CAM) applications in which features like fast navigational access, versions, and long
transactions are extremely important.
Object-oriented DBMSs, therefore, support advanced object-oriented database applications
with features like support for persistent objects from more than one programming language,
distribution of data, advanced transaction models, versions, schema evolution, and dynamic
generation of new types.
Object data modeling
An object consists of three parts: structure (attribute, and relationship to other objects like
138
aggregation, and association), behavior (a set of operations) and characteristic of types

(generalization/serialization). An object is similar to an entity in ER model; therefore we begin
with an example to demonstrate the structure and relationship.
Attributes are like the fields in a relational model. However in the Book example we have, for
attributes publishedBy and writtenBy, complex types Publisher and Author, which are also
objects. Attributes with complex objects, in RDNS, are usually other tables linked by keys to
the employee table.
Relationships: publish and writtenBy are associations with I: N and 1:1 relationship;
composed of is an aggregation (a Book is composed of chapters). The 1: N relationship is
usually realized
as attributes through complex types and at the behavioral level. For example,
Generalization/Serialization is the is a
relationship
which is supported in OODB through class hierarchy. An ArtBook is a Book, therefore the
ArtBook class is a subclass of Book class. A subclass inherits all the attribute and method of
its superclass.
139
Message: means by which objects communicate, and it is a request from one object to
another to execute one of its methods. For example: Publisher_object.insert (Rose,
123) i.e. request to execute the insert method on a Publisher object)
Method: defines the behavior of an object. Methods can be used to change state by modifying
its attribute values to query the value of selected attributes The method that responds to the
message example is the method insert defied in the Publisher class. The main differences
between relational database design and object oriented database design include:
Many-to-many relationships must be removed before entities can be translated into relations.
Many-to-many relationships can be implemented directly in an object-oriented database.
Operations are not represented in the relational data model. Operations are one of the main
components in an object-oriented database. In the relational data model relationships are
implemented by primary and foreign keys. In the object model objects communicate through
their interfaces. The interface describes the data (attributes) and operations (methods) that are
visible to other objects.
5.18 XML Database
XML: Extensible Markup Language. It is defined by the WWW Consortium (W3C).It is derived
from SGML (Standard Generalized Markup Language), but simpler to use than SGML.
Documents have tags giving extra information about sections of the document.
E.g. <title> XML </title> <slide> Introduction </slide>
Extensible, unlike HTML. Users can add new tags, and separately specify how the tag should be
handled for display. The ability to specify new tags, and to create nested tag structures make
XML a great way to exchange data, not just documents.Much of the use of XML has been in data
exchange applications, not as a replacement for HTML.Tags make data (relatively) selfdocumenting.
Example:
<bank>
<account>
140
<account_number> A-101
<branch_name>
</account_number>
Downtown </branch_name>
<balance>
500
</balance>
</account>
<depositor>
<account_number> A-101
</account_number>
<customer_name> Johnson </customer_name>
</depositor>
</bank>
Data interchange is critical in todays networked world
Examples:
Banking: funds transfer
Order processing (especially inter-company orders)
Scientific data
Chemistry: ChemML,
Genetics: BSML (Bio-Sequence Markup Language),

Paper flow of information between organizations is being replaced by electronic flow of
information. Each application area has its own set of standards for representing
141
information.XML has become the basis for all new generation data interchange formats.Earlier
generation formats were based on plain text with line headers indicating the meaning of fields.
Similar in concept to email headers.Does not allow for nested structures, no standard type
language.Tied too closely to low level document structure (lines, spaces, etc)
Each XML based standard defines what are valid elements, using XML type specification
languages to specify the syntax
DTD (Document Type Descriptors)
XML Schema
Plus textual descriptions of the semantic.XML allows new tags to be defined as required.
However, this may be constrained by DTDs. A wide variety of tools is available for parsing,
browsing and querying XML documents/data.
Inefficient: tags, which in effect represent schema information, are repeated better than relational
tuples as a data-exchange format.Unlike relational tuples, XML data is self-documenting due to
presence of tags.
Non-rigid format: tags can be added.Allows nested structures.Wide acceptance, not only in
database systems, but also in browsers, tools, and applications.
STRUCTURE OF XML
Tag: label for a section of data
Element: section of data beginning with <tagname> and ending with matching
</tagname>.Elements must be properly nested.
Proper nesting
<account> <balance> . </balance> </account>
Improper nesting
<account> <balance> . </account> </balance>
Formally: every start tag must have a unique matching end tag, that is in the context of the
same parent element.Every document must have a single top-level element
Example
<bank-1> <customer>
<customer_name> Hayes </customer_name>
142
<customer_street> Main </customer_street>

<customer_city>
Harrison </customer_city>
<account>
<account_number> A-102 </account_number>
<branch_name>
<balance>
Perryridge </branch_name>
400 </balance>
</account>
<account>
</account>
</customer>
</bank-1>
143

DBMS INTRO CORE CONCEPTS

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

DBMS INTRO CORE CONCEPTS

Încărcat de

Drepturi de autor:

Formate disponibile

UNIT I

1. File Systems Organization -Sequential, Pointer, Indexed, Direct

A file is a collection or set (ordered or unordered) of data elements stored on storage

Logical Structure of a file

Write: Write to a file

A sequential file organization

Advantages of sequential file

Index File Organization

Index Sequential File (ISAM - Index Sequential Access Method)

1. With hashing, the address generated is random.

Record distribution in address space: a. Uniform. b. Random.

Data redundancy and Consistency

Self Describing Nature (of Database System)

Data isolation or Abstraction

Concurrent Access or sharing of Data

(viii) Support for multiple views of Data

Data redundancy and Consistency:

wasted storage space,

redundant efforts to maintain common update,

higher storage and access cost and

In Database approach, a single repository of data is maintained that is maintained that is

Example for a database - A University Database

Software to process Queries / Programs

Software to access Stored Data

Fig 1.1 Database System Environment

Banking Customer information, accounts, loans and transactions

Airlines Reservation and schedule information

Universities Student, course, registration and grades

Credit and Transactions purchases on cards and report

Telecommunications Records of calls, report, balances etc.,

Finance Holdings, sales, stocks and bond purchase.

Sales customer, product, purchase.

Manufacturing Inventory items, orders, purchase.

Human Resources-Employee details, salary, tax etc.,

Insulation between Programs and Data, and Data Abstraction

Insulation between Programs and Data, and Data Abstraction

5.2 Types of Data model:

High level or Conceptual: Close to users

Low level or Physical: how data is stored in computer

5.3 DBMS Architecture:

Three schema architecture

Achieve the database characteristics

Three Schema Architecture:

It has an internal schema which describes physical storage structure of the

How the data are actually stored

Uses physical model

It has an conceptual schema

which describes the structure of whole

Uses high level or implementational data model.

Hides the details of physical storage structures and describes datatypes,

External or View Level:

Includes a number of external schemas or views.

Uses high level or implementational data model.

Fig 1.2 DBMS architecture

Data base System

Fig 1.3 Two Tier and Three Tier Architecture

A data dictionary may contain information such as:

Database design information

Stored SQL procedures

Database process information

Database growth statistics

Database performance statistics

Query evaluation engine