Documente Academic
Documente Profesional
Documente Cultură
INTRODUCTION TO DBMS
File Systems Organization -Sequential, Pointer, Indexed, Direct -Purpose of Database SystemDatabase System Terminologies-Database characteristics-Data models Types of data models
Components of DBMS-Relational Algebra. LOGICAL DATABASE DESIGN: Relational DBMS
-Codd's Rule-Entity-Relationship model -Extended ER Normalization Functional Dependencies,
Anomaly-1NF to 5NF-Domain Key Normal Form Denormalization.
A record can be of fixed or variable size. Most commonly fixed size records are used. The file
structure has been illustrated here in a tabular form but note that a file is not just a table its records
can be organized in many different ways each way defines a unique organization.
Operations on file
Create: Create a new file.
1
http://www.francisxavier.ac.in
One of the main objectives of file organization is to speed up the file retrieval time, that is,
to reduce the I/O time. This can be done by improving disk access speed but that has a limit
because of the hardware constraints.
The other way is to store file records in such a way that records can be accessed
concurrently or in parallel. A hardware solution to this is using RAID structure. A RAID is an
array of several disks which can be accessed in parallel.
No single file organization is the most efficient for all types of data access. For this reason
a database management system uses a number of different file organizations for storing the same
set of data.
There are basically three categories of file organizations (a) Sequential organization, (b)
organization using Index, and (c) Random organization. We begin with sequential category. Files
of these categories are called sequential files.
(a) understand the file organization, (b) understand how a set of operations are performed, and (c)
analyze the time required to performed each operation. This will help us to identify which file is
suitable for what kind of data processing.
Sequential file
In this organization records are written consecutively when the file is created. Records in a
sequential file can be stored in two ways. Each way identifies a file organization.
Pile file: Records are placed one after another as they arrive (no sorting of any kind).
Sorted file: Records are placed in ascending or descending values of the primary key.
Pile file: The total time to fetch (read) a record from a pile file requires seek time (s), rotational
latency time (r), Block transfer time (btt), and number of blocks in the file (B). We also need a key
field of the record (K). In pile file we do not know the address of a record, so we search
2
http://www.francisxavier.ac.in
sequentially. The possibilities are (a) the desired record may be in the first block or (b) it may be in
some middle block or (c) it may be in the last block.
The fetching (searching) process is shown in the flow diagram (Figure 1) and the example
illustrates the time involved.
Fi
gure 1. Search in a sequential file organization.
File Reorganization: In file reorganization all records, which are marked to be deleted are deleted
and all inserted records are moved to their correct place (sorting). File reorganization steps are:
read the entire file (all blocks) in RAM.
remove all the deleted records.
write all the modified blocks at a different place on the disk.
Inserting a record: To insert a record, it is placed at the end of the file. No need to sort
(ascending or descending order) the file. However, the file may have duplicate records.
3
http://www.francisxavier.ac.in
Deleting or modifying a record: This will require to fetch the block containing the record, find
the record in the block and just mark it deleted, then write the modified block to the disk.
Sorted Sequential File: In a sorted file, first the record is inserted at the end of the file and then
moved to its correct location (ascending or descending). Records are stored in order of the values
of the key field.
A sequential file usually has an overflow area. This area is to avoid sorting the file after every
deletion, insertion and/or modification. All records, which were added after the file was first
populated, go to this overflow area. The overflow area is not itself sorted it is a pile file with fixed
size record. At some convenient time the entire file is sorted where records from the overflow area
go to their correct position in the main file.
Retrieval: Records are retrieved in a consecutive order. The order of record storage determines
order of retrieval. During retrieval several required operations (partial result output etc.) can be
performed simultaneously.
Insert, delete and modify (Update): Sequential files are usually not updated in place. Each
operation regenerates a new version of the file. The new version contains the up-to-date
information and the old version is usually saved for recovery. The new version becomes the old
version and the next update to the file uses this old version to generate next new version. The
intensity or frequency of use of a sequential file is defined by a parameter called Hit ratio, which
defines is defined as follows:
4
http://www.francisxavier.ac.in
Desirable: high hit ratio value. This means a larger number of records are accessed to respond to a
query. Interactive transactions have very low hit ratio.
Disadvantages
Not good for interactive transactions
High overheads in file processing for simple queries.
Index organization tries to reduce access time. It may not reduce the storage requirement of
a file. Some important terms of this organization are.
Index: An index is similar to a pointer with an additional property. This additional property allows
an index to identify the record it is pointing to.
For example, an index to a record of employees points and identifies an employee record.
This means that an index has some semantics associated with it. Formally, an index maps the key
space to the record space. Index for a file may be created on the primary or secondary keys.
Index types: Primary, Non dense, and Dense.
Primary Index: An ordered file of index record. This file contains index records which are of
fixed length. Each index record has two fields:
one field holds the primary key of the data file record.
5
http://www.francisxavier.ac.in
the other holds pointer to disk block where the data file is stored.
Nondense index: No. of entries in the index file << no. of records in the data file.
Dense index: No. of entries in the index file = no. of records in the data file.
Example: How record search time is reduced by indexing.
Direct File
Indexing provides the most powerful and flexible tools for organizing files. However, (a)
indexes take up space and time to keep them organized and (b) the amount of I/O increases with
the size of index.
This problem can be minimized by direct file organization where the address of the desired
record can be found directly (no need of indexing or sequential search). Such file are created using
some hashing function so they are called hashing organization or hashed files.
Hashing
Two types of hashing: (a) Static and (b) Dynamic.
Static: The address space size is predefined and does not grow or shrink with file.
Dynamic: The address space size can grow and shrink with file.
Key Space: Set of Primary keys.
Address Space: Set of home addresses.
Hash function (Primary Key) Home address of the record.
Home address file manager disk address.
Requirements: 1. A predefined set of home addresses (address space). 2. A hashing function. 3.
Primary key (Hashed key).
6
http://www.francisxavier.ac.in
There is a unique relationship between the key and the record address. Any add, delete or change
in a record must not break this relationship. If this bond is broken then the record is no longer
accessible by hashing.
Address distribution
2. No obvious connection between key and the home address. (For this reason hashing is
sometimes referred as randomizing.)
(ii)
(iii)
(iv)
Integrity
(v)
Atomicity
(vi)
(vii)
Security
(i)
because each requires some data not available from other users files. This redundancy in defining
and storage of data results in
Leads to inconsistency of data (ie.,) various copies of same data may not agree.
In data base system, DBMS provide capabilities for defining and enforcing constraints. The
constraints are maintained in system catalog. Therefore application programs work independently
with addition or modification of constraints. Hence integrity problems are avoided.
(v) Atomicity:
A Computer system is subjected to failure. If failure occurs, the data has to be
restored to the consistent state that existed prior to failure. The transactions must be atomic it
must happen in entirety or not at all.
It is difficult to ensure atomicity in File processing System.
In DB approach, the DBMS ensures atomicity using the Transaction manager
inbuilt in it. DBMS supports online transaction processing and recovery techniques to maintain
atomicity.
(vi) Concurrent Access or sharing of Data:
When multiple users update the data simultaneously, it may result in inconsistent
data. The system must maintain supervision which is difficult because data may be accessed by
many different application programs that may have not been coordinated previously.
The database (DBMS) include concurrency control software to ensure that several
programs /users trying to update the same data do so in controlled manner, so that the result of
update is correct.
(vii) Security:
Every user of the database system should not be able to access the data.
But since the application programs are added to the system in an adhoc manner, enforcing
such security constraints is difficult in file system.
DBMS provide security and authorization subsystem, which the DBA uses to create
accounts and to specify account restrictions.
(viii) Support for multiple views of Data:
Database approach support multiple views of data. A database has many users each of
whom may require a different view of the database. View may be a subset of database or virtual
data retrieved from database which is not explicitly stored. DBMS provide multiple views of the
data or DB.Different application programs are to be written for different views of data.
3. Database System Terminologies:
a) File system:
A file system is a method for storing and organizing computer files and the data they contain to
make it easy to find and access them. It is controlled by Operating system.
b) Data:
9
http://www.francisxavier.ac.in
Data are Known facts that can be recorded and that have implicit meaning
c) Database:
Database is a collection of related data with some inherent meaning. It is designed built and
populated with data for specific purpose. It represents some aspects of real world.
d) Database Management System:
It is a collection of programs that enables to create and maintain database. It is a general
purpose Software system that facilitates the process of defining, constructing and manipulating
data bases for various applications.
Defining involves specifying data types, structures and constraints for the data to be stored in the
database
Constructing is the process of storing the database itself on some storage medium that is
controlled by DBMS.
Manipulating includes functions such as querying the database to retrieve specific data, updating
database to reflect changes and generating reports from the data.
Eg. Oracle, Ms-access, Sybase, Informix, Foxpro
10
http://www.francisxavier.ac.in
User / programmers
DataBaseSystem Application programs / Queries
Stored Database
Definition
Stored
(Metadata)
Database
4. Database Characteristics:
Self-Describing Nature of a Database System
11
http://www.francisxavier.ac.in
In the database approach, a single repository of data is maintained that is defined once and then is
accessed by various users. The main characteristics of the database approach versus the fileprocessing approach are the following.
Self-Describing Nature of a Database System
A fundamental characteristic of the database approach is that the database system contains not
only the database itself but also a complete definition or description of the database structure and
constraints.
This definition is stored in the system catalog, which contains information such as the structure
of each file, the type and storage format of each data item, and various constraints on the data. The
information stored in the catalog is called meta-data, and it describes the structure of the primary
database.
The catalog is used by the DBMS software and also by database users who need information
about the database structure. A general purpose DBMS software package is not written for a
specific database application, and hence it must refer to the catalog to know the structure of the
files in a specific database, such as the type and format of data it will access.
The DBMS software must work equally well with any number of database applicationsfor
example, a university database, a banking database, or a company databaseas long as the
database definition is stored in the catalog.
DBMS software can access diverse databases by extracting the database definitions from the
catalog and then using these definitions.
Whenever a request is made to access, say, the Name of a STUDENT record, the DBMS
software refers to the catalog to determine the structure of the STUDENT file and the position and
size of the Name data item within a STUDENT record.
12
http://www.francisxavier.ac.in
In a DBMS environment, we just need to change the description of STUDENT records in the
catalog to reflect the inclusion of the new data item Birthdate; no programs are changed. The next
time a DBMS program refers to the catalog, the new structure of STUDENT records will be
accessed and used.
An operation (also called a function) is specified in two parts. The interface (or signature) of
an operation includes the operation name and the data types of its arguments (or parameters). The
implementation (or method) of the operation is specified separately and can be changed without
affecting the interface.
User application programs can operate on the data by invoking these operations through their
names and arguments, regardless of how the operations are implemented. This may be termed
program-operation independence.
The
characteristic
that
allows
program-data
independence
and
program-operation
independence is called data abstraction. A DBMS provides users with a conceptual representation
of data that does not include many of the details of how the data is stored or how the operations are
implemented.
The data model uses logical concepts, such as objects, their properties, and their
interrelationships, that may be easier for most users to understand than computer storage concepts.
Hence, the data model hides storage and implementation details that are not of interest to most
database users.
Support of Multiple Views of the Data
A database typically has many users, each of whom may require a different perspective or view of
the database. A view may be a subset of the database or it may contain virtual data that is derived
from the database files but is not explicitly stored. Some users may not need to be aware of
13
http://www.francisxavier.ac.in
whether the data they refer to is stored or derived. A multiuser DBMS whose users have a variety
of applications must provide facilities for defining multiple views.
For example, one user of the database may be interested only in the transcript of each student;
the view for this user is displayed. A second user, who is interested only in checking that students
have taken all the prerequisites of each course they register for, may require the different view.
Sharing of Data and Multiuser Transaction Processing
A multiuser DBMS must allow multiple users to access the database at the same time. This is
essential if data for multiple applications is to be integrated and maintained in a single database.
The DBMS must include concurrency control software to ensure that several users trying to
update the same data do so in a controlled manner so that the result of the updates is correct.
For example, when several reservation clerks try to assign a seat on an airline flight, the
DBMS should ensure that each seat can be accessed by only one clerk at a time for assignment to a
passenger. These types of applications are generally called on-line transaction processing (OLTP)
applications. A fundamental role of multiuser DBMS software is to ensure that concurrent
transactions operate correctly.
5. Data Models
5.1 Definitions:
a) Data Model:
Data model is a collection of concepts that can be used to describe the structure of a
database (ie., Data, Relationships, Datatypes and constraints)
b) Schema:
Complete definition and description of the database is known as database schema. Each object in
the schema is known as schema construct.It is known as Intension
c) Data Base State:
The data in the database at a particular moment in time is called a database state or snapshot.It is
known as the extension of database schema.
DBMS restores the description of the schema constructs and constraints (meta data ) in DBMS
Catalog.
14
http://www.francisxavier.ac.in
Representational or Implementational : Concepts understood by users and not too far from
the way data is organized Eg. Network, Hierarchical Model.
Internal Level:
Describes the complete details of data storage and access paths for the
database.
(ii)
Conceptual Level:
database
What data are stored and what relationships exist among data.
(iii)
Each external schema describes the part of the database and hides the rest.
15
http://www.francisxavier.ac.in
Client end communicates with application server usually through form interface.
Application server (has Business logic) communicates with database system.
User
User
Client
Application
Client
Application
Client
Client
Network
Server
Server
Network
Application Server
Data base
System
17
http://www.francisxavier.ac.in
Database Engine:
The Database Engine is the core service for storing, processing, and securing data. The
Database Engine provides controlled access and rapid transaction processing to meet the
requirements of the most demanding data consuming applications within your enterprise.
Use the Database Engine to create relational databases for online transaction processing or
online analytical processing data. This includes creating tables for storing data, and database
objects such as indexes, views, and stored procedures for viewing, managing, and securing data.
You can use SQL Server Management Studio to manage the database objects, and SQL Server
Profiler for capturing server events.
Data dictionary:
A data dictionary is a reserved space within a database which is used to store information
about the database itself. A data dictionary is a set of table and views which can only be read and
never altered.
Most data dictionaries contain different information about the data used in the enterprise. In
terms of the database representation of the data, the data table defines all schema objects including
views, tables, clusters, indexes, sequences, synonyms, procedures, packages, functions, triggers
and many more. This will ensure that all these things follow one standard defined in the dictionary.
The data dictionary also defines how much space has been allocated for and / or currently in
used by all the schema objects. A data dictionary is used when finding information about users,
18
http://www.francisxavier.ac.in
objects, and schema and storage structures. Every time a data definition language (DDL) statement
is issued, the data dictionary becomes modified.
User permissions
User statistics
Query Processor :
A relational database consists of many parts, but at its heart are two major components: the
storage engine and the query processor. The storage engine writes data to and reads data from the
disk. It manages records, controls concurrency, and maintains log files.The query processor
accepts SQL syntax, selects a plan for executing the syntax, and then executes the chosen plan.
The user or program interacts with the query processor, and the query processor in turn interacts
with the storage engine. The query processor isolates the user from the details of execution: The
user specifies the result, and the query processor determines how this result is obtained.
The query processor components include
DDL interpreter
DML compiler
writers allow you to select records that meet certain conditions and to display selected fields in
rows and columns.
You can also format data into pie charts, bar charts, and other diagrams. Once you have
created a format for a report, you can save the format specifications in a file and continue reusing
it for new data.
7. Relational Algebra:
Relational algebra is a procedural query language, which consists of basic set of relational
model operations. These operations enable the user to specify basic retrieval requests. It takes one
or more relations as input and result of retrieval is a relation.
Relational Algebra operations are divided into two groups
(i)
Set operations
(ii)
20
http://www.francisxavier.ac.in
Operation
Keyword
Symbol
Operation
Keyword
Union
UNION
Selection
SELECT
Intersection
INTERSECTION
Projection
PROJECT
Renaming
RENAME
Assignment
<-
Cartesian
product
Symbol
Symbol
Join
Other Operators
Group or
AGGREGATE
JOIN
LEFT OUTER
JOIN
Aggregate
Right outer
RIGHT OUTER
join
JOIN
Division
FULL OUTER
JOIN
DIVIDE
The SELECT operation is used to select a set of tuples from a relation that satisfy
the selection condition.
21
http://www.francisxavier.ac.in
The selection condition is applied to each tuple in R. This is done by substituting each
occurrence of an Attribute Ai in the selection condition with its value in the tuple t[Ai]. If
the condition evaluates to true, then tuple t is selected. All the selected tuples appear in the
result of SELECT operation.
The number of tuples in the resultant relation is always less than or equal to the number of
tuples in R. ie.,| c(R) | |R|. The number of tuples selected by a selection condition is
referred to as the selectivity of the condition.
Cascade of select operations can be combined into a single select using conjunctive AND
condition.
SALARY>30000(EMPLOYEE)
(DNO=4 AND SALARY>25000) OR (DNO=5 AND SALARY>30000)(EMPLOYEE)
The PROJECT operation is used to select certain columns from the table and
discards other columns. The pi represents the project operator. It is a Unary
operator because it operates on one relation.
The result of PROJECT operation has only the attribute specified in <attribute list>
in the same order as they appear in the list.
The degree of the relation resulting from SELECT is same as that of attribute list.
If the attribute list has a nonkey attribute, PROJECT operation removes any
duplicates , so the result is a set of tuples. This is known as duplicate elimination.
The number of tuples in the resultant relation is always less than or equal to the
number of tuples in R
23
http://www.francisxavier.ac.in
(ii)
(iii)
Union (U)
Intersection (n)
(i) Union: The result of the operation denoted by R U S is a relation that includes all tuples that are
either in R or in S or in both R and S. Duplicate tuples are eliminated.
(ii) Intersection: The result of the operation denoted by R n S is a relation that includes all tuples
that are in both R and S.
Both union and intersection are commutative operations and can be treated as n-ary operations
applicable to any number of relations and both are associative operations;
R U S = S U R and R n S = S n R
R U (S U T) = (R U S) U T ; R n (S n T) = (R n S) n T
(iii) Set Difference: The result of this operation, denoted by R - S is a relation that includes all
tuples that are in R but not in S.
The minus operation is not commutative;
R S S R
STUDENT
FN
INSTRUCTOR
LN
Suresh
Rao
Bala
Ganesh
Ravi
Sankar
Mohan
Varma
25
http://www.francisxavier.ac.in
FNAME
LNAME
Kishore
Das
Sushil
Sharma
Suresh
Rao
Rakesh
Agarwal
Bala
Ganesh
Sushil
Sharma
Ashwin
Kumar
FN
LN
STUDENT U INSTRUCTOR
STUDENT n INSTRUCTOR
FN
LN
Suresh
Rao
Suresh
Rao
Bala
Ganesh
Bala
Ganesh
Ravi
Sankar
INSTRUCTOR STUDENT
Sushil
Sharma
STUDENT - INSTRUCTOR
Mohan
Varma
Sushil
Sharma
Rakesh
Agarwal
FN
FN
KishoreRavi
LN
LN
Das
Sankar
Das
Rakesh
Ashwin
Agarwal
Kumar
26
http://www.francisxavier.ac.in
The operation applied leads to meaningless tuples. It is useful when followed by a selection
that matches the values of attributes coming from relations.
FEMALE_EMPS SEX=F(EMPLOYEE)
EMPNAMES FNAME, LNAME, SSN (FEMALE_EMPS)
EMP_DEPENDENTS EMPNAMES x DEPENDENT
JOIN Operation:
<join condition> S
domain. And is one of the relational operator { =,<,>,,,} A join with such a general join
condition is called as theta join.
The result of join is a relation Q with n + m attributes Q(A1,.. An,B1,.Bm).. Q has one
tuple from R and one tuple from S whenever the combination satisfies the join condition.
The join condition specified on attributes from two relations R and S is evaluated for each
combination. Each tuple combination for which the condition is evaluated to true is included in the
result.
Equi Join: The join which involves join conditions with equality comparisons only ie., it involves
only = operator in join condition is known as equi join.
DEPT_MGR DEPARTMENT
MGRSSN=SSN EMPLOYEE
Natural Join : denoted by * Equi join with two join attributes have same name in both
relationships. The attribute involved in join condition is known as Join attribute.
DEPT_LOCS DEPARTMENT * DEPT_LOCATIONS
Outer Join : The tuples without a matching tuple and the tuples with null values for the join
attribute are eliminated from JOIN result. Outer joins can be used to keep all the tuples in R or in S
or those in both relation in the result of join even if they donot have matching tuples in join.
It preserves all the tuples whether or not they match in the join condition.
Left outer Join: Keeps every tuple in the left relation R in R
found in S., The attributes of S in the join result are filled with null values. It is denoted by
28
http://www.francisxavier.ac.in
To find the names of all employees, along with the names of the departments they manage, we
could use:
T1 EMPLOYEE
SSN=MGRSSN DEPARTMENT
found in S., The attributes of R in the join result are filled with null values. It is denoted by
Full Outer Join: Keeps all tuples in both the right relation R and S in R S , even if no matching
tuples are found and filled with null values when needed. It is denoted as
29
http://www.francisxavier.ac.in
Relations r, s:
A
r s:
B
a
Null Values
It is possible for tuples to have a null value, denoted by null, for some of their attributes
null signifies an unknown value or that a value does not exist.
The result of any arithmetic expression involving null is null.
Aggregate functions simply ignore null values (as in SQL)
For duplicate elimination and grouping, null is treated like any other value, and two nulls are
assumed to be the same (as in SQL)
4. Aggregate Functions and Grouping
30
http://www.francisxavier.ac.in
The relational algebra allows to specify mathematical aggregate functions on collections of values
from the database. Examples of such functions include retrieving the average or total salary of all
employees or the number of employee tuples. Common functions applied to collections of numeric
values include SUM, AVERAGE, MAXIMUM, and MINIMUM. The COUNT function is used
for counting tuples or values.
Another type may be grouping the tuples in a relation by the value of some of their attributes and
then applying an aggregate function independently to each group. An example would be to group
employee tuples by DNO, so that each group includes the tuples for employees working in the
same department. We can then list each DNO value along with, say, the average salary of
employees within the department.
<grouping attributes> <function list>
(R)
where <grouping attributes> is a list of attributes of the relation specified in R, and <function list>
is a list of (<function> <attribute>) pairs. In each such pair, <function> is one of the allowed
functions such as SUM, AVERAGE, MAXIMUM, MINIMUM, COUNTand <attribute> is
an attribute of the relation specified by R.
The resulting relation has the grouping attributes plus one attribute for each element in the function
list.
1. FMAXIMUM Salary (Employee) retrieves the maximum salary value from the Employee
relation
MAXIMUM_Salary
55000
2. Retrieve each department number, the number of employees in the department, and their
average salary.
(DNO COUNT SSN, AVERAGE SALARY (EMPLOYEE))
31
http://www.francisxavier.ac.in
3. Without the grouping attribute, you would get the result for all tuples together
COUNT SSN, AVERAGE Salary (Employee)
Note :The results of aggregate functions are still relations But not numbers.
7. Logical Database Design: Relational DBMS - Codd's Rule
A relational database management system (RDBMS) is a database management system
(DBMS) that is based on the relational model as introduced by E. F. Codd. Most popular
commercial and open source databases currently in use are based on the relational model.
A short definition of an RDBMS may be a DBMS in which data is stored in the form of tables
and the relationship among the data is also stored in the form of tables.
E.F. Codd, the famous mathematician has introduced 12 rules for the relational model for
databases commonly known as Codd's rules. The rules mainly define what is required for a
DBMS for it to be considered relational, i.e., an RDBMS. There is also one more rule i.e. Rule00
which specifies the relational model should use the relational way to manage the database. The
rules and their description are as follows:Rule 0: Foundation Rule
A relational database management system should be capable of using its relational facilities
(exclusively) to manage the database.
Rule 1: Information Rule
32
http://www.francisxavier.ac.in
All information in the database is to be represented in one and only one way. This is achieved by
values in column positions within rows of tables.
Rule 2: Guaranteed Access Rule
All data must be accessible with no ambiguity, that is, Each and every datum (atomic value) is
guaranteed to be logically accessible by resorting to a combination of table name, primary key
value and column name.
Rule 3: Systematic treatment of null values
Null values (distinct from empty character string or a string of blank characters and distinct from
zero or any other number) are supported in the fully relational DBMS for representing missing
information in a systematic way, independent of data type.
Rule 4: Dynamic On-line Catalog Based on the Relational Model
The database description is represented at the logical level in the same way as ordinary data, so
authorized users can apply the same relational language to its interrogation as they apply to regular
data. The authorized users can access the database structure by using common language i.e. SQL.
Rule 5: Comprehensive Data Sublanguage Rule
A relational system may support several languages and various modes of terminal use. However,
there must be at least one language whose statements are expressible, per some well-defined
syntax, as character strings and whose ability to support all of the following is comprehensible:
a. data definition
b. view definition
c. data manipulation (interactive and by program)
d. integrity constraints
e. authorization
f. Transaction boundaries (begin, commit, and rollback).
All views that are theoretically updateable are also updateable by the system.
Rule 7: High-level Insert, Update, and Delete
The system is able to insert, update and delete operations fully. It can also perform the operations
on multiple rows simultaneously.
Rule 8: Physical Data Independence
Application programs and terminal activities remain logically unimpaired whenever any changes
are made in either storage representation or access methods.
Rule 9: Logical Data Independence
Application programs and terminal activities remain logically unimpaired when information
preserving changes of any kind that theoretically permit unimpairment are made to the base tables.
Rule 10: Integrity Independence
Integrity constraints specific to a particular relational database must be definable in the relational
data sublanguage and storable in the catalog, not in the application programs.
Rule 11: Distribution Independence
The data manipulation sublanguage of a relational DBMS must enable application programs and
terminal activities to remain logically unimpaired whether and whenever data are physically
centralized or distributed.
Rule 12: Nonsubversion Rule
If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level
language cannot be used to subvert or bypass the integrity rules or constraints expressed in the
higher-level (multiple-records-at-a-time) relational language.
ENTITY-RELATIONSHIP MODEL
The basic object that the ER model represents is an entity, which is a "thing" in the real
world with an independent existence.
An entity may be an object with a physical existencea particular person, car, house, or
employeeor it may be an object with a conceptual existencea company, a job, or a
university course.
34
http://www.francisxavier.ac.in
For example, an employee entity may be described by the employees name, age, address,
salary, and job.
The attribute values that describe each entity become a major part of the data stored in the
database.
Types of attributes:
Composite attributes are useful to model situations in which a user sometimes refers to the
composite attribute as a unit but at other times refers specifically to its components.
If the composite attribute is referenced only as a whole, there is no need to subdivide it into
component attributes. For example, if there is no need to refer to the individual components of an
address (Zip, Street, and so on), then the whole address is designated as a simple attribute. Singlevalued Versus Multivalued Attributes
Most attributes have a single value for a particular entity; such attributes are called singlevalued. For example, Age is a single-valued attribute of person. In some cases an attribute can
have a set of values for the same entityfor example, a Colors attribute for a car, or a
CollegeDegrees attribute for a person. Cars with one color have a single value, whereas two-tone
cars have two values for Colors. Similarly, one person may not have a college degree, another
person may have one, and a third person may have two or more degrees; so different persons can
have different numbers of values for theCollegeDegrees attribute. Such attributes are called
multivalued.
A multivalued attribute may have lower and upper bounds on the number of values allowed
for each individual entity. For example, the Colors attribute of a car may have between one and
three values, if we assume that a car can have at most three colors.
Stored Versus Derived Attributes
In some cases two (or more) attribute values are relatedfor example, the Age and
BirthDate attributes of a person. For a particular person entity, the value of Age can be determined
from the current (todays) date and the value of that persons BirthDate. The Age attribute is hence
called a derived attribute and is said to be derivable from the BirthDate attribute, which is called a
stored attribute. Some attribute values can be derived from related entities; for example, an
attribute NumberOfEmployees of a department entity can be derived by counting the number of
employees related to (working for) that department.
o An entity type defines a collection (or set) of entities that have the same attributes.
Each entity type in the database is described by its name and attributes.
o Figure shows two entity types, named EMPLOYEE and COMPANY, and a list of
attributes for each. A few individual entities of each type are also illustrated, along
with the values of their attributes.
o The collection of all entities of a particular entity type in the database at any point
in time is called an entity set; the entity set is usually referred to using the same
name as the entity type. For example, EMPLOYEE refers to both a type of entity as
well as the current set of all employee entities in the database.
o An entity type is represented in ER diagrams as a rectangular box enclosing the
entity type name.
o Attribute names are enclosed in ovals and are attached to their entity type by
straight lines.
o Composite attributes are attached to their component attributes by straight lines.
o Multivalued attributes are displayed in double ovals.
o An entity type describes the schema or intension for a set of entities that share the
same structure.
o The collection of entities of a particular entity type are grouped into an entity set,
which is also called the extension of the entity type.
Key Attributes of an Entity Type
o An important constraint on the entities of an entity type is the key or uniqueness
constraint on attributes.
o An entity type usually has an attribute whose values are distinct for each individual
entity in the collection. Such an attribute is called a key attribute, and its values can
be used to identify each entity uniquely.
o For example, the Name attribute is a key of the COMPANY entity type in Figure,
because no two companies are allowed to have the same name. For the PERSON
entity type, a typical key attribute is SocialSecurityNumber.
o Sometimes, several attributes together form a key, meaning that the combination of
the attribute values must be distinct for each entity. If a set of attributes possesses
this property, we can define a composite attribute that becomes a key attribute of
the entity type.
o In ER diagrammatic notation, each key attribute has its name underlined inside the
oval,
37
http://www.francisxavier.ac.in
38
http://www.francisxavier.ac.in
Y R.H.S of F.D.
Definition 1:
X functionally determines Y in a relational schema R iff whenever two tuples of r(R) agree on
their X value , they must necessarily agree on their Y value.
F.D. is a property of semantics or meaning odf attributes.
Definition 2 :
For any two tuples t1 and t2 in r if t1[x]=t2[x], we must have t1[y]=t2[y], ie., values of Y
component of a tuple in r depend on and determined by the values of X component.
39
http://www.francisxavier.ac.in
SSN
ENAME
SSN
ENAME
DNO{DNAME,MGRSSN}
DNO
DNAME
MGRSSN
Inference rules:
Let F be the set of f.ds specified on R. Other numerous functional dependencies also hold
in R that satisfy F.
The set of all such dependencies inferred from F is called the closure of F denoted by F+.
SSN {Ename,Bdate,Address,Dno}
DNO {Dname,MGRSSN}
Inferred F.d.s F+
SSN {Dname,MGRSSN}
SSN Ename
40
http://www.francisxavier.ac.in
X Y
IR1: (Reflexive)
: If x y, then x y
IR2: (Augmentation)
: {xy}
IR3: (Transitive)
: {xy,yz}
IR4: (Decomposition)
: {xyz}
IR5: (Union)
: xy, xz
: {xy,wyz}
xzyz
xz
xy
xyz
wxz
Closure of X under F:
For each set of attributes X in L.H.S. of a f.d. in F ,determine all set of attributes dependent on X.
Thus for each X we determine X+ functionally determined by X based on F.
X+ is called closure of X under F.
Equivalence sets of f.d.s
A set of F.Ds E is covered by a set of f.ds F or F covers E if every F.D in E can be inferred
from F (also in F+)
Two sets of F.Ds E and F are said to be equivalent if E+=F+ ie., every F.D in E can be
inferred from F and vice-versa.ie., E covers F and F covers E.
Minimal set of functional dependency:
A set of functional dependencies F is said to be minimal if
1) Every dependency in F has a single attribute in its R.H.S.
2) We cannot replace any f.d X A in F with Y A where Y X have a set of functional
dependencies equivalent to F.
3) We cannot remove any dependency from F and still have a set of dependencies equivalent
to F.
41
http://www.francisxavier.ac.in
It takes the relational schema through a series tests whether it satisfies a certain normal
form.
Codd has defined I, II., III NF and a stronger definition of 3NF known as
BCNF.(Boyce Codd Normal Form)
All these normal forms are based on f.ds among attributes of a relation.
Normalization:
It is a process of analyzing the given relational schema based on their FDs and primary keys to
achieve desirable properties of minimizing redundancy and minimizing the insertion and Updation
anamolies.
The relational schema that donot satisfy normal form tests are decomposed into smaller relational
schemas that meet the tests and possess desirable properties.
Normalization procedure provides:
1) Framework for analyzing relational schema based on keys and f.ds
2) A series of normal from tests carried out on each relational schema so that the database
is normalized to desired degree.
42
http://www.francisxavier.ac.in
A normal form of a relation is the highest normal form condition that it meets ad hence indicates
the degree to which it is normalized.
Normalisation through decomposition must meet
1) Lossless join or non-additive join property:
Spurious tuples problem donot occur w.r.t schemas after decomposition.
2) Dependency preservation property
Each f.d is represented by some individual relations after decomposition.
The process of storing the join of higher normal form relations as a base relation in lower normal
form is known as denormalization.
Superkey: A superkey s of R is a set of attributes with the property that no two tuples t1 and t2 in
r(R) will have t1[s] = t2[s]. Different set of attributes, which are able to identify any row in the
database, is known as super key.
Candidate Key: A candidate key is a minimal superkey. The removal of any attribute from K will
not cause K to be a super key any more. K={A1,Am} K-Ai is not a key
Eg{SSN,Ename} is a superkey and {ssn} is a candidate key.
Primary Key: Primary key could be any key, which is able to identify a specific row in database
in a unique manner.
An attribute of a relation schema R, which is a member of some candidate key of R is called Prime
Attribute otherwise it is a non-prime attribute.
And minimal super key is termed as candidate key i.e. among set of super keys one with minimum
number of attributes. Primary key could be any key, which is able to identify a specific row in
database in a unique manner.
First Normal Form:
Statement of First normal form:
The domain of an attribute must include only atomic (simple atomic) values and that value of an
attribute in a tuple must be a single value from domain of that attribute.
43
http://www.francisxavier.ac.in
1NF disallows set of values, a tuples of values ie., it disallows a relation within a relation or
relations as attributes (ie., it disallows composite and Multivalued attribute)
Eg. DEPARTMENT relational schema whose primary key is DNUMBER. Considering
DLOCATIONS attribute. Each department can have a number of locations.
DEPARTMENT
DNAME
DNUMBER
DMGRSSN
DLOCATIONS
NOT IN 1NF
1) The domain of DLOCATIONS contains atomic values, but some tuples can have a set of these
values. Thereore DLOCATIONS is not functionally dependent on DNUMBER.
2)
DNAME
DNUMBER
DMGRSSN
DLOCATIONS
Research
334
Bellaire
Research
334
Alaska
Domain of
DLOCATIONS has set
Research
of
334
Newyork
atomic.
But DNUMBER
DLOCATIONS exist.
DNAME
DNUMBER
DMGRSSN
DLOCATIONS
Research
334
{Bellaire,Alaska,Newyork}
DNUMBER
DMGRSSN
44
http://www.francisxavier.ac.in
DNUMBER
DLOCATIONS
2. Expand the key so that there will be a separate tuple for each location of department. But it
introduces redundancy in relation.
3. If maximum number of values is known for DLOCATIONS eg 3., Replace DLOCATIONS
by 3 atomic attributes LOC1, LOC2, LOC3. It introduces more null values in the relation.
The 1 NF also disallows multivalued attributes that are themselves composite. These are called
nested relations.
Test for First Normal Form:
Relation should not have non-atomic attributes or nested relation.
Remedy
Form new relations for each non-atomic attribute or nested relation.
Second Normal Form:
2NF is based on Full Functional Dependency.
Full Functional Dependency:
A FD X Y is a full functional dependency, if removal of any attribute A from X, {X{A}} does not functionally determine Y. Eg. {SSN, PNUMBER} HOURS
Partial Functional dependency:
A FD XY is a partial dependency if some attribute A X can be removed from X and
the dependency still holds ie., A X, X-{A} Y
45
http://www.francisxavier.ac.in
If a relational schema R is not in 2NF, it can be 2NF normalized into a number of 2NF
relations in which non prime attributes are associated only with the part of the primary key on
which they are fully functionally dependent. Eg.,
SSN
PNumber Hours
NOT IN 2 NF
FD1
FD2
FD3
Ename, Pname & PLocation violates 2NF. FD2 and FD3 are partially dependent on primary key.
Normalisation (Decomposition) to achieve 2NF:
SSN
Ename
SSN
FD2
PNumber Hours
FD1
FD3
Test for Second Normal Form
For relations where primary key contains multiple attributes, non-key or non-prime
attribute should not be functionally dependent on a part of primary key.
Remedy:
Decompose and set up a new relation for each partial key with its dependent attribute(s).
make sure to keep a relation with the original primary key and any attributes that are fully
functionally dependent on it.
Third Normal Form:
46
http://www.francisxavier.ac.in
X is a superkey, or
Eg., EMPLOYEE
Ename SSN
Dname Dmgrssn
Dno
Dname Dmgrssn
A relation should not have non key attribute functionally determined by another non key
attribute ie., there should be no transitive dependency of a non key attribute on the primary key.
Remedy:
Decompose and set up a relation that includes non-key attributes that functionally
determine other non-key attribute.
Boyce-Codd normal form (BCNF):
It is stricter than 3 NF because every relation in BCNF also in 3NF. A relation R is in BCNF if
whenever non-trivial dependency XA holds in R then X is a superkey of R.
The condition of 3NF, which allows A to be prime, is absent from BCNF.
A relation is in BCNF, if and only if, every determinant is a candidate key.
ClientNo
interviewDate
interviewTime
staffNo
roomNo
CR76
13-May-02
10.30
SG5
G101
CR76
13-May-02
12.00
SG5
G101
CR74
13-May-02
12.00
SG37
G102
CR56
1-Jul-02
10.30
SG5
G102
ClientInterview
FD1 : clientNo, interviewDate interviewTime, staffNo, roomNo
(Primary Key)
(Candidate key)
(Candidate key)
interviewDate
interviewTime
staffNo
CR76
CR76
CR74
CR56
13-May-02
13-May-02
13-May-02
1-Jul-02
10.30
12.00
12.00
10.30
SG5
SG5
SG37
SG5
Staff Room
staffNo
interviewDate
SG5
SG37
SG5
13-May-02
13-May-02
1-Jul-02
roomNo
G101
G102
G102
The ultimate goal during normalization is to separate the logically related attributes into
tables to minimize redundancy, and thereby avoid the update anomalies that lead to an extra
processing overhead to maintain consistency in the database.
49
http://www.francisxavier.ac.in
The above ideals are sometimes sacrificed in favor of faster execution of frequently
occurring queries and transactions. This process of storing the logical database design (which may
be in BCNF or 4NF) in a weaker normal form, say 2NF or 1NF, is called denormalization.
Typically, the designer adds to a table attributes that are needed for answering queries or
producing reports so that a join with another table, which contains the newly added attribute, is
avoided. This reintroduces a partial functional dependency or a transitive dependency into the
table, thereby creating the associated redundancy problems.
Other forms of denormalization consist of storing extra tables to maintain original
functional dependencies that are lost during a BCNF decomposition.
Differentiate 3 NF and BCNF:
The difference between 3NF and BCNF is that for a functional dependency A B, 3NF allows
this dependency in a relation if B is a primary-key attribute and A is not a candidate key.whereas
BCNF insists that for this dependency to remain in a relation, A must be a candidate key.
Note:
A non-prime attribute of R is an attribute that does not belong to any candidate key of R.
Denormalization
Denormalization doesnt mean not normalizing. Its a step in the process.
First we normalize, then we realize that we now have hundreds or thousands of small tables and
that the performance cost of all those joins will be prohibitive, and then carefully apply some
denormalization techniques so that the application will return results before the werewolves have
overrun the city.
Since normalization is about reducing redundancy, denormalizing is about deliberately adding
redundancy to improve performance. Before beginning, consider whether or not its
necessary.Data Normalization, Denormalization, and the Forces of Darkness / Hollingsworth / p18
Is the systems performance unacceptable with fully normalized data? Mock up a
client and do some testing.
If the performance is unacceptable, will denormalizing make it acceptable? Figure out
50
http://www.francisxavier.ac.in
Denormalization Strategies
Materialized Views
If were lucky, we wont need to denormalize our logical data design at all. We can let the
database management system store additional information on disk, and its the responsibility of the
DBMS software to keep the redundant information consistent.
Oracle does this with materialized views, which are pretty much what they sound likeSQL
views, but made material on the disk. Like regular views, they change when the underlying data
does. Microsoft SQL has something called indexed views, which I gather are similar although
Ive never used them. Other SQL databases have standard hacks to simulate materialized views;
a little Googling will tell you whether or not yours does.
Database Constraints
The more common approach is to denormalize some carefully chosen parts of the database
design, adding redundancies to speed performance. Danger lies ahead. It is now the database
designers responsibility to create database constraints specifying how redundant pieces of data
will be kept consistent.
These constraints introduce a tradeoff. They do speed up reads, just as we wish, but they slow
down writes. If the databases regular usage is write-heavy, such as the Council of Lights
51
http://www.francisxavier.ac.in
message forum, then this may actually be worse than the non-denormalized version.
Mirror Tables
If a particular table is too active, with constant querying and updating leading to slowdowns
and timeouts and deadlocks, consider a miniature and internal version of Double your database
fun. You can have a background and foreground copy of the same table. The background table
takes only writes, the foreground table is for reading, and synchronizing the two is a low-priority
process that happens when less is going on.
Split Tables
The original table could also be maintained as a master, in which case the split tables are
special cases of mirror tables (above) which happen to contain a subset instead of a complete copy.
If few people want to query the original table, I could maintain it as a view which joins the split
tables and treat the split tables as masters.
Thats a horizontal split, pulling rows out into different tables. For other needs we might do
a vertical split, keeping all the same rows/keys but with each table having separate sets of columns
of information.
This is only worthwhile when there are distinct kinds of queries which are, in effect,
already treating the table as more than one table.
Combined Tables
Instead of splitting one table into several, we might combine several into one. If tables have
a one-to-one relationship, it might speed performance to combine them into one table even if that
isnt part of the logical database design. We might even combine tables with a one-to-many
relationship, although that would significantly complicate our update process and would likely
work out to be a net loss.
Combined tables differ from joined tables (above) in that they already share some data in
some relationship, while any two tables which are frequently joined are potential candidates for
joined tables. They also differ in that a joined table usually populates itself from the co-existing
52
http://www.francisxavier.ac.in
normalized tables, while a combined table will usually be the master copy.
Index Tables
These are nice for searching hierarchies. Remember how Normal 5 taught us that weapon
composition is actually a hierarchy of materials rather than a simple list? Well, now were stuck
with searching that hierarchy by searching all the leaves under particular parent nodes and
combining with a bunch of union statements.
UNIT II
SQL & QUERY OPTIMIZATION
SQL Standards -Data types -Database Objects-DDL-DML-DCL-TCL-Embedded SQL-Static Vs
Dynamic SQL -QUERY OPTIMIZATION: Query Processing and Optimization -Heuristics and
cost Estimates in Query Optimization.
Embedded SQL and Dynamic SQL - SQL statements can be embedded in general purpose
programming language.
Integrity SQL DDL commands specify integrity constraints that the data stored in the
database must satisfy. The updates that violate these constraints are disallowed.
Authorization specifying access rights to relations and views.
Basic Structure of SQL:
The basic structure of SQL expression consists of three clauses: select, from and where
The SELECT clause corresponds to the project operator of the relational algebra to list the
attributes desired in the result of a query.
The FROM clause corresponds to the Cartesian product operation of the relational algebra
to list the relations in the evaluation of the expression.
The WHERE clause corresponds the selection predicate of the relational algebra. It consist
of the predicate involving attributes of the relations that appear in the from clause.
A SQL query is of the form:
SELECT A1,A2,A3An
FROM r1,r2,r3rm
WHERE P
or
SELECT <attribute list>
FROM <table list>
WHERE <condition>;
where:
<attribute list> is a list of attribute names whose values are to be retrieved by the query.
<table list> is a list of the relation names required to process the query.
<condition> is a conditional (Boolean) expression that identifies the tuples to be
retrieved by
the query.
The SQL Query is equivalent to the relational algebra expression
A1,A2,A3..An( P (r1 xr2 x r3 xrm)
SQL forms the Cartesian product of the relations named in the from clause, performs a
relational algebra selection using the where clause predicate and then projects the result onto the
attributes of the select clause.
2.2 DDL:
54
http://www.francisxavier.ac.in
SQL uses the terms table, row, and column for relation, tuple, and attribute, respectively.
The SQL2 commands for data definition are CREATE, ALTER, TRUNCATE and DROP;
The SQL DDL allows specification of not only a set of relations, but also information
about each relation, including:
55
http://www.francisxavier.ac.in
Where r is the name of the relation, each Ai is the name of an attribute in the schema of a
relation r and Di is the domain type of values in the domain
DML(Data Manipulation Language)
It is a language that provides a set of operation to support the basic data manipulation, operation
on the data held in the database.
Commands used are:
Insert
Delete
Select
Update
DCL (Data Control Language)
Grant
Revoke
Data/Domain types in SQL
The SQL standard supports a variety of built-in domain types including:
char(n)
varchar(n)
int
- integer
smallint
number(n)
number(p,d)
- a fixed point number with p digits and d of the p digits are to the right of the
decimal point.
real
float(n)
date
time
timestamp
Database Objects
View
A View is subset of part of a database. It is a personalized model of a database. A view can hide
data that a user does not need to see. Simplifies the usage of the system and enhance security. A
user who is not allowed to directly access a relation may be allowed to access a part of a relation
through view. Views may also be called as a virtual table.
56
http://www.francisxavier.ac.in
Provide a mechanism to hide certain data from the view of certain users. To create a view we
use the command:
create view v as <query expression>
where:<query expression> is any legal expression
The view name is represented by v
create view <viewname> as select <fields> from <table name>;
Update of a View
Create a view of all loan data in loan relation, hiding the amount attribute
Updates on more complex views are difficult or impossible to translate, and hence are
disallowed.
Most SQL implementations allow updates only on simple views (without aggregates)
defined on a single relation
Advantages of Views:
1. Provide automatic security for hidden data.
2. Different views of same data for different users.
3. Provide logical data independence.
4. Provides the principle of interchangeability and principle of database relativity.
Sequence
A sequence is a database object created by a user and can be shared by multiple users.
Use:
To create a primary key value, this must be unique for each row.
57
http://www.francisxavier.ac.in
Syntax
To Create Sequence
CREATE SEQUENCE sequence_name
[INCREMENT BY n]
[START WITH n]
[{MAXVALUE n | NOMAXVALUE}]
[{MINVALUE n | NOMINVALUE}]
[CYCLE | NOCYCLE}]
[{CACHE n | NOCACHE}];
To use a Sequence
sequence_name;
Synonyms:
Direct references to objects. They are used to provide public access to an object, mask the real
name or owner of an object, etc. A user may create a private synonym that is available to only that
user.
58
http://www.francisxavier.ac.in
The program passes the SQL statement to the DBMS with a PREPARE statement, which requests
that the DBMS parse, validate, and optimize the statement and generate an execution plan for it.
The program can use the EXECUTE statement repeatedly, supplying different parameter values
each time the dynamic statement is executed.
Example of the use of dynamic SQL from within a C program.
char * sqlprog = update account set balance = balance * 1.05 where account_number = ?
EXEC SQL prepare dynprog from :sqlprog;
char account [10] = A-101;
EXEC SQL execute dynprog using :account;
The dynamic SQL program contains a ?, which is a place holder for a value that is provided when
the SQL program is executed.
Difference between Static and Dynamic SQL:
S.No. Static (embedded) SQL
1.
In static SQL how database will be accessed In dynamic SQL, how database will be
is predetermined in the embedded SQL
statement.Hard coded
2.
3.
time.
4.
5.
6.
7.
compile time.
time.
It is generally used for situations where data It is generally used for situations where data is
is distributed uniformly.
distributed non-uniformly.
It is less flexible.
It is more flexible.
60
http://www.francisxavier.ac.in
The scanner identifies the language tokenssuch as SQL keywords, attributes names and
relation namesin the text of the query, whereas the parser checks the query syntax to determine
whether it is formulated according to the syntax rules (rules of grammar) of the query language.
The query must also be validated, by checking that all attribute and relation names are
valid and semantically meaningful names in the schema of the particular database being queried.
An internal representation of the query is then created, usually as a tree data structure called a
query tree. It is also possible to represent the query using a graph data structure called a query
graph.
The DBMS must then devise an execution strategy for retrieving the result of the query
from the database files. A query typically has many possible execution strategies, and the process
of choosing a suitable one for processing a query is known as query optimization.
Finding the optimal strategy is usually too time-consuming except for the simplest of
queries and may require information on how the files are implemented and even on the contents of
the filesinformation that may not be fully available in the DBMS catalog. Hence, planning of an
execution strategy may be a more accurate description than query optimization.
Steps in Query Processing
The steps involved in processing a query appear in Figure. The basic steps are
1. Parsing and translation
2. Optimization
3. Evaluation
Given a query, there are generally a variety of methods for computing the answer.
For example, we have seen that, in SQL, a query could be expressed in several different ways.
Each SQL query can itself be translated into a relational-algebra expression in one of several ways.
Furthermore, the relational-algebra representation of a query specifies only partially how to
61
http://www.francisxavier.ac.in
evaluate a query; there are usually several ways to evaluate relational-algebra expressions. As an
illustration, consider the query
select balance
from account
where balance < 2500
A drawback of cost-based optimization is the cost of optimization itself. Although the cost
of query processing can be reduced by clever optimizations, cost-based optimization is still
expensive. Hence, many systems use heuristics to reduce the number of choices that must be
made in a cost-based fashion. Some systems even choose to use only heuristics, and do not use
cost-based optimization at all.
An example of a heuristic rule is the following rule for transforming relational algebra
queries:
Perform selection operations as early as possible.
A heuristic optimizer would use this rule without finding out whether the cost is reduced by
this transformation. In the first transformation the selection operation was pushed into a join.
We say that the preceding rule is a heuristic because it usually, but not always, helps to
reduce the cost. For an example of where it can result in an increase in cost, consider an expression
(r s), where the condition refers to only attributes in s. The selection can certainly be
performed before the join. However, if r is extremely small compared to s, and if there is an index
on the join attributes of s, but no index on the attributes used by , then it is probably a bad idea to
perform the selection early.
Performing the selection earlythat is, directly on swould require doing a scan of all
tuples in s. It is probably cheaper, in this case, to compute the join by using the index, and then to
reject tuples that fail the selection.
The projection operation, like the selection operation, reduces the size of relations. Thus,
whenever we need to generate a temporary relation, it is advantageous to apply immediately any
projections that are possible. This advantage suggests a companion to the perform selections
early heuristic:
Perform projections early.
62
http://www.francisxavier.ac.in
It is usually better to perform selections earlier than projections, since selections have the
potential to reduce the sizes of relations greatly, and selections enable the use of indices to access
tuples.
An example similar to the one used for the selection heuristic should convince you that this
heuristic does not always reduce the cost.
A heuristic optimization algorithm will reorder the components of an initial query tree to
achieve improved query execution. Heuristics can be understood by visualizing a query expression
as a tree, as illustrated in Figure
1. Deconstruct conjunctive selections into a sequence of single selection operations. This step,
based on equivalence rule 1, facilitates moving selection operations down the query tree.
2. Move selection operations down the query tree for the earliest possible execution. This step uses
the commutativity and distributivity properties of the selection operation noted in equivalence
rules 2, 7.a, 7.b, and 11.
For instance, this step transforms (r s) into either (r) s or r (s) whenever possible.
Performing value-based selections as early as possible reduces the cost of sorting and merging
intermediate results. The degree of reordering permitted for a particular selection is determined by
the attributes involved in that selection condition.
3. Determine which selection operations and join operations will produce the smallest relations
that is, will produce the relations with the least number of tuples. Using associativity of the
operation, rearrange the tree so that the leaf-node relations with these restrictive selections are
executed first.
This step considers the selectivity of a selection or join condition. Recall that the most restrictive
selectionthat is, the condition with the smallest selectivityretrieves the fewest records. This
step relies on the associativity of binary operations given in equivalence rule 6.
63
http://www.francisxavier.ac.in
4. Replace with join operations those Cartesian product operations that are followed by a selection
condition (rule 4.a). The Cartesian product operation is often expensive to implement since r1 r2
includes a record for each combination of records from r1 and r2. The selection may significantly
reduce the number of records, making the join much less expensive than the Cartesian product.
5. Deconstruct and move as far down the tree as possible lists of projection attributes, creating new
projections where needed. This step draws on the properties of the projection operation given in
equivalence rules 3, 8.a, 8.b, and 12.
6. Identify those subtrees whose operations can be pipelined, and execute them using pipelining.
In summary, the heuristics listed here reorder an initial query-tree representation in such a
way that the operations that reduce the size of intermediate results are applied first; early selection
reduces the number of tuples, and early projection reduces the number of attributes. The heuristic
transformations also restructure the tree so that the system performs the most restrictive selection
and join operations before other similar operations.
Heuristic optimization further maps the heuristically transformed query expression into
alternative sequences of operations to produce a set of candidate evaluation plans. An evaluation
plan includes not only the relational operations to be performed, but also the indices to be used, the
order in which tuples are to be accessed, and the order in which the operations are to be performed.
The access-plan selection phase of a heuristic optimizer chooses the most efficient strategy for
each operation.
UNIT III
TRANSACTION PROCESSING AND CONCURRENCY CONTROL
Introduction-Properties of Transaction- Serializability- Concurrency Control Locking MechanismsTwo Phase Commit Protocol-Dead lock.
3.1 INTRODUCTION
A transaction is a unit of program execution that accesses and possibly updates various data
items. A transaction consists of collection of operations used to perform a particular task. Each
transaction begins with BEGIN TRANSACTION statement and ends with END TRANSACTION
statement.
Transaction -States
Active
This is the initial state, the transaction stays in this state while it is executing.
Partially committed
A transaction is in this state when it has executed the final statement.
64
http://www.francisxavier.ac.in
Failed
A transaction is in this state once the normal execution of the transaction cannot
proceed.
Aborted
A transaction is said to be aborted when the transaction has rolled back and the database is
being restored to the consistent state prior to the start of the transaction.
Committed
A transaction is in the committed state once it has been successfully executed and the database is
transformed into a new consistent state.
A transaction starts in the active state; A transaction contains a group of statements that form
a logical unit of work. When the transaction has finished executing the last statement, it enters the
partially committed state. At this point the transaction has completed execution, but it is still
possible that it may have to be aborted. This is because the actual output may still be in the main
memory and a hardware failure can still prevent the successful completion. The database system
then writes enough information to the disk. When the last of this information is written, the
transaction enters the committed states.
A transaction enters the failed state once the system determines that the transaction can no
longer proceed with its normal execution. This could be due to hardware failures or logical errors.
Such a transaction should be rolled back. When the roll back is complete, the transaction enters the
aborted state when a transaction aborts, the system has two options as follows:
65
http://www.francisxavier.ac.in
The four basic properties that all transactions should possess are called ACID properties.
Atomicity: The 'all or nothing' property. A transaction is an indivisible unit that is either
performed in its entirety or is not performed at all. It is the responsibility of the recovery subsystem
of the DBMS to ensure atomicity.
Consistency: A transaction must transform the database from one consistent state to another
consistent state. It is the responsibility of both the DBMS and the application developers to ensure
consistency. The DBMS can ensure consistency by enforcing all the constraints that have been
specified on the database schema, such as integrity and enterprise constraints. However in itself this
is insufficient to ensure consistency.
Example:
A transaction that is intended to transfer money from one bank account to another and the
programmer makes an error in the transaction logic and debits one account but credits the wrong
account, then the database is in an inconsistent state.
Isolation: Transactions execute independently of one another, i.e. the partial effects of incomplete
transactions should not be visible to other transactions. It is the responsibility of the concurrency
control subsystem to ensure isolation.
Durability: The effects of a successfully completed transaction are permanently recorded in the
database and must not be lost because of a subsequent failure. It is the responsibility of the recovery
subsystem to ensure durability.
3.3 SERIALIZABILITY
66
http://www.francisxavier.ac.in
Basic
assumption:
Each transaction preserves database consistency.
Thus serial execution of a set of transactions preserves database consistency
A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule.
1. Conflict Serializability
2. View Serializability
Intuitively, a conflict between Ii and Ij forces a temporal order between them. If Ii and Ij are
consecutive in a schedule and they do not conflict, their results would remain the same even if they
had been interchanged in the schedule.
If a schedule S can be transformed into a schedule S0 by a series of swaps of non-conflicting
instructions, we say that S and S0 are conflict equivalent.
If a schedule Sis conflict serializable if it is conflict equivalent to a serial schedule.
It is not possible to swap instructions in the above schedule to obtain either the serial
schedule < T3, T4 >, or the serial schedule < T4, T3 >.Schedule 3 below can be transformed
67
http://www.francisxavier.ac.in
into Schedule1, a serial schedule where T2 follows T1, by a series of swaps of non conflicting instructions. Therefore Schedule 3 is conflict serializable.
For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction
For each data item Q, if transaction Ti executes read( Q) in schedule S, and that value was
produced by transaction Tj (if any), then transaction Ti must in schedule S0 also read the value
of Q that was produced by transaction Tj.
3.
For each data item Q, the transaction (if any) that performs the final write (Q) operation in
schedule S must perform the final write (Q) operation in schedule S0.
Every view serializable schedule, which is not conflict serializable, has blind
68
http://www.francisxavier.ac.in
writes.
Precedence graph: It is a directed graph where the vertices are the transactions (names). We
draw an arc from Ti to Tj if the two transactions conflict, and Ti accessed the data item on which the
conflict arose earlier. We may label the arc by the item that was accessed.
Example Schedule
(Schedule
A) Figure
Precedence
Graph
for
Schedule A
52
o Increased processor and disk utilization, leading to better transaction throughput: one transaction can be
using the CPU while another is reading from or writing to the disk.
o Reduced average response time for transactions: short transactions need not wait behind long ones.
. Concurrency control schemes: These are mechanisms to control the interaction among the concurrent
transactions in order to prevent them from destroying the consistency of the database.
. Schedules- Schedules are sequences that indicate the chronological order in which instructions of concurrent
transactions are executed.
A schedule for a set of transactions must consist of all instructions of those transactions. Must preserve the
order in which the instructions appear in each individual transaction.
Recoverable schedule if a transaction Tj reads a data items previously written by transaction
Ti, the commit operation of Ti appears before the commit opera-tion of Tj.
3.5 LOCKING MECHANISMS
Concurrency control protocols, which ensure serializability of transactions, are most desirable.
Concurrency control protocols can be broadly divided into two categories:
Binary Locks: a lock on data item can be in two states; it is either locked or unlocked.
Shared/exclusive: this type of locking mechanism differentiates lock based on their uses.
If a lock is acquired on a data item to perform a write operation, it is exclusive lock.
Because allowing more than one transactions to write on same data item would lead the
database into an inconsistent state. Read locks are shared because no data value is being
changed.
This locking protocol is divides transaction execution phase into three parts. In the first
part, when transaction starts executing, transaction seeks grant for locks it needs as it
executes. Second part is where the transaction acquires all locks and no other lock is
required. Transaction keeps executing its operation. As soon as the transaction releases its
first lock, the third phase starts. In this phase a transaction cannot demand for any lock but
only releases the acquired locks.
Phase Locking
The first phase of Strict-2PL is same as 2PL. After acquiring all locks in the first phase, transaction
continues to execute normally. But in contrast to 2PL, Strict-2PL does not release lock as soon as it
is no more required, but it holds all locks until commit state arrives. Strict-2PL releases all locks at
once at commit point.
Operation rejected.
72
http://www.francisxavier.ac.in
Operation executed.
Operation rejected.
If TS(Ti) < W-timestamp(X) ,Operation rejected and Ti rolled back. Timestamp ordering
rules can be modified to make the schedule view serializable. Instead of making Ti rolled
back, the 'write' operation itself is ignored.
73
http://www.francisxavier.ac.in
Basic Algorithm
During phase 1, initially the coordinator sends a query to commit message to all cohorts.
Then it waits for all cohorts to report back with the agreement message. The cohorts, if the
transaction was successful, write an entry to the undo log and an entry to the redo log. Then the
cohorts reply with an agree message, or an abort if the transaction failed at a cohort node.
During phase 2, if the coordinator receives an agree message from all cohorts, then it writes
a commit record into its log and sends a commit message to all the cohorts. If all agreement
messages do not come back the coordinator sends an abort message. Next the coordinator waits for
the acknowledgement from the cohorts. When acks are received from all cohorts the coordinator
writes a complete record to its log. Note the coordinator will wait forever for all the
acknowledgements to come back. If the cohort receives a commit message, it releases all the locks
and resources held during the transaction and send an acknowledgement to the coordinator. If the
message is abort, then the cohort undoes the transaction with the undo log and releases the
resources and locks held during the transaction. Then it sends an acknowledgement.
Disadvantages
The greatest disadvantage of the two phase commit protocol is the fact that it is a blocking
protocol. A node will block while it is waiting for a message. This means that other processes
competing for resource locks held by the blocked processes will have to wait for the locks to be
released. A single node will continue to wait even if all other sites have failed. If the coordinator
fails permanently, some cohorts will never resolve their transactions. This has the effect that
resources are tied up forever.
3.6 DEADLOCK:
When dealing with locks two problems can arise, the first of which being deadlock.
Deadlock refers to a particular situation where two or more processes are each waiting for another
to release a resource, or more than two processes are waiting for resources in a circular chain.
Deadlock is a common problem in multiprocessing where many processes share a specific type of
mutually exclusive resource. Some computers, usually those intended for the time-sharing and/or
real-time markets, are often equipped with a hardware lock, or hard lock, which guarantees
exclusive access to processes, forcing serialization. Deadlocks are particularly disconcerting
because there is no general solution to avoid them.
74
http://www.francisxavier.ac.in
If TS(Ti) < TS(Tj), that is Ti, which is requesting a conflicting lock, is older than Tj, Ti is
allowed to wait until the data-item is available.
If TS(Ti) > TS(tj), that is Ti is younger than Tj, Ti dies. Ti is restarted later with random
delay but with same timestamp.
This scheme allows the older transaction to wait but kills the younger one.
Wound-Wait Scheme:
In this scheme, if a transaction request to lock a resource (data item), which is already held with
conflicting lock by some other transaction, one of the two possibilities may occur:
If TS(Ti) < TS(Tj), that is Ti, which is requesting a conflicting lock, is older than Tj, Ti
forces Tj to be rolled back, that is Ti wounds Tj. Tj is restarted later with random delay but
with same timestamp.
75
http://www.francisxavier.ac.in
If TS(Ti) > TS(Tj), that is Ti is younger than Tj, Ti is forced to wait until the resource is
available.
This scheme, allows the younger transaction to wait but when an older transaction request an item
held by younger one, the older transaction forces the younger one to abort and release the item.
In both cases, transaction, which enters late in the system, is aborted.
3.6.2Deadlock Avoidance
Aborting a transaction is not always a practical approach. Instead deadlock avoidance
mechanisms can be used to detect any deadlock situation in advance. Methods like "wait-for
graph" are available but for the system where transactions are light in weight and have hold on
fewer instances of resource. In a bulky system deadlock prevention techniques may work well.
Wait-for Graph
This is a simple method available to track if any deadlock situation may arise. For each transaction
entering in the system, a node is created. When transaction Ti requests for a lock on item, say X,
which is held by some other transaction Tj, a directed edge is created from Ti to Tj. If Tj releases
item X, the edge between them is dropped and Ti locks the data item.
The system maintains this wait-for graph for every transaction waiting for some data items held by
others. System keeps checking if there's any cycle in the graph.
76
http://www.francisxavier.ac.in
It is not feasible to always roll back the younger transaction, as it may be important than
the older one. With help of some relative algorithm a transaction is chosen, which is to be aborted,
this transaction is called victim and the process is known as victim selection.
UNIT IV
TRENDS IN DATABASE TECHNOLOGY
Overview of Physical Storage Media Magnetic Disks RAID Tertiary storage File
Organization Organization of Records in Files Indexing and Hashing Ordered Indices B+
tree Index Files B tree Index Files Static Hashing Dynamic Hashing - Introduction to
Distributed Databases- Client server technology- Multidimensional and Parallel databases- Spatial
and multimedia databases-Mobile and web databases- Data Warehouse-Mining- Data marts
4.1 OVERVIEW OF PHYSICAL STORAGE MEDIA
Classification of Physical Storage Media
Based on Reliability
Tertiary storage: lowest level in hierarchy, non-volatile, slow access time. Also called off-line
storage. E.g. magnetic tape, optical storage.
Cache
Volatile
Main memory
Generally too small (or too expensive) to store the entire database
Capacities have gone up and per-byte costs have decreased steadily and rapidly
Volatile contents of main memory are usually lost if a power failure or system crash
occurs.
Flash memory
Data can be written at a location only once, but location can be erased and written to again.
Widely used in embedded devices such as digital cameras. Also known as EEPROM.
Magnetic-disk
Primary medium for the long-term storage of data; typically stores entire database.
Data must be moved from disk to main memory for access, and written back for storage.
Direct-access possible to read data on disk in any order, unlike magnetic tape Hard disks
vs. floppy disks
Capacities range up to roughly 100 GB currently. Much larger capacity and cost/byte than
main memory/ flash memory
Optical storage
Non-volatile, data is read optically from a spinning disk using a laser CD-ROM (640 MB)
and DVD (4.7 to 17 GB) most popular forms
Write-one, read-many (WORM) optical disks used for archival storage (CD-R and DVDR)
Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism
for automatic loading/unloading of disks available for storing large volumes of data.
Optical Disks
High seek times or about 100 msec (optical read head is heavier and slower)
Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to
magnetic disks
DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB & 17 GB
Tape storage
Non-volatile, used primarily for backup (to recover from disk failure), and for archival data
Tape can be removed from drive storage costs much cheaper than disk, but drives are
expensive
Hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)
Magnetic Tapes
Few GB for DAT (Digital Audio Tape) format, 10-40 GB with DLT (Digital Linear Tape)
format, 100 GB+ with Ultrium format, and 330 GB with Ampex helical scan format
Very slow access time in comparison to magnetic disks and optical disks
Used mainly for backup, for storage of infrequently used information, and as an offline
medium for transferring information from one system to another.
Tape jukeboxes used for very large capacity storage (Terabyte (1012 bytes) to Petabye
(1015 bytes)
80
http://www.francisxavier.ac.in
Surface of platter divided into circular tracks. Over 17,000 tracks per platter on typical hard
disks.
Each track is divided into sectors. A sector is the smallest unit of data that can be read or
written.
Sector size typically 512 bytes. Typical sectors per track: 200 (on inner tracks) to 400 (on
outer tracks).
Read-write head is positioned very close to the platter surface (almost touching it). Reads
or writes magnetically encoded information.
To read/write a sector-Disk arm swings to position head on right track. Platter spins
continually; data is read/written as sector passes under head.
81
http://www.francisxavier.ac.in
Access Time the time it takes from when a read or write request is issued to when data
transfer begins.
Seek Time time it takes to reposition the arm over the correct track. Average Seek time
is 1/2 the worst case seek time. 4 to 10 milliseconds on typical disks.
Rotational latency time it takes for the sector to be accessed to appear under the head.
Average latency is 1/2 of the worst-case latency. 4 to 11 milliseconds on typical disks
(5400 to 15000 r.p.m.)
Data-Transfer Rate the rate at which data can be retrieved from or stored to the disk. 4
to 8 MB per second is typical. Multiple disks may share a controller, so
rate that
Mean Time To Failure (MTTF) the average time the disk is expected to run
continuously without any failure.
Typically 3 to 5 years.
Probability of failure of new disks is quite low, corresponding to a theoretical
high reliability by storing data redundantly, so that data can be recovered even if a
disk
fails.
The chance that some disk out of a set of N disks will fail is much higher than the chance that a
specific single disk will fail.
E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will
have a system MTTF of 1000 hours (approx. 41 days)
o Techniques for using redundancy to avoid data loss are critical with large
numbers of disks
Originally a cost-effective alternative to large, expensive disks.
82
http://www.francisxavier.ac.in
A request for a long sequence of blocks can utilize all disks in parallel.
RAID Levels
o Schemes to provide redundancy at lower cost by using disk striping combined with parity bits.
Different RAID organizations, or RAID levels, have differing cost, performance and reliability
characteristics
RAID Level 0: Block striping; non-redundant.
o Used in high-performance applications where data lost is not critical.
RAID Level 1: Mirrored disks with block striping.
o Offers best write performance.
o Popular for applications such as storing log files in a database system.
RAID Level 2: Memory-Style Error-Correcting-Codes (ECC)
RAID Level 3: Bit-Interleaved Parity
o a single parity bit is enough for error correction, not just detection, since we know which disk
has failed
When writing data, corresponding parity bits must also be computed and written to a parity bit
disk.
To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit
disk).
o Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to
participate in every I/O.
o Subsumes Level 2 (provides all its benefits, at lower cost).
RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a
separate disk for corresponding blocks from N other disks.
o When writing data block, corresponding block of parity bits must also be computed and written
to parity disk.
o To find value of a damaged block, compute XOR of bits from corresponding blocks (including
parity block) from other disks.
o Provides higher I/O rates for independent block reads than Level 3
block read goes to a single disk, so blocks stored on different disks can be read in parallel.
o Provides high transfer rates for reads of multiple blocks than no-striping.
o Before writing a block, parity data must be computed
84
http://www.francisxavier.ac.in
Can be done by using old parity block, old value of current block and new value of current
block (2 block reads + 2 block writes).
Or by recomputing the parity value using the new values of blocks corresponding to the parity
block. More efficient for writing large amounts of data sequentially.
o Parity block becomes a bottleneck for independent block writes since every block write also
writes to parity disk.
RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1
disks, rather than storing data in N disks and parity in 1 disk.
o E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data
blocks stored on the other 4 disks.
o Higher I/O rates than Level 4.
Block writes occur in parallel if the blocks and their parity blocks are on different disks.
o Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.
RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant
information to guard against multiple disk failures.
o Better reliability than Level 5 at a higher cost; not used
85
http://www.francisxavier.ac.in
Optical storage
Non-volatile, data is read optically from a spinning disk using a laser CD-ROM (640 MB) and
DVD (4.7 to 17 GB) most popular forms Write-one, read-many (WORM) optical disks used for
archival storage (CD-R and DVD-R) .
Multiple write versions also available (CD-RW, DVDRW, and DVD-RAM) 4 Reads and writes
are slower than with magnetic disk.
Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for
automatic loading/unloading of disks available for storing large volumes of data.
Optical Disks
High seek times or about 100 msec (optical read head is heavier and slower)
Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to
magnetic disks
DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB & 17 GB.
Other characteristics similar to CD-ROM 4 Record once versions (CD-R and DVD-R)
Tape storage
Non-volatile, used primarily for backup (to recover from disk failure), and for archival data
88
http://www.francisxavier.ac.in
Tape can be removed from drive storage costs much cheaper than disk, but drives are
expensive
Hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)
Optical Disks
CD-ROM has become a popular medium for distributing software, multimedia data, and
other electronic published information.
Capacity of CD-ROM: 500 MB. Disks are cheap to mass produce and also drives.
CD-ROM: much longer seek time (250m-sec), lower rotation speed (400 rpm), leading to
high latency and lower data-transfer rate (about 150 KB/sec). Drives spins at audio CD
spin speed (standard) is available.
Recently, a new optical format, digit video disk (DVD) has become standard. These disks
hold between 4.7 and 17 GB data.
WORM (write-once, read many) disks are popular for archival storage of data since they
have a high capacity (about 500 MB), longer life time than HD, and can be removed from
drive -- good for audit trail (hard to tamper).
o Store record i starting from byte n (i 1), where n is the size of each
record.
o Record access is simple but records may cross blocks
Modification: do not allow records to cross block boundaries.
Deletion of record I:
alternatives:
o move records i + 1, . . ., n to i, . . . , n 1
o move record n to i
o do not move records, but link all free records on a free list
Free Lists
o Store the address of the first deleted record in the file header.
o Use this first record to store the address of the second deleted record, and so on.
o Can think of these stored addresses as pointers since they point to the location of a record.
o More space efficient representation: reuse space for normal attributes of free records to store
pointers. (No pointers stored in in-use records.)
Storage of multiple record types in a file.Record types that allow variable lengths for one or
more fields.
Record types that allow repeating fields (used in some older data models).
o Byte string representation
Attach an end-of-record () control character to the end of each record.
Difficulty with deletion.
Difficulty with growth.
Variable-Length Records: Slotted Page Structure
91
http://www.francisxavier.ac.in
92
http://www.francisxavier.ac.in
Hashing a hash function computed on some attribute of each record; the result specifies in
which block of the file the record should be placed.
Records of each relation may be stored in a separate file. In a clustering file organization records
of several different relations can be stored in the same file
Motivation: store related records on the same block to minimize I/O
4.6.1 Sequential File Organization
Suitable for applications that require sequential processing of the entire file
The records in the file are ordered by a search-key
Deletion use pointer chains
Insertion locate the position where the record is to be inserted
if there is free space insert there
if no free space, insert the record in an overflow block.In either case, pointer chain must be
updated
Need to reorganize the file from time to time to restore sequential order
93
http://www.francisxavier.ac.in
INDEXING
Indexing mechanisms used to speed up access to desired data.
o E.g., author catalog in library
Search Key - attribute to set of attributes used to look up records in a file.
An index file consists of records (called index entries) of the form.
Index files are typically much smaller than the original file.
Multilevel Index
If primary index does not fit in memory, access becomes expensive.
96
http://www.francisxavier.ac.in
To reduce number of disk accesses to index records, treat primary index kept on disk as
a sequential file and construct a sparse index on it.
o outer index a sparse index of primary index
o inner index the primary index file
If even the outer index is too large to fit in main memory, yet another level of index can
be created, and so on.
Indices at all levels must be updated on insertion or deletion from the file.
97
http://www.francisxavier.ac.in
order). If the next search-key value already has an index entry, the entry is deleted instead
of being replaced.
Index Update: Insertion
o Single-level index insertion:
o Perform a lookup using the search-key value appearing in the record to be inserted.
o Dense indices if the search-key value does not appear in the index, insert it.
o Sparse indices if index stores an entry for each block of the file, no change needs to be
made to the index unless a new block is created. In this case, the first search-key value appearing
in the new block is inserted into the index.
o Multilevel insertion (as well as deletion) algorithms are simple extensions of the singlelevel algorithms.
Secondary Indices
o Frequently, one wants to find all the records whose values in a certain field (which is not
the search-key of the primary index satisfy some condition.
o Example 1: In the account database stored sequentially by account number, we may want
to find all accounts in a particular branch.
o Example 2: as above, but where we want to find all accounts with a specified balance or
range of balances
o We can have a secondary index with an index record for each search-key value; index
record points to a bucket that contains pointers to all the actual records with that particular searchkey value.
Secondary Index on balance field of account
100
http://www.francisxavier.ac.in
record with search-key value Ki, or to a bucket of pointers to file records, each record having
search-key
value Ki. Only need bucket structure if search-key does
not form a primary key.
If Li, Lj are leaf nodes and i < j, Lis search-key values
are less than Ljs search-key values.
Pn points to next leaf node in search-key order.
101
http://www.francisxavier.ac.in
Example of a B+-tree
o If the node reached by following the pointer above is not a leaf node, repeat the above procedure
on the node, and follow the corresponding pointer.
o Eventually reach a leaf node. If for some i, key Ki = k follow pointer Pi to the desired record or
bucket. Else no record with search-key value k exists.
In processing a query, a path is traversed in the tree from the root to some leaf node.
If there are K search-key values in the file, the path is no longer than logn/2(K).
A node is generally the same size as a disk block, typically 4 kilobytes, and n is typically around
100 (40 bytes per index entry).
With 1 million search key values and n = 100, at most log50(1,000,000) = 4 nodes are accessed
in a lookup.
Contrast this with a balanced binary free with 1 million search key values around 20 nodes
are accessed in a lookup.
o The above difference is significant since every node access may need a disk I/O, costing around
20 milliseconds!
Updates on B+-Trees: Insertion
o Find the leaf node in which the search-key value would appear
o If the search-key value is already there in the leaf node, record is added to file and if necessary a
pointer is inserted into the bucket.
o If the search-key value is not there, then add the record to the main file and create a bucket if
necessary. Then:
If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node
Otherwise, split the node (along with the new (key-value, pointer) entry) as discussed in the
next slide.
o Splitting a node:
take the n(search-key value, pointer) pairs (including the one being inserted) in sorted order.
Place the first n/2 in the original node, and the rest in a new node.
let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node
being split. If the parent is full, split it and propagate the split further up.
The splitting of nodes proceeds upwards till a node that is not full is found. In the worst case the
root node may be split increasing the height of the tree by 1.
103
http://www.francisxavier.ac.in
104
http://www.francisxavier.ac.in
Delete the pair (Ki1, Pi), where Pi is the pointer to the deleted node, from its parent,
recursively using the above procedure.
Otherwise, if the node has too few entries due to the removal, and the entries in the node and a
sibling fit into a single node,
then
Redistribute the pointers between the node and a sibling such that both have more than the
minimum number of entries.
Update the corresponding search-key value in the parent of the node.
The node deletions may cascade upwards till a node which has n/2 or more pointers is found. If
the root node has only one pointer after deletion, it is deleted and the sole child becomes the root.
Examples of B+-Tree Deletion
105
http://www.francisxavier.ac.in
106
http://www.francisxavier.ac.in
107
http://www.francisxavier.ac.in
HASHING
Hashing is a hash function computed on some attribute of each record; the result specifies
in which block of the file the record should be placed.
108
http://www.francisxavier.ac.in
Static Hashing
A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block). In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function. Hash function h is a function from the set of all search-key values K
to the set of all bucket addresses B. Hash function is used to locate records for access, insertion as
well as deletion. Records with different search-key values may be mapped to the same bucket; thus
entire bucket has to be searched sequentially to locate a record.
Example of Hash File Organization
Hash file organization of account file, using branch-name as key
o There are 10 buckets,
o The binary representation of the ith character is assumed to be the integer i.
o The hash function returns the sum of the binary representation of the characters
modulo10.
o E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
o Hash file organization of account file, using branch-name as key
109
http://www.francisxavier.ac.in
Hash Functions
o Worst had function maps all search-key values to the same bucket; this makes access
time proportional to the number of search-key values in the file.
o An ideal hash function is uniform, i.e., each bucket is assigned the same number of
search-key values from the set of all possible values.
o Ideal hash function is random, so each bucket will have the same number of records
assigned to it irrespective of the actual distribution of search-key values in the file.
o Typical hash functions perform computation on the internal binary representation
of the search-key.
o For example, for a string search-key, the binary representations of all the characters in the
string could be added and the sum modulo the number of buckets could be returned.
Handling of Bucket Overflows
o Bucket overflow can occur because of
Insufficient buckets
Skew in distribution of records.
This can occur due to two reasons:
multiple records have same search-key value
chosen hash function produces non-uniform distribution of key values
o Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is
handled by using overflow buckets.
o Overflow chaining the overflow buckets of a given bucket are chained together in a
linked list.
o The Above scheme is called closed hashing.
o An alternative, called open hashing, which does not use overflow buckets, is not suitable
for database applications.
110
http://www.francisxavier.ac.in
Hash Indices
o Hashing can be used not only for file organization, but also for index-structure creation.
o A hash index organizes the search keys, with their associated record pointers,into a hash
file structure.
o Strictly speaking, hash indices are always secondary indices
o If the file itself is organized using hashing, a separate primary hash index on it using the
same search-key is unnecessary.
o However, we use the term hash index to refer to both secondary index structures and hash
organized files.
Example of Hash Index
112
http://www.francisxavier.ac.in
o make the second half of the bucket address table entries pointing to j to point to z
o remove and reinsert each record in bucket j.
o recompute new bucket for Kj and insert record in the bucket (further splitting is required
if the bucket is still full)
o If i = ij (only one pointer to bucket j)
o increment i and double the size of the bucket address table.
o replace each entry in the table by two entries that point to the same bucket.
o recompute new bucket address table entry for Kj
Now i > ij so use the first case above.
o When inserting a value, if the bucket is full after several splits (that is, i reaches some
limit create an overflow bucket instead of splitting bucket entry table further.
o Note: decreasing bucket address table size is an expensive operation and should be done
only if number of buckets becomes much smaller than the size of the table.
Use of Extendable Hash Structure:
Example
114
http://www.francisxavier.ac.in
115
http://www.francisxavier.ac.in
Linear hashing is an alternative mechanism which avoids these disadvantages at the possible
cost of more bucket overflows.
4.8
INTRODUCTION
TO
DISTRIBUTED
DATABASES
AND
CLIENT/SERVER TECHNOLOGY
DISTRIBUTED DATABASE SYSTEM
A distributed database system consist of loosely coupled sites that share no physical
component
Database systems that run on each site are independent of each other
Are aware of each other and agree to cooperate in processing user requests.
Each site surrenders part of its autonomy in terms of right to change schemas or software
Sites may not be aware of each other and may provide only limited facilities for
cooperation in transaction processing
Is advantages of Replication
Increased cost of updates: each replica of relation r must be updated.
Increased complexity of concurrency control: concurrent updates to distinct replicas may
lead to inconsistent data unless special concurrency control mechanisms are implemented.
One solution: choose one copy as primary copy and apply concurrency control operations on
primary copy.
Data Fragmentation
Division of relation r into fragments r1 , r2 , , rn which contain sufficient
information to reconstruct relation r.
Horizontal fragmentation : each tuple of r is assigned to one or more fragments
Vertical fragmentation : the schema for relation r is split into several smaller schemas
All schemas must contain a common candidate key (or superkey) to ensure lossless join property.
A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key.
Example : relation account with following schema
Account = (account_number, branch_name , balance )
Data transparency : Degree to which system user may remain unaware of the details of
how and where the data items are stored in a distributed system.
Consider transparency issues in relation to:
Fragmentation transparency
Replication transparency
Location transparency
Naming of data items: criteria
Every data item must have a system-wide unique name.
It should be possible to find the location of data items efficiently.
It should be possible to change the location of data items transparently.
Each site should be able to create new data items autonomously.
Database server
Client application
Network.
Database Server
A server (or "back end") manages the resources such as database, efficiently and optimally among
various clients that simultaneously request the server for the same resources. Database server
mainly concentrates on the following tasks.
Managing a single database of information among many concurrent users.
Controlling database access and other security requirements.
Protecting database of information with backup and recovery features.
Centrally enforcing global data integrity rules across all client applications.
Client Application
A client application (the "front end") is the part of the system that users apply to interact with
data. The client application in a client/ server model focus on the following job:
Presenting an interface between the user and the resource to complete the job Managing
presentation logic, Performing application logic and Validating data entry Managing the
request traffic of receiving and sending information from database server
Network
The third component of a client/ server system is network. The communication software is the
vehicles that transmit data between the clients and the server in client server system. Both the
client and the server run communication software that allows them to talk across the network.
Three-Tier Technology
Client Server technology is also called 3-tier technology as illustrated in Figure below.
Client/server is an important idea in a network, however, it can be used by programs within a
119
http://www.francisxavier.ac.in
single computer. In a network, the client/ server model provides a convenient way to interconnect
programs that are distributed efficiently across different locations. Computer transactions using the
client/ server model are very common. For example, to check your bank account from your
computer, a client program in your computer forwards your request to a server program at the
bank. That program may in turn forward the request to its own client program that sends a request
to a database server at another bank computer to retrieve your account balance.
This detailed organization of the data allows for advanced and complex query generation while
providing outstanding performance in certain cases when compared to traditional relational
structures and databases. This type of database is usually structured in an order that optimizes
OLAP and data warehouse applications.
120
http://www.francisxavier.ac.in
parallelization easier.
Different queries can be run in parallel with each other. Concurrency control takes care of
conflicts. Thus, databases naturally lend themselves to parallelism.
Reduce the time required to retrieve relations from disk by partitioning the relations on multiple
disks.
Horizontal partitioning tuples of a relation are divided among many disks such that each
tuple resides on one disk.
Partitioning techniques (number of disks = n):
Round-robin:
Send the ith tuple inserted in the relation to disk i mod n.
Hash partitioning:
Choose one or more attributes as the partitioning attributes.
Choose hash function h with range 0n - 1
Let i denote result of hash function h applied to the partitioning attribute value of a tuple.
Send tuple to disk i.
Range partitioning:
Choose an attribute as the partitioning attribute.
A partitioning vector [vo, v1, ..., vn-2] is chosen.
Let v be the partitioning attribute value of a tuple. Tuples such that vi vi+1 go to disk I +
1. Tuples with v < v0 go to disk 0 and tuples with v vn-2 go to disk n-1.
E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will
go to disk 0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to
disk2.
INTERQUERY PARALLELISM
Queries/transactions execute in parallel with one another.
Increases transaction throughput; used primarily to scale up a transaction processing system
to support a larger number of transactions per second.
Easiest form of parallelism to support, particularly in a shared-memory parallel database,
because even sequential database systems support concurrent processing.
More complicated to implement on shared-disk or shared-nothing architectures
Locking and logging must be coordinated by passing messages between processors.
Data in a local buffer may have been updated at another processor.
122
http://www.francisxavier.ac.in
Cache-coherency has to be maintained reads and writes of data in buffer must find
latest version of data.
INTRAQUERY PARALLELISM
Execution of a single query in parallel on multiple processors/disks; important for
speeding up long-running queries.
Two complementary forms of intraquery parallelism :
Intraoperation Parallelism parallelize the execution of each individual operation in the query.
Interoperation Parallelism execute the different operations in a query expression in parallel
the first form scales better with increasing parallelism because the number of tuples processed
by each operation is typically more than the number of operations in a query
123
http://www.francisxavier.ac.in
Some wireless networks, such as WiFi and Bluetooth, use unlicensed areas of
the frequency spectrum, which may cause interference with other appliances,
such as cordless telephones.
Modern wireless networks can transfer data in units called packets, that are
used in wired networks in order to conserve bandwidth.
Client/Network Relationships
Mobile units can move freely in a geographic mobility domain, an area that
is circumscribed by wireless network coverage.
To manage entire mobility domain is divided into one or more
smallerdomains, called cells, each of which is supported by at least one base
station.
Mobile units be unrestricted throughout the cells of domain, while
124
http://www.francisxavier.ac.in
In a MANET, mobile units are responsible for routing their own data,
effectively acting as base stations as well as clients.
In either case, neither client nor server can reach the other, and
modifications must be made to the architecture in order to compensate
for this case.
One way servers relieve this problem is by broadcasting data whenever possible.
A server can simply broadcast data periodically.
Broadcast also reduces the load on the server, as clients do not have to
maintain active connections to it.
Client data should be stored in the network location that minimizes the traffic
necessary toaccess
The act of moving between cells must be transparent to the client.
The server must be able to gracefully divert the shipment of data from one
base to another, without the client noticing.
Client mobility also allows new applications that are location-based.
WEB DATABASES
A web database is a system for storing information that can then be accessed via a website.
For example, an online community may have a database that stores the username, password, and
other details of all its members. The most commonly used database system for the internet is
MySQL due to its integration with PHP one of the most widely used server side programming
languages.
At its most simple level, a web database is a set of one or more tables that contain data.
Each table has different fields for storing information of various types. These tables can then be
linked together in order to manipulate data in useful or interesting ways. In many cases, a table
will use a primary key, which must be unique for each entry and allows for unambiguous selection
of data.
A web database can be used for a range of different purposes. Each field in a table has to
have a defined data type. For example, numbers, strings, and dates can all be inserted into a web
database. Proper database design involves choosing the correct data type for each field in order to
reduce memory consumption and increase the speed of access. Although for small databases this
often isn't so important, big web databases can grow to millions of entries and need to be well
designed to work effectively.
SPATIAL AND MULTIMEDIA DATABASES
SPATIAL DATABASE
Types of Spatial Data
Point Data
Points in a multidimensional space
E.g., Raster data such as satellite imagery, where
each pixel stores a measured value
E.g., Feature vectors extracted from text
Region Data
Objects have spatial extent with location and boundary.
DB typically uses geometric approximations constructed using line
segments, polygons, etc., called vector data.
Types of Spatial Queries
Spatial Range Queries
Find all cities within 50 miles of Madison
Query has associated region (location, boundary)
Answer includes ovelapping or contained data regions
Nearest-Neighbor Queries
Find the 10 cities nearest to Madison
Results must be ordered by proximity
Single-Dimensional Indexes
B+ trees are fundamentally single-dimensional indexes.
When we create a composite search key B+ tree, e.g., an index on <age, sal>, we
effectively linearize the 2-dimensional space since we sort entries first by age and then by
sal.
Multi-dimensional Indexes
A multidimensional index clusters entries so as to exploit nearness in multidimensional
space.
Keeping track of entries and maintaining a balanced index structure presents a
challenge! Consider entries:
Motivation for Multidimensional Indexes
Spatial queries (GIS, CAD).
Find all hotels within a radius of 5 miles from the conference venue.
Find the city with population 500,000 or more that is nearest to Kalamazoo, MI.
Find all cities that lie on the Nile in Egypt.
Find all parts that touch the fuselage (in a plane design).
MULTIMEDIA DATABASES
To provide such database functions as indexing and consistency, it is desirable to store
multimedia data in a database
Rather than storing them outside the database, in a file system
The database must handle large object representation.
Similarity-based retrieval must be provided by special index structures.
Must provide guaranteed steady retrieval rates for continuous-media data.
Multimedia Data Formats
Store and transmit multimedia data in compressed form
JPEG and GIF the most widely used formats for image data.
MPEG standard for video data use commonalties among a sequence of
frames to achieve a greater degree of compression.
MPEG-1 quality comparable to VHS video tape.
Stores a minute of 30-frame-per-second video and audio in approximately 12.5 MB
MPEG-2 designed for digital broadcast systems and digital video disks; negligible
loss of video quality.
Compresses 1 minute of audio-video to approximately 17 MB.
Several alternatives of audio encoding
MPEG-1 Layer 3 (MP3), RealAudio, WindowsMedia format, etc.
Continuous-Media Data
Most important types are video and audio data.
Characterized by high data volumes and real-time information-delivery requirements.
Data must be delivered sufficiently fast that there are no gaps in the audio or
video.
Data must be delivered at a rate that does not cause overflow of system buffers.
Synchronization among distinct data streams must be maintained
video of a person speaking must show lips moving synchronously
with the audio
105
http://www.francisxavier.ac.in
Video Servers
Pictorial data:
WAREHOUSE SCHEMA
Typically warehouse data is multidimensional, with very large fact tables
Examples of dimensions: item-id, date/time of sale, store where sale was made,
customer identifier
Examples of measures: number of items sold, price of items
Dimension values are usually encoded using small integers and mapped to full
values via dimension tables
Resultant schema is called a star schema
More complicated schema structures
Snowflake schema: multiple levels of dimension tables
Constellation: multiple fact tables
108
http://www.francisxavier.ac.in
Data Mining
Broadly speaking, data mining is the process of semi-automatically analyzing
large databases to find useful patterns.
Like knowledge discovery in artificial intelligence data mining discovers statistical
rules and patterns
Differs from machine learning in that it deals with large volumes of data stored
primarily on disk.
Some types of knowledge discovered from a database can be represented by a set
of rules. e.g.,: Young women with annual incomes greater than $50,000 are most
likely to buy sports cars.
Other types of knowledge represented by equations, or by prediction functions.
Some manual intervention is usually required
Pre-processing of data, choice of which type of pattern to find,
postprocessing to find novel patterns
Applications of Data Mining
Prediction based on past history
Predict if a credit card applicant poses a good credit risk, based on
some attributes (income, job type, age, ..) and past history
Predict if a customer is likely to switch brand loyalty
Predict if a customer is likely to respond to junk mail
109
http://www.francisxavier.ac.in
Associations
Find books that are often bought by the same customers. If a new customer
buys one such book, suggest that he buys the others too.
Other similar applications: camera accessories,
clothes, etc. Associations may also be used as a first step in
detecting causation
E.g. association between exposure to chemical X and cancer, or new medicine
and cardiac problems
Clusters
E.g. typhoid cases were clustered in an area surrounding a contaminated well
Detection of clusters remains important in detecting epidemics
DATA MART
A data mart is the access layer of the data warehouse environment that is used to get data out to
the users. The data mart is a subset of the data warehouse that is usually oriented to a specific
business line or team. Data marts are small slices of the data warehouse. Whereas data
warehouses have an enterprise-wide depth, the information in data marts pertains to a single
department. In some deployments, each department or business unit is considered the owner of
110
http://www.francisxavier.ac.in
its data mart including all the hardware, software and data. This enables each department to use,
manipulate and develop their data any way they see fit; without altering information inside other
data marts or the data warehouse. In other deployments where conformed dimensions are used,
this business unit ownership will not hold true for shared dimensions like customer, product, etc.
The reasons why organizations are building data warehouses and data marts are because the
information in the database is not organized in a way that makes it easy for organizations to find
what they need. Also complicated queries might take a long time to answer what people want to
know since the database systems are designed to process millions of transactions per day.
Transactional database are designed to be updated, however, data warehouses or marts are read
only. Data warehouses are designed to access large groups of related records.
Data marts improve end-user response time by allowing users to have access to the specific type
of data they need to view most often by providing the data in a way that supports the collective
view of a group of users.
A data mart is basically a condensed and more focused version of a data warehouse that reflects
the regulations and process specifications of each business unit within an organization. Each data
mart is dedicated to a specific business function or region. This subset of data may span across
many or all of an enterprises functional subject areas. It is common for multiple data marts to be
used in order to serve the needs of each individual business unit (different data marts can be used
to obtain specific information for various enterprise departments, such as accounting, marketing,
sales, etc.)
111
http://www.francisxavier.ac.in
UNIT V
ADVANCED TOPICS
DATABASE SECURITY: Data Classification-Threats and risks Database access Control
Types of Privileges Cryptography- Statistical Databases.- Distributed Databases-ArchitectureTransaction Processing-Data Warehousing and Mining-Classification-Association rulesClustering-Information Retrieval- Relevance ranking-Crawling and Indexing the Web- Object
Oriented Databases-XML Databases.
5.1 DATABASE SECURITY: Data Classification
Database security concerns the use of a broad range of information security controls to
protect databases (potentially including the data, the database applications or stored functions,
the database systems, the database servers and the associated network links) against
compromises of their confidentiality, integrity and availability. It involves various types or
categories of controls, such as technical, procedural/administrative and physical. Database
security is a specialist topic within the broader realms of computer security, information security
and risk management.
Security risks to database systems include, for example:
112
http://www.francisxavier.ac.in
lightning,
accidental
liquid
spills,
static
discharge,
electronic
Design flaws and programming bugs in databases and the associated programs and
systems, creating various security vulnerabilities (e.g. unauthorized privilege escalation),
data loss/corruption, performance degradation etc.;
Data corruption and/or loss caused by the entry of invalid data or commands, mistakes in
database or system administration processes, sabotage/criminal damage etc.
Types of Security
Policy issues
System-related issues
and the employee takes a backup of sensitive data to work on from his home. This not only
violates the security policies of the organization, but also may result in data security breach if the
system at home is compromised.
2. Operating System vulnerabilities: Vulnerabilities in underlying operating systems
like Windows, UNIX, Linux, etc., and the services that are related to the databases could lead to
unauthorized access. This may lead to a Denial of Service (DoS) attack. This could be prevented
by updating the operating system related security patches as and when they become available.
3. Database rootkits: A database rootkit is a program or a procedure that is hidden inside
the database and that provides administrator-level privileges to gain access to the data in the
database. These rootkits may even turn off alerts triggered by Intrusion Prevention Systems
(IPS). It is possible to install a rootkit only after compromising the underlying operating system.
This can be avoided by periodical audit trails, else the presence of the database rootkit may go
undetected.
4. Weak authentication: Weak authentication models allow attackers to employ
strategies such as social engineering and brute force to obtain database login credentials and
assume the identity of legitimate database users.
5. Weak audit trails: A weak audit logging mechanism in a database server represents a
critical risk to an organization especially in retail, financial, healthcare, and other industries with
stringent regulatory compliance. Regulations such as PCI, SOX, and HIPAA demand extensive
logging of actions to reproduce an event at a later point of time in case of an incident. Logging of
sensitive or unusual transactions happening in a database must be done in an automated manner
for resolving incidents. Audit trails act as the last line of database defense. Audit trails can detect
the existence of a violation that could help trace back the violation to a particular point of time
and a particular user.
5.3 Database Access Control
To protect databases against these types of threats four kinds of countermeasures can be
implemented : access control, inference control, flow control, and encryption.
114
http://www.francisxavier.ac.in
The security mechanism of a DBMS must include provisions for restricting access to the
database as a whole; this function is called access control and is handled by creating user
accounts and passwords to control login process by the DBMS.
Discretionary Access Control Based on Granting and Revoking Privileges
The typical method of enforcing discretionary access control in a database system is
based on the granting and revoking privileges.
5.3.1Types of Discretionary Privileges
The account level: At this level, the DBA specifies the particular privileges that each
account holds independently of the relations in the database.
The relation (or table level): At this level, the DBA can control the privilege to access
each individual relation or view in the database.
The privileges at the account level apply to the capabilities provided to the account itself
and can include the CREATE SCHEMA or CREATE TABLE privilege, to create a
schema or base relation; the CREATE VIEW privilege; the ALTER privilege, to apply
schema changes such adding or removing attributes from relations; the DROP privilege,
to delete relations or views; the MODIFY privilege, to insert, delete, or update tuples;
and the SELECT privilege, to retrieve information from the database by using a SELECT
query.
The second level of privileges applies to the relation level, whether they are base
relations or virtual (view) relations.The granting and revoking of privileges generally follow an
authorization model for discretionary privileges known as the access matrix model, where the
rows of a matrix M represents subjects (users, accounts, programs) and the columns represent
115
http://www.francisxavier.ac.in
objects (relations, records, columns, views, operations). Each position M(i,j) in the matrix
represents the types of privileges (read, write, update) that subject i holds on object j.
To control the granting and revoking of relation privileges, each relation R in a
database is assigned and owner account, which is typically the account that was used when the
relation was created in the first place. The owner of a relation is given all privileges on that
relation. In SQL2, the DBA can assign and owner to a whole schema by creating the schema and
associating the appropriate authorization identifier with that schema, using the CREATE
SCHEMA command. The owner account holder can pass privileges on any of the owned relation
to other users by granting privileges to their accounts.
In SQL the following types of privileges can be granted on each individual relation R:
SELECT (retrieval or read) privilege on R: Gives the account retrieval privilege. In SQL
this gives the account the privilege to use the SELECT statement to retrieve tuples from
R.
MODIFY privileges on R: This gives the account the capability to modify tuples of R. In
SQL this privilege is further divided into UPDATE, DELETE, and INSERT privileges to
apply the corresponding SQL command to R. In addition, both the INSERT and
UPDATE privileges can specify that only certain attributes can be updated by the
account.
REFERENCES privilege on R: This gives the account the capability to reference relation
R when specifying integrity constraints. The privilege can also be restricted to specific
attributes of R.
Notice that to create a view, the account must have SELECT privilege on all relations involved
in the view definition
5.3.2 Specifying Privileges Using Views
The mechanism of views is an important discretionary authorization mechanism in its
own right.For example, if the owner A of a relation R wants another account B to be able to
retrieve only some fields of R, then A can create a view V of R that includes only those
116
http://www.francisxavier.ac.in
attributes and then grant SELECT on V to B. The same applies to limiting B to retrieving only
certain tuples of R; a view V can be created by defining the view by means of a query that
selects only those tuples from R that A wants to allow B to access.
5.3.3 Revoking Privileges
In some cases it is desirable to grant a privilege to a user temporarily.For example, the
owner of a relation may want to grant the SELECT privilege to a user for a specific task and then
revoke that privilege once the task is completed. Hence, a mechanism for revoking privileges is
needed. In SQL, a REVOKE command is included for purpose of canceling privileges.
5.3.4 Mandatory Access Control and Role-Based Access Control for Multilevel Security
The discretionary access control techniques of granting and revoking privileges on
relations has traditionally been the main security mechanism for relational database systems.
This is an all-or-nothing method: A user either has or does not have a certain privilege.In many
applications, and additional security policy is needed that classifies data and users based on
security classes. This approach as mandatory access control, would typically be combined with
the discretionary access control mechanisms.
Typical security classes are top secret (TS), secret (S), confidential (C), and
unclassified (U), where TS is the highest level and U the lowest: TS S C U.The commonly
used model for multilevel security, known as the Bell-LaPadula model, classifies each subject
(user, account, program) and object (relation, tuple, column, view, operation) into one of the
security classifications, T, S, C, or U: clearance (classification) of a subject S as class(S) and to
the classification of an object O as class(O).
Two restrictions are enforced on data access based on the subject/object classifications:
1. A subject S is not allowed read access to an object O unless class(S) class(O). This is
known as the simple security property.
2. A subject S is not allowed to write an object O unless class(S) class(O). This known as
the star property (or * property).
117
http://www.francisxavier.ac.in
The value of the TC attribute in each tuple t which is the highest of all attribute
classification values within t provides a general classification for the tuple itself, whereas each
Ci provides a finer security classification for each attribute value within the tuple.
The apparent key of a multilevel relation is the set of attributes that would have
formed the primary key in a regular(single-level) relation.
A multilevel relation will appear to contain different data to subjects (users) with
different clearance levels. In some cases, it is possible to store a single tuple in the relation at a
higher classification level and produce the corresponding tuples at a lower-level classification
through a process known as filtering.
In other cases, it is necessary to store two or more tuples at different classification
levels with the same value for the apparent key. This leads to the concept of polyinstantiation
where several tuples can have the same apparent key value but have different attribute values for
users at different classification levels.
In general, the entity integrity rule for multilevel relations states that all attributes that
are members of the apparent key must not be null and must have the same security classification
within each individual tuple.
In addition, all other attribute values in the tuple must have a security classification
greater than or equal to that of the apparent key. This constraint ensures that a user can see the
key if the user is permitted to see any part of the tuple at all.
Other integrity rules, called null integrity and interinstance integrity, informally
ensure that if a tuple value at some security level can be filtered (derived) from a higherclassified tuple, then it is sufficient to store the higher-classified tuple in the multilevel relation.
5.3.5 Role-Based Access Control
Role-based access control (RBAC) emerged rapidly in the 1990s as a proven
technology for managing and enforcing security in large-scale enterprisewide systems. Its basic
notion is that permissions are associated with roles, and users are assigned to appropriate roles.
Roles can be created using the CREATE ROLE and DESTROY ROLE commands. The GRANT
118
http://www.francisxavier.ac.in
and REVOKE commands discussed under DAC can then be used to assign and revoke privileges
from roles.
RBAC appears to be a viable alternative to traditional discretionary and mandatory access
controls; it ensures that only authorized users are given access to certain data or
resources.
Many DBMSs have allowed the concept of roles, where privileges can be assigned to
roles.
Role hierarchy in RBAC is a natural way of organizing roles to reflect the organizations
lines of authority and responsibility.
Another important consideration in RBAC systems is the possible temporal constraints
that may exist on roles, such as time and duration of role activations, and timed triggering
of a role by an activation of another role.
Using an RBAC model is highly desirable goal for addressing the key security
requirements of Web-based applications.
In contrast, discretionary access control (DAC) and mandatory access control (MAC) models
lack capabilities needed to support the security requirements emerging enterprises and Webbased applications.
5.4 Types of Privileges
Privileges: Privileges defines the access rights provided to a user on a database object.
There are two types of privileges.
1) System privileges - This allows the user to CREATE, ALTER, or DROP
database objects.
2) Object privileges - This allows the user to EXECUTE, SELECT, INSERT,
UPDATE, or DELETE data from database objects to which the privileges apply.
Few CREATE system privileges are listed below:
119
http://www.francisxavier.ac.in
System
Privileges
Description
CREATE object
The above rules also apply for ALTER and DROP system privileges.
Few of the object privileges are listed below:
Object
Description
Privileges
INSERT allows users to insert rows into a table.
allows users to select data from a
SELECT
database object.
UPDATE allows user to update data in a table.
allows user to execute a stored
EXECUTE
procedure or a function.
Roles: Roles are a collection of privileges or access rights. When there are many users in a
database it becomes difficult to grant or revoke privileges to users. Therefore, if you define roles,
you can grant or revoke privileges to users, thereby automatically granting or revoking
privileges. You can either create Roles or use the system roles pre-defined by oracle.Some of the
privileges granted to the system roles are as given below:
System Role
120
http://www.francisxavier.ac.in
5.5 Cryptography
A DBMS can use encryption to protect information in certain situations where the normal
security mechanisms of the DBMS are not adequate. For example, an intruder may steal tapes
containing some data or tap a communication line. By storing and transmitting data in an
encrypted form, the DBMS ensures that such stolen data is not intelligible to the intruder. Thus,
encryption is a technique to provide privacy of data.
122
http://www.francisxavier.ac.in
123
http://www.francisxavier.ac.in
Although this technique is secure, but it is also computationally expensive. A hybrid scheme
used for secure communication is to use DES keys exchanged via a public-key encryption
scheme and DES encryption is used on the data transmitted subsequently.
5.5.3 Disadvantages of encryption
There are following problems of Encryption:
Key management (i.e. keeping keys secret) is a problem. Even in public-key encryption
the decryption key must be kept secret.
Even in a system that supports encryption, data must often be processed in plaintext form.
Thus sensitive data may still be accessible to transaction programs.
Encrypting data gives rise to serious technical problems at the level of physical storage
organization. For example indexing over data, which is stored in encrypted form, can be
very difficult.
124
http://www.francisxavier.ac.in
Statistical databases typically contain parameter data and the measured data for these
parameters. For example, parameter data consists of the different values for varying conditions in
an experiment (e.g., temperature, time). The measured data (or variables) are the measurements
taken in the experiment under these varying conditions.
Many statistical databases are sparse with many null or zero values. It is not uncommon
for a statistical database to be 40% to 50% sparse. There are two options for dealing with the
sparseness: (1) leave the null values in there and use compression techniques to squeeze them out
or (2) remove the entries that only have null values.
5.7 DISTRIBUTED DATABASE
A distributed database is a database in which a storage devices are not all attached to a
common processing unit such as the CPU, controlled by a distributed database management
system (together sometimes called a distributed database system).
5.7.1 Architecture of DBMS
125
http://www.francisxavier.ac.in
126
http://www.francisxavier.ac.in
Atomicity: Though a transaction involves several low level operations but this property
states that a transaction must be treated as an atomic unit, that is, either all of its
operations are executed or none. There must be no state in database where the transaction
is left partially completed. States should be defined either before the execution of the
transaction or after the execution/abortion/failure of the transaction.
Consistency: This property states that after the transaction is finished, its database must
remain in a consistent state. There must not be any possibility that some data is
incorrectly affected by the execution of transaction. If the database was in a consistent
state before the execution of the transaction, it must remain in consistent state after the
execution of the transaction.
Durability: This property states that in any case all updates made on the database will
persist even if the system fails and restarts. If a transaction writes or updates some data in
database and commits that data will always be there in the database. If the transaction
127
http://www.francisxavier.ac.in
commits but data is not written on the disk and the system fails, that data will be updated
once the system comes up.
Isolation: In a database system where more than one transaction are being executed
simultaneously and in parallel, the property of isolation states that all the transactions will
be carried out and executed as if it is the only transaction in the system. No transaction
will affect the existence of any other transaction.
5.9 DATAWAREHOUSING
Data sources often store only current data, not historical data. Corporate decision making
requires a unified view of all organizational data, including historical data.A data warehouse is a
repository (archive) of information gathered from multiple sources, stored under a unified
schema, at a single site
Greatly simplifies querying, permits study of historical trends Shifts decision support query load
away from transaction processing systems
DESIGN ISSUES
129
http://www.francisxavier.ac.in
5.10 DATAMINING
Data mining is the process of semi-automatically analyzing large databases to find useful
patterns. Prediction based on past history. Predict if a credit card applicant poses a good credit
risk, based on some attributes (income, job type, age, ..) and past history.Predict if a pattern of
phone calling card usage is likely to be fraudulent. Some examples of prediction mechanisms:
Classification
Given a new item whose class is unknown, predict to which
class it belongs
Regression formulae
Given a set of mappings for an unknown function, predict the
function result for a new parameter value
Descriptive Patterns
Associations
Find books that are often bought by similar customers. If a
new such customer buys one such book, suggest the others too.
Associations may be used as a first step in detecting causation
E.g. association between exposure to chemical X and cancer, Clusters
E.g. typhoid cases were clustered in an area surrounding a
contaminated well. Detection of clusters remains important in
detecting epidemics.
5.11 Classification:
Rules are not necessarily exact: there may be some misclassifications.Classification rules can be
shown compactly as a decision tree.
Training set: a data sample in which the classification is already known.Greedy top down
generation of decision trees.
Each internal node of the tree partitions the data into groups based on a partitioning attribute, and
a partitioning condition for the node
Leaf node:
all (or most) of the items at the node belong to the same class, or
all attributes have been considered, and no further partitioning is possible.
131
http://www.francisxavier.ac.in
The purity of a set S of training instances can be measured quantitatively in several ways.
Notation: number of classes = k, number of instances = |S|, fraction
of instances in class i = pi.
The Gini measure of purity is defined as
Gini (S) = 1 -
When all instances are in a single class, the Gini value is 0.
It reaches its maximum (of 1 1 /k) if each class the same number of instances.
DECISION TREE CONSTRUCTION ALGORITHM
Procedure
GrowTree (S )
Partition (S );
for i = 1, 2,
.., r
Partition (Si
);
computation of p (d | cj )
precomputation of p (cj )
p (d ) can be ignored since it is the same for all classes.To simplify the task, nave Bayesian
classifiers assume attributes have independent distributions, and thereby estimate
p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * .* (p (dn | cj )
REGRESSION
Regression deals with the prediction of a value, rather than a class.Given values for a set of
variables, X1, X2, , Xn, we wish to predict the value of a variable Y.
One way is to infer coefficients a0, a1, a1, , an
such that Y = a0 + a1 * X1 + a2 * X2 +
133
http://www.francisxavier.ac.in
+ an * Xn
Finding such a linear polynomial is called linear regression. In general, the process of finding a
curve that fits the data is also called curve fitting.
The fit may only be approximate
-because of noise in the data, or
-because the relationship is not exactly a polynomial
Regression aims to find coefficients that give the best possible fit.
Basic association rules have several limitations.Deviations from the expected probability are
more interesting
E.g. if many people purchase bread, and many people purchase cereal, quite a few
would be expected to purchase both .We are interested in positive as well as
negative correlations between sets of items
Positive correlation: co-occurrence is higher than predicted
Negative correlation: co-occurrence is lower than predicted
Sequence associations / correlations
E.g. whenever bonds go up, stock prices go down in 2 days
Deviations from temporal patterns
E.g. deviation from a steady growth
135
http://www.francisxavier.ac.in
Document is the generic term for an information holder (book, chapter, article,
webpage, etc)
Systemic approach
Cognitive approach
Goal (in an interactive information-seeking environment, with a given
IRS):Support the users exploration of the problem domain and the task completion.
Relevancy ranking is the process of sorting the document results so that those
documents which are most likely to be relevant to your query are shown at the top.
Relevance ranking is usually best for searches that are not either/or types of searches.
For example, in most traditional title searches, the result is either the library has the book, or it
does not. The relevancy program would either show the entry for the book, or an alphabetical list
that has a statement in the appropriate place that says, Your search would be here. This is a
very good place for this concrete, well-known sorting method.
A Web crawler is an Internet both that systematically browses the World Wide Web,
typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an
ant, an automatic indexer, or (in the FOAF software context) a Web scutter. Web search engines
and some other sites use Web crawling or spidering software to update their web content or
indexes of others sites' web content. Web crawlers can copy all the pages they visit for later
processing by a search engine that indexes the downloaded pages so that users can search them
137
http://www.francisxavier.ac.in
Attributes are like the fields in a relational model. However in the Book example we have, for
attributes publishedBy and writtenBy, complex types Publisher and Author, which are also
objects. Attributes with complex objects, in RDNS, are usually other tables linked by keys to
the employee table.
Relationships: publish and writtenBy are associations with I: N and 1:1 relationship;
composed of is an aggregation (a Book is composed of chapters). The 1: N relationship is
usually realized
as attributes through complex types and at the behavioral level. For example,
Generalization/Serialization is the is a
relationship
which is supported in OODB through class hierarchy. An ArtBook is a Book, therefore the
ArtBook class is a subclass of Book class. A subclass inherits all the attribute and method of
its superclass.
139
http://www.francisxavier.ac.in
Message: means by which objects communicate, and it is a request from one object to
another to execute one of its methods. For example: Publisher_object.insert (Rose,
123) i.e. request to execute the insert method on a Publisher object)
Method: defines the behavior of an object. Methods can be used to change state by modifying
its attribute values to query the value of selected attributes The method that responds to the
message example is the method insert defied in the Publisher class. The main differences
between relational database design and object oriented database design include:
Many-to-many relationships must be removed before entities can be translated into relations.
Many-to-many relationships can be implemented directly in an object-oriented database.
Operations are not represented in the relational data model. Operations are one of the main
components in an object-oriented database. In the relational data model relationships are
implemented by primary and foreign keys. In the object model objects communicate through
their interfaces. The interface describes the data (attributes) and operations (methods) that are
visible to other objects.
XML: Extensible Markup Language. It is defined by the WWW Consortium (W3C).It is derived
from SGML (Standard Generalized Markup Language), but simpler to use than SGML.
Documents have tags giving extra information about sections of the document.
E.g. <title> XML </title> <slide> Introduction </slide>
Extensible, unlike HTML. Users can add new tags, and separately specify how the tag should be
handled for display. The ability to specify new tags, and to create nested tag structures make
XML a great way to exchange data, not just documents.Much of the use of XML has been in data
exchange applications, not as a replacement for HTML.Tags make data (relatively) selfdocumenting.
Example:
<bank>
<account>
140
http://www.francisxavier.ac.in
<account_number> A-101
<branch_name>
</account_number>
Downtown </branch_name>
<balance>
500
</balance>
</account>
<depositor>
<account_number> A-101
</account_number>
</depositor>
</bank>
Examples:
Banking: funds transfer
Order processing (especially inter-company orders)
Scientific data
Chemistry: ChemML,
information.XML has become the basis for all new generation data interchange formats.Earlier
generation formats were based on plain text with line headers indicating the meaning of fields.
Similar in concept to email headers.Does not allow for nested structures, no standard type
language.Tied too closely to low level document structure (lines, spaces, etc)
Each XML based standard defines what are valid elements, using XML type specification
languages to specify the syntax
DTD (Document Type Descriptors)
XML Schema
Plus textual descriptions of the semantic.XML allows new tags to be defined as required.
However, this may be constrained by DTDs. A wide variety of tools is available for parsing,
browsing and querying XML documents/data.
Inefficient: tags, which in effect represent schema information, are repeated better than relational
tuples as a data-exchange format.Unlike relational tuples, XML data is self-documenting due to
presence of tags.
Non-rigid format: tags can be added.Allows nested structures.Wide acceptance, not only in
database systems, but also in browsers, tools, and applications.
STRUCTURE OF XML
Tag: label for a section of data
Element: section of data beginning with <tagname> and ending with matching
</tagname>.Elements must be properly nested.
Proper nesting
<account> <balance> . </balance> </account>
Improper nesting
Formally: every start tag must have a unique matching end tag, that is in the context of the
same parent element.Every document must have a single top-level element
Example
<bank-1> <customer>
<customer_name> Hayes </customer_name>
142
http://www.francisxavier.ac.in
Harrison </customer_city>
<account>
<account_number> A-102 </account_number>
<branch_name>
<balance>
Perryridge </branch_name>
400 </balance>
</account>
<account>
</account>
</customer>
</bank-1>
143
http://www.francisxavier.ac.in