Sunteți pe pagina 1din 35

BA7205

INFORMATION MANAGEMENT

UNIT III DATABASE MANAGEMENT SYSTEMS

DBMS – HDBMS, NDBMS, RDBMS, OODBMS, Query Processing, SQL, Concurrency Management,
Data warehousing and Data Mart.

Author: Dr. K.Suresh Kumar, MBA., PGDPM&IL., Ph.D.,


Associate Professor, MBA Department

Panimalar Engineering College


Chennai.
February, 2017
DATABASE MANAGEMENT SYSTEMS

3.1 DBMS- Data Base Management Systems


Database
o Database is collection of data which is related by some aspect. Data is collection of facts
and figures which can be processed to produce information. Name of a student, age, class
and her subjects can be counted as data for recording purposes. Databases store sets of data about
products, customers, orders and other facets of business operations.
 The database is a collection of related tables and other structures.
 A database is a self-describing collection of related records
DBMS
o Database Management System is a software package that has been designed to create and
maintain shared collection of logically related data. A Database Management System
generally facilitates the processes of Defining, Constructing and Manipulating of data. In
addition to that it provides various background services including transaction
management, disaster recovery and security. Data management has become a high
priority issue in modern business management. Therefore to support that, database
management systems providers improving their products using innovative technologies.
 A database management system stores data, in such a way which is easier to retrieve,
manipulate and helps to produce information.
 Mostly data represents recordable facts. Data aids in producing information which is
based on facts.
 For example, Based on the data about marks obtained by all students, it can be
concluded about toppers and average marks etc.
 Data is a vital part of any organization. It needs to be stored, organized, managed,
accessed, protected and manipulated.
 A computer program used to create, process and administer the database
 DBMS receives request encoded in SQL and translates these requests into actions on the
database
 DBMS is a large, complicated program that is licensed software. Almost majority of the
companies never write their own DBMS program.
A DBMS is the software system that allows users to define, create and maintain a database and provides controlled
access to the data. A Database Management System (DBMS) is basically a collection of programs that enables
users to store, modify, and extract information from a database as per the requirements. DBMS is an intermediate
layer between programs and the data. Programs access the DBMS, which then accesses the data. There are different
types of DBMS ranging from small systems that run on personal computers to huge systems that run on mainframes.
The following are main examples of database applications:
• Computerized library systems
• Automated teller machines
• Flight reservation systems
• Computerized parts inventory systems
A database management system is a piece of software that provides services for accessing a database, while
maintaining all the required features of the data. Commercially available DBMS in the market are dbase, FoxPro, IMS
and Oracle, MySQL, SQL Servers and DB2 etc. These systems allow users to create update, and extract information
from their databases. Compared to a manual filing system, the biggest advantages to a computerized database system
are speed, accuracy, and' accessibility.
A relational database management system is the most popular type of database management system for business uses.
Database Application
o A set of one or more computer programs that serves as an intermediary between the users
and the DBMS
o Application program that read or modify database data by sending SQL statements to
DBMS.
o Application program to present data to users in the format of forms and reports.

Database Application: Functions


 Create and process forms
 Process user queries
 Create and process reports
 Execute application logic
 Control database applications
The four components of a database system are:
 Users
 Database Application
 Database Management System (DBMS)
 Database

Database System Environment


Components of the Database System Environment: There are five major components in the database system
environment and their interrelationship are.
• Hardware
• Software
• Data
• Users
• Procedures

Functions of DBMS
 Create databases
 Create tables
 Create supporting structures
 Read database data
 Modify database data (insert, update, delete)
 Maintain database structures
 Enforce rules
 Control concurrency
 Provide security
 Perform backup and recovery
Advantages of DBMS
The database management system has promising potential advantages, which are explained below:
a) Controlling Redundancy: In file system, each application has its own private files, which cannot be shared
between multiple applications. 1:his can often lead to considerable redundancy in the stored data, which results in
wastage of storage space. By having centralized database most of this can be avoided. It is not possible that all
redundancy should be eliminated. Sometimes there are sound business and technical reasons for· maintaining
multiple copies of the same data. In a database system, however this redundancy can be controlled.
b) For example: In case of college database, there may be the number of applications like General Office, Library,
Account Office, Hostel etc.
c) Integrity can be enforced: Integrity of data means that data in database is always accurate, such that incorrect
information cannot be stored in database. In order to maintain the integrity of data, some integrity constraints are
enforced on the database. A DBMS should provide capabilities for defining and enforcing the constraints.
d) Inconsistency can be avoided : When the same data is duplicated and changes are made at one site, which is not
propagated to the other site, it gives rise to inconsistency and the two entries regarding the same data will not agree.
At such times the data is said to be inconsistent. So, if the redundancy is removed chances of having inconsistent
data is also removed.
e) Data can be shared: As explained earlier, the data about Name, Class, Father __name etc. of General_Office is
shared by multiple applications in centralized DBMS as compared to file system so now applications can be
developed to operate against the same stored data. The applications may be developed without having to create any
new stored files.
f) Standards can be enforced : Since DBMS is a central system, so standard can be enforced easily may be at
Company level, Department level, National level or International level. The standardized data is very helpful during
migration or interchanging of data. The file system is an independent system so standard cannot be easily enforced
on multiple independent applications.
g) Restricting unauthorized access: When multiple users share a database, it is likely that some users will not be
authorized to access all information in the database. For example, account office data is often considered
confidential, and hence only authorized persons are allowed to access such data. In addition, some users may be
permitted only to retrieve data, whereas other are allowed both to retrieve and to update. Hence, the type of access
operation retrieval or update must also be controlled. Typically, users or user groups are given account numbers
protected by passwords, which they can use to gain access to the database. A DBMS should provide a security and
authorization subsystem, which the DBA uses to create accounts and to specify account restrictions. The DBMS
should then enforce these restrictions automatically.
h) Solving Enterprise Requirement than Individual Requirement: Since many types of users with varying level of
technical knowledge use a database, a DBMS should provide a variety of user interface. The overall requirements of
the enterprise are more important than the individual user requirements. So, the DBA can structure the database
system to provide an overall service that is "best for the enterprise".
i) Providing Backup and Recovery: A DBMS must provide facilities for recovering from hardware or software
failures. The backup and recovery subsystem of the DBMS is responsible for recovery. For example, if the
computer system fails in the middle of a complex update program, the recovery subsystem is responsible for
making sure that the .database is restored to the state it was in before the program started executing.
j) Cost of developing and maintaining system is lower: It is much easier to respond to unanticipated requests
when data is centralized in a database than when it is stored in a conventional file system. Although the initial cost
of setting up of a database can be large, but the cost of developing and maintaining application programs to be far
lower than for similar service using conventional systems. The productivity of programmers can be higher in using
non-procedural languages that have been developed with DBMS than using procedural languages.
k) Data Model can be developed : The centralized system is able to represent the complex data and interfile
relationships, which results better data modeling properties. The data madding properties of relational model is
based on Entity and their Relationship, which is discussed in detail in chapter 4 of the book.
l) Concurrency Control : DBMS systems provide mechanisms to provide concurrent access of data to multiple
users.
Disadvantages of DBMS: The disadvantages of the database approach are summarized as follows:
a. Complexity : The provision of the functionality that is expected of a good DBMS makes the DBMS an extremely
complex piece of software. Database designers, developers, database administrators and end-users must
understand this functionality to take full advantage of it. Failure to understand the system can lead to bad design
decisions, which can have serious consequences for an organization.
b. Size : The complexity and breadth of functionality makes the DBMS an extremely large piece of software,
occupying many megabytes of disk space and requiring substantial amounts of memory to run efficiently.
c. Performance: Typically, a File Based system is written for a specific application, such as invoicing. As result,
performance is generally very good. However, the DBMS is written to be more general, to cater for many
applications rather than just one. The effect is that some applications may not run as fast as they used to.
d. Higher impact of a failure: The centralization of resources increases the vulnerability of the system. Since all
users and applications rely on the ~vailabi1ity of the DBMS, the failure of any component can bring operations to
a halt.
e. Cost of DBMS: The cost of DBMS varies significantly, depending on the environment and functionality
provided. There is also the recurrent annual maintenance cost.
f. Additional Hardware costs: The disk storage requirements for the DBMS and the database may necessitate the
purchase of additional storage space. Furthermore, to achieve the required performance it may be necessary to
purchase a larger machine, perhaps even a machine dedicated to running the DBMS. The procurement of
additional hardware results in further expenditure.
g. Cost of Conversion: In some situations, the cost oftlle DBMS and extra hardware may be insignificant compared
with the cost of converting existing applications to run on the new DBMS and hardware. This cost also includes
the cost of training staff to use these new systems and possibly the employment of specialist staff to help with
conversion and running of the system. This cost is one of the main reasons why some organizations feel tied to
their current systems and cannot switch to modern database technology.

Stages of Information System


 Stage 0: Manual Information System
 Records
 Files
 Index Cards
 Stage 1: Sequential Information Systems
 Tapes
 Files
 slow, non-interactive, redundancy
 Stage 2: File Based Information Systems
 Disk (direct access)
 application program has its own file data dependence
 data redundancy
 Stage 3: DBMS based Information Systems
 Generalized data management software
 Transaction processing

Organizational DBMS
Organizational database systems typically:
 Support several users simultaneously
 Include more than one application
 Involve multiple computers
 Are complex in design
 Have many tables
 Have many databases
Conventional Data Processing techniques:

File Based Information Systems


1. How does DBMS solve the problem of Traditional File Management? Primary Key vs. Foreign Key. Mapping
Cordinalities. (Unit-3)
 A DBMS Solves the Problems of the Traditional File Environment: Traditional file management techniques
make it difficult for organizations to keep track of all of the pieces of data they use in a systematic way and to
organize these data so that they can be easily accessed. Different functional areas and groups were allowed to
develop their own files independently. Over time, this traditional file management environment creates such
problems as data redundancy and inconsistency, program–data dependence, inflexibility, poor security, and lack of
data sharing and availability. A database management system (DBMS) solves these problems with software that
permits centralization of data and data management so that businesses have a single, consistent source for all their
data needs. Using a DBMS minimizes redundant and inconsistent files.

It reduces data redundancy and inconsistency by minimizing isolated files -It can’t eliminate data redundancy as a
whole, but can help control it -It uncouples data and programs, enabling data to stand up on their own -Access
and availability of information increases -Program development and maintenance costs decreases -Users and
programmers can perform and hoc queries of data in the database -Enables the organization to centrally manage:
the data, their use, and security through the use of a data dictionary Relational DBMS -Contemporary DBMS uses
different database models -Most popular type is the relational DBMS -Relational DBMS: data as two-dimensional
tables (called relations) -Tables are also referred to as files -Each table contains data on an entity and its attributes
-Example: Microsoft Access is a Relational DBMS -Each element of data for each entity is stored as a separate
field -Each field represents an attribute for that entity -Fields in a relational database are also called columns - The
actual information about a single supplier that resides in a table is called a row -Rows are referred to as records or
as tuples -When a field uniquely identifies each record, so that it can be retrieved, updated or sorted, it is called a
key field - Every table in a relational database has one field designated as its primary key -The key field is the
unique identifier for all the information in any row of the table -The primary key cannot be duplicated
 A primary key is a column or a set of columns that uniquely identify a row in a table. A primary key should be
short, stable and simple. A foreign key is a field (or collection of fields) in a table whose value is required to
match the value of the primary key for a second table.

3.1.1 Characteristics of DBMS


Traditionally data was organized in file formats. DBMS was all new concepts then and all
the research was done to make it to overcome all the deficiencies in traditional style of data
management. Modern DBMS has the following characteristics:
 Real-world entity: Modern DBMS are more realistic and uses real world entities to design
its architecture. It uses the behavior and attributes too. For example, a school database may
use student as entity and their age as their attribute.
 Relation-based tables: DBMS allows entities and relations among them to form as tables.
This eases the concept of data saving. A user can understand the architecture of database
just by looking at table names etc.
 Isolation of data and application: A database system is entirely different than its data.
Where database is said to active entity, data is said to be passive one on which the database
works and organizes. DBMS also stores metadata which is data about data, to ease its own
process.
 Less redundancy: DBMS follows rules of normalization, which splits a relation when any
of its attributes is having redundancy in values. Following normalization, which itself is a
mathematically rich and scientific process, make the entire database to contain as less
redundancy as possible.
 Consistency: DBMS always enjoy the state on consistency where the previous form of data
storing applications like file processing does not guarantee this. Consistency is a state where
every relation in database remains consistent. There exist methods and techniques, which
can detect attempt of leaving database in inconsistent state.
 Query Language: DBMS is equipped with query language, which makes it more efficient to
retrieve and manipulate data. A user can apply as many and different filtering options, as he
or she wants. Traditionally it was not possible where file-processing system was used.
 ACID Properties: DBMS follows the concepts for ACID properties, which stands for
Atomicity, Consistency, Isolation and Durability. These concepts are applied on transactions,
which manipulate data in database. ACID properties maintains database in healthy state in
multi-transactional environment and in case of failure.
 Multiuser and Concurrent Access: DBMS support multi-user environment and allows
them to access and manipulate data in parallel. Though there are restrictions on transactions
when they attempt to handle same data item, but users are always unaware of them.
 Multiple views: DBMS offers multiples views for different users. A user who is in sales
department will have a different view of database than a person working in production
department. This enables user to have a concentrate view of database according to their
requirements.
 Security: Features like multiple views offers security at some extent where users are unable
to access data of other users and departments. DBMS offers methods to impose constraints
while entering data into database and retrieving data at later stage. DBMS offers many
different levels of security features, which enables multiple users to have different view with
different features. For example, a user in sales department cannot see data of purchase
department is one thing, additionally how much data of sales department he can see, can also
be managed. Because DBMS is not saved on disk as traditional file system it is very hard for
a thief to break the code.
3.1.2 Users of DBMS
DBMS is used by various users for various purposes. Some may involve in retrieving data
and some may involve in backing it up. Some of them are described as follows:
 Administrators: A bunch of users maintain the DBMS and are responsible for
administrating the database. They are responsible to look after its usage and by whom it
should be used. They create users access and apply limitation to maintain isolation and
force security. Administrators also look after DBMS resources like system license,
software application and tools required and other hardware related maintenance. The
database administrator (DBA) :
 Works with programmers and analysts to design and implement the database
 Works with users and managers to establish database policies
 Implements security features and establishes database permissions
 Designers: This is the group of people who actually works on designing part of database.
The actual database is started with requirement analysis followed by a good designing
process. They people keep a close watch on what data should be kept and in what format.
They identify and design the whole set of entities, relations, constraints and views.
 End Users: This group contains the persons who actually take advantage of database
system. End users can be just viewers who pay attention to the logs or market rates or end
users can be as sophisticated as business analyst who takes the most of it.
A user of a database system will
o Use a database application to track things
o Use forms to enter, read, delete and query data
o Produce reports
3.1.3 Advantages of the Database Approach
 Program-data independence
 Minimal data redundancy
 Improved data consistency
 Improved data sharing
 Increased productivity of application development
 Enforcement of standards
 Improved data quality
 Improved data accessibility
 Reduced program maintenance
3.1.4 DBMS - Architecture
The design of a Database Management System highly depends on its architecture. It can be
centralized or decentralized or hierarchical. DBMS architecture can be seen as single tier or
multi-tier.
n-tier architecture divides the whole system into related but independent n modules, which
can be independently modified, altered, changed or replaced.
 1-tier architecture: DBMS is the only entity where user directly sits on n DBMS and
uses it. Any changes done here will directly be done on DBMS itself. It does not provide
handy tools for end users and preferably database designer and programmers use single
tier architecture.
 2-tier architecture: DBMS is then must have some application, which uses the DBMS.
Programmers use 2-tier architecture where they access DBMS by means of application.
Here application tier is entirely independent of database in term of operation, design and
programming.
 3-tier architecture: Most widely used architecture is 3-tier architecture. 3-tier
architecture separates it tier from each other on basis of users. It is described as follows:
 Database (Data) Tier: At this tier, only database resides. Database along with its query
processing languages sits in layer-3 of 3-tier architecture. It also contains all relations and
their constraints.
 Application (Middle) Tier: At this tier the application server and program, which
access database, resides. For a user this application tier works as abstracted view of
database. Users are unaware of any existence of database beyond application. For
database-tier, application tier is the user of it. Database tier is not aware of any other user
beyond application tier. This tier works as mediator between the two.
 User (Presentation) Tier: An end user sits on this tier. From a users aspect this tier is
everything. He/she doesn't know about any existence or form of database beyond this
layer. At this layer multiple views of database can be provided by the application. All
views are generated by applications, which reside in application tier.
Multiple tier database architecture is highly modifiable as almost all its components are
independent and can be changed independently.
3.1.4 DBMS - Data Models
Data model tells how the logical structure of a database is modeled. Data Models are
fundamental entities to introduce abstraction in DBMS. Data models define how data is
connected to each other and how it will be processed and stored inside the system.
The very first data model could be flat data-models where all the data used to be kept in
same plane. Because earlier data models were not so scientific they were prone to introduce
lots of duplication and update anomalies.
3.1.5 Entity-Relationship Model
Entity-Relationship model is based on the notion of real world entities and relationship
among them. While formulating real-world scenario into database model, ER Model creates
entity set, relationship set, general attributes and constraints. ER Model is best used for the
conceptual design of database.
ER Model is based on:
Entities and their attributes
Relationships among entities
Entity: An entity in ER Model is real world entity, which has some properties called
attributes. Every attribute is defined by its set of values, called domain. For example, in a
school database, a student is considered as an entity. Student has various attributes like
name, age and class etc.

Relationship: The logical association among entities is called relationship. Relationships
are mapped with entities in various ways. Mapping cardinalities define the number of
association between two entities.
Mapping cardinalities:
one to one
one to many
many to one
many to many

3.1.6 Relational Model


The most popular data model in DBMS is Relational Model. It is more scientific model then
others. This model is based on first-order predicate logic and defines table as an n-ary
relation.
The main highlights of this model are:
Data is stored in tables called relations.
Relations can be normalized.
In normalized relations, values saved are atomic values.
Each row in relation contains unique value
Each column in relation contains values from a same domain.
3.1.7 DBMS - Data Schemas
3.1.7.1 Database schema: Database schema skeleton structure of and it represents the
logical view of entire database. It tells about how the data is organized and how relation
among them is associated. It formulates all database constraints that would be put on data in
relations, which resides in database.
A database schema defines its entities and the relationship among them. Database schema is
a descriptive detail of the database, which can be depicted by means of schema diagrams.
All these activities are done by database designer to help programmers in order to give some
ease of understanding all aspect of database.
Database schema can be divided broadly in two categories:
Physical Database Schema: This schema pertains to the actual storage of data and its
form of storage like files, indices etc. It defines the how data will be stored in secondary
storage etc.
Logical Database Schema: This defines all logical constraints that need to be applied on
data stored. It defines tables, views and integrity constraints etc.

3.1.7.2 Database Instance


It is important that we distinguish these two terms individually. Database schema is the
skeleton of database. It is designed when database doesn't exist at all and very hard to do any
changes once the database is operational. Database schema does not contain any data or
information.
Database instances, is a state of operational database with data at any given time. This is a
snapshot of database. Database instances tend to change with time. DBMS ensures that its
every instance (state) must be a valid state by keeping up to all validation, constraints and
condition that database designers has imposed or it is expected from DBMS itself.

3.1.8 Flat File System


A Flat File is a database that stores the data in a plain text file. Each line of the file stores a
single value. Each field is separated by delimiters such as commas or tabs. Although it can
have multiple tables, it cannot have multiple relations as the Relational Databases have.
3.2 HDBMS Hierarchical Database Management System
A hierarchical database model is a data model in which the data is organized into a tree-like
structure. The data is stored as records which are connected to one another through links. A
record is a collection of fields, with each field containing only one value. The entity type of
a record defines which fields the record contains.
3.2.1 Example of a hierarchical model
A record in the hierarchical database model corresponds to a row in the relational database
model and an entity type corresponds to a table.
The hierarchical database model mandates that each child record has only one parent,
whereas each parent record can have one or more child records. In order to retrieve data
from a hierarchical database the whole tree needs to be traversed starting from the root node.
This model is recognized as the first database model created by IBM in the 1960s.
The Hierarchical Data Model is a way of organizing a database with multiple one to many
relationships. The structure is based on the rule that one parent can have many children but
children are allowed only one parent. This structure allows information to be repeated
through the parent child relations created by IBM and was implemented mainly in their
Information Management System.

3.2.2 Advantages of hierarchical model


The model allows easy addition and deletion of new information. Data at the top of the
Hierarchy is very fast to access. It was very easy to work with the model because it worked
well with linear type data storage such as tapes. The model relates very well to natural
hierarchies such as assembly plants and employee organization in corporations. It relates
well to anything that works through a one to many relationships. For example; there is a
president with many managers below them, and those managers have many employees
below them, but each employee has only one manager.
3.2.3 Disadvantages of hierarchical model
This model has many issues that hold it back now that we require more sophisticated
relationships. It requires data to be repetitively stored in many different entities. The
database can be very slow when searching for information on the lower entities. We no
longer use linear data storage mediums such as tapes so that advantage is null. Searching for
data requires the DBMS to run through the entire model from top to bottom until the
required information is found, making queries very slow. Can only model one to many
relationships, many to many relationships are not supported. Clever manipulation of the
model is required to make many to may relationships.
3.3 NDBMS-Network Database Management System
Network Database: A network databases are mainly used on large digital computers. It more
connections can be made between different types of data, network databases are considered
more efficiency. It contains limitations must be considered when we have to use this kind of
database. It is Similar to the hierarchical databases; network databases.

Network databases are similar to hierarchical databases by also having a hierarchical


structure. A network database looks more like a cobweb or interconnected network of
records.
In network databases, children are called members and parents are called occupier. The
difference between each child or member can have more than one parent. The Approval of
the network data model similar with the esteem of the hierarchical data model. Some data
were more naturally modeled with more than one parent per child. The network model
authorized the modeling of many-to-many relationships in data.
The network model is very similar to the hierarchical model really. Actually the hierarchical
model is a subset of the network model. However, instead of using a single-parent tree
hierarchy, the network model uses set theory to provide a tree-like hierarchy with the
exception that child tables were allowed to have more than one parent. It supports many-to-
many relationships.
3.4 RDBMS-Relational Database Management System
In relational databases, the relationship between data files is relational. Hierarchical and
network databases require the user to pass a hierarchy in order to access needed data. These
databases connect to the data in different files by using common data numbers or a key field.
Data in relational databases is stored in different access control tables, each having a key
field that mainly identifies each row. In the relational databases are more reliable than either
the hierarchical or network database structures. In relational databases, tables or files filled
up with data are called relations designates a row or record, and columns are referred to as
attributes or fields.
The main advantage of Relational Database over the Flat File System is that the proper
organization of data that has. A relational database also defines the relationships between
those tables. In the relational databases, queries are used to fetch data with the help of the
indexes. The relational database technology makes the databases efficient, lighter and faster.
3.4.1 Advantages of the RDBMS:
 Bringing tables together using relations
 Provides a structure query language (SQL) to define and manipulate data.
 Security
 makes the databases efficient, lighter and faster
 defines the relationships between those tables
 queries are used to fetch data with the help of the indexes

Relational databases work on each table has a key field that uniquely indicates each row, and
that these key fields can be used to connect one table of data to another.
3.4.2 The relational database has two major reasons
1. Relational databases can be used with little or no training.
2. Database entries can be modified without specify the entire body.
3.4.3 Properties of Relational Tables
In the relational database some properties have to be followed, which are given below.
 It's Values are Atomic
 In Each Row is alone.
 Column Values are of the same thing.
 Columns are undistinguished.
 Sequence of Rows is Insignificant.
 Each Column has a common Name.
Distinguish between DBMS & RDBMS. Explain the advantages & disadvantages of both.
3.5 OODBMS – Object oriented Database Management System
An Object Oriented database is a combination of objects in a persistent storage which holds
information. It is quite similar to the object oriented languages. It can be named as the fifth-
generation database technology that was began to develop in mid 80’s. The real world
entities are represented like an object in the Object Oriented Data Model.
In this Model we have to discuss the functionality of the object oriented Programming .It
takes more than storage of programming language objects. Object DBMS's increase the
semantics of the C++ and Java .It provides full-featured database programming capability,
while containing native language compatibility. It adds the database functionality to object
programming languages. This approach is the analogical of the application and database
development into a constant data model and language environment. Applications require less
code, use more natural data modeling, and code bases are easier to maintain. Object
developers can write complete database applications with a decent amount of additional
effort.

The object-oriented database derivation is the integrity of object-oriented programming


language systems and consistent systems. The power of the object-oriented databases comes
from the cyclical treatment of both consistent data, as found in databases, and transient data,
as found in executing programs.

Object-oriented databases use small, recyclable separated of software called objects. The
objects themselves are stored in the object-oriented database. Each object contains of two
elements:
1. Piece of data (e.g., sound, video, text, or graphics).
2. Instructions or software programs called methods, for what to do with the data.
3.5.1 Disadvantage of Object-oriented databases
Object-oriented databases have these disadvantages.
Object-oriented database are more expensive to develop.
In the Most organizations are unwilling to abandon and convert from those databases.
The benefits to object-oriented databases are compelling. The ability to mix and match
reusable objects provides incredible multimedia capability.
3.6 Object-Relational Model (Hybrid Model): It is also a relational data model but with
object orientation in it. It reduces the gap between the conceptual data modeling techniques
and object-relational mapping.

Multidimensional Database: It is a database system that can be used to utilize the


advantages of the databases. It is usually structured to optimize online analytical processing
and data warehouse applications. The multidimensional database can receive data from a
variety of relational databases and structure the information into categories and sections that
can be accessed in number of different ways.

3.7 Query Processing


3.7.1 Upper levels of the data integration problem
How to construct mappings from sources to a single mediated schema
How queries posed over the mediated schema are reformulated over the sources
3.7.2 Basic Steps in Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
Parsing and translation: Translate the query into its internal form. This is then
translated into relational algebra. Parser checks syntax, verifies relations.
Query Optimization: Amongst all equivalent evaluation plans choose the one with
lowest cost. Cost is estimated using statistical information from the database catalog.
Evaluation: The query-execution engine takes a query-evaluation plan, executes that
plan, and returns the answers to the query. A relational algebra expression may have
many equivalent expressions. Each relational algebra operation can be evaluated using
one of several different algorithms. Correspondingly, a relational-algebra expression can
be evaluated in many ways. Annotated expression specifying detailed evaluation strategy
is called an evaluation-plan.
3.8 SQL
SQL is Structured Query Language, which is a computer language for storing, manipulating
and retrieving data stored in relational database. SQL is the standard language for Relation
Database System. All relational database management systems like MySQL, MS Access,
and Oracle, Sybase, Informix, postgres and SQL Server use SQL as standard database
language.
2. What is SQL? What is Query Processing? Concurrency Management. (Unit-3)
What is SQL?
 SQL stands for Structured Query Language
 SQL lets you access and manipulate databases
 SQL is an ANSI (American National Standards Institute) standard
What Can SQL do?
 SQL can execute queries against a database
 SQL can retrieve data from a database
 SQL can insert records in a database
 SQL can update records in a database
 SQL can delete records from a database
 SQL can create new databases
 SQL can create new tables in a database
 SQL can create stored procedures in a database
 SQL can create views in a database
 SQL can set permissions on tables, procedures, and views
Some of the Most Important SQL Commands
 SELECT - extracts data from a database
 UPDATE - updates data in a database
 DELETE - deletes data from a database
 INSERT INTO - inserts new data into a database
 CREATE DATABASE - creates a new database
 ALTER DATABASE - modifies a database
 CREATE TABLE - creates a new table
 ALTER TABLE - modifies a table
 DROP TABLE - deletes a table
 CREATE INDEX - creates an index (search key)
 DROP INDEX - deletes an index
3. Write any 2 DDL & DML Commands in SQL with example. ( Unit – 3 )
SQL language is divided into four types of primary language statements: DML, DDL, DCL and TCL. Using these
statements, we can define the structure of a database by creating and altering database objects, and we can manipulate
data in a table through updates or deletions. We also can control which user can read/write data or manage
transactions to create a single unit of work.
The four main categories of SQL statements are as follows:
 DML (Data Manipulation Language)
 DDL (Data Definition Language)
 DCL (Data Control Language)
 TCL (Transaction Control Language)
DML (Data Manipulation Language): DML statements affect records in a table. These are basic operations we
perform on data such as selecting a few records from a table, inserting new records, deleting unnecessary records, and
updating/modifying existing records. DML statements include the following:
SELECT – select records from a table SELECT * from class;
INSERT – insert new records INSERT into table-namevalues(data1,data2,..)
UPDATE – update/Modify existing records UPDATEtable-name set column-name = value where condition;
DELETE – delete existing records
DDL (Data Definition Language): DDL statements are used to alter/modify a database or table structure and
schema. These statements handle the design and storage of database objects.
CREATE – create a new Table, database, schema
ALTER – alter existing table, column description
DROP – delete existing objects from database

3.8.1 Why SQL?


Allows users to access data in relational database management systems.
Allows users to describe the data.
Allows users to define the data in database and manipulate that data.
Allows to embed within other languages using SQL modules, libraries & pre-
compilers.
Allows users to create and drop databases and tables.
Allows users to create view, stored procedure, functions in a database.
Allows users to set permissions on tables, procedures, and views

3.8.2 History
 1970 -- Dr. Edgar F. "Ted" Codd of IBM is known as the father of relational databases.
He described a relational model for databases.
 1974 -- Structured Query Language appeared.
 1978 -- IBM worked to develop Codd's ideas and released a product named System/R.
 1986 -- IBM developed the first prototype of relational database and standardized by
ANSI. The first relational database was released by Relational Software and its later
becoming Oracle.
3.8.3 SQL Process
When you are executing an SQL command for any RDBMS, the system determines the best
way to carry out your request and SQL engine figures out how to interpret the task.
There are various components included in the process. These components are Query
Dispatcher, Optimization Engines, Classic Query Engine and SQL Query Engine, etc.
Classic query engine handles all non-SQL queries but SQL query engine won't handle
logical files.
3.8.4 SQL Commands
The standard SQL commands to interact with relational databases are CREATE, SELECT,
INSERT, UPDATE, DELETE and DROP. These commands can be classified into groups
based on their nature.
3.8.4.1 DDL - Data Definition Language
Command Description
CREATE Creates a new table, a view of a table, or other object in database
ALTER Modifies an existing database object, such as a table.
DROP Deletes an entire table, a view of a table or other object in the database.

3.8.4.2 DML - Data Manipulation Language


Command Description
SELECT Retrieves certain records from one or more tables
INSERT Creates a record
UPDATE Modifies records
DELETE Deletes records
3.8.4.3 DCL - Data Control Language
Command Description
GRANT Gives a privilege to user
REVOKE Takes back privileges granted from user
3.9 Concurrency Management
In a multiprogramming environment where more than one transactions can be concurrently
executed, there exists a need of protocols to control the concurrency of transaction to ensure
atomicity and isolation properties of transactions. Concurrency control protocols, which
ensure serializability of transactions, are most desirable. Concurrency control protocols can
be broadly divided into two categories:
Lock based protocols
Time stamp based protocols

3.9.1 Lock based protocols: Database systems, which are equipped with lock-based
protocols, use mechanism by which any transaction cannot read or write data until it
acquires appropriate lock on it first. Locks are of two kinds:
 Binary Locks: a lock on data item can be in two states; it is either locked or unlocked.
 Shared/exclusive: this type of locking mechanism differentiates lock based on their
uses. If a lock is acquired on a data item to perform a write operation, it is exclusive lock.
Because allowing more than one transactions to write on same data item would lead the
database into an inconsistent state. Read locks are shared because no data value is being
changed.
3.9.1.1 Types lock protocols
 Simplistic: Simplistic lock based protocols allow transaction to obtain lock on every
object before 'write' operation is performed. As soon as 'write' has been done, transactions
may unlock the data item.
 Pre-claiming: In this protocol, a transactions evaluations its operations and creates a list
of data items on which it needs locks. Before starting the execution, transaction requests
the system for all locks it needs beforehand. If all the locks are granted, the transaction
executes and releases all the locks when all its operations are over. Else if all the locks
are not granted, the transaction rolls back and waits until all locks are granted.
 Two Phase Locking - 2PL: This locking protocol is divides transaction execution phase
into three parts.
1. When transaction starts executing, transaction seeks grant for locks it needs as it
executes.
2. Where the transaction acquires all locks and no other lock is required. Transaction
keeps executing its operation. As soon as the transaction releases its first lock, the
third phase starts.
3. A transaction cannot demand for any lock but only releases the acquired locks.

Two phase locking has two phases, one is growing; where all locks are being acquired by
transaction and second one is shrinking, where locks held by the transaction are being
released. To claim an exclusive (write) lock, a transaction must first acquire a shared (read)
lock and then upgrade it to exclusive lock.
Strict Two Phase Locking: The first phase of Strict-2PL is same as 2PL. After acquiring
all locks in the first phase, transaction continues to execute normally. But in contrast to 2PL,
Strict-2PL does not release lock as soon as it is no more required, but it holds all locks until
commit state arrive. Strict- 2PL releases all locks at once at commit point.

3.9.2 Time stamp based protocols: The most commonly used concurrency protocol is time-
stamp based protocol. This protocol uses either system time or logical counter to be used as
a time-stamp. Lock based protocols manage the order between conflicting pairs among
transaction at the time of execution whereas time-stamp based protocols start working as
soon as transaction is created.
Every transaction has a time-stamp associated with it and the ordering is determined by the
age of the transaction. A transaction created at 0002 clock time would be older than all other
transaction, which come after it. For example, any transaction 'y' entering the system at 0004
is two seconds younger and priority may be given to the older one. In addition, every data
item is given the latest read and write-timestamp. This lets the system know, when last read
was and write operation made on the data item.
3.9.2.1 Time-stamp ordering protocol: The timestamp-ordering protocol ensures
serializability among transaction in their conflicting read and writes operations. This is the
responsibility of the protocol system that the conflicting pair of tasks should be
If a transaction Ti issues write(X) operation: executed according to the timestamp values
of the transactions.
 Time-stamp of Transaction Ti is denoted as TS (Ti).
 Read time-stamp of data-item X is denoted by R-timestamp(X).
 Write time-stamp of data-item X is denoted by W-timestamp(X).
Timestamp ordering protocol works as follows:
If a transaction Ti issues read(X) operation:
If TS(Ti) < W-timestamp(X)
o Operation rejected.
If TS(Ti) >= W-timestamp(X)
o Operation executed.
All data-item Timestamps updated.

If TS(Ti) < R-timestamp(X)


o Operation rejected.
o If TS(Ti) < W-timestamp(X)
o Operation rejected and Ti rolled back.
Otherwise, operation executed.

3.10 Data warehouse / Data Warehousing / What is a Data Warehouse?


History of data warehousing
 The concept of data warehousing dates back to the late 1980s when IBM researchers
Barry Devlin and Paul Murphy developed the "business data warehouse".
 1960s - General Mills and Dartmouth College, in a joint research project, develop the
terms dimensions and facts.
 1970s - ACNielsen and IRI provide dimensional data marts for retail sales.
 1983 – Tera data introduces a database management system specifically designed for
decision support.
 1988 - Barry Devlin and Paul Murphy publish the article An architecture for a business
and information systems in IBM Systems Journal where they introduce the term "business
data warehouse".
Q)Define data warehousing.

Data Warehouse

 A data warehouse is a collection of corporate information; derived


directly from operational system and some external data sources. "A
data warehousing is subject oriented integrated non-volatile, time
varying collection of data in support of its decision making process".

 According to Bill Inmon, "A collection of non-volatile data of different


business subjects and objects, which is time variant and integrated
down various sources and applications and stored ina manner to make
a quick analysis of business situation."

Data warehousing is the process of constructing and using a data warehouse. It is a process
of transforming data into information and making it available to users in a timely enough
manner to make a difference. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc
queries, and decision making. Data warehousing involves data cleaning, data integration,
and data consolidations. Data warehouse is data management and data analysis. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data consolidations. Its main goal
is to integrate enterprise wide corporate data into a single repository from which users can
easily run queries.
• The data has been selected from various sources and then integrate and store the data in a
single and particular format.
• Data warehouses contain current detailed data, historical detailed data, lightly and highly
summarized data, and metadata.
• Current and historical data are voluminous because they are stored at the highest level of
detail.
• Lightly and highly summarized data are necessary to save processing time when users
request them and are readily accessible.
• Metadata are “data about data”. It is important for designing, constructing, retrieving, and
controlling the warehouse data.
• Technical metadata include where the data come from, how the data were changed,
how the data are organized, how the data are stored, who owns the data, who is
responsible for the data and how to contact them, who can access the data , and the
date of last update.
• Business metadata include what data are available, where the data are, what the data
mean, how to access the data, predefined reports and queries, and how current the data
are.
A producer wants to know….
 Which are our lowest/highest margin customers ?
 Who are my customers and what products are they buying?
I
 Which customers are most likely to go to the competition? nf
 What impact will new products/services have on revenue and margins?
 What product promotions have the biggest impact on revenue?
D or
 What is the most effective distribution channel?
am
t
Features of Data warehousing: Data warehousing is a single, complete and consistent store

at
of data obtained from a variety of different sources made available to end users in a what
they can understand and use in a business context.
• Subject Oriented: Data that gives information about a particular subjectainstead of
about a company's ongoing operations. io
• Integrated: Data that is gathered into the data warehouse from a variety of sources
and merged into a coherent whole. n
• Time-variant: All data in the data warehouse is identified with a particular time
period.
• Non-volatile: Data is stable in a data warehouse. More data is added but data is never
removed. This enables management to gain a consistent picture of the business.
• Data warehousing is combining data from multiple and usually varied sources into one
comprehensive and easily manipulated database.
• Common accessing systems of data warehousing include queries, analysis and
reporting.
• Because data warehousing creates one database in the end, the number of sources can
be anything you want it to be, provided that the system can handle the volume, of
course.
• The final result, however, is homogeneous data, which can be more easily
manipulated.
• It is a relational or multidimensional database management system designed to
support management decision making.
• A data warehousing is a copy of transaction data specifically structured for querying
and reporting.
• Technique for assembling and managing data from various sources for the purpose of
answering business questions. Thus making decisions that were not previous possible.
• It is a relational or multidimensional database management system designed to
support management decision making.
• A data warehousing is a copy of transaction data specifically structured for querying
and reporting.
• Technique for assembling and managing data from various sources for the purpose of
answering business questions. Thus making decisions that were not previous possible
3.10.1 Benefits / Business advantages of Data warehouse: There are decision support
technologies that help utilize the data available in a data warehouse. These technologies help
executives to use the warehouse quickly and effectively. They can gather data, analyze it,
and take decisions based on the information present in the warehouse. The information
gathered in a warehouse can be used in any of the following domains:
 Tuning Production Strategies - The product strategies can be well tuned by repositioning
the products and managing the product portfolios by comparing the sales quarterly or yearly.
 Customer Analysis - Customer analysis is done by analyzing the customer's buying
preferences, buying time, budget cycles, etc.
 Operations Analysis - Data warehousing also helps in customer relationship management,
and making environmental corrections. The information also allows us to analyze business
operations.
 High returns on investment.
 Increased productivity of corporate decision-makers.
 It provides business users with a “customer-centric” view of the company’s heterogeneous
data by helping to integrate data from sales, service, manufacturing and distribution, and
other customer-related business systems.
 It provides added value to the company’s customers by allowing them to access better
information when data warehousing is coupled with internet technology.
 It consolidates data about individual customers and provides a repository of all customer
contacts for segmentation modeling, customer retention planning, and cross sales analysis.
 It removes barriers among functional areas by offering a way to reconcile views from
multiple areas, thus providing a look at activities that cross functional lines.
 It reports on trends across multidivisional, multinational operating units, including trends or
relationships in areas such as merchandising, production planning etc.
Strategic uses of data warehousing

Functional areas of
Industry Strategic use
use
Crew assignment, aircraft development, mix of fares, analysis of
Airline Operations; marketing
route profitability, frequent flyer program promotions
Product development; Customer service, trend analysis, product and service
Banking
Operations; marketing promotions, reduction of IS expenses
Product development;
Credit card Customer service, new information service, fraud detection
marketing
Health care Operations Reduction of operational expenses
Investment and Product development; Risk management, market movements analysis, customer
Insurance Operations; marketing tendencies analysis, portfolio management
Distribution; Trend analysis, buying pattern analysis, pricing policy, inventory
Retail chain
marketing control, sales promotions, optimal distribution channel
Product development; New product and service promotions, reduction of IS budget,
Telecommunications
Operations; marketing profitability analysis
Distribution; Distribution decisions, product promotions, sales decisions,
Personal care
marketing pricing policy
Public sector Operations Intelligence gathering

3.10.2 Problems / Disadvantages of data warehouses


 Underestimation of resources for data loading
 Hidden problems with source systems
 Required data not captured
 Increased end-user demands
 Data homogenization
 High demand for resources
 Data ownership
 High maintenance
 Long-duration projects
 Complexity of integration
 Data warehouses are not the optimal environment for unstructured data.
 Because data must be extracted, transformed and loaded into the warehouse, there is an
element of latency in data warehouse data.
 Over their life, data warehouses can have high costs. Maintenance costs are high.
 Data warehouses can get outdated relatively quickly. There is a cost of delivering
suboptimal information to the organization.
 There is often a fine line between data warehouses and operational systems. Duplicate,
expensive functionality may be developed. Or, functionality may be developed in the data
warehouse that, in retrospect, should have been developed in the operational systems and
vice versa.
3.10.3 Main components
 Operational data sources: Operational data sources for the DW is supplied from
mainframe operational data held in first generation hierarchical and network databases,
departmental data held in proprietary file systems, private data held on workstaions and
private serves and external systems such as the Internet, commercially available DB, or
DB assoicated with and organization‘s suppliers or customers.
 Operational datastore (ODS): ODS is a repository of current and integrated operational
data used for analysis. It is often structured and supplied with data in the same way as the
data warehouse, but may in fact simply act as a staging area for data to be moved into the
warehouse.
 Query manager: Query manager also called backend component, it performs all the
operations associated with the management of user queries. The operations performed by
this component include directing queries to the appropriate tables and scheduling the
execution of queries
 End-user access tools: End-user access tools can be categorized into five main groups:
data reporting and query tools, application development tools, executive information
system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools.
Data warehousing Concepts: Several concepts are of particular importance to data
warehousing. They are discussed in detail in this section.
 Dimensional Data Model: Dimensional data model is most commonly used in data
warehousing systems. This modeling technique includes two common schema types, star
schema and snowflake schema. This is different from the 3rd normal form, commonly used
for transactional (OLTP) type systems. The same data would be stored differently in a
dimensional model than in a 3rd normal form model. To understand dimensional data
modeling, some of the terms commonly used in this type of modeling are:
 Dimension: A category of information. For example, the time dimension.
 Attribute: A unique level within a dimension. For example, Month is an attribute in the
Time Dimension.
 Hierarchy: The specification of levels that represents relationship between different
attributes within a dimension. For example, one possible hierarchy in the Time dimension is
Year → Quarter → Month → Day.
 Fact Table: A fact table is a table that contains the measures of interest. For example, sales
amount would be such a measure. This measure is stored in the fact table with the
appropriate granularity. For example, it can be sales amount by store by day. In this case,
the fact table would contain three columns: A date column, a store column, and a sales
amount column.
 Lookup Table: The lookup table provides the detailed information about the attributes.
For example, the lookup table for the Quarter attribute would include a list of all of the
quarters available in the data warehouse. Each row (each quarter) may have several fields,
one for the unique ID that identifies the quarter, and one or more additional fields that
specifies how that particular quarter is represented on a report (for example, first quarter of
2001 may be represented as "Q1 2001" or "2001 Q1").
 A dimensional model includes fact tables and lookup tables. Fact tables connect to one or
more lookup tables, but fact tables do not have direct relationships to one another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key
columns in the lookup tables.
 In designing data models for data warehouses / data marts, the most commonly used
schema types are Star Schema and Snowflake Schema.
 Whether one uses a star or a snowflake largely depends on personal preference and
business needs.
 Star Schema: In the star schema design, a single object (the fact table) sits in the
middle and is radically connected to other surrounding objects (dimension lookup
tables) like a star. Each dimension is represented as a single table. The primary key in
each dimension table is related to a foreign key in the fact table.
Sample star schema

All measures in the fact table are related to all the dimensions that fact table is related
to. In other words, they all have the same level of granularity. A star schema can be
simple or complex. A simple star consists of one fact table; a complex star can have
more than one fact table. Let's look at an example: Assume data warehouse keeps
store sales data, and the different dimensions are time, store, product, and customer.
In this case, the figure on the left represents our star schema. The lines between two
tables indicate that there is a primary key / foreign key relationship between the two
tables. Note that different dimensions are not related to one another.
 Snowflake Schema: The snowflake schema is an extension of the star schema, where
each point of the star explodes into more points. In a star schema, each dimension is
represented by a single dimensional table, whereas in a snowflake schema, that
dimensional table is normalized into multiple lookup tables, each representing a level
in the dimensional hierarchy.
Sample snowflake schema
For example, the Time Dimension that consists of 2 different hierarchies:
1. Year → Month → Day
2. Week → Day
4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for
month, a lookup table for week, and a lookup table for day. Year is connected to Month,
which is then connected to Day. Week is only connected to Day. A sample snowflake
schema illustrating the above relationships in the Time Dimension is shown to the right.
The main advantage of the snowflake schema is the improvement in query
performance due to minimized disk storage requirements and joining smaller lookup
tables. The main disadvantage of the snowflake schema is the additional maintenance
efforts needed due to the increase number of lookup tables.
 Slowly Changing Dimension: This is a common issue facing data warehousing
practioners.
 Conceptual Data Model: What is a conceptual data model, its features, and an example of
this type of data model.
 Logical Data Model: What is a logical data model, its features, and an example of this type
of data model.
 Physical Data Model: What is a physical data model, its features, and an example of this
type of data model.
 Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data
model.
 Data Integrity: What is data integrity and how it is enforced in data warehousing.
 What is OLAP: Definition of OLAP.
o OLTP : OLTP- ONLINE TRANSACTION PROCESSING
o Special data organization, access methods and implementation methods are needed to
support data warehouse queries (typically multidimensional queries)
o OLTP systems are tuned for known transactions and workloads
o OLTP Systems are used to “run” a business
o e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of
December
OLTP vs Data Warehouse:
OLTP • Data Warehouse (DSS)
• Application Oriented • Subject Oriented
• Used to run business • Used to analyze business
• Detailed data • Summarized and refined
• Current up to date • Snapshot data
• Isolated Data • Integrated Data
• Clerical User • Knowledge User (Manager)
• Few Records accessed at a time (tens) • Large volumes accessed at a time (millions)
• Read/Update Access • Mostly Read (Batch Update)
• No data redundancy • Redundancy present
• Database Size 100MB -100 GB • Database Size \100 GB - few terabytes
• Transaction throughput is the performance metric • Query throughput is the performance metric
• Thousands of users • Hundreds of users
• Managed in entirety • Managed by subsets
• OLTP Systems are • The Data Warehouse helps to “optimize” the
used to “run” a business business

 Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different
view of the role between data warehouse and data mart.
 Factless Fact Table: A fact table without any fact may sound silly, but there are real life
instances when a factless fact table is useful in data warehousing.
 Junk Dimension: Discusses the concept of a junk dimension: When to use it and why it is
useful.
 Conformed Dimension: Discusses the concept of a conformed dimension: What is it and
why is it important.
3.10.4 Data flow
 Inflow: The processes associated with the extraction, cleansing, and loading of the data
from the source systems into the data warehouse.
 Upflow: The process associated with adding value to the data in the warehouse through
summarizing, packaging, packaging, and distribution of the data.
 Downflow: The processes associated with archiving and backing-up of data in the
warehouse.
3.10.5 Tools and Technologies
The critical steps in the construction of a data warehouse:
 Extraction
 Cleansing
 Transformation
After the critical steps, loading the results into target system can be carried out either by
separate products, or by a single, category:
 code generators
 database data replication tools
 dynamic transformation engines
For the various types of meta-data and the day-to-day operations of the data warehouse, the
administration and management tools must be capable of supporting those tasks:
 Monitoring data loading from multiple sources
 Data quality and integrity checks
 Managing and updating meta-data
 Monitoring database performance to ensure efficient query response times and resource
utilization
 Auditing data warehouse usage to provide user chargeback information
 Replicating, subsetting, and distributing data
 Maintaining effient data storage management
 Purging data;
 Archiving and backing-up data
 Implementing recovery following failure
Virtual Warehouse: The view over an operational data warehouse is known as virtual
warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires
excess capacity on operational database servers.
3.11 Data Mart
A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), such as sales, finance or marketing. Data marts are often built and
controlled by a single department within an organization. Given their single-subject focus,
data marts usually draw data from only a few sources. The sources could be internal
operational systems, a central data warehouse, or external data. Data marts contain a subset
of organization-wide data that is valuable to specific groups of people in an organization. In
other words, a data mart contains only those data that is specific to a particular group. For
example, the marketing data mart may contain only data related to items, customers, and
sales. Data marts are confined to subjects.
• A data mart is a scaled down version of a data warehouse that focuses on a particular
subject area.
• A data mart is a subset of an organizational data store, usually oriented to a specific
purpose or major data subject, that may be distributed to support business needs.
• Implemented as the first step in proving the usefulness of the technologies to solve
business problems
Reasons for creating a data mart
• Easy access to frequently needed data
• Creates collective view by a group of users
• Improves end-user response time
• Ease of creation in less time
• Lower cost than implementing a full Data warehouse
• Potential users are more clearly defined than in a full Data warehouse
The following figure shows a graphical representation of data marts.

From the Data Warehouse to Data Marts

3.11.1 Dependent and Independent Data Marts


There are two basic types of data marts: dependent and independent. The categorization is
based primarily on the data source that feeds the data mart. Dependent data marts draw data
from a central data warehouse that has already been created. Independent data marts, in
contrast, are standalone systems built by drawing data directly from operational or external
sources of data, or both.
The main difference between independent and dependent data marts is how you data mart is
populated i.e., how data is get out of the sources and into the data mart. This step is called
the Extraction-Transformation-and Loading (ETL) process which involves moving data
from operational systems, filtering it, and loading it into the data mart. With dependent data
marts, this process is somewhat simplified because formatted and summarized (clean) data
has already been loaded into the central data warehouse.
The ETL process for dependent data marts is mostly a process of identifying the right subset
of data relevant to the chosen data mart subject and moving a copy of it, perhaps in a
summarized form. With independent data marts, however, you must deal with all aspects of
the ETL process, much as you do with a central data warehouse. The number of sources is
likely to be fewer and the amount of data associated with the data mart is less than the
warehouse, given your focus on a single subject. The motivations behind the creation of
these two types of data marts are also typically different.
Dependent data marts are usually built to achieve improved performance and availability,
better control, and lower telecommunication costs resulting from local access of data
relevant to a specific department. The creation of independent data marts is often driven by
the need to have a solution within a shorter time.
3.11.2 Features of a Data Mart
 Windows-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
 The implementation cycle of a data mart is measured in short periods of time, i.e., in
weeks rather than months or years.
 The life cycle of data marts may be complex in the long run, if their planning and design
are not organization-wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data mart is departmentally structured data warehouse.
 Data marts are flexible.

3.11.3 Steps in Implementing a Data Mart


Simply stated, the major steps in implementing a data mart are to design the schema,
construct the physical storage, populate the data mart with data from source systems, access
it to make informed decisions, and manage it over time.
1) Designing
2) Constructing
3) Populating
4) Accessing
5) Managing
1. Designing: The design step is first in the data mart process. This step covers all of the
tasks from initiating the request for a data mart through gathering information about the
requirements, and developing the logical and physical design of the data mart. The design
step involves the following tasks:
 Gathering the business and technical requirements
 Identifying data sources
 Selecting the appropriate subset of data
 Designing the logical and physical structure of the data mart
2. Constructing: This step includes creating the physical database and the logical structures
associated with the data mart to provide fast and efficient access to the data. This step
involves the following tasks:
 Creating the physical database and storage structures, such as table spaces, associated
with the data mart
 Creating the schema objects, such as tables and indexes defined in the design step
 Determining how best to set up the tables and the access structures
3. Populating: The populating step covers all of the tasks related to getting the data from
the source, cleaning it up, modifying it to the right format and level of detail, and moving
it into the data mart. More formally stated, the populating step involves the following
tasks:
 Mapping data sources to target data structures
 Extracting data
 Cleansing and transforming the data
 Loading data into the data mart
 Creating and storing metadata
4. Accessing: The accessing step involves putting the data to use: querying the data,
analyzing it, creating reports, charts, and graphs, and publishing these. Typically, the end
user uses a graphical front-end tool to submit queries to the database and display the
results of the queries. The accessing step requires that you perform the following tasks:
 Set up an intermediate layer for the front-end tool to use. This layer, the metalayer,
translates database structures and object names into business terms, so that the end user
can interact with the data mart using terms that relate to the business function.
 Maintain and manage these business interfaces.
 Set up and manage database structures, like summarized tables that help queries
submitted through the front-end tool execute quickly and efficiently.
5. Managing: This step involves managing the data mart over its lifetime. In this step, you
perform management tasks such as the following:
 Providing secure access to the data
 Managing the growth of the data
 Optimizing the system for better performance
 Ensuring the availability of data even with system failures
3.11.4 Data Mart issues
 Data mart functionality: the capabilities of data marts have increased with the growth in their
popularity
 Data mart size: the performance deteriorates as data marts grow in size, so need to reduce
the size of data marts to gain improvements in performance
 Data mart load performance: two critical components:
o end-user response time and dataloading performance
o to increment DB updating so that only cells affected by the change are updated and
not the entire MDDB structure.
4. How do you use database to improve business performance& Decision Making? (Unit-3)
Using Databases to Improve Business Performance and Decision-Making
Businesses use their databases to:
 Keep track of basic transactions
 Provide information that will help the company run the business more efficiently
 Help managers and employees make better decisions

Figure: Components Of A Data Warehouse


In a large company, special capabilities and tools are required for analyzing vast quantities of data and for accessing
data from multiple systems, such as:
 Data warehouse: a database that stores current and historical data from core operational transactional
systems for use in management analysis, but this data cannot be altered.
 Data mart: A subset of a data warehouse in which a summarized or highly focused portion of the
organization's data is placed in a separate database for a specific population of users.
 Business intelligence (BI) tools: Data analysis tools used for consolidating, analyzing, and accessing vast
stores of data to help in decision making, such as software for database query and reporting, tools for
multidimensional data analysis (online analytical processing), and data mining.
The data warehouse extracts current and historical data from multiple operational systems inside the organization.
These data are combined with data from external sources and reorganized into a central database designed for
management reporting and analysis. The information directory provides users with information about the data
available in the warehouse.

Figure: Business Intelligence


A series of analytical tools works with data stored in databases to find patterns and insights for helping managers and
employees make better decisions to improve organizational performance.
Online Analytical Processing (OLAP) supports multidimensional data analysis, enabling users to view the same
data in different ways using multiple dimensions, for example: How many dishwashers were sold in the East in June.
Figure: Multidimensional Data Model
The view that is showing is product versus region. If you rotate the cube 90 degrees, the face that will show is product
versus actual and projected sales. If you rotate the cube 90 degrees again, you will see region versus actual and
projected sales. Other views are possible.
Data mining finds hidden patterns and relationships and infers rules from these to predict future behavior. The types
of information obtainable from data mining include
 Associations
 Sequences
 Classifications
 Clustering
 Forecasting
Predictive analysis uses data mining techniques, historical data, and assumptions about future conditions to predict
outcomes of events, such as the probability a customer will respond to an offer or purchase a specific product.
Databases can be linked to the Web by using middleware software products, allowing users or clients to access
corporate data through a Web browser interface. Such software might consist of an application server, a custom
software program, or CGI (common gateway interface) scripts. In a client/server environment, the DBMS might
reside on a special dedicated computer called a database server. Web interfaces are easy to use and require few or no
changes to the internal database.

Figure: Linking Internal Databases To The Web

5. What do you understand by data warehousing & Data Mart? What are the advantages of Data warehousing?
(Unit – 3)
 Database is a structured collection of data. It can be anything from list of names in a text file, to a relational
database. It is commonly confused with the database management system (ex: MySQL is a relational database
management system, but if you store data in it, that data is a database. People incorrectly say ‘I use MySQL as my
database’)
 Data warehouse is a structured collection of [ideally] all theorganisation’s data.
The concept of a data warehouse is not difficult to understand. Basically the idea is to create a permanent storage
space for the data needed to support reporting, analysis, and other BI functions. It may seem wasteful to store
data in multiple places (source systems and the data warehouse), the many advantages of doing that more than
justify the effort and expense.
Data warehouses reside on servers dedicated to this function running a database management system (DBMS)
such as SQL Server and using Extract, Transform, and Load (ETL) software such as SQL Server Integration
Services (SSIS) to pull data from the source systems and into the data warehouse.
Benefits of a Data Warehouse and BI solution: Once a data warehouse is in place and populated with data, it will
become a part of a BI solution that will deliver benefits to business users in many ways:
 End user creation of reports: The creation of reports directly by end users is much easier to accomplish in a BI
environment. They can also create much more useful reports because of the power and capability of BI tools
compared to a source application. And moving the creation of reports to a BI system increases consistency and
accuracy and usually reduces cost
 Ad-hoc reporting and analysis: Since the data warehouse eliminates the need for BI tools to compete with the
transactional source systems, users can analyze data faster and generate reports more easily, and slice-and-dice in
ways they could never do before. The Microsoft BI toolset vastly improves the ability to analyze data
 Dynamic presentation through dashboards: Managers want access to an interactive display of up-to-date
critical management data. That is accomplished via dashboards, which are sophisticated displays that show
information in creative and highly graphical forms, much like the instrument panel on an automobile
 Drill-down capability: Users can drill down into the details underlying the summaries on dashboards and
reports. The allows users to slice and dice to find underlying problems
 Support for regulations: Sarbanes-Oxley and other related regulations have requirements that transactional
systems are sometimes not able to support. With a data warehouse, the necessary data can be retained as long as
the law requires
 Metadata creation: Descriptions of the data can be stored with the data warehouse to make it a lot easier for
users to understand the data in the warehouse. This will make report creation much simpler for the end-user
 Support for operational processes: A data warehouse can help support business needs, such as the ability to
consolidate financial results within a complex company that uses different software for different divisions
 Data mining: Once you have built out a data warehouse, there are data mining tools that you can use to help find
hidden patterns using automatic methodologies. While reporting tools can tell you where you have been, data
mining tools can tell you where you are going
 Security: A data warehouse makes it much easier to provide secure access to those that have a legitimate need to
specific data and to exclude others
 Analytical tool support: There are many vendors who have analytical tools (i.eQlikView, Tableau) that allow
business units to slice and dice the data and create reports and dashboards. These tools will all work best when
extracting data from a data warehouse
This long list of benefits is what makes BI based on a data warehouse an essential management tools for companies.
A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data
mart is a subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data
warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. While
transactional databases are designed to be updated, data warehouses or marts are read only. Data mart is a subset
of the data warehouse structured to allow easy user access.

S-ar putea să vă placă și