Sunteți pe pagina 1din 63

08/16/15

TCS Confidential

Course Roadmap

Why we use Data warehousing

Difference between Operational System and Data Warehouse


Introduction to Dataware housing
Emergence of Decision Support Systems
Data Warehousing Approaches
Data Warehouse Technical Architecture
Data Modelling concepts
Operational Data Store
Schema Design of Data warehouse
Data Acquisation

Why We Need Data Warehousing ?


Better business intelligence for end-users
Reduction in time to locate, access, and analyze information
Consolidation of disparate information sources

To Store Large Volumes of Historical Detail Data from Mission


Critical Applications

Strategic advantage over competitors


Faster time-to-market for products and services
Replacement of older, less-responsive decision support systems
Reduction in demand on IS to generate reports

OPERATIONAL DATABASE:
Online Transaction Processing
Designed for running the business and not suitable for
analyzing the business in the prospect Of business executives
because data volatile nature (Keep on changing)
It does not maintain historical data.
It contains only current data.
If u insert any new values it will update
Eg:
Acnthno
Acnthsal
1072
13,000 20,000

OLTP Systems Vs Data Warehouse


users are different
data content is different,
data structures are different

hardware is different
Understanding The Differences Is The Key

OLTP Vs Data Warehouse


Operational System

Data Warehouse

Transaction Processing

Query Processing

Predictable CPU Usage

Random CPU Usage

Time Sensitive

History Oriented

Operator View

Managerial View

Normalized Efficient

Denormalized Design for

Design for TP

Query Processing

Data Warehouse
OLTP Vs Warehouse

Operational System
Designed for Atmocity,
Consistency, Isolation and
Durability

Designed for quite or static


database

Organized by transactions
(Order, Input, Inventory)

Organized by subject
(Customer, Product)

Relatively smaller database Large database size


Many concurrent users

Relatively few concurrent


users

Volatile Data

Non Volatile Data

Operational System

Data Warehouse

Stores all data

Stores relevant data

Performance Sensitive

Less Sensitive to performance

Not Flexible

Flexible

Efficiency

Effectiveness

What is a Data Warehouse ?


Data Warehouse is a
Subject-Oriented
Integrated
Time-Variant
Non-volatile
WH Inmon - Regarded As Father Of Data Warehousing

Subject Oriented Analysis


Process Oriented

Subject Oriented

Entry
Sales Rep
Quantity Sold
Part Number
Date
Customer Name
Product Description
Unit Price
Mail Address

Transactional Storage

Sales
Sales
Customers
Customers
Products
Products
Data Warehouse Storage
10

Integration of Data
M, F

Unit of
Attributes

Appl. A - pipeline cm.


Appl. B - pipeline inches
Appl. C - pipeline mcf

pipeline cm

Physical
Attributes

Appl. A - balance dec(13,2)


Appl. B - balance PIC 9(9)V99
Appl. C - balance float

Naming
Conventions

Appl. A - bal-on-hand
Appl. B - current_balance
Appl. C - balance

Data
Consistency

Appl. A - date (Julian)


Appl. B - date (yymmdd)
Appl. C - date (absolute)

Transactional Storage

Integration

Appl. A - M, F
Appl. B - 1, 0
Appl. C - X, Y

Encoding

balance dec(13, 2)

balance

date (Julian)

Data Warehouse Storage


11

Volatility of Data
Volatile
Insert

Non-Volatile
Change

Access

Delete
Insert

Load

Change
Access
Record-by-Record Data Manipulation

Transactional Storage

Mass Load / Access of Data

Data Warehouse Storage


12

Time Variant Data Analysis


Current Data

Historical Data
Sales ( Region , Year - Year 97 - 1st Qtr)
20
15
Sales ( in lakhs
10
)

East
West
North

5
0

January

February

March

Year97

Transactional Storage

Data Warehouse Storage


13

Decision Support Systems (DSS)

What is DSS?
Need for DSS
Comparison of OLTP & DSS
Transition from Data Processing to Information
Processing

14

What is DSS?
Decision
DecisionSupport
SupportSystems
Systems(DSS)
(DSS)are
areinteractive
interactivecomputercomputer-

based
basedsystems
systemsintended
intendedtotohelp
helpdecision
decisionmakers
makersutilize
utilizedata
dataand
and
models
modelstotoidentify
identifyand
andsolve
solveproblems
problemsand
andmake
makedecisions.
decisions.
Data
DataWarehouse
Warehouseisisthe
thefoundation
foundationof
ofDSS
DSSprocess.
process.ItItisisaaStrategy
Strategyand
and
aaProcess
Processfor
forStaging
StagingCorporate
CorporateData.
Data.

Enable users to get a Business View of


the data
Facilitate Data based Decision Making that
would drive and improve the Business
Discover Hidden Trends
15

Why DSS?: How to answer these Business Queries?


What is the sales distribution region wise?
How did my revenue improve in the past 5 years?
What are the slow movers
in my product line?

Which of my Sales Agents


are doing better?

What is Defaulters Profile?

Which channel costs me


more and pays less?

Strategic Planning / Budgeting

Currency Risk, Interest Rate


Risk, Liquidity Risk

Who are my profitable customers?

OLTP v/s DSS Environment

OLTP Environment

DSS Environment

get data IN
large volumes of simple
transaction queries
continuous data changes
low processing time
mode of processing
transaction details
data inconsistency
mostly current data

get information OUT


small number of diverse
queries
periodic updates only
high processing time
mode of discovery
subject oriented summaries
data consistency
historical data is relevant

17

OLTP v/s DSS Environment

OLTP Environment

DSS Environment

high concurrent usage


highly normalized data
structure
static applications
automates routines

low concurrent usage


fewer tables, but more
columns per table
dynamic applications
facilitates creativity

18

DW Implementation
Approaches

Top Down
Bottom-up
Combination of both
Choices depend on:

current infrastructure
resources
architecture
ROI
Implementation speed

Top Down Implementation

Bottom Up Implementation

DW Implementation Approaches
Top Down

Bottom Up

More planning and design


initially
Involve people from
different work-groups,
departments
Data marts may be built
later from Global DW
Overall data model to be
decided up-front

Can plan initially without


waiting for global
infrastructure
built incrementally
can be built before or in
parallel with Global DW
Less complexity in design

DW Implementation Approaches
Top Down
Consistent data definition
and enforcement of business
rules across enterprise
High cost, lengthy process,
time consuming
Works well when there is
centralized IS department
responsible for all H/W and
resources

Bottom Up
Data redundancy and
inconsistency between
data marts may occur
Integration requires
great planning
Less cost of H/W and
other resources
Faster pay-back

DW Architectures

24

Data warehousing Architecture

ExtractPush/Pull

Source 1

Source 2

Source 3

Source n

Sources

Cleansing, Transformation & Loading

Metadata

Staging
Layer

Canned
Reports

Detail Data

Summaries
/
Aggregatio
ns

Ad-hoc
analysis
CubesConformed
Dimensions

Transformatio
n
Summarization
Aggregation

ODS

Data
Warehouse

Data Marts

Reportin
g Layer
25

Benefits of DWH
To formulate effective business, marketing
and sales strategies.
To precisely target promotional activity.
To discover and penetrate new markets.
To successfully compete in the marketplace
from a position of informed strength.
To build predictive rather than retrospective models.

Data Modeling

Data Modeling
WHAT IS A DATA MODEL?

A data model is an abstraction of some aspect of


the real world (system).
WHY A DATA MODEL?
Helps to visualize the business
A model is a means of communication.
Models help elicit and document requirements.
Models reduce the cost of change.
Model is the essence of DW architecture based on
which DW will be implemented

STEPS in DATA MODELING


Problem & scope definition
Requirement Gathering

Analysis
Logical Database Design
Deciding Database
Physical Database design
Schema Generation

Levels of modeling
Conceptual modeling
Describe data requirements from a business
point of view without technical details

Logical modeling
Refine conceptual models
Data structure oriented, platform independent

Physical modeling
Detailed specification of what is physically
implemented using specific technology

Conceptual Model
A conceptual model shows data through
business eyes.
All entities which have business meaning.
Important relationships
Few significant attributes in the entities.
Few identifiers or candidate keys.

Logical Model
Replaces many-to-many relationships with
associative entities.
Defines a full population of entity attributes.
May use non-physical entities for domains
and sub-types.
Establishes entity identifiers.
Has no specifics for any RDBMS or
configuration.

Physical Model
A Physical data model may include

Referential Integrity
Indexes
Views
Alternate keys and other constraints
Tablespaces and physical storage objects.

Modeling Techniques
Entity-Relationship Modeling
Traditional modeling technique
Technique of choice for OLTP
Suited for corporate data warehouse

Dimensional Modeling
Analyzing business measures in the specific business context
Helps visualize very abstract business questions
End users can easily understand and navigate the data structure

Entity-Relationship Modeling - Basic Concepts


Relationship
Relationship between entities - structural interaction and
association
described by a verb
Cardinality
1-1
1-M
M-M

Example : Books belong to Printed Media

Entity-Relationship Modeling - Basic Concepts


Attributes
Characteristics and properties of entities
Example :
Book Id, Description, book category are attributes of entity
Book

Attribute name should be unique and self-explanatory


Primary Key, Foreign Key, Constraints are defined on
Attributes

Examples: ER Model

37

Limitations of E-R Modeling

Poor Performance
Tend to be very complex and difficult to
navigate.

Dimensional Modeling

39

Dimensional Modeling
Dimensional modeling uses three basic
concepts : measures, facts, dimensions.
Is powerful in representing the requirements
of the business user in the context of
database tables.
Focuses on numeric data, such as values
counts, weights, balances and occurences.

Dimensional modeling
Must identify

Business process to be supported


Grain (level of detail)
Dimensions
Facts

What is a Facts
A fact is a collection of related data items,
consisting of measures and context data.
Each fact typically represents a business item,
a business transaction, or an event that can be
used in analyzing the business or business
process.
Facts are measured, continuously valued,
rapidly changing information. Can be
calculated and/or derived.

Types of Facts
Additive

Able to add the facts along all the dimensions


Discrete numerical measures eg. Retail sales in $
Semi Additive
Snapshot, taken at a point in time
Measures of Intensity
Not additive along time dimension eg. Account balance, Inventory
balance
Added and divided by number of time period to get a time-average
Non Additive
Numeric measures that cannot be added across any dimensions
Intensity measure averaged across all dimensions eg. Room temperature
Textual facts - AVOID THEM

Dimensions
A dimension is a collection of members or
units of the same type of views.
Dimensions determine the contextual
background for the facts.
Dimensions represent the way business
people talk about the data resulting from a
business process, e.g., who, what, when,
where, why, how

Dimensional Hierarchy
Geography Dimension
World Level

America

Europe

Asia

Pa
re

nt

Continent Level

Re
la

tio
n

World

Country Level

USA

State Level

FL

City Level

Miami

Attributes: Population,
Tourists Place

Canada
GA
Tampa

VA

CA

Orlando

Argentina
WA
NaplesDimension

Member / Business
Entity
45

Dimensions Types

Conformed Dimension
junk Dimension
Dirty Dimension
Monster Dimension
Slowly Changing Dimension
Degenerated Dimension

46

Data marts (DM)


A data mart is a
Powerful and natural extension of the data warehouse
Extends information to the departmental environment
from an enterprise environment
Interprets and structures data to suit departments
specific needs
Several names for DMs:
departmental DSS DBs
OLAP Data bases
multi-dimensional DBs
(MDDB)
lightly summarized tables
Data marts
47

DM - Types

Embedded data marts are marts that are stored within


the central DW. They can be stored relationally as files or
cubes.
Dependent data marts are marts that are fed directly by
the DW, sometimes supplemented with other feeds, such as
external data.
Independent data marts are marts that are fed directly
by external sources and do not use the DW.
Data marts
48

Operational Data Store (ODS)


An ODS
pulls together, validates, cleanses and integrates data
foundation for providing integrated view of enterprise data
tactical decision support, day-to-day operations and
management reporting
Characteristics
Integrated
Subject-oriented
Volatile (including update)
Current valued

ODS

49

ODS - Types

Class I Immediate Load.


Class II Delayed Load
Class III Overnight Load.
Class IV Data warehouse Load.

ODS
50

OLTP Vs ODS Vs DWH


Characteristic

OLTP

ODS

Data Warehouse

Data redundancy

Somewhat
redundant with
operational
databases

Managed
redundancy

Data stability

Non-redundant
within system;
Unmanaged
redundancy among
systems
Dynamic

Data update

Field by field

Field by field

Controlled batch

Data usage

Highly structured,
repetitive

Somewhat
structured, some
analytical

Database size

Moderate

Moderate

Highly
unstructured,
heuristic or
analytical
Large to very large

Somewhat stable

Dynamic

Database
Stable
structure stability

Somewhat dynamic Static

Star Schema Design


Single fact table surrounded by denormalized
dimension tables
The fact table primary key is the composite of the
foreign keys (primary keys of dimension tables)
Fact table contains transaction type information.
Many star schemas in a data mart
Easily understood by end users, more disk storage
required

Example of Star Schema

Snowflake Schema
Single fact table surrounded by normalized dimension
tables
Normalizes dimension table to save data storage space.
When dimensions become very very large
Less intuitive, slower performance due to joins

May want to use both approaches, especially if


supporting multiple end-user tools.

Example of Snow flake schema

Snowflake - Disadvantages
Normalization of dimension makes it
difficult for user to understand
Decreases the query performance because it
involves more joins
Dimension tables are normally smaller than
fact tables - space may not be a major issue
to warrant snowflaking

On-Line Analytical Processing (OLAP)


OLAP is a category of applications/technology for
collecting
managing
processing
presenting
multidimensional data for analysis and management purposes

OLAP Cubes
57

OLAP Features
Subject oriented approach to Decision Support
Calculations applied across dimensions, through hierarchies
and/or across members
Trend analysis over sequential time periods, What If
scenarios.
Slicing/Dicing subsets for on-screen viewing
Drill-down/up along the hierarchy
Reach-through to underlying detail data
Rotation to new dimensional comparisons in the viewing
area
OLAP Cubes

58

OLAP Categories
Multi-dimensional OLAP (MOLAP)
Relational OLAP (ROLAP)
Hybrid OLAP (HOLAP)

OLAP Cubes

59

MOLAP
Use pre-calculated data set CUBE
Cube contains all possible answers to given range of
questions
Features:
Very fast response
Ability to quickly write data into the cube
Downsides:
Limited Scalability
Inability to contain detailed data
Load time
OLAP Cubes

60

ROLAP
Do not use pre-calculated CUBE
Intercept query & pose it to the Relational DB

Features:
Ask any question (not limited to the contents of the
cube)
Ability to drill down
Downsides:
Slow Response
Some limitations on scalability
OLAP Cubes

61

HOLAP
Combines MOLAP & ROLAP
Utilizes both pre-calculated cubes & relational data
sources
Features:
For summary type info cube, (Faster response)
Ability to drill down relational data sources (drill
through detail to underlying data)
Source of data transparent to end-user

OLAP Cubes

62

Data Acquisation
Data Extraction
Data Transformation
Data Loading

63

S-ar putea să vă placă și