Sunteți pe pagina 1din 34

Data Warehousing Fundamentals

Yasim Kolathayil
yasimk@gmail.com
yasim@damaninc.com

Yasim Kolathayil – yasimk@gmail.com 1
What is a Data Warehouse?

Inmon definition of a Data Warehouse


 Subject-oriented
 Integrated
 Time-variant
 Nonvolatile
 Collection of data to support management’s decision-making process.

Kimball definition of a Data Warehouse


 DW is nothing more than the union of all the constituent data marts

Yasim Kolathayil – yasimk@gmail.com 2
Why Data Warehousing?

Which are our


lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
Yasim Kolathayil – yasimk@gmail.com 3
and margins?
Historical DSS before EDW

End User Developed


User Grown DSS Systems Reporting Environment

Replicated
Data Sets

Source
OLTP
Systems

Multiple Versions of Truth


Yasim Kolathayil – yasimk@gmail.com 4
Common DW Components
 Staging Area
 A preparatory repository where transaction data can be transformed for use in the
data warehouse (EDW_DEV_ST, EDW_PROD_ST)
 Data Warehouse
 ER Modeled with historical data pulled from source systems.
 Data Mart

Traditional dimensionally modeled set of dimension and fact tables

Operational Data Store (ODS)

Modeled to support near real-time reporting needs

Contains traits of both relational and dimensional modeling techniques
 OLTP and OLAP Systems
 Online Transactional Processing & Online Analytical Processing
Yasim Kolathayil – yasimk@gmail.com 5
DW Arch. – Hub and Spoke (Inmon Model)
ETL ETL Real-time
Source
Dashboards
OLTP
Systems Data Marts

Static and
Ad-hoc Reporting

Data
Warehouse

Graphical
Data Analysis

Single Version of Truth


Yasim Kolathayil – yasimk@gmail.com 6
Data Warehouse Architecture
Integrated
ETL ETL Scrubbed ETL
Subject Or. Queries from Data Mart
Source History
OLTP Staging Non-Volatile
Systems Area Summary Data Marts

Data
Warehouse

Queries from EDW

ODS Integrated OLAP


Scrubbed Systems
Subject Or
Yasim Kolathayil – yasimk@gmail.com
No History 7
Volatile
Kimball DW – BUS Architecture
ETL ETL
Source End User
OLTP Data Marts
Applications
Systems

Data Mart 1
OLAP
Servers
Data Mart 2
Data
Staging
Area Data Mart 3 •Ad-Hoc Query
•End User
Applications
•Reports
Data Mart 4

Collection of Data Marts


Called as Data Warehouse
Yasim Kolathayil – yasimk@gmail.com 8
Inmon Vs Kimball DW Differences

Inmon DW Kimball DW

Hub and Spoke Model (Corporate Information Factory) – SWA Bus Architecture
Model

Integrated DW with data at the atomic level Integration through conformed dimensions & facts.

Staging area and Data Warehouse constitutes the backroom. Transient staging area is the backroom.

Top Down approach (Enterprise wide) Bottom Up approach (Department wide)

Increased startup time, Integrated. Quick development, Fast to build

E-R Model for Data Warehouse Only Star Schema for Data Warehouse

Yasim Kolathayil – yasimk@gmail.com 9
DW Design Strategies

Top-down DW design (Inmon Model) – the data warehouse


design is based on the enterprise model itself. It implies a
strategic, rather than operational, perspective of the data.

Bottom-up DW design (Kimball Model)– focuses more on


making use of the data available in the current system.
This is less effort than the top-down approach, but may
end up with a DW that does not satisfy all of the
organization’s information needs.

Hybrid Model – Rapid development with in an enterprise


context. The DW is populated only as data is needed for
data marts.
Yasim Kolathayil – yasimk@gmail.com 10
From Data Warehouse to Data Marts

Information

Individually Less
Structured

Departmentally History
Structured Normalized
Detailed

Organizationally More
Structured Data Warehouse

Data
Yasim Kolathayil – yasimk@gmail.com 11
What are the differences?

Properties Operational Source Operational Data Data Warehouse Data Mart


S/ms Store

Contents Detailed Data Detailed Data + Summary Info+ Single Function


Appr Summary Appr Detail Summary

Timeliness Current Nearly Current Point-in-Time Point-in-Time

Updated Continuously Frequently Periodically Periodically

Performance Needs Tuned for Update Tuned for Query & Tuned for Query Not Applicable
Extraction

Volatility of Very Volatile Volatile Non-Volatile Non-Volatile


Contents

Amt of Data Low Controlled for May Be Very Moderate


Accessed Performance High

Yasim Kolathayil – yasimk@gmail.com 12
The Multi-Dimensional Data Model

Data is divided into:



Facts

Dimensions
Facts are the important entity

Facts have measures that can be aggregated: E.g. sales price
Dimensions describe facts

A sale has the dimensions Product, Store and Time
Goal for dimensional modeling:

Surround facts with as much context (dimensions) as possible

Yasim Kolathayil – yasimk@gmail.com 13
The “Classic” Star Schema

Store Fact Table Time Dimension


Dimension STORE KEY
STORE KEY PERIOD KEY
Store PRODUCT KEY
Description PERIOD KEY Period Desc
City Year
Dollars Quarter
State
Units
District ID Month
Price
District Desc. Day
Region_ID
Region Desc. Product Dimension
Regional Mgr. PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer

Yasim Kolathayil – yasimk@gmail.com 14
Baggage Data Mart

Yasim Kolathayil – yasimk@gmail.com 15
The “Classic” Star Schema

A single fact table, with detail and


summary data
Store Dimension Fact Table Time Dimension
STORE KEY STORE KEY
PRODUCTKEY
PERIOD KEY Fact table primary key has only
Store Description
City PERIOD KEY Period Desc
Year
one key column per dimension
State
Dollars Quarter
District ID
Units
District Desc. Month
Region_ID
Region Desc.
Price
Day Each key is generated
Product Dimension Current Flag
Regional Mgr. Resolution
Level PRODUCT KEY Sequence
Product Desc.
Brand
Each dimension is a single table,
Color
Size
highly denormalized
Manufacturer
Level

Benefits: Easy to understand, easy to define hierarchies, reduces #


of physical joins, low maintenance, very simple metadata
Drawbacks: Summary data in the fact table yields poorer
performance for summary levels, huge dimension tables a problem
Yasim Kolathayil – yasimk@gmail.com 16
EDW Baggage Data

Yasim Kolathayil – yasimk@gmail.com 17
Star and Snow Flake Model

Complex dimensions are normalized


Yasim Kolathayil – yasimk@gmail.com 18
OLTP vs. OLAP
OLTP: On Line Transaction Processing
 Describes processing at operational sites
 Source data for Data Warehouse.

OLAP: On Line Analytical Processing


 Multi-dimensional view of Business data

Used mainly for decision making by the Business Analyst.

An element of Decision Support Systems

Yasim Kolathayil – yasimk@gmail.com 19
OLTP vs. OLAP

OLTP OLAP

Atomized Summarized Consolidated data

Many small transactions Gb-Tb of data that’s historical

Clerical users Decision-makers (Analysts)

Present data Historical data (Viewing trends)

Critical Application Oriented Subject Oriented (Which product sells


more?)

Yasim Kolathayil – yasimk@gmail.com 20
What is METADATA?

•Data about data

•Anything other than the real data

•Warehouse Metadata answers questions like


 When was this data placed in DW?
 What is the time frame of the data
 What is the source of the data?
 What is the meaning of the codes?
 Was this data derived, translated or
summarized?

Yasim Kolathayil – yasimk@gmail.com 21
Types of Meta Data
Business Metadata – Captures business definitions, structure and
hierarchy of data, subject areas, definition of metrics etc.

Process Metadata – Describes the where (Origin of data), the when


(date of capture, frequency of capture, history of extracts and loads etc.)
and How (tools used, transformations applied)

Technical Metadata - Describes the physical locations, formats, file


and table structures, DB indexes

Application Metadata - Describes how the data is accessed and used,


when is it accessed and how frequently, who is using the data and how is it
being used.

Yasim Kolathayil – yasimk@gmail.com 22
Metadata Architecture

Business Metadata Process Metadata


•Data Definitions •Source/Target Maps
•Metrics Definitions •Transformation rules
•Subject Models •Data Cleansing rules
•Data Models •Extract Audit trail
•Business Rules •Load Audit trail
•Data Rules •Data Quality audit

Technical Metadata Application Metadata


•Data locations •Data/access history
•Data Formats •Who is accessing
•Technical •Frequency of access
names •When accessed
•Data sizes •How accessed
•Data Types •Etc?
•Indexes
Yasim Kolathayil – yasimk@gmail.com 23
Meta Data Population/Maintenance

Access

Replication

Aggregation
Loading
Transformation

Scrubbing

Extraction
Mapping

User Needs Gathering

Yasim Kolathayil – yasimk@gmail.com 24 and Maintenance


Meta Data Population
Extraction, Transformation & Load-ETL

• Data extraction
• Data cleaning
• Data transformation
 Convert from legacy/host format to warehouse format

• Load
Sort, summarize, consolidate, compute views, check integrity, build indexes,

partition

• Refresh
 Propagate updates from sources to the warehouse

Yasim Kolathayil – yasimk@gmail.com 25
Data Extraction Concepts

Which data to extract?



Select only known source of data
Take all available data from source with out filtering is a widely accepted

practice (Application Triage)


Extract changes or full file extracts
Deciding factors are volumes of data, frequency of changes and ease of change

detection
How to recognize the changes?
 Important to understand when a data is changed in source system
When to extract the data?
 Choosing when to extract the data is a key consideration.

Yasim Kolathayil – yasimk@gmail.com 26
Data Transformation Concepts

Why cleansing?

Data warehouse contains data that is analyzed for business decisions


More data and multiple sources could mean more errors in the data and harder to trace such errors

Results in incorrect analysis


Detecting data anomalies and rectifying them early has huge payoffs (Validate component at
the beginning of graph)

Filter un-necessary data at the beginning.

Transformation Rules

 Example: translate “gender” to “sex”


 Make the date format consistent – “YYYY-MM-DD”
 Check for valid states and zip codes.

Yasim Kolathayil – yasimk@gmail.com 27
Load Concepts

Issues:

 huge volumes of data to be loaded


 small time window (usually at night) when the warehouse can be taken off-line
 When to build indexes and summary tables
 Truncate and load or incremental load.
 restart after failure with no loss of data integrity

Techniques:

 API or load utilities.



use parallelism and other incremental techniques

Yasim Kolathayil – yasimk@gmail.com 28
Questions? About TDWI?

Remember to fill Questionnaire


Yasim Kolathayil – yasimk@gmail.com 29
Appendix

Yasim Kolathayil – yasimk@gmail.com 30
E.g. of Graphical Breakeven Analysis (Taken from Internet)

Breakeven Analysis

4,000,000

3,000,000
Dollars

Benefits
2,000,000
. Costs
1,000,000
0
0 1 2 3 4 5 6
Years

Yasim Kolathayil – yasimk@gmail.com 31
Wal-Mart - 500-600% Growth with No Decline

Integrated Data Warehouse


Becomes a Reality

Yasim Kolathayil – yasimk@gmail.com 32
A Present Day History Lesson

Wal-Mart

Kmart

Yasim Kolathayil – yasimk@gmail.com 33
A Present Day History Lesson

Wal-Mart

Kmart

Yasim Kolathayil – yasimk@gmail.com 34

S-ar putea să vă placă și