Sunteți pe pagina 1din 21

Lecture 01 Tue, Jan 20, 2009 1800 : 2100 FAST NU, Karachi

Course Outline

Introduction to Data Warehousing and Background Dimension Modeling Architecture and Infrastructure Extract Transform Load Data Quality Management OLAP Implementation Methods of Data Warehouse Data Mining Overview

Course Material
Data Warehousing Fundamentals by Paulraj Ponniah John Wiley and Sons Articles
Class Notes

Marks Distribution

Objective of the course


Why exactly the world needs a Data Warehouse?
How Data Warehouse differs from traditional databases

and RDBMS? Where does OLAP stands in the Data Warehouse picture? What are different Data Warehouse and OLAP models/schemas? How to perform ETL? What is data cleansing? How to perform it? What are the famous algorithms? Which different Data Warehouse architectures are there? What are their strengths and weaknesses?

What is a Data Warehouse?


The Data Warehouse is an integrated, subject-

oriented, time-variant, non-volatile database that provides support for decision making
Subject Oriented
Organized along the lines of the subjects of the corporation. Typical subjects are customer, product, vendor and transaction.

Integrated
Single, Enterprise-Wide view.

Time Variant
Every record in the data warehouse has some form of time dimension attached to it.

Non Volatile
Refers to the inability of data to be updated. Every record in the data warehouse is time stamped in one form or the other.

Decision Support is a methodology (or a series of

methodologies) designed to extract information from data and to use such information as a basis for decision making
6

What is a Data Warehouse?


Legacy Data
Large Scale Data Collection Generation or Digitization Exercise Online Online Operational Online Operational Source Online Operational Source Operational Source Source

Corporate Decision Support Infrastructure Reporting End DW Servers User

Needs for Strategic Information


Retain the present customer base
Increase the customer base by 15% over the next 5

years Gain market share by 10% in the next 3 years Improve product quality levels in the top five product groups Enhance customer service level in shipments Bring three new products to market in 2 years Increase sales by 15% in the Northern Division
8

Need of a Data Warehouse


The amount of data the average business collects and

stores is doubling each year Total hardware and software cost to store and manage 1 Mbyte of data
1990: ~ $15 2002: ~ 15 (Down 100 times) 2005: ~ 1 (Down 1500 times)

A Few Examples

Cern: Up to 20 PB by 2006 Stanford Linear Accelerator Center (SLAC): 500TB France Telecom: ~ 100 TB WalMart: 24 TB
9

Operational Systems
User needs information
User requests reports from IT IT places request on backlog IT creates ad queries IT sends requested reports User hopes to find the right answer User needs information

10

Operational vs. Informational


Operational Data Content Data Structure Access Frequency Access Type Usage
Current values Optimized for transactions High

Informational
Archived, derived, summarized Optimized for complex queries Medium to low

Read, update, delete Predictable, repetitive

Read Ad hoc, random, heuristic

Response Time
Users

Sub seconds
Large number

Several seconds to minutes


Relatively small number
11

Data Warehouse
Information Sources Data Warehouse Server (Tier 1) OLAP Servers (Tier 2) e.g., MOLAP Semistructured Sources extract transform load refresh etc. Operational DBs Analysis Data Warehouse serve Query/Reporting serve e.g., ROLAP serve Data Mining Clients (Tier 3)

Data Marts

12

Online Transaction Processing (OLTP)


Also known as operational sources Day-to-day handling of transactions that result from

enterprise operation Airline reservation systems, Electronic point of sale systems, Automatic teller machines etc Typically several systems within same enterprise Read and Update mostly Standard, Predefined, less complex queries Queries based on individual or a relatively less number of records (Single-Hit Queries) Typically used in Tactical Management
13

Decision Support Systems


Decision Support is a methodology (or a series of

methodologies) designed to extract information from data and to use such information as a basis for decision making
Communication Driven DSS
Data Driven DSS Document Driven DSS Knowledge Driven DSS Model Driven DSS

14

Data Driven DSS

15

Online Analytical Processing (OLAP)


Goal of OLAP is to support ad-hoc querying for the

business analyst Multidimensional view of data is the foundation of OLAP Extend spreadsheet analysis model to work with warehouse data
Read Only Access Semantically enriched to understand business terms

(e.g., time, geography) Combined with reporting features


16

OLTP vs. Data Driven DSS


Trait User Function DB Design Data View Usage Unit of work Access Operations Records accessed #Users Db size Metric

OLTP
Sales Staff, IT Professionals Day to day operations Application-oriented (E-R based) Current, Isolated Detailed, Flat relational Structured, Repetitive Short, Simple transaction Read/write Index/hash on primary key Tens to Hundreds Thousands 100 MB-GB Trans. throughput

Data Driven DSS Knowledge worker Decision support Subject-oriented (Star, snowflake) Historical, Consolidated Summarized, Multidimensional Ad hoc Complex query Read Mostly Lots of Scans Thousands to Millions Hundreds 100GB-TB Query throughput, response
17

Data Mining
Knowledge Extraction
Verification: OLAP type analyses, hypothesis testing Discovery: Extracting rules or patterns

Data Mining is finding hidden patterns in data


Predict which customers will buy new policies Identify behavior patterns of risky customers Identify fraudulent behavior Characterize patient behavior to predict office visits Identify successful medical therapies for different illnesses

18

Knowledge Discovery in Databases (KDD)


Non-trivial extraction of implicit, previously unknown

and potentially useful knowledge from data KDD stages


Problem definition Data selection Cleaning Enrichment Coding and organization Data mining Reporting
19

DW and DB Clarifying Confusions


Is DW different from DB No The difference is historical not technical DW is a DB inside and out DW is to Data Driven DSS what DB is to OLTP

20

Brief History of DB Design


Master file design
Integrated, subject-oriented design Relational design Star join design

21

S-ar putea să vă placă și