Documente Academic
Documente Profesional
Documente Cultură
Course Outline
Introduction to Data Warehousing and Background Dimension Modeling Architecture and Infrastructure Extract Transform Load Data Quality Management OLAP Implementation Methods of Data Warehouse Data Mining Overview
Course Material
Data Warehousing Fundamentals by Paulraj Ponniah John Wiley and Sons Articles
Class Notes
Marks Distribution
and RDBMS? Where does OLAP stands in the Data Warehouse picture? What are different Data Warehouse and OLAP models/schemas? How to perform ETL? What is data cleansing? How to perform it? What are the famous algorithms? Which different Data Warehouse architectures are there? What are their strengths and weaknesses?
oriented, time-variant, non-volatile database that provides support for decision making
Subject Oriented
Organized along the lines of the subjects of the corporation. Typical subjects are customer, product, vendor and transaction.
Integrated
Single, Enterprise-Wide view.
Time Variant
Every record in the data warehouse has some form of time dimension attached to it.
Non Volatile
Refers to the inability of data to be updated. Every record in the data warehouse is time stamped in one form or the other.
methodologies) designed to extract information from data and to use such information as a basis for decision making
6
years Gain market share by 10% in the next 3 years Improve product quality levels in the top five product groups Enhance customer service level in shipments Bring three new products to market in 2 years Increase sales by 15% in the Northern Division
8
stores is doubling each year Total hardware and software cost to store and manage 1 Mbyte of data
1990: ~ $15 2002: ~ 15 (Down 100 times) 2005: ~ 1 (Down 1500 times)
A Few Examples
Cern: Up to 20 PB by 2006 Stanford Linear Accelerator Center (SLAC): 500TB France Telecom: ~ 100 TB WalMart: 24 TB
9
Operational Systems
User needs information
User requests reports from IT IT places request on backlog IT creates ad queries IT sends requested reports User hopes to find the right answer User needs information
10
Informational
Archived, derived, summarized Optimized for complex queries Medium to low
Response Time
Users
Sub seconds
Large number
Data Warehouse
Information Sources Data Warehouse Server (Tier 1) OLAP Servers (Tier 2) e.g., MOLAP Semistructured Sources extract transform load refresh etc. Operational DBs Analysis Data Warehouse serve Query/Reporting serve e.g., ROLAP serve Data Mining Clients (Tier 3)
Data Marts
12
enterprise operation Airline reservation systems, Electronic point of sale systems, Automatic teller machines etc Typically several systems within same enterprise Read and Update mostly Standard, Predefined, less complex queries Queries based on individual or a relatively less number of records (Single-Hit Queries) Typically used in Tactical Management
13
methodologies) designed to extract information from data and to use such information as a basis for decision making
Communication Driven DSS
Data Driven DSS Document Driven DSS Knowledge Driven DSS Model Driven DSS
14
15
business analyst Multidimensional view of data is the foundation of OLAP Extend spreadsheet analysis model to work with warehouse data
Read Only Access Semantically enriched to understand business terms
OLTP
Sales Staff, IT Professionals Day to day operations Application-oriented (E-R based) Current, Isolated Detailed, Flat relational Structured, Repetitive Short, Simple transaction Read/write Index/hash on primary key Tens to Hundreds Thousands 100 MB-GB Trans. throughput
Data Driven DSS Knowledge worker Decision support Subject-oriented (Star, snowflake) Historical, Consolidated Summarized, Multidimensional Ad hoc Complex query Read Mostly Lots of Scans Thousands to Millions Hundreds 100GB-TB Query throughput, response
17
Data Mining
Knowledge Extraction
Verification: OLAP type analyses, hypothesis testing Discovery: Extracting rules or patterns
18
20
21