Documente Academic
Documente Profesional
Documente Cultură
Warehousing
Enrico Franconi
CS 636
Problem: Heterogeneous
Information Sources
Heterogeneities are everywhere
Personal
Databases
Scientific Databases
Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information
CS 336
World
Wide
Web
Sales Administration
CS 336
Finance
Manufacturing
...
Integration System
World
Wide
Web
Digital Libraries
Scientific Databases
Personal
Databases
Why a Warehouse?
Two Approaches:
Query-Driven (Lazy)
Warehouse (Eager)
?
Source
CS 336
Source
Integration System
Metadata
...
Wrapper
Source
Wrapper
Source
CS 336
Wrapper
...
Source
Disadvantages of Query-Driven
Approach
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Source
Extractor/
Monitor
CS 336
Source
Extractor/
Monitor
...
Source
CS 336
CS 336
10
CS 336
11
12
Subject-oriented
Organized by subject, not by application
Used for analysis, data mining, etc.
13
Contd
Large volume of data (Gb, Tb)
Non-volatile
Historical
Time attributes are important
Updates infrequent
May be append-only
Examples
All transactions ever at Sainsburys
Complete client histories at insurance firm
LSE financial information and portfolios
CS 336
14
Query&&Analysis
Analysis
Query
Client
Loading
Design Phase
Warehouse
Metadata
Maintenance
Integrator
Extractor/
Monitor
Extractor/
Monitor
Optimization
Extractor/
Monitor
...
CS 336
15
Single-layer
Every data element is stored once only
Virtual warehouse
Two-layer
Real-time + derived data
Most commonly used approach in
industry today
Informational
systems
Real-time data
Operational
systems
Informational
systems
Derived Data
Real-time data
CS 336
16
Three-layer Architecture:
Conceptual View
Transformation of real-time data to derived
data really requires two steps
Operational
systems
Informational
systems
Derived Data
Reconciled Data
View level
Particular informational
needs
Physical Implementation
of the Data Warehouse
Real-time data
CS 336
17
18
Integration
Cleansing & merging
19
CS 336
20
Warehouse is a Specialized DB
Standard DB (OLTP)
Mostly updates
Many small transactions
Mb - Gb of data
Current snapshot
Index/hash on p.k.
Raw data
Thousands of users (e.g.,
clerical users)
CS 336
Warehouse (OLAP)
Mostly reads
Queries are long and complex
Gb - Tb of data
History
Lots of scans
Summarized, reconciled data
Hundreds of users (e.g.,
decision-makers, analysts)
21