Documente Academic
Documente Profesional
Documente Cultură
Warehousing
In the Beginning, life was
simple
But
Our information needs
Kept growing. (The Spider
web)
SOURCE: William H. Inmon
producer wants to know .
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most
effective distribution
channel?
What product prom-
-otions have the biggest
impact on revenue?
What impact will
new products/services
have on revenue
and margins?
6
and what products
are they buying?
Which customers
are most likely to go
to the competition ?
Data, Data everywhere
yet ...
I cant find the data I need
- data is scattered over the network
- many versions, subtle differences
I cant get the data I need
need an expert to get the data
I cant understand the data I found
available data poorly documented
I cant use the data I found
results are unexpected
data needs to be transformed
from one form to other
Scenario 1
ABC Pvt Ltd is a company with branches
at Mumbai, Delhi, Chennai and Banglore.
The Sales Manager wants quarterly sales
report. Each branch has a separate
operational system.
Scenario 1 : ABC Pvt Ltd.
Mumbai
Delhi
Sales per item type per branch
for first quarter.
Chennai
Banglore
Sales
Manager
Solution 1:ABC Pvt Ltd.
Extract sales information from each database.
Store the information in a common repository at
a single site.
Solution 1:ABC Pvt Ltd.
Mumbai
Report
Delhi
Data
Warehouse
Chennai
Banglore
Query & Sales
Analysis tools Manager
Scenario 2
One Stop Shopping Super Market has huge
operational database.Whenever Executives wants
some report the OLTP system becomes
slow and data entry operators have to wait for
some time.
Scenario 2 : One Stop Shopping
Data Entry Operator
Report
Wait
Operational
Management
Database
Data Entry Operator
Solution 2
Extract data needed for analysis from operational
database.
Store it in warehouse.
Refresh warehouse at regular interval so that it
contains up to date information for analysis.
Warehouse will contain data with historical
perspective.
Solution 2
Data Entry
Operator
Report
Transaction Operational
database
Data Entry
Operator
Extract
data
Data
Warehouse
Manager
Scenario 3
Cakes & Cookies is a small,new
company.President of the company wants his
company should grow.He needs information so
that he can make correct decisions.
Solution 3
Improve the quality of data before loading it
into the warehouse.
Perform data cleaning and transformation
before loading the data.
Use query analysis tools to support adhoc
queries.
Solution 3
Expansi
on
sales
Data
Warehouse
Query and Analysis
tool
President
time
Improveme
nt
Why Do We Need Data
Warehouses?
Consolidation of information resources
Improved query performance
Separate research and decision support
functions from the operational systems
Foundation for data mining, data
visualization, advanced reporting and OLAP
tools
What Is a Data Warehouse
Used for?
Knowledge discovery
- Making consolidated reports
- Finding relationships and correlations
- Data mining
- Examples
Banks identifying credit risks
Insurance companies searching for fraud
Medical research
How Do Data Warehouses Differ From
Operational Systems?
Goals
Structure
Size
Performance optimization
Technologies used
Comparison Chart of
Database Types
Data warehouse
Operational system
Subject oriented
Transaction oriented
Large (hundreds of GB up to several
Small (MB up to several GB)
TB)
Historic data
Current data
De-normalized table structure (few
Normalized table structure (many
tables, many columns per table)
tables, few columns per table)
Batch updates
Continuous updates
Usually very complex queries
Simple to complex queries
Design Differences
Operational System
ER Diagram
Data Warehouse
Star Schema
Supporting a Complete Solution
Operational System-
Data Entry
Data Warehouse-
Data Retrieval
Introduction to Data Warehousing
Data, data, data everywhere!
Information thats another story!
Especially, the right information @ the right time!
Data warehousing's goal is to make the right
information available @ the right time
Data warehousing is a data store (eg., a database of
some sort) and a process for bringing together
disparate data from throughout an organization for
decision-support purposes
Introduction Contd
Data warehouses are natural allies for data
mining (work together well)
Data mining can help fulfill some of the goal of
data warehouses - right information @ the
right time
Relational database management systems
(RDBMS), such as Oracle, DB2, Sybase,
Informix, Focus, SQL Server, etc. are often
used for data warehousing
What is Data Warehouse?
Loosely speaking, a data warehouse refers to a database that
is maintained separately from an organizations operational
database
Officially speaking:
data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
managements decision-making process.William. H. Inmon
Relational Database Theory
Relational database modeling process
normalization, relations or tables are
progressively decomposed into smaller relations
to a point where all attributes in a relation are
very tightly coupled with the primary key of the
relation.
First normal form: data items are atomic,
Second normal form: attributes fully depend on
primary key,
Third normal form: all non-key attributes are
completely independent of each other.
CSE601 29
University Tables
staff
Num
first
Name
last
Name
gender
1234 Jane Smith F
2323 Tom Green M
1111 Jim Brow
n
M
Staff
matricN
um
fName lName gender year
reg
super
visor
121212 Mary Hill F 200
3
1234
232323 Steve Gray M 200
5
1234
123456 Jimm
y
Smith M 200
0
1111
Student
course
code
student
Num
c1 121212
c3 121212
c3 123456
c1 232323
Etc etc Etc etc
Enrolled
course
code
credit
value
c1 120
c3 60
c5 60
Course
Relation Database Theory, contd
The process of normalization generally
breaks a table into many independent
tables.
A normalized database yields a flexible
model, making it easy to maintain dynamic
relationships between business entities.
A relational database system is effective and
efficient for operational databases a lot of
updates (aiming at optimizing update
performance).
Problems
A fully normalized data model can perform
very inefficiently for queries.
Historical data are usually large with static
relationships:
Unnecessary joins may take unacceptably long
time
Historical data are diverse
Problem: Heterogeneous
Information Sources
Heterogeneities are
everywhere
Different interfaces
Different data representations
Duplicate and inconsistent information
Personal
Databases
Digital Libraries
Scientific Databases
World
Wide
Web
Goal: Unified Access to Data
Integration System
Collects and combines information
Provides integrated view, uniform user interface
Supports sharing
World
Wide
Web
Digital Libraries Scientific Databases
Personal
Databases
The Traditional Research Approach
Source Source Source
. . .
Integration System
. . .
Metadata
Clients
Wrapper Wrapper Wrapper
Query-driven (lazy, on-demand)
Disadvantages of Query-Driven
Approach
Delay in query processing
Slow or unavailable information sources
Complex filtering and integration
Inefficient and potentially expensive for
frequent queries
Competes with local processing at sources
Hasnt caught on in industry
The Warehousing Approach
Data
Warehouse
Clients
Source Source Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
Information
integrated in
advance
Stored in wh
for direct
querying and
analysis
CSE601 37
Advantages of Warehousing Approach
High query performance
But not necessarily most current information
Doesnt interfere with local processing at sources
Complex queries at warehouse
OLTP at information sources
Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing
Has caught on in industry
CSE601 38
Not Either-Or Decision
Query-driven approach still better for
Rapidly changing information
Rapidly changing information sources
Truly vast amounts of data from large numbers of
sources
Clients with unpredictable needs
CSE601 39
What is a Data Warehouse?
A Practitioners Viewpoint
A data warehouse is simply a single,
complete, and consistent store of data
obtained from a variety of sources and made
available to end users in a way they can
understand and use it in a business context.
-- Barry Devlin, IBM Consultant
What is a Data Warehouse?
An Alternative Viewpoint
A DW is a
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in
organizational decision making.
-- W.H. Inmon, Building the Data Warehouse, 1992
A Data Warehouse is...
Stored collection of diverse data
A solution to data integration problem
Single repository of information
Subject-oriented
Organized by subject, not by application
Used for analysis, data mining, etc.
Optimized differently from transaction-
oriented db
User interface aimed at executive
Contd
Large volume of data (Gb, Tb)
Non-volatile
Historical
Time attributes are important
Updates infrequent
May be append-only
Examples
All transactions ever at Sainsburys
Complete client histories at insurance firm
LSE financial information and portfolios
Generic Warehouse Architecture
Extractor/
Monitor
Extractor/
Monitor
Extractor/
Monitor
Integrator
Warehouse
Client
Client
Design Phase
Maintenance
Loading
...
Metadata
Optimization
Query & Analysis
Data Warehouse Architectures:
Conceptual View
Single-layer
Every data element is stored once only
Virtual warehouse
Two-layer
Real-time + derived data
Most commonly used approach in
industry today
Real-time data
Operational
systems
Informational
systems
Derived Data
Real-time data
Operational
systems
Informational
systems
Three-layer Architecture:
Conceptual View
Transformation of real-time data to derived
data really requires two steps
Derived Data
Real-time data
Operational
systems
Informational
systems
Reconciled Data
Physical Implementation
of the Data Warehouse
View level
Particular informational
needs
Data Warehousing: Two
Distinct Issues
(1) How to get information into warehouse
Data warehousing
(2) What to do with data once its in warehouse
Warehouse DBMS
Both rich research areas
Industry has focused on (2)
Issues in Data Warehousing
Warehouse Design
Extraction
Wrappers, monitors (change detectors)
Integration
Cleansing & merging
Warehousing specification & Maintenance
Optimizations
Miscellaneous (e.g., evolution)
Definitions of a Data Warehouse
A subject-oriented, integrated, time-variant and
1.
2.
non-volatile collection of data in support of
management's decision making process
- W.H. Inmon
A copy of transaction data, specifically
structured for query and analysis
- Ralph Kimball
Data Warehouse
For organizational learning to take place, data from
many sources must be gathered together and
organized in a consistent and useful way - hence,
Data Warehousing (DW)
DW allows an organization (enterprise) to remember
what it has noticed about its data
Data Mining techniques make use of the data in a
Data Warehouse
Data Warehouse
Enterprise
Database
Customers
Transactions
Vendors
Etc
Orders
Etc
Data Miners:
Farmers - they know
Explorers - unpredictable
Copied,
organized
summarized
Data Data Mining
Warehouse
Data Warehouse
A data warehouse is a copy of transaction data specifically
structured for querying, analysis, reporting, and more
rigorous data mining
Note that the data warehouse contains a copy of the
transactions which are not updated or changed later by the
transaction system
Also note that this data is specially structured, and may have
been transformed when it was copied into the data
warehouse
Data Warehouse
Subject oriented
Data integrated
Time variant
Nonvolatile
Data WarehouseSubject-
Oriented
Organized around major subjects, such as customer,
product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process
Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data sources
- relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are applied.
- Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
- When data is moved to the warehouse, it is converted.
Data WarehouseTime Variant
The time horizon for the data warehouse is
significantly longer than that of operational
systems
- Operational database: current value data
- Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Data WarehouseNonvolatile
A physically separate store of data transformed from the operational
environment
Operational update of data does not occur in the data warehouse
environment
- Does not require transaction processing, recovery, and concurrency control
mechanisms
- Requires only two operations in data accessing:
initial loading of data and access of data
Characteristics of Data Warehouse
Subject oriented. Data are organized based on how
the users refer to them.
Integrated. All inconsistencies regarding naming
convention and value representations are removed.
Nonvolatile. Data are stored in read-only format and
do not change over time.
Time variant. Data are not current but normally time
series.
Characteristics of Data Warehouse
Summarized Operational data are mapped into a
decision-usable format
Large volume. Time series data sets are normally
quite large.
Not normalized. DW data can be, and often are,
redundant.
Metadata. Data about data are stored.
Data sources. Data come from internal and external
unintegrated operational systems.
A Data Warehouse is Subject Oriented
Subject Orientation
Application Environment Data warehouse
Environment
Design activities must be equally
DW world is primarily void of process
focused on both process and database
design and tends to focus exclusively on
design
issues of data modeling and database
design
Data Integrated
Integration -consistency naming conventions
and measurement attributers, accuracy, and
common aggregation.
Establishment of a common unit of measure
for all synonymous data elements from
dissimilar database.
The data must be stored in the DW in an
integrated, globally acceptable manner
Data Integrated
Time Variant
In an operational application system, the expectation
is that all data within the database are accurate as of
the moment of access. In the DW data are simply
assumed to be accurate as of some moment in time
and not necessarily right now.
One of the places where DW data display time
variance is in the structure of the record key. Every
primary key contained within the DW must contain,
either implicitly or explicitly an element of time( day,
week, month, etc)
Time Variant
Every piece of data contained within the
warehouse must be associated with a
particular point in time if any useful analysis is
to be conducted with it.
Another aspect of time variance in DW data is
that, once recorded, data within the
warehouse cannot be updated or changed.
Nonvolatility
Typical activities such as deletes, inserts, and
changes that are performed in an operational
application environment are completely
nonexistent in a DW environment.
Only two data operations are ever performed
in the DW: data loading and data access
Nonvolatility
Application
DW
The design issues must focus on data
integrity and update anomalies. Complex
processes must be coded to ensure that the
data update processes allow for high
integrity of the final product.
Such issues are no concern to in a DW
environment because data update is never
performed.
Data is placed in normalized form to
Designers find it useful to store many of
ensure a minimal redundancy (totals that
such calculations or summarizations.
could be calculated would never be stored)
The technologies necessary to support Relative simplicity in technology
issues of transaction and data recovery,
roll back, and detection and remedy of
deadlock are quite complex.
Data Warehouse
In order for data to be effective, DW must be:
- Consistent.
- Well integrated.
- Well defined.
- Time stamped.
DW environment:
- The data store, data mart & the metadata.
The Data Store
An operational data store (ODS) stores data for a
specific application. It feeds the data warehouse a
stream of desired raw data.
Is the most common component of DW environment.
Data store is generally subject oriented, volatile,
current commonly focused on customers, products,
orders, policies, claims, etc
Data Store & Data Warehouse
Data store & Data warehouse
The data store-Contd.
Its day-to-day function is to store the data for a
single specific set of operational application.
Its function is to feed the data warehouse data
for the purpose of analysis.
The Data Mart
It is lower-cost, scaled down version of the
DW.
Data Mart offer a targeted and less costly
method of gaining the advantages associated
with data warehousing and can be scaled up to
a full DW environment over time.
The Meta Data
Last component of DW environments.
It is information that is kept about the warehouse
rather than information kept within the warehouse.
Legacy systems generally dont keep a record of
characteristics of the data (such as what pieces of data
exist and where they are located).
The metadata is simply data about data.
Data Mart
A Data Mart is a smaller, more focused Data
Warehouse - a mini-warehouse.
A Data Mart typically reflects the business
rules of a specific business unit within an
enterprise.
Data Warehouse to Data Mart
Decision
Support
Data
Warehouse
Data Mart Information
Decision
Support
Data Mart Information
Decision
Support
Data Mart Information
Generic Architecture of Data
(synonym) Transaction data
General Architecture for Data Warehousing
Source systems
Extraction, (Clean),
Transformation, & Load
(ETL)
Central repository
Metadata repository
Data marts
Operational feedback
End users (business)
Data Warehouse vs.
Heterogeneous DBMS
Traditional heterogeneous DB integration: A query driven
approach
- Build wrappers/mediators on top of heterogeneous databases
Data warehouse: update-driven, high performance
- Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis
Data Warehouse vs.
Operational DBMS
OLTP (on-line transaction (query) processing)
- Major task of traditional relational DBMS
- Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
- Major task of data warehouse system
- Data analysis and decision making
Data Warehouse vs. Operational
DBMS
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
historical,
detailed, flat relational
summarized, multidimensional
isolated
integrated, consolidated
usage
repetitive
ad-hoc
access
read/write
lots of scans
index/hash on prim. key
unit of work
short, simple transaction
complex query
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
Design of Data Warehouse
Four views regarding the design of a data warehouse MUST be concerned
- Top-down view
allows selection of the relevant information necessary for the data
warehouse
- Data source view
exposes the information being captured, stored, and managed by
operational systems
- Data warehouse view
consists of fact tables and dimension tables
- Business query view
sees the perspectives of data in the warehouse from the view of
end-user
Data Warehouse Design
Process
Top-down, bottom-up approaches or a combination of both
- Top-down: Starts with overall design and planning (mature)
- Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
- Waterfall: structured and systematic analysis at each step before
proceeding to the next
- Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
Typical data warehouse design process
- Choose a business process to model, e.g., orders, invoices, etc.
- Choose the grain (atomic level of data) of the business process
- Choose the dimensions that will apply to each fact table record
- Choose the measure that will populate each fact table record
Data Warehouse: A Multi-Tiered
Architecture
Other Metadata
sources
Operational Extract
DBs Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
Data Marts
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Sources
Data Storage
OLAP Engine
Front-End Tools
The Data Warehouse
Architecture
The architecture consists of various
interconnected elements:
- Operational and external database layer - the
source data for the DW
- Information access layer - the tools the end user
access to extract and analyze the data
- Data access layer - the interface between the
operational and information access layers
- Metadata layer - the data directory or repository
of metadata information
Components of the Data
Warehouse Architecture
Three Data Warehouse Models
Enterprise warehouse
- collects all of the information about subjects spanning the entire organization
Data Mart
- a subset of corporate-wide data that is of value to a specific groups of users.
Its scope is confined to specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
- A set of views over operational databases. Only some of the possible
summary views may be materialized.
Data Warehouse Development: A
Recommended Approach
4
Multi-Tier Data
3
Distributed
Data Marts
2
Data Data
Mart Mart
Warehouse
1
Enterprise
Data
Warehouse
Model refinement Model refinement
Define a high-level corporate data model
Data Warehouse Back-End Tools and
Utilities
Data extraction
- get data from multiple, heterogeneous, and external sources
Data cleaning
- detect errors in the data and rectify them when possible
Data transformation
- convert data from legacy or host format to warehouse format
Load
- sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
Refresh
- propagate the updates from the data sources to the
warehouse
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
- schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data
- data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
- warehouse schema, view and derived data definitions
Business data
- business terms and definitions, ownership of data, charging policies
Building a Data Warehouse
Data Warehouse Lifecycle
- Analysis
- Design
- Import data
- Install front-end tools
- Test and deploy
Stage 1: Analysis
Analysis
Identify:
- Target Questions
- Data needs
- Timeliness of data
- Granularity
- Design
- Import data
- Install front-end tools
- Test and deploy
Create an enterprise-level data dictionary
Dimensional analysis
Identify facts and dimensions
Stage 2: Design
- Analysis
Star schema
Data Transformation
Aggregates
Pre-calculated Values
HW/SW Architecture
Design
- Import data
- Install front-end tools
- Test and deploy
Dimensional Modeling
Dimensional Modeling
Fact Table - The primary table in a
dimensional model that is meant to contain
measurements of the business.
Dimension Table - One of a set of
companion tables to a fact table. Most
dimension tables contain many textual
attributes that are the basis for
constraining and grouping within data
warehouse queries.
Stage 3: Import Data
Identify data sources
Extract the needed data
from existing systems to a
data staging area
Transform and Clean the
data
- Resolve data type conflicts
- Resolve naming and key
conflicts
- Remove, correct, or flag bad
data
- Conform Dimensions
Load the data into the
warehouse
- Analysis
- Design
Import data
- Install front-end tools
- Test and deploy
Importing Data Into the
Warehouse
OLTP 1
Data Staging Area
OLTP 2
OLTP 3
Data
Warehouse
Operational Systems
(source systems)
Stage 4: Install Front-end Tools
- Analysis
Reporting tools
Data mining tools
GIS
Etc.
- Design
- Import data
Install front-end tools
- Test and deploy
Stage 5: Test and Deploy
- Analysis
Usability tests
Software installation
User training
- Design
- Import data
- Install front-end tools
Test and deploy
Performance tweaking based on usage
Special Concerns
Time and expense
Managing the complexity
Update procedures and maintenance
Changes to source systems over time
Changes to data needs over time
Data Warehouse Usage
Three kinds of data warehouse applications
- Information processing
supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
- Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data Warehouse Usage
- Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting
the mining results using visualization tools
Data Warehousing Typology
The virtual data warehouse - the end users have
direct access to the data stores, using tools enabled
at the data access layer
The central data warehouse - a single physical
database contains all of the data for a specific
functional area
The distributed data warehouse - the components
are distributed across several physical databases
12 Rules of a Data Warehouse
Data Warehouse and Operational
Environments are Separated
Data is integrated
Contains historical data over a long period of
time
Data is a snapshot data captured at a given
point in time
Data is subject-oriented
12 Rules of Data Warehouse
Mainly read-only with periodic batch updates
Development Life Cycle has a data driven
approach versus the traditional process-driven
approach
Data contains several levels of detail
- Current, Old, Lightly Summarized, Highly
Summarized
12 Rules of Data Warehouse
Environment is characterized by Read-only
transactions to very large data sets
System that traces data sources, transformations,
and storage
Metadata is a critical component
- Source, transformation, integration, storage, relationships,
history, etc
Contains a chargeback mechanism for resource
usage that enforces optimal use of data by end users
Implementing the Data
Warehouse
Kozar list of seven deadly sins of data warehouse
implementation:
1. If you build it, they will come - the DW needs to be
designed to meet peoples needs
2. Omission of an architectural framework - you need to
consider the number of users, volume of data, update
cycle, etc.
3. Underestimating the importance of documenting
assumptions - the assumptions and potential conflicts
must be included in the framework
Seven Deadly Sins, continued
4. Failure to use the right tool - a DW project needs
different tools than those used to develop an application
5. Life cycle abuse - in a DW, the life cycle really never ends
6. Ignorance about data conflicts - resolving these takes a
lot more effort than most people realize
7. Failure to learn from mistakes - since one DW project
tends to beget another, learning from the early mistakes
will yield higher quality later
Issues of Data Redundancy between
DW and operational environments
The lack of relevancy of issues such as data normalization in
the DW environment may suggest that existence of massive
data redundancy within the data warehouse and between the
operational and DW environments.
The data being loaded into the DW are filtered and cleansed
as they pass from the operational database to the warehouse.
Because of this cleansing numerous data that exists in the
operational environment never pass to the data warehouse.
Only the data necessary for processing by the DSS or EIS are
ever actually loaded into the DW.
Issues of Data Redundancy between
DW and operational environments
The time horizons for warehouse and operational data
elements are unique. Data in the operational environment
are fresh, whereas warehouse data are generally much
older.(so there is minimal opportunity of the data to
overlap between two environments )
The data loaded into the DW often undergo a radical
transformation as they pass form operational to the DW
environment. So data in DW are not the same.
Data Warehouse Architecture
At the top - a centralized
database
- Generally configured for
queries and appends -
not transactions
- Many indices,
materialized views, etc.
Data is loaded and
periodically updated via
Extract/Transform/Load
(ETL) tools
Data Warehouse
ETL pipeline
outputs
ETL
ETL ETL ETL ETL
RDBMS1 RDBMS 2
HTML1 XML1
ETL Tools
ETL tools are the equivalent of schema mappings in
virtual integration, but are more powerful
Arbitrary pieces of code to take data from a source,
convert it into data for the warehouse:
- import filters - read and convert from data sources
- data transformations - join, aggregate, filter, convert data
- de-duplication - finds multiple records referring to the
same entity, merges them
- profiling - builds tables, histograms, etc. to summarize
data
- quality management - test against master values, known
business rules, constraints, etc.
Example ETL Tool Chain
Split
Date -
time
Invoice
line items
Filter
invalid
Invalid
dates /times
Join
Item
records
Filter
invalid
Invalid
items
Filter
non -
match
Invalid
customers
Customer
records
Group by
customer
Customer
balance
This is an example for e-commerce loading
Note multiple stages of filtering (using
selection or join-like operations), logging bad
records, before we group and load
Data Warehouse Architectures
Generic Two-Level Architecture
Independent Data Mart
Dependent Data Mart and Operational
Data Store
Logical Data Mart and @ctive Warehouse
Three-Layer architecture
All involve some form of extraction, transformation and loading (ETL)
Generic two-level data warehousing architecture
L
One,
company-
T
E
wide
warehouse
Periodic extraction data is not completely current in warehouse
Independent data mart data
warehousing architecture
L
T
E
Separate ETL for each
independent data mart
Data marts:
Mini-warehouses, limited in scope
Data access complexity
due to multiple data marts
Dependent data mart with
operational data store: a three-level
architecture
L
T
E
Single ETL for
ODS provides option for
obtaining current data
Simpler data access
enterprise data warehouse
Dependent data marts
(EDW)
loaded from EDW
Logical data mart and real time
ODS and data warehouse
warehouse architecture
L
T
E
Near real-time ETL for
Data Warehouse
are one and the same
Data marts are NOT separate databases,
but logical views of the data warehouse
Easier to create new data marts
Example
of DBMS
log entry
Data Characteristics
Status vs. Event Data
Status
Event = a database action
(create/update/delete) that
results from a transaction
Status
Transient
operational
data
Data Characteristics
Transient vs. Periodic Data
With
transient
data,
changes to
existing
records are
written over
previous
records, thus
destroying
the previous
data content
Periodic
warehouse
data
Data Characteristics
Transient vs. Periodic Data
Periodic
data are
never
physically
altered or
deleted
once they
have
been
added to
the store
Other Data Warehouse
Changes
New descriptive attributes
New business activity attributes
New classes of descriptive attributes
Descriptive attributes become more refined
Descriptive data are related to one another
New source of data
The Reconciled Data Layer
Typical operational data is:
- Transient-not historical
- Not normalized (perhaps due to denormalization for
performance)
- Restricted in scope-not comprehensive
- Sometimes poor quality-inconsistencies and errors
After ETL, data should be:
- Detailed-not summarized yet
- Historical-periodic
- Normalized-3rd normal form or higher
- Comprehensive-enterprise-wide perspective
- Timely-data should be current enough to assist decision-making
- Quality controlled-accurate with full integrity
The ETL Process
Capture/Extract
Scrub or data cleansing
Transform
Load and Index
ETL = Extract, transform, and load
Capture/Extractobtaining a snapshot of a chosen subset
of the source data for loading into the data warehouse
Steps in data
reconciliation
Static extract = capturing
a snapshot of the source
data at a point in time
Incremental extract =
capturing changes that
have occurred since the last
static extract
Scrub/Cleanseuses pattern recognition and AI
techniques to upgrade data quality
Steps in data
reconciliation
(cont.)
Fixing errors: misspellings,
erroneous dates, incorrect field
usage, mismatched addresses,
missing data, duplicate data,
inconsistencies
Also: decoding, reformatting,
time stamping, conversion, key
generation, merging, error
detection/logging, locating
missing data
Transform = convert data from format of operational
system to format of data warehouse
Steps in data
reconciliation
(cont.)
Record-level:
Selection-data partitioning
Joining-data combining
Aggregation-data summarization
Field-level:
single-field-from one field to one field
multi-field-from many fields to one, or
one field to many
Load/Index= place transformed data
into the warehouse and create indexes
Steps in data
reconciliation
(cont.)
Refresh mode: bulk rewriting Update mode: only changes
of target data at periodic intervals in source data are written to data
warehouse
Single-field transformation
In general-some transformation
function translates data from old
form to new form
Algorithmic transformation uses
a formula or logical expression
Table lookup-another
approach, uses a separate
table keyed by source
record code
Multifield transformation
M:1-from many source
fields to one target field
1:M-from one
source field to
many target fields
108
Derived Data
Objectives
- Ease of use for decision support applications
- Fast response to predefined user queries
- Customized data for particular target audiences
- Ad-hoc query support
- Data mining capabilities
Characteristics
- Detailed (mostly periodic) data
- Aggregate (for summary)
- Distributed (to departmenta
Most common data model = star schema
(also called dimensional model)
Components of a star schema
Fact tables contain factual
or quantitative data
1:N relationship between
Dimension tables are denormalized to
dimension tables and fact tables
maximize performance
Dimension tables contain descriptions
about the subjects of the business
Excellent for ad-hoc queries, but bad for online transaction processing
Figure 11-14: Star schema example
Fact table provides statistics for sales
broken down by product, period and store
dimensions
Star schema with sample data
Issues Regarding Star Schema
Dimension table keys must be surrogate (non-intelligent and non-
business related), because:
- Keys may change over time
- Length/format consistency
Granularity of Fact Table-what level of detail do you want?
- Transactional grain-finest level
- Aggregated grain-more summarized
- Finer grains better market basket analysis capability
- Finer grain more dimension tables, more rows in fact table
Duration of the database-how much history should be kept?
- Natural duration-13 months or 5 quarters
- Financial institutions may need longer duration
- Older data is more difficult to source and cleanse
Modeling dates
Fact tables contain time-period data
Date dimensions are important
On-Line Analytical Processing (OLAP)
The use of a set of graphical tools that provides users
with multidimensional views of their data and allows
them to analyze the data using simple windowing
techniques
Relational OLAP (ROLAP)
- Traditional relational representation
Multidimensional OLAP (MOLAP)
- Cube structure
OLAP Operations
- Cube slicing - come up with 2-D view of data
- Drill-down - going from summary to more detailed views
Slicing a data cube
Summary report
Example of drill-down
Starting with summary
data, users can obtain
details for particular
cells
Drill-down with
color added
Data Mining and Visualization
Knowledge discovery using a blend of statistical, AI, and
computer graphics techniques
Goals:
- Explain observed events or conditions
- Confirm hypotheses
- Explore data for new or unexpected relationships
Techniques
- Case-based reasoning
- Rule discovery
- Signal processing
- Neural nets
- Fractals
Data visualization - representing data in graphical/multimedia
formats for analysis