Sunteți pe pagina 1din 137

Introduction to Data

Warehousing

In the Beginning, life was
simple

But

Our information needs

Kept growing. (The Spider

web)

SOURCE: William H. Inmon

producer wants to know .

Which are our
lowest/highest margin
customers ?

Who are my customers

What is the most
effective distribution
channel?

What product prom-

-otions have the biggest
impact on revenue?

What impact will
new products/services
have on revenue

and margins?

6

and what products
are they buying?

Which customers
are most likely to go
to the competition ?

Data, Data everywhere

yet ...

I cant find the data I need

- data is scattered over the network

- many versions, subtle differences

I cant get the data I need

need an expert to get the data

I cant understand the data I found

available data poorly documented

I cant use the data I found

results are unexpected

data needs to be transformed
from one form to other

Scenario 1

ABC Pvt Ltd is a company with branches
at Mumbai, Delhi, Chennai and Banglore.
The Sales Manager wants quarterly sales
report. Each branch has a separate

operational system.

Scenario 1 : ABC Pvt Ltd.

Mumbai

Delhi

Sales per item type per branch
for first quarter.

Chennai

Banglore

Sales

Manager

Solution 1:ABC Pvt Ltd.

Extract sales information from each database.

Store the information in a common repository at
a single site.

Solution 1:ABC Pvt Ltd.

Mumbai

Report

Delhi

Data
Warehouse

Chennai

Banglore

Query & Sales
Analysis tools Manager

Scenario 2

One Stop Shopping Super Market has huge

operational database.Whenever Executives wants

some report the OLTP system becomes
slow and data entry operators have to wait for
some time.

Scenario 2 : One Stop Shopping

Data Entry Operator

Report

Wait

Operational

Management

Database

Data Entry Operator

Solution 2

Extract data needed for analysis from operational
database.

Store it in warehouse.

Refresh warehouse at regular interval so that it
contains up to date information for analysis.

Warehouse will contain data with historical
perspective.

Solution 2

Data Entry

Operator

Report

Transaction Operational

database

Data Entry
Operator

Extract

data

Data

Warehouse

Manager

Scenario 3

Cakes & Cookies is a small,new
company.President of the company wants his
company should grow.He needs information so
that he can make correct decisions.

Solution 3

Improve the quality of data before loading it
into the warehouse.

Perform data cleaning and transformation
before loading the data.

Use query analysis tools to support adhoc
queries.

Solution 3

Expansi

on

sales

Data

Warehouse

Query and Analysis
tool

President

time

Improveme

nt

Why Do We Need Data

Warehouses?

Consolidation of information resources

Improved query performance

Separate research and decision support
functions from the operational systems

Foundation for data mining, data
visualization, advanced reporting and OLAP
tools

What Is a Data Warehouse

Used for?

Knowledge discovery

- Making consolidated reports

- Finding relationships and correlations

- Data mining

- Examples

Banks identifying credit risks

Insurance companies searching for fraud

Medical research

How Do Data Warehouses Differ From
Operational Systems?

Goals

Structure

Size

Performance optimization

Technologies used

Comparison Chart of

Database Types

Data warehouse

Operational system

Subject oriented

Transaction oriented

Large (hundreds of GB up to several

Small (MB up to several GB)

TB)

Historic data

Current data

De-normalized table structure (few

Normalized table structure (many

tables, many columns per table)

tables, few columns per table)

Batch updates

Continuous updates

Usually very complex queries

Simple to complex queries

Design Differences

Operational System

ER Diagram

Data Warehouse

Star Schema

Supporting a Complete Solution

Operational System-
Data Entry

Data Warehouse-
Data Retrieval

Introduction to Data Warehousing

Data, data, data everywhere!

Information thats another story!

Especially, the right information @ the right time!

Data warehousing's goal is to make the right
information available @ the right time

Data warehousing is a data store (eg., a database of
some sort) and a process for bringing together
disparate data from throughout an organization for
decision-support purposes

Introduction Contd

Data warehouses are natural allies for data
mining (work together well)

Data mining can help fulfill some of the goal of
data warehouses - right information @ the
right time

Relational database management systems
(RDBMS), such as Oracle, DB2, Sybase,
Informix, Focus, SQL Server, etc. are often
used for data warehousing

What is Data Warehouse?

Loosely speaking, a data warehouse refers to a database that
is maintained separately from an organizations operational
database

Officially speaking:

data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of

managements decision-making process.William. H. Inmon

Relational Database Theory
Relational database modeling process
normalization, relations or tables are
progressively decomposed into smaller relations
to a point where all attributes in a relation are
very tightly coupled with the primary key of the
relation.
First normal form: data items are atomic,
Second normal form: attributes fully depend on
primary key,
Third normal form: all non-key attributes are
completely independent of each other.
CSE601 29
University Tables
staff
Num
first
Name
last
Name
gender
1234 Jane Smith F
2323 Tom Green M
1111 Jim Brow
n
M
Staff
matricN
um
fName lName gender year
reg
super
visor
121212 Mary Hill F 200
3
1234
232323 Steve Gray M 200
5
1234
123456 Jimm
y
Smith M 200
0
1111
Student
course
code
student
Num
c1 121212
c3 121212
c3 123456
c1 232323
Etc etc Etc etc
Enrolled
course
code
credit
value
c1 120
c3 60
c5 60
Course
Relation Database Theory, contd
The process of normalization generally
breaks a table into many independent
tables.
A normalized database yields a flexible
model, making it easy to maintain dynamic
relationships between business entities.
A relational database system is effective and
efficient for operational databases a lot of
updates (aiming at optimizing update
performance).
Problems
A fully normalized data model can perform
very inefficiently for queries.
Historical data are usually large with static
relationships:
Unnecessary joins may take unacceptably long
time
Historical data are diverse
Problem: Heterogeneous
Information Sources
Heterogeneities are
everywhere
Different interfaces
Different data representations
Duplicate and inconsistent information
Personal
Databases
Digital Libraries
Scientific Databases
World
Wide
Web
Goal: Unified Access to Data
Integration System
Collects and combines information
Provides integrated view, uniform user interface
Supports sharing
World
Wide
Web
Digital Libraries Scientific Databases
Personal
Databases
The Traditional Research Approach
Source Source Source
. . .
Integration System
. . .
Metadata
Clients
Wrapper Wrapper Wrapper
Query-driven (lazy, on-demand)
Disadvantages of Query-Driven
Approach
Delay in query processing
Slow or unavailable information sources
Complex filtering and integration
Inefficient and potentially expensive for
frequent queries
Competes with local processing at sources
Hasnt caught on in industry
The Warehousing Approach
Data
Warehouse
Clients
Source Source Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
Information
integrated in
advance
Stored in wh
for direct
querying and
analysis
CSE601 37
Advantages of Warehousing Approach
High query performance
But not necessarily most current information
Doesnt interfere with local processing at sources
Complex queries at warehouse
OLTP at information sources
Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing
Has caught on in industry
CSE601 38
Not Either-Or Decision
Query-driven approach still better for
Rapidly changing information
Rapidly changing information sources
Truly vast amounts of data from large numbers of
sources
Clients with unpredictable needs
CSE601 39
What is a Data Warehouse?
A Practitioners Viewpoint
A data warehouse is simply a single,
complete, and consistent store of data
obtained from a variety of sources and made
available to end users in a way they can
understand and use it in a business context.
-- Barry Devlin, IBM Consultant
What is a Data Warehouse?
An Alternative Viewpoint
A DW is a
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in
organizational decision making.
-- W.H. Inmon, Building the Data Warehouse, 1992
A Data Warehouse is...
Stored collection of diverse data
A solution to data integration problem
Single repository of information
Subject-oriented
Organized by subject, not by application
Used for analysis, data mining, etc.
Optimized differently from transaction-
oriented db
User interface aimed at executive
Contd
Large volume of data (Gb, Tb)
Non-volatile
Historical
Time attributes are important
Updates infrequent
May be append-only
Examples
All transactions ever at Sainsburys
Complete client histories at insurance firm
LSE financial information and portfolios
Generic Warehouse Architecture
Extractor/
Monitor
Extractor/
Monitor
Extractor/
Monitor
Integrator
Warehouse
Client
Client
Design Phase
Maintenance
Loading
...
Metadata
Optimization
Query & Analysis
Data Warehouse Architectures:
Conceptual View
Single-layer
Every data element is stored once only
Virtual warehouse

Two-layer
Real-time + derived data
Most commonly used approach in
industry today

Real-time data
Operational
systems
Informational
systems
Derived Data
Real-time data
Operational
systems
Informational
systems
Three-layer Architecture:
Conceptual View
Transformation of real-time data to derived
data really requires two steps
Derived Data
Real-time data
Operational
systems
Informational
systems
Reconciled Data
Physical Implementation
of the Data Warehouse
View level
Particular informational
needs
Data Warehousing: Two
Distinct Issues
(1) How to get information into warehouse
Data warehousing
(2) What to do with data once its in warehouse
Warehouse DBMS
Both rich research areas
Industry has focused on (2)
Issues in Data Warehousing
Warehouse Design
Extraction
Wrappers, monitors (change detectors)
Integration
Cleansing & merging
Warehousing specification & Maintenance
Optimizations
Miscellaneous (e.g., evolution)
Definitions of a Data Warehouse

A subject-oriented, integrated, time-variant and

1.

2.

non-volatile collection of data in support of
management's decision making process

- W.H. Inmon

A copy of transaction data, specifically
structured for query and analysis

- Ralph Kimball

Data Warehouse

For organizational learning to take place, data from
many sources must be gathered together and
organized in a consistent and useful way - hence,
Data Warehousing (DW)

DW allows an organization (enterprise) to remember
what it has noticed about its data

Data Mining techniques make use of the data in a
Data Warehouse

Data Warehouse

Enterprise
Database

Customers

Transactions

Vendors

Etc

Orders

Etc

Data Miners:

Farmers - they know

Explorers - unpredictable

Copied,

organized
summarized

Data Data Mining
Warehouse

Data Warehouse

A data warehouse is a copy of transaction data specifically
structured for querying, analysis, reporting, and more
rigorous data mining

Note that the data warehouse contains a copy of the

transactions which are not updated or changed later by the

transaction system

Also note that this data is specially structured, and may have
been transformed when it was copied into the data

warehouse

Data Warehouse

Subject oriented

Data integrated

Time variant

Nonvolatile

Data WarehouseSubject-

Oriented

Organized around major subjects, such as customer,
product, sales

Focusing on the modeling and analysis of data for

decision makers, not on daily operations or transaction

processing

Provide a simple and concise view around particular

subject issues by excluding data that are not useful in the
decision support process

Data WarehouseIntegrated

Constructed by integrating multiple, heterogeneous data sources

- relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied.

- Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources

- When data is moved to the warehouse, it is converted.

Data WarehouseTime Variant

The time horizon for the data warehouse is
significantly longer than that of operational
systems
- Operational database: current value data

- Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)

Data WarehouseNonvolatile

A physically separate store of data transformed from the operational
environment

Operational update of data does not occur in the data warehouse
environment

- Does not require transaction processing, recovery, and concurrency control
mechanisms

- Requires only two operations in data accessing:

initial loading of data and access of data

Characteristics of Data Warehouse

Subject oriented. Data are organized based on how
the users refer to them.

Integrated. All inconsistencies regarding naming

convention and value representations are removed.

Nonvolatile. Data are stored in read-only format and
do not change over time.

Time variant. Data are not current but normally time
series.

Characteristics of Data Warehouse

Summarized Operational data are mapped into a
decision-usable format

Large volume. Time series data sets are normally
quite large.

Not normalized. DW data can be, and often are,
redundant.

Metadata. Data about data are stored.

Data sources. Data come from internal and external
unintegrated operational systems.

A Data Warehouse is Subject Oriented

Subject Orientation

Application Environment Data warehouse

Environment
Design activities must be equally

DW world is primarily void of process

focused on both process and database

design and tends to focus exclusively on
design

issues of data modeling and database

design

Data Integrated

Integration -consistency naming conventions
and measurement attributers, accuracy, and
common aggregation.

Establishment of a common unit of measure
for all synonymous data elements from
dissimilar database.

The data must be stored in the DW in an
integrated, globally acceptable manner

Data Integrated

Time Variant

In an operational application system, the expectation
is that all data within the database are accurate as of
the moment of access. In the DW data are simply
assumed to be accurate as of some moment in time
and not necessarily right now.

One of the places where DW data display time

variance is in the structure of the record key. Every
primary key contained within the DW must contain,
either implicitly or explicitly an element of time( day,
week, month, etc)

Time Variant

Every piece of data contained within the
warehouse must be associated with a
particular point in time if any useful analysis is
to be conducted with it.

Another aspect of time variance in DW data is
that, once recorded, data within the
warehouse cannot be updated or changed.

Nonvolatility

Typical activities such as deletes, inserts, and
changes that are performed in an operational
application environment are completely
nonexistent in a DW environment.

Only two data operations are ever performed
in the DW: data loading and data access

Nonvolatility

Application

DW

The design issues must focus on data
integrity and update anomalies. Complex
processes must be coded to ensure that the
data update processes allow for high

integrity of the final product.

Such issues are no concern to in a DW
environment because data update is never
performed.

Data is placed in normalized form to

Designers find it useful to store many of

ensure a minimal redundancy (totals that

such calculations or summarizations.

could be calculated would never be stored)

The technologies necessary to support Relative simplicity in technology
issues of transaction and data recovery,
roll back, and detection and remedy of
deadlock are quite complex.

Data Warehouse

In order for data to be effective, DW must be:

- Consistent.

- Well integrated.

- Well defined.

- Time stamped.

DW environment:

- The data store, data mart & the metadata.

The Data Store

An operational data store (ODS) stores data for a
specific application. It feeds the data warehouse a
stream of desired raw data.

Is the most common component of DW environment.

Data store is generally subject oriented, volatile,

current commonly focused on customers, products,

orders, policies, claims, etc

Data Store & Data Warehouse

Data store & Data warehouse

The data store-Contd.

Its day-to-day function is to store the data for a
single specific set of operational application.

Its function is to feed the data warehouse data
for the purpose of analysis.

The Data Mart

It is lower-cost, scaled down version of the
DW.

Data Mart offer a targeted and less costly
method of gaining the advantages associated
with data warehousing and can be scaled up to
a full DW environment over time.

The Meta Data

Last component of DW environments.

It is information that is kept about the warehouse
rather than information kept within the warehouse.

Legacy systems generally dont keep a record of

characteristics of the data (such as what pieces of data
exist and where they are located).

The metadata is simply data about data.

Data Mart

A Data Mart is a smaller, more focused Data
Warehouse - a mini-warehouse.

A Data Mart typically reflects the business
rules of a specific business unit within an

enterprise.

Data Warehouse to Data Mart

Decision

Support

Data
Warehouse

Data Mart Information

Decision

Support
Data Mart Information

Decision

Support

Data Mart Information

Generic Architecture of Data

(synonym) Transaction data

General Architecture for Data Warehousing

Source systems

Extraction, (Clean),

Transformation, & Load
(ETL)

Central repository

Metadata repository

Data marts

Operational feedback

End users (business)

Data Warehouse vs.

Heterogeneous DBMS

Traditional heterogeneous DB integration: A query driven
approach

- Build wrappers/mediators on top of heterogeneous databases

Data warehouse: update-driven, high performance

- Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis

Data Warehouse vs.

Operational DBMS

OLTP (on-line transaction (query) processing)

- Major task of traditional relational DBMS

- Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.

OLAP (on-line analytical processing)

- Major task of data warehouse system

- Data analysis and decision making

Data Warehouse vs. Operational
DBMS

OLTP

OLAP

users

clerk, IT professional

knowledge worker

function

day to day operations

decision support

DB design

application-oriented

subject-oriented

data

current, up-to-date

historical,

detailed, flat relational

summarized, multidimensional
isolated

integrated, consolidated
usage

repetitive

ad-hoc

access

read/write

lots of scans

index/hash on prim. key

unit of work

short, simple transaction

complex query

# records accessed

tens

millions

#users

thousands

hundreds

DB size

100MB-GB

100GB-TB

metric

transaction throughput

query throughput, response

Design of Data Warehouse

Four views regarding the design of a data warehouse MUST be concerned

- Top-down view

allows selection of the relevant information necessary for the data
warehouse

- Data source view

exposes the information being captured, stored, and managed by
operational systems

- Data warehouse view

consists of fact tables and dimension tables

- Business query view

sees the perspectives of data in the warehouse from the view of
end-user

Data Warehouse Design

Process

Top-down, bottom-up approaches or a combination of both

- Top-down: Starts with overall design and planning (mature)

- Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view

- Waterfall: structured and systematic analysis at each step before
proceeding to the next

- Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around

Typical data warehouse design process

- Choose a business process to model, e.g., orders, invoices, etc.

- Choose the grain (atomic level of data) of the business process

- Choose the dimensions that will apply to each fact table record

- Choose the measure that will populate each fact table record

Data Warehouse: A Multi-Tiered

Architecture

Other Metadata

sources

Operational Extract

DBs Transform
Load

Refresh

Monitor

&

Integrator

Data
Warehouse

Data Marts

OLAP Server

Serve

Analysis
Query

Reports

Data mining

Data Sources

Data Storage

OLAP Engine

Front-End Tools

The Data Warehouse

Architecture

The architecture consists of various

interconnected elements:

- Operational and external database layer - the
source data for the DW

- Information access layer - the tools the end user
access to extract and analyze the data

- Data access layer - the interface between the
operational and information access layers

- Metadata layer - the data directory or repository
of metadata information

Components of the Data

Warehouse Architecture

Three Data Warehouse Models

Enterprise warehouse

- collects all of the information about subjects spanning the entire organization

Data Mart

- a subset of corporate-wide data that is of value to a specific groups of users.
Its scope is confined to specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart

Virtual warehouse

- A set of views over operational databases. Only some of the possible
summary views may be materialized.

Data Warehouse Development: A

Recommended Approach

4

Multi-Tier Data

3

Distributed
Data Marts

2

Data Data

Mart Mart

Warehouse

1

Enterprise
Data

Warehouse

Model refinement Model refinement

Define a high-level corporate data model

Data Warehouse Back-End Tools and
Utilities

Data extraction

- get data from multiple, heterogeneous, and external sources

Data cleaning

- detect errors in the data and rectify them when possible

Data transformation

- convert data from legacy or host format to warehouse format

Load

- sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions

Refresh

- propagate the updates from the data sources to the
warehouse

Metadata Repository

Meta data is the data defining warehouse objects. It stores:

Description of the structure of the data warehouse

- schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents

Operational meta-data

- data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)

The algorithms used for summarization

The mapping from operational environment to the data warehouse

Data related to system performance

- warehouse schema, view and derived data definitions

Business data

- business terms and definitions, ownership of data, charging policies

Building a Data Warehouse

Data Warehouse Lifecycle

- Analysis

- Design

- Import data

- Install front-end tools

- Test and deploy

Stage 1: Analysis

Analysis

Identify:

- Target Questions

- Data needs

- Timeliness of data

- Granularity

- Design

- Import data

- Install front-end tools

- Test and deploy

Create an enterprise-level data dictionary

Dimensional analysis

Identify facts and dimensions

Stage 2: Design

- Analysis

Star schema

Data Transformation

Aggregates

Pre-calculated Values

HW/SW Architecture

Design

- Import data

- Install front-end tools

- Test and deploy

Dimensional Modeling

Dimensional Modeling

Fact Table - The primary table in a

dimensional model that is meant to contain
measurements of the business.

Dimension Table - One of a set of

companion tables to a fact table. Most
dimension tables contain many textual
attributes that are the basis for
constraining and grouping within data
warehouse queries.

Stage 3: Import Data

Identify data sources

Extract the needed data
from existing systems to a
data staging area

Transform and Clean the
data
- Resolve data type conflicts

- Resolve naming and key
conflicts

- Remove, correct, or flag bad
data

- Conform Dimensions

Load the data into the
warehouse

- Analysis

- Design

Import data

- Install front-end tools

- Test and deploy

Importing Data Into the

Warehouse

OLTP 1

Data Staging Area
OLTP 2

OLTP 3

Data

Warehouse

Operational Systems
(source systems)

Stage 4: Install Front-end Tools

- Analysis

Reporting tools

Data mining tools

GIS

Etc.

- Design

- Import data

Install front-end tools

- Test and deploy

Stage 5: Test and Deploy

- Analysis

Usability tests

Software installation

User training

- Design

- Import data

- Install front-end tools
Test and deploy

Performance tweaking based on usage

Special Concerns

Time and expense

Managing the complexity

Update procedures and maintenance

Changes to source systems over time

Changes to data needs over time

Data Warehouse Usage

Three kinds of data warehouse applications

- Information processing

supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs

- Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling, pivoting

Data Warehouse Usage

- Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models,
performing classification and prediction, and presenting

the mining results using visualization tools

Data Warehousing Typology

The virtual data warehouse - the end users have
direct access to the data stores, using tools enabled
at the data access layer

The central data warehouse - a single physical
database contains all of the data for a specific
functional area

The distributed data warehouse - the components
are distributed across several physical databases

12 Rules of a Data Warehouse

Data Warehouse and Operational
Environments are Separated

Data is integrated

Contains historical data over a long period of
time

Data is a snapshot data captured at a given
point in time

Data is subject-oriented

12 Rules of Data Warehouse

Mainly read-only with periodic batch updates

Development Life Cycle has a data driven

approach versus the traditional process-driven
approach

Data contains several levels of detail

- Current, Old, Lightly Summarized, Highly
Summarized

12 Rules of Data Warehouse

Environment is characterized by Read-only
transactions to very large data sets

System that traces data sources, transformations,
and storage

Metadata is a critical component
- Source, transformation, integration, storage, relationships,
history, etc

Contains a chargeback mechanism for resource
usage that enforces optimal use of data by end users

Implementing the Data

Warehouse

Kozar list of seven deadly sins of data warehouse
implementation:

1. If you build it, they will come - the DW needs to be
designed to meet peoples needs

2. Omission of an architectural framework - you need to
consider the number of users, volume of data, update
cycle, etc.

3. Underestimating the importance of documenting
assumptions - the assumptions and potential conflicts
must be included in the framework

Seven Deadly Sins, continued

4. Failure to use the right tool - a DW project needs
different tools than those used to develop an application

5. Life cycle abuse - in a DW, the life cycle really never ends

6. Ignorance about data conflicts - resolving these takes a
lot more effort than most people realize

7. Failure to learn from mistakes - since one DW project

tends to beget another, learning from the early mistakes
will yield higher quality later

Issues of Data Redundancy between

DW and operational environments

The lack of relevancy of issues such as data normalization in
the DW environment may suggest that existence of massive
data redundancy within the data warehouse and between the
operational and DW environments.

The data being loaded into the DW are filtered and cleansed
as they pass from the operational database to the warehouse.
Because of this cleansing numerous data that exists in the
operational environment never pass to the data warehouse.
Only the data necessary for processing by the DSS or EIS are
ever actually loaded into the DW.

Issues of Data Redundancy between

DW and operational environments

The time horizons for warehouse and operational data

elements are unique. Data in the operational environment
are fresh, whereas warehouse data are generally much
older.(so there is minimal opportunity of the data to
overlap between two environments )

The data loaded into the DW often undergo a radical

transformation as they pass form operational to the DW
environment. So data in DW are not the same.

Data Warehouse Architecture

At the top - a centralized
database

- Generally configured for
queries and appends -
not transactions

- Many indices,
materialized views, etc.

Data is loaded and
periodically updated via

Extract/Transform/Load
(ETL) tools

Data Warehouse

ETL pipeline
outputs

ETL

ETL ETL ETL ETL

RDBMS1 RDBMS 2

HTML1 XML1

ETL Tools

ETL tools are the equivalent of schema mappings in
virtual integration, but are more powerful

Arbitrary pieces of code to take data from a source,
convert it into data for the warehouse:

- import filters - read and convert from data sources

- data transformations - join, aggregate, filter, convert data

- de-duplication - finds multiple records referring to the
same entity, merges them

- profiling - builds tables, histograms, etc. to summarize
data

- quality management - test against master values, known
business rules, constraints, etc.

Example ETL Tool Chain

Split

Date -
time

Invoice
line items

Filter
invalid

Invalid
dates /times

Join

Item

records

Filter
invalid

Invalid

items

Filter

non -
match

Invalid

customers

Customer

records

Group by
customer

Customer

balance

This is an example for e-commerce loading

Note multiple stages of filtering (using

selection or join-like operations), logging bad
records, before we group and load

Data Warehouse Architectures

Generic Two-Level Architecture

Independent Data Mart

Dependent Data Mart and Operational
Data Store

Logical Data Mart and @ctive Warehouse

Three-Layer architecture

All involve some form of extraction, transformation and loading (ETL)

Generic two-level data warehousing architecture

L

One,

company-

T

E

wide

warehouse

Periodic extraction data is not completely current in warehouse

Independent data mart data

warehousing architecture

L

T

E

Separate ETL for each
independent data mart

Data marts:

Mini-warehouses, limited in scope

Data access complexity
due to multiple data marts

Dependent data mart with

operational data store: a three-level
architecture

L

T

E

Single ETL for

ODS provides option for
obtaining current data

Simpler data access

enterprise data warehouse

Dependent data marts

(EDW)

loaded from EDW

Logical data mart and real time

ODS and data warehouse

warehouse architecture

L

T

E

Near real-time ETL for

Data Warehouse

are one and the same

Data marts are NOT separate databases,
but logical views of the data warehouse
Easier to create new data marts

Example

of DBMS

log entry

Data Characteristics

Status vs. Event Data

Status

Event = a database action

(create/update/delete) that
results from a transaction

Status

Transient

operational

data

Data Characteristics

Transient vs. Periodic Data

With

transient

data,

changes to
existing
records are
written over
previous

records, thus
destroying
the previous
data content

Periodic

warehouse

data

Data Characteristics

Transient vs. Periodic Data

Periodic

data are

never

physically
altered or
deleted

once they

have

been

added to

the store

Other Data Warehouse

Changes

New descriptive attributes

New business activity attributes

New classes of descriptive attributes

Descriptive attributes become more refined

Descriptive data are related to one another

New source of data

The Reconciled Data Layer

Typical operational data is:

- Transient-not historical

- Not normalized (perhaps due to denormalization for
performance)

- Restricted in scope-not comprehensive

- Sometimes poor quality-inconsistencies and errors

After ETL, data should be:

- Detailed-not summarized yet

- Historical-periodic

- Normalized-3rd normal form or higher

- Comprehensive-enterprise-wide perspective

- Timely-data should be current enough to assist decision-making

- Quality controlled-accurate with full integrity

The ETL Process

Capture/Extract

Scrub or data cleansing

Transform

Load and Index

ETL = Extract, transform, and load

Capture/Extractobtaining a snapshot of a chosen subset
of the source data for loading into the data warehouse

Steps in data

reconciliation

Static extract = capturing
a snapshot of the source
data at a point in time

Incremental extract =
capturing changes that
have occurred since the last
static extract

Scrub/Cleanseuses pattern recognition and AI
techniques to upgrade data quality

Steps in data

reconciliation
(cont.)

Fixing errors: misspellings,
erroneous dates, incorrect field

usage, mismatched addresses,
missing data, duplicate data,
inconsistencies

Also: decoding, reformatting,
time stamping, conversion, key
generation, merging, error
detection/logging, locating
missing data

Transform = convert data from format of operational
system to format of data warehouse

Steps in data

reconciliation
(cont.)

Record-level:
Selection-data partitioning

Joining-data combining
Aggregation-data summarization

Field-level:

single-field-from one field to one field
multi-field-from many fields to one, or
one field to many

Load/Index= place transformed data
into the warehouse and create indexes

Steps in data

reconciliation
(cont.)

Refresh mode: bulk rewriting Update mode: only changes

of target data at periodic intervals in source data are written to data

warehouse

Single-field transformation

In general-some transformation
function translates data from old
form to new form

Algorithmic transformation uses
a formula or logical expression

Table lookup-another

approach, uses a separate
table keyed by source
record code

Multifield transformation

M:1-from many source

fields to one target field

1:M-from one

source field to
many target fields

108

Derived Data

Objectives

- Ease of use for decision support applications

- Fast response to predefined user queries

- Customized data for particular target audiences

- Ad-hoc query support

- Data mining capabilities

Characteristics

- Detailed (mostly periodic) data

- Aggregate (for summary)

- Distributed (to departmenta

Most common data model = star schema
(also called dimensional model)

Components of a star schema

Fact tables contain factual

or quantitative data

1:N relationship between

Dimension tables are denormalized to

dimension tables and fact tables

maximize performance

Dimension tables contain descriptions
about the subjects of the business

Excellent for ad-hoc queries, but bad for online transaction processing

Figure 11-14: Star schema example

Fact table provides statistics for sales

broken down by product, period and store
dimensions

Star schema with sample data

Issues Regarding Star Schema

Dimension table keys must be surrogate (non-intelligent and non-
business related), because:

- Keys may change over time

- Length/format consistency

Granularity of Fact Table-what level of detail do you want?

- Transactional grain-finest level

- Aggregated grain-more summarized

- Finer grains better market basket analysis capability

- Finer grain more dimension tables, more rows in fact table

Duration of the database-how much history should be kept?

- Natural duration-13 months or 5 quarters

- Financial institutions may need longer duration

- Older data is more difficult to source and cleanse

Modeling dates

Fact tables contain time-period data
Date dimensions are important

On-Line Analytical Processing (OLAP)

The use of a set of graphical tools that provides users
with multidimensional views of their data and allows

them to analyze the data using simple windowing
techniques

Relational OLAP (ROLAP)

- Traditional relational representation

Multidimensional OLAP (MOLAP)

- Cube structure

OLAP Operations

- Cube slicing - come up with 2-D view of data

- Drill-down - going from summary to more detailed views

Slicing a data cube

Summary report

Example of drill-down

Starting with summary
data, users can obtain
details for particular
cells

Drill-down with
color added

Data Mining and Visualization

Knowledge discovery using a blend of statistical, AI, and
computer graphics techniques

Goals:
- Explain observed events or conditions

- Confirm hypotheses

- Explore data for new or unexpected relationships

Techniques
- Case-based reasoning

- Rule discovery

- Signal processing

- Neural nets

- Fractals

Data visualization - representing data in graphical/multimedia
formats for analysis

S-ar putea să vă placă și