Data Warehouse

OLAP and Data Warehousing
Slides courtesy of: Julia Stoyanovitch

Columbia University
Surajit Chaudhuri
Microsoft Research, Redmond, WA, USA surajitc@microsoft.com
Umeshwar Dayal
Hewlett-Packard Labs., Palo Alto, CA, USA dayal@hpl.hp.com
What is OLAP?
On-Line
Analytical Processing Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. OLAP is an element of decision support systems (DSS).
Surajit Chaudhuri, Umeshwar Dayal
3 2
Running Example: Car Sales

Cars:
carId, make, model, color dealerId, city, state
Dealers:
Time
of Sale: tid, year, month, day

carId, dealerId, tid, price
Sales:
OLTP Queries: Examples

create
a new sales record that indicates that a red VW Golf was sold in Boston, MA how many black and silver VW Passats were sold at dealership #123 on April 11th 2005
see
OLAP Queries: Examples

Analyze
comparative sales of the different colors of VW Golf by state which months are particularly favorable to the sale of different VW models and colors VW dealerships by revenue, displaying a ranked list of dealerships and % differences in sales between each dealership and the one ranked 1 place higher
5
See
Rank
OLAP vs. OLTP

OLTP User Clerk, IT professional Function Day to day operations DB design Application-oriented (E-R based) Data Current, Isolated View Detailed, Flat relational Usage Structured, Repetitive Unit of work Short, simple transaction Read/write Access Index/hash on prim. key Operations # Records accessed Tens Thousands # Users 100 MB - GB Db size Trans. throughput Metric
OLAP Knowledge worker Decision support Subject-oriented (Star, snowflake) Historical, Consolidated Summarized, Multidimensional Ad hoc Complex query Read mostly Lots of scans Millions Hundreds 100 GB - TB Query throughput, response
6
OLAP Queries: Challenges

Many AND, OR in the WHERE clause Self-join, nested sub-queries

Last years sales vs this years sales for each product Show reps for whom every sale has been more than $15000
Extensive use of aggregation, often on related datasets Aggregation over time periods Ranking Use of statistical functions Very large datasets Expectation of an interactive response time
OLAP Query Tools

Goal
of OLAP is to support ad-hoc querying for the business analyst (Power user) Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with warehouse data
Large data set Semantically enriched to understand business terms (e.g., time, geography) Combined with reporting features
Multidimensional
view of data is the foundation of

8
OLAP.
Multidimensional Data Model

Database
is a set of facts (points) in a multidimensional space A fact has a measure dimension

quantity that is analyzed, e.g., sale amount, budget
A
set of dimensions with respect to which data is analyzed

e.g., store, product, date associated with a sale amount
Dimensions
form a sparsely populated coordinate
system Each dimension has a set of attributes

e.g., owner, city and county of store
Surajit Chaudhuri, Umeshwar Dayal 9
Attribute Hierarchies
Attributes
of a dimension may be related An m:1 dependency is most common Dependency graph may be:
Hierarchy: e.g., city -> state -> country Lattice: date -> month -> year date -> week -> year
Hierarchies
are most common Dependencies influence choice of operations and data representation
Multidimensional Data
Sales volume as a function of product, time, geography
Dimensions
Color, State, Date
WI
CA NY
Attributes
date (year, month, day)
Red Green Blue White Silver Black
Color
10 50 20 12 15 10
Attribute Hierarchies and Lattice

Industry Category Country State City Year Quarter Month Week Date
11
1 2 3 4 5 67
Date
Product
Fact data: Sales volume in $100

ROLAP and MOLAP

Relational
OLAP (ROLAP)
Relational and Specialized Relational DBMS to store and manage warehouse data OLAP middleware to support missing pieces Optimize for each DBMS backend Aggregation Navigation Logic Additional tools and services
Multidimensional
OLAP (MOLAP)
Array-based storage structures Direct access to array data structures
12
Multiple Aggregations
Create
a 2-dimensional spreadsheet that shows sum of sales by year as well as by model of car Each subtotal requires a separate aggregate query
STATE
Y E A R Sum By State
Sum by Year
Example: Multiple Aggregations

WI 2003 2004 2005 Total 63 38 75 176 CA 81 107 35 223 Total 144 145 110 399
14
Generalization: The Data Cube

Base
tuples Aggregate tuples:

one aggregation for each subset of dimensions (powerset) exponential number of subsets, but can optimize the computation
Example
N = 3 dimensions model = {Golf, Jetta} color = {red, black, white} state = {NY, CA, WI} How many aggregate tuples in the data cube? face 1D agg; edge 2D agg; corner 3D agg
15
Operations on Multidimensional Data Model

Aggregation (roll-up) of detailed data to create summary data Navigation to detailed data (drill-down) from summary Selection (slice) defines a subcube
Project the cube on fewer dimensions by specifying coordinates of remaining dimensions e.g., sales where state = NY and month = Jan
Calculation
Within a dimension, e.g., (sales - expense) by state Across dimensions
Ranking
top 3% of states by average sales
Window Queries
16
Roll-up and Drill-Down

Roll-Up: Use of aggregation dimension reduction: e.g., total sales by state by color e.g., total sales by state navigating attribute hierarchy: e.g., sales by city -> total sales by state -> total sales by country e.g., total sales by city and year -> total sales by state and year -> total sales by country Drill-Down: Inverse operation of roll-up Provides the data set that was aggregated e.g., show base data for total sales figure for CA state
17
Slice and Dice

What
colors of Golf are not doing so well?
Select color, sum(price) From SALES Where model = Golf slicing Group By color dicing
Keep
slicing if results are uniform
18
More Examples
Q: Given a query, which values from the CUBE do we need to retrieve?
A: To answer a query Q use tuples T s.t.

If Q groups by A, T must have a non-* value in its component for A If Q slices by A = b, T must have the value b (not * or any other value) in its component for A If Q neither groups nor slices by A, then T has to have * in its component for A
19
Pivot (Rotate)
LA SF NY Juice 10 Cola 50 Milk 20 Cream 12 Toothpaste 15 Soap 10
Product
1 2 3 4 5 67
Month
Fact data: Sales volume in $100
City
Product
Result: cross tabulation

Warehouse Database Schema

Entity-Relationship
design techniques not
appropriate Design should reflect multidimensional view Typical schemas:

Star Schema Snowflake Schema Fact Constellation Schema
21
Example of a Star Schema

Order OrderNo OrderDate Fact table Customer OrderNo SalespersonID CustomerNo ProdNo DateKey CityName Quantity TotalPrice Product ProdNo ProdName ProdDescr Category CategoryDescr UnitPrice QOH Date DateKey Date Month Year City CityName State Country
22
CustomerNo CustomerName CustomerAddress City

Salesperson SalespersonID SalespersonName City Quota
Star Schema and Variants

A
single fact table and a single table for each dimension Generated keys are used for performance and maintenance reasons Fact constellation: Multiple Fact tables that share common dimension tables
Example: ProjectedExpense and ActualExpense may share dimensional tables
Snowflake
Schema: Represents dimensional hierarchy by normalization

23
Example of a Snowflake Schema

Order OrderNo OrderDate Fact table Customer OrderNo SalespersonID CustomerNo DateKey CityName ProdNo Quantity TotalPrice Product ProdNo ProdName ProdDescr Category UnitPrice QOH Date DateKey Date Month Category CategoryName CategoryDescr

Salesperson SalespersonID SalespesonName City Quota
Month Month Year
Year
Year
City CityName State
State StateName Country

24
Performance Considerations
Normalization
for dimension tables of summary tables
Read-only data, so no update anomalies Fewer joins better performance

Pre-computation
Re-use can speed up performance How can we use pre-computed results effectively?
Data
is very large, dimension data often sparse
Crucial to use indexes effectively Need for new indexing techniques: bitmap indexes, join indexes
25
Bit Map Index

An
alternative representation of RID-list Comparison, join and aggregation operations are reduced to bit arithmetic Specially advantageous for low-cardinality domains
Significant reduction in space and I/O (30:1) Adapted for higher cardinality domains Compression (e.g., run-length encoding) exploited Upper Bound of 2R words for any bitmap over R rows [Hasan & Sinha, 1997]
26
Bitmap Index Example

M F custid 112 115 119 116 2 3 4 name Joe Ram Sue Woo 5 gender rating M M F M 3 5 5 4
1
1 0
0
0 1
1
1
0 0 0 0
0 0 0 0
1 0 0 0
0 0 0 1
0 1 1 0
27
Join Index
Traditional
index maps the value in a column to a list of rows with that value Join index maintain relationships between attribute value of a dimension and the matching rows in the fact table Join index may span multiple dimensions (composite join index)
Use join index to identify regions of cartesian product that are of interest Few people in Southern California may buy umbrellas
28
Algorithm Using Bitmapped Join Indexes

[ONeil&Graefe95]
Maintain
bit mapped join indexes between each dimension table and the fact table To answer a query over multiple dimensions
Take intersection of join indexes until the set of candidate fact tuples is small Do foreign key joins with rest of the dimension tables Look up the fact table
29
Join Index over Star Schema

Order OrderNo OrderDate Fact table Customer OrderNo SalespersonID CustomerNo ProdNo DateKey CityName Quantity TotalPrice Product ProdNo ProdName ProdDescr Category CategoryDescr UnitPrice QOH

Salesperson SalespersonID SalespesonName City Quota
Date
DateKey Date Month Year
City CityName State Country

30
ROLAP: Handling of Aggregate Views

Important
component for ROLAP Servers Choice of aggregate views to materialize Physical representation of Materialized Views in the star schema Logic for Aggregation Navigation
make optimum use of materialized aggregates to answer a query
31
ROLAP: Choice of Aggregate Views to Materialize

Storage
can increase dramatically if precomputed views are not chosen properly Must take into account queries in the workload, their frequencies and their costs The decision must be taken in the broader context of physical database design
e.g., should take into account the choice of indexes
Heuristic
approaches adopted in products
32
ROLAP: Using Materialized Views Through Selection

A
query can use a view through a selection if Each selection condition C on each dimension d in the query is Logically implies a condition C on dimension d in the view Example: A view has sum(sales) by product and by year for products introduced after 1991 OK to use for sum(sales) by product for products introduced after 1992 CANNOT use for sum(sales) for products introduced after 1989
Using Materialized Views through Group By (Roll Up)

The
view V may be applicable via roll-up if for every grouping attribute g of the query Q: Q has Group By a1,..,g, an V has Group By a1,..,h, an Attribute g is higher than h in the attribute hierarchy Aggregation functions are distributive Example: Compute sum(sales) by category from the view sum(sales) by product
34
Data Warehouse
A
decision support database that is maintained separately from the organizations operational databases. A data warehouse is a
subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making.
-- W.H. Inmon, Building the Data Warehouse, 1992.
Why Separate Data Warehouse

Performance
Op dbs designed & tuned for known trans. workloads. Complex OLAP queries would degrade performance for operational transactions. Special data organization, access & implementation methods needed for multidimensional views & queries.
Function
Missing data: Decision support requires historical data, which op dbs do not typically maintain. Data consolidation: Decision support requires data consolidation (aggregation, summarization) from many heterogeneous sources: op dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats, which have to be reconciled.
Data Warehousing Architecture

Monitoring & Administration Metadata Repository Data Warehouse OLAP Servers
OLAP
External sources
Operational dbs
Extract Transform Transport
Query/Reporting
Serve Data Mining
Data sources Data Marts

Front-End Tools
37
Data Warehouse vs. Data Marts

Enterprise
data warehouse: collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization.
Requires extensive business modeling. May take years to design and build.
Data
Marts: Departmental subsets that focus on selected subjects.

Marketing data mart: customer, products, sales. Faster roll out, but complex integration in the long run.
Virtual
warehouse: views over operational dbs
materialize some summary views for efficient query processing easier to build requisite excess capacity on operational db servers.
Three-Tier Architecture
Warehouse OLAP
database server
almost always a relational DBMS; rarely flat files.
servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operations. Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.
Clients
Query and reporting tools. Analysis tools. Data mining tools.
39
Populating & Refreshing the Warehouse

Data
extraction Data cleaning Data transformation

Convert from legacy/host format to warehouse format
Load
Sort, summarize, consolidate, compute views, check integrity, build indexes, partition
Refresh
Propagate updates from sources to the warehouse.
40
Data Cleaning
Why?
Data warehouse contains data that is analyzed for business decisions More data and mulitple sources could mean more errors Results in incorrect analysis
Detecting
data anomalies and rectifying them early has huge payoffs Important to identify tools that work together well Long Term Solution
Change business practices and data entry tools Repository for metadata
Load
Issues:
huge volumes of data to be loaded small time window (usually at night) when the warehouse can be taken off-line when to build indexes and summary tables allow system administrator to monitor status, cancel suspend, resume load, or change load rate restart after failure with no loss of data integrity.
Techniques:
batch load utility: sort input records on clustering key and use sequential I/O; build indexes and derived tables sequential loads still too long (~100 days for TB) use parallelism and incremental techniques.
Parallel Load
Pipelined and partitioned parallelism
Source tables
Scan
Sort runs
Merge runs
Table insert
Target tables
Build index record Sort runs Merge runs Index insert Target index
[Barclay, Barnes, Gray, Sundaresan: Loading Databases Using Dataflow Parallelism]

Incremental Load
Full load may still take too long. entire load is a (long) batch transaction replace old table with new after transaction commits use periodic checkpoints; after failure, restart from last checkpoint. Use incremental loads during refresh to reduce data volume insert only updated tuples now, incremental load conflicts with queries break into sequence of shorter transactions (every ~1000 records, every few seconds) coordinate this sequence of transactions: must ensure consistency between base tables and derived tables & indices.
44
Refresh
Issues:
when to refresh on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-to-theminute stock quotes) periodically (e.g., every 24 hours, every week) or after significant events refresh policy set by administrator based on user needs and traffic possibly different policies for different sources. how to refresh.
45
Refresh Techniques
Full
extract from base tables
read entire source table or database: expensive may be the only choice for legacy databases or files. Incremental techniques (related to work on active dbs) detect & propagate changes on base tables: replication servers snapshots & triggers (Oracle) transaction shipping (Sybase) logical correctness computing changes to star tables computing changes to derived and summary tables optimization: only significant changes transactional correctness: incremental load.

Data Warehouse

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Warehouse

Încărcat de

Drepturi de autor:

Formate disponibile

OLAP and Data Warehousing

Slides courtesy of: Julia Stoyanovitch

Surajit Chaudhuri, Umeshwar Dayal

Running Example: Car Sales

carId, make, model, color dealerId, city, state

of Sale: tid, year, month, day

OLTP Queries: Examples

OLAP Queries: Examples

OLAP vs. OLTP

OLAP Queries: Challenges

Many AND, OR in the WHERE clause Self-join, nested sub-queries

OLAP Query Tools

view of data is the foundation of

Multidimensional Data Model

is a set of facts (points) in a multidimensional space A fact has a measure dimension

set of dimensions with respect to which data is analyzed

form a sparsely populated coordinate

system Each dimension has a set of attributes

Red Green Blue White Silver Black

Attribute Hierarchies and Lattice

Fact data: Sales volume in $100

ROLAP and MOLAP

Array-based storage structures Direct access to array data structures

Surajit Chaudhuri, Umeshwar Dayal

Example: Multiple Aggregations

Generalization: The Data Cube

tuples Aggregate tuples:

Operations on Multidimensional Data Model

Surajit Chaudhuri, Umeshwar Dayal

Roll-up and Drill-Down

Surajit Chaudhuri, Umeshwar Dayal

Slice and Dice

colors of Golf are not doing so well?

slicing if results are uniform

A: To answer a query Q use tuples T s.t.

Result: cross tabulation

Warehouse Database Schema

design techniques not

appropriate Design should reflect multidimensional view Typical schemas:

Surajit Chaudhuri, Umeshwar Dayal

Example of a Star Schema

CustomerNo CustomerName CustomerAddress City

Surajit Chaudhuri, Umeshwar Dayal

Star Schema and Variants

Schema: Represents dimensional hierarchy by normalization

Surajit Chaudhuri, Umeshwar Dayal

Example of a Snowflake Schema

CustomerNo CustomerName CustomerAddress City

Month Month Year

City CityName State

State StateName Country

Surajit Chaudhuri, Umeshwar Dayal

for dimension tables of summary tables

Read-only data, so no update anomalies Fewer joins better performance

is very large, dimension data often sparse

Bit Map Index

Surajit Chaudhuri, Umeshwar Dayal

Bitmap Index Example

Surajit Chaudhuri, Umeshwar Dayal

Algorithm Using Bitmapped Join Indexes

Surajit Chaudhuri, Umeshwar Dayal

Join Index over Star Schema

CustomerNo CustomerName CustomerAddress City