Documente Academic
Documente Profesional
Documente Cultură
Dhruv Nath
Slides on OLAP
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation
Problems with using the ER Model / 3NF for Querying Complex to understand and query
All kinds of tables being joined to all kinds of other tables Maybe OK for joining a few tables. Not OK when lots of tables involved
The E-R Model is designed for capturing / updating detailed data. Not for querying it Different Model required for querying this data by Management
Geography
Dealer
Year
Dimensional Model
Very clear what data is business numbers (changing - facts) and what is constant (eg. Regions, Products - dimensions)
Fact
Cust. Id Month & Yr Region Code Balance
Phone
Manager
Dimension
Dimension
What is the primary key in each dimension ? What is the primary key in the Fact table ? What are the foreign keys ? What relationships do they define ? What do we call this schema ? Star Schema
Dimension
Fact
Cust. Id Month & Yr Region Code Balance
Phone
Manager
Dimension
Dimension
Dimension
Dimensional Model
Cust. Id Cust Name Address Region Code
Fact
Cust. Id Month & Yr Region Code Balance
Phone
Manager
Dimension
Dimension
Exercise : Compare the ER Model with the Dimensional Model of Data ER Model
Dimensional Model
Designed for entering / storing data (transactions) Optimized for transactions: single row entry and retrieval Thousands of concurrent users No way to figure out what data is business numbers (changing) and what is constant / static / nearstatic (eg. Regions, Products). All of them are fields or relations. Therefore tough to implement a query JOINs needed between any combination of tables. Therefore tough to implement a query
Designed for analysis / querying by the user Optimized for bulk load and large, complex, unpredictable queries Few concurrent users What is constant / static / nearstatic (dimensions) and what are business numbers (facts) very clear. Therefore easier to implement a query JOINS only between the Fact Table and each Dimension Table. Therefore easier to implement a query
Data Marts
Cust. Id Cust Name Address Region Code
Fact
Cust. Id Month & Yr Region Code Balance
Phone
Manager
Dimension
Dimension
Dimension
How would Data Marts created out of such a Data Warehouse look ? Similar. Some fields may be missing. Examples ? Corporate customers : No personal details Retail customers : No Organisational details Data Cubes usually formed in Data Marts
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation
CUSTOMER
PRODUCT
Line_Item
ORDER
Dimensional Model
What are the Foreign Keys in the Fact Table ? What is the primary key in the Fact Table ?
CUSTOMER Cust Id Cust Name Address
SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME
Date Quarter
ORDER
Star : Instead of keeping a relationship from Sales_Rep to Customer, the relationship is from both to line item New Dimension created : Time. Time will always be a dimension in a Data Warehouse
SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME
Date Quarter
ORDER
So no anomalies ----- Lack of normalisation is not a problem The E-R Model tries to remove redundancy completely The Dimensional model tries to simplify the schema, and therefore brings in redundancy
eg. the relationship between sales_rep and customer is repeated in every line_item where these two are involved
Constellation
Multiple STARs
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation
Examples ?
Sales, Collections, Revenue, Expenses
Continuously valued (even numbers (eg. no. of complaints / no. of transactions are considered continuously valued)
Dimensions
Determined by what you want as row and column headers in your query reports : Usually :
Textual Discrete
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation
OLTP Databases
Data Marts
Data Warehouse
DBMS Vendors
OLTP Databases
Data Marts
Data Warehouse
BI Tool Vendors Provide everything except the OLTP DBMS and DW. ETL included SAS, Cognos (IBM), Business Objects (SAP), Qlikview..
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation
Exercise : How big are the Fact and Dimension Tables ? a) Number of records b) Size in bytes
Cust. Id
Cust Name Address Phone
Region Code
Fact
Cust. Id
Month & Yr
Region Code Balance
Dimension
Dimension
Month & Yr
Quarter
1 lakh customers, 10 regions. Data stored for the past 10 years What if we store daily balances, and for each of the 1000 branches ? Implications ? Space, speed. So what do we do ? Optimise on Fact table size. Ignore dimension tables !!!
Dimension
Optimisation : Exercise : Can we modify this Star Schema to cut down space ?
CUSTOMER Cust Id Cust Name Address
SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME
Date Quarter
ORDER
CUSTOMER
SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME
Date Quarter
ORDER
Advantage / Disadvantage ? Fact Table space vs. Ease of Querying Which one would you use ?
CUSTOMER
SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME
Date Quarter
ORDER
Advantage / Disadvantage ? Table space vs. Ease of Querying Which one would you use ?
Order Num Credit Terms Lead Time Cust Id Cust Name Address Emp. Id Name Qualifications
Region Code
Fact
Cust. Id
Month & Yr
Region Code Balance
Dimension
Dimension
Month & Yr
Quarter
Dimension
Keys How do we reduce the size of the keys ? Use surrogate keys
Disadvantage ?
Processing reqd to transform from op to surrogate In any case, when the data comes from multiple sources, keys in all but one of the sources need to change
CUSTOMER
SALES_REP
Emp. Id Name Qualifications Emp Key (PK) Fact LINE_ITEM Emp Id Emp Key Cust Id Cust Key Date Order Num Order Key Product Code Product Key Quantity TIME
Date Quarter
ORDER
Do we need both the original and the surrogate key in the Dimension Table ? Fact Table ?
SALES_REP
Emp. Id Name Qualifications Emp Key (PK) Fact LINE_ITEM Emp Id Emp Key Cust Id Cust Key Date Key Date Order Num Order Key Product Code Product Key Quantity TIME Date Key (PK) Date Quarter
ORDER
Based on this exercise, what is the process for converting an ER Model into a Dimensional Model (Data Warehouse)
Thinking question : Is there any situation where we would normalise a dimension table ? Fact
Cust. Id Month Region Code Balance
Dimension
Dimension
Dimension
Add fields to each dimension to make it denormalised Now, what does the schema look like if we normalise each dimension Snowflake Schematable ? Are Snowflake Schemas desirable ? Why ? Speed of querying. Complexity of querying for the user
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation
Representing dimensions
SKU Store Date Promotion Brand Locality Month All
Product
Product Category
City Year
Department
Region
All products
All
All
How do we represent a query - eg. Get Sales by SKU by Store by Date by Promotion ? How do we show a Roll-up / Drill-down ?
Representing dimensions
SKU Store Date Promotion Brand Locality Month All
Product
Product Category
City Year
Department
Region
All products
All
All
For this query, we need to add fields across rows in the Fact Table. How many rows need to be summed? Problems ? Speed. Solution ?
Product
Product Category
City Year
Department
Region
All products
All
All
Aggregation : Issues
When are aggregates computed ?
During every update
Aggregation : Issues
Where are Aggregations stored ?
Separate Fact table Families of Stars (Constellations)
Users should not be aware of aggregation. The software automatically uses the aggregate Fact table to answer the query. Why ?
Implementing OLAP
Relational OLAP Disc
Implemented using a regular Relational DBMS Linked list structures
ROLAP
vs
MOLAP
Array Structure therefore fast All cells in the Fact Table are stored whether they exist or not Therefore huge space (Explain)
eg. (Bank example) A customer does not have any Account in a given branch A customer does not perform any transaction in most of his accounts on specific days
Linked List Structure slow Space Optimised only records that have some value are stored
Therefore only small DW can be handled. For large DW, summarised data can be kept in the MDDB. Drilling down requires going back to ROLAP (Called HOLAP Hybrid OLAP) Pre-aggregated data therefore fast
MOLAP
Sparse Matrix techniques used to optimised space
ROLAP vs MOLAP
DBMS vendors started off with ROLAP (knowhow already existed), but are now adding MOLAP Pure BI vendors largely into MOLAP (proprietary)
Book
The Data Warehouse Toolkit Ralph Kendall, Margy Ross - Wiley
Dhruv Nath