Documente Academic
Documente Profesional
Documente Cultură
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
11
12
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
14
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
15
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
17
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
18
An Overview
Understanding What is a Data Warehouse
19
20
21
22
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
23
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
24
25
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
26
Data Modeling
Effective way of using a Data Warehouse
27
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
30
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
33
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south
Dimension Table
city cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
34
35
36
Continuous Monitoring
y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
37
38
40
41
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
42
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
43
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software. The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
44
45
46
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
47
Metadata Management
48
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
49
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
50
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
52
53
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
55
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
58
OLAP
60
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
1/13/2012
61
61
1/13/2012
62
62
Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
63
Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012
63
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
64
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
65
3 x 3 x 3 = 27 cells
66
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
1/13/2012
67
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
67
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
1/13/2012
68
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
68
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
1/13/2012
69
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
69
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
( ROTATE 90 )
White
o
Coupe
C O L O R
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
70
M O D E L
Mini Van
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr
Mini Van
Clyde
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
MODEL
COLOR
COLOR
View #4
View #5
View #6
71
Sales Volumes
M O D E L
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
1/13/2012
72
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
72
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
1/13/2012
73
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
73
1/13/2012
74
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
74
75
76
1/13/2012
77
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
77
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
78
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
79
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
79
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
80
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
80
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers
1/13/2012
81
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
81
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
82
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
82
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
1/13/2012
83
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
83
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis
1/13/2012
84
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
84
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
1/13/2012
85
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
85
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
86
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
86
87
88
89
90
91
92
93
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
94
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
95
Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
96
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
97
Performance Testing
Performance Testing should check for : ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
98
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
99
Questions
100
Thank You
101
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
103
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
104
An Overview
Understanding What is a Data Warehouse
105
106
107
108
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
110
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
111
An Overview
Understanding What is a Data Warehouse
112
113
114
115
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
117
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
118
An Overview
Understanding What is a Data Warehouse
119
120
121
122
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
123
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
124
Data Modeling
Effective way of using a Data Warehouse
125
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
128
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south
Dimension Table
city cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
129
130
131
Continuous Monitoring
y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
132
133
135
136
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
137
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
138
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software. The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
139
140
141
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
142
Metadata Management
143
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
144
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
145
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
147
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
149
OLAP
151
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
1/13/2012
152
152
1/13/2012
153
153
Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
154
Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012
154
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
155
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
156
3 x 3 x 3 = 27 cells
157
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
1/13/2012
158
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
158
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
1/13/2012
159
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
159
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
1/13/2012
160
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
160
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
( ROTATE 90 )
White
o
Coupe
C O L O R
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
161
M O D E L
Mini Van
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr
Mini Van
Clyde
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
MODEL
COLOR
COLOR
View #4
View #5
View #6
162
Sales Volumes
M O D E L
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
1/13/2012
163
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
163
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
1/13/2012
164
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
164
1/13/2012
165
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
165
166
167
1/13/2012
168
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
168
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
169
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
170
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
170
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
171
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
171
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers
1/13/2012
172
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
172
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
173
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
173
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
1/13/2012
174
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
174
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis
1/13/2012
175
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
175
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
1/13/2012
176
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
176
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
177
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
177
178
179
180
181
182
183
184
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
185
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
186
Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
187
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
188
Performance Testing
Performance Testing should check for : ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
189
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
190
Questions
191
Thank You
192
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
193
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
194
Data Modeling
Effective way of using a Data Warehouse
195
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
198
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south
Dimension Table
city cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
199
200
201
Continuous Monitoring
y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
202
203
205
206
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
207
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
208
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software. The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
209
210
211
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
212
Metadata Management
213
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
214
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
215
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
217
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
219
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
221
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
223
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
224
An Overview
Understanding What is a Data Warehouse
225
226
227
228
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
229
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
230
Data Modeling
Effective way of using a Data Warehouse
231
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
234
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south
Dimension Table
city cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
235
236
237
Continuous Monitoring
y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
238
239
241
242
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
243
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
244
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software. The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
245
246
247
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
248
Metadata Management
249
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
250
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
251
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
253
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
255
OLAP
257
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
1/13/2012
258
258
1/13/2012
259
259
Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
260
Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012
260
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
261
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
262
3 x 3 x 3 = 27 cells
263
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
1/13/2012
264
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
264
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
1/13/2012
265
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
265
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
1/13/2012
266
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
266
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
( ROTATE 90 )
White
o
Coupe
C O L O R
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
267
M O D E L
Mini Van
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr
Mini Van
Clyde
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
MODEL
COLOR
COLOR
View #4
View #5
View #6
268
Sales Volumes
M O D E L
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
1/13/2012
269
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
269
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
1/13/2012
270
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
270
1/13/2012
271
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
271
272
273
1/13/2012
274
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
274
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
275
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
276
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
276
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
277
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
277
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers
1/13/2012
278
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
278
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
279
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
279
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
1/13/2012
280
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
280
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis
1/13/2012
281
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
281
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
1/13/2012
282
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
282
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
283
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
283
284
285
286
287
288
289
290
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
291
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
292
Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
293
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
294
Performance Testing
Performance Testing should check for : ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
295
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
296
Questions
297
Thank You
298
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
299
An Overview
Understanding What is a Data Warehouse
300
301
302
303
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
304
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
305
Data Modeling
Effective way of using a Data Warehouse
306
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
309
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south
Dimension Table
city cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
310
311
312
Continuous Monitoring
y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
313
314
316
317
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
318
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
319
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software. The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
320
321
322
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
323
Metadata Management
324
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
325
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
326
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
328
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
330
OLAP
332
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
1/13/2012
333
333
1/13/2012
334
334
Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
335
Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012
335
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
336
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
337
3 x 3 x 3 = 27 cells
338
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
1/13/2012
339
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
339
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
1/13/2012
340
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
340
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
1/13/2012
341
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
341
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
( ROTATE 90 )
White
o
Coupe
C O L O R
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
342
M O D E L
Mini Van
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr
Mini Van
Clyde
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
MODEL
COLOR
COLOR
View #4
View #5
View #6
343
Sales Volumes
M O D E L
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
1/13/2012
344
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
344
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
1/13/2012
345
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
345
1/13/2012
346
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
346
347
348
1/13/2012
349
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
349
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
350
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
351
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
351
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
352
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
352
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers
1/13/2012
353
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
353
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
354
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
354
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
1/13/2012
355
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
355
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis
1/13/2012
356
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
356
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
1/13/2012
357
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
357
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
358
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
358
359
360
361
362
363
364
365
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
366
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
367
Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
368
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
369
Performance Testing
Performance Testing should check for : ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
370
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
371
Questions
372
Thank You
373
OLAP
375
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
1/13/2012
376
376
1/13/2012
377
377
Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
378
Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012
378
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
379
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
380
3 x 3 x 3 = 27 cells
381
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
1/13/2012
382
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
382
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
1/13/2012
383
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
383
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
1/13/2012
384
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
384
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
( ROTATE 90 )
White
o
Coupe
C O L O R
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
385
M O D E L
Mini Van
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr
Mini Van
Clyde
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
MODEL
COLOR
COLOR
View #4
View #5
View #6
386
Sales Volumes
M O D E L
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
1/13/2012
387
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
387
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
1/13/2012
388
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
388
1/13/2012
389
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
389
390
391
1/13/2012
392
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
392
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
393
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
394
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
394
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
395
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
395
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers
1/13/2012
396
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
396
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
397
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
397
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
1/13/2012
398
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
398
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis
1/13/2012
399
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
399
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
1/13/2012
400
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
400
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
401
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
401
402
403
404
405
406
407
408
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
409
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
410
Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
411
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
412
Performance Testing
Performance Testing should check for : ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
413
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
414
Questions
415
Thank You
416
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
417
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
418
Data Modeling
Effective way of using a Data Warehouse
419
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
422
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south
Dimension Table
city cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
423
424
425
Continuous Monitoring
y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
426
427
429
430
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
431
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
432
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software. The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
433
434
435
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
436
Metadata Management
437
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
438
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products? What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?
439
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
441
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
443
OLAP
445
Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools
1/13/2012
446
446
1/13/2012
447
447
Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
448
Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012
448
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
449
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
450
3 x 3 x 3 = 27 cells
451
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
1/13/2012
452
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
452
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
1/13/2012
453
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
453
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
1/13/2012
454
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
454
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
( ROTATE 90 )
White
o
Coupe
C O L O R
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
455
M O D E L
Mini Van
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Carr
Mini Van
Clyde
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
DEALERSHIP
( ROTATE 90 )
o
MODEL
COLOR
COLOR
View #4
View #5
View #6
456
Sales Volumes
M O D E L
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
1/13/2012
457
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
457
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
1/13/2012
458
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
458
1/13/2012
459
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
459
460
461
1/13/2012
462
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
462
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
463
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
464
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
464
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
465
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
465
ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers
1/13/2012
466
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
466
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
1/13/2012
467
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
467
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
1/13/2012
468
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
468
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis
1/13/2012
469
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
469
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
1/13/2012
470
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
470
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
471
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
471
472
473
474
475
476
477
478
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
479
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.
480
Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
481
Integration Testing
Integration testing will involve following: Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
482
Performance Testing
Performance Testing should check for : ETL processes completing within time window. Monitoring and measuring the data quality issues. Refresh times for standard/complex reports.
483
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
484
Questions
485
Thank You
486
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
488
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
489
An Overview
Understanding What is a Data Warehouse
490