Sunteți pe pagina 1din 490

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

2009 Wipro Ltd - Confidential

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

11

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

12

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

14

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

15

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

17

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

18

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

19

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

20

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

21

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

22

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

23

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

24

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

25

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

26

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

27

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
28
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
29
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

30

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
31
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
32
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

33

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
34

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

35

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

36

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

37

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

38

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
39
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

40

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

41

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

42

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

43

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

44

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

45

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

46

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
47

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

48

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

49

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

50

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
51
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

52

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

53

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
54
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

55

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
56
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
57
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

58

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
59
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

60

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

61

61

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

62

62

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
63

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

63

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

64

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
65

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
66
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

66

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
67
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

67

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
68
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

68

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
69
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

69

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
70
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

70

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
71
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

71

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
72
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

72

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
73
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

73

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
74
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

74

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
75
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

75

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

76

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
77
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

77

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
78
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

78

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
79
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

79

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
80
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

80

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
81
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

81

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
82
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

82

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
83
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

83

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
84
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

84

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
85
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

85

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
86
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

86

Data Warehouse Testing

87

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

88

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

89

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

90

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

91

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

92

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

93

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

94

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

95

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

96

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

97

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

98

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

99

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

100

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

101

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

103

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

104

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

105

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

106

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

107

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

108

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

110

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

111

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

112

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

113

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

114

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

115

2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

117

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

118

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

119

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

120

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

121

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

122

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

123

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

124

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

125

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
126
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
127
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

128

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
129

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

130

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

131

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

132

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

133

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
134
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

135

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

136

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

137

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

138

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

139

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

140

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

141

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
142

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

143

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

144

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

145

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
146
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

147

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
148
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

149

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
150
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

151

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

152

152

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

153

153

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
154

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

154

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

155

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
156

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
157
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

157

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
158
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

158

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
159
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

159

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
160
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

160

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
161
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

161

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
162
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

162

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
163
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

163

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
164
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

164

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
165
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

165

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
166
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

166

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

167

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
168
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

168

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
169
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

169

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
170
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

170

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
171
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

171

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
172
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

172

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
173
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

173

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
174
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

174

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
175
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

175

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
176
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

176

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
177
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

177

Data Warehouse Testing

178

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

179

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

180

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

181

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

182

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

183

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

184

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

185

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

186

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

187

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

188

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

189

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

190

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

191

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

192

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

193

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

194

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

195

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
196
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
197
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

198

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
199

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

200

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

201

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

202

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

203

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
204
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

205

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

206

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

207

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

208

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

209

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

210

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

211

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
212

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

213

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

214

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

215

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
216
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

217

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
218
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

219

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

221

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

223

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

224

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

225

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

226

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

227

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

228

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

229

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

230

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

231

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
232
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
233
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

234

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
235

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

236

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

237

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

238

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

239

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
240
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

241

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

242

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

243

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

244

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

245

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

246

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

247

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
248

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

249

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

250

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

251

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
252
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

253

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
254
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

255

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
256
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

257

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

258

258

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

259

259

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
260

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

260

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

261

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
262

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
263
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

263

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
264
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

264

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
265
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

265

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
266
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

266

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
267
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

267

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
268
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

268

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
269
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

269

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
270
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

270

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
271
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

271

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
272
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

272

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

273

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
274
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

274

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
275
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

275

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
276
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

276

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
277
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

277

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
278
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

278

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
279
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

279

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
280
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

280

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
281
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

281

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
282
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

282

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
283
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

283

Data Warehouse Testing

284

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

285

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

286

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

287

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

288

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

289

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

290

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

291

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

292

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

293

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

294

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

295

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

296

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

297

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

298

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

299

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

300

2009 Wipro Ltd - Confidential

What is Data Warehouse?


Definitions of Data Warehouse  A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. WH Inmon  Data Warehouse is a repository of data summarized or aggregated in simplified form from operational systems. End user orientated data access and reporting tools let user get at the data for decision support Babcock  A data warehouse is a relational database a copy of transaction data specifically structured for query and analysis Ralph Kimball  In simple: Data warehousing is collection of data from different systems, which helps in Business Decisions, Analysis and Reporting.

301

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse def. by WH Inmon


A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented  Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated  Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Nonvolatile  Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Time Variant  In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. All data in the data warehouse is identified with a particular time period.

302

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


What makes a Data Warehouse

303

2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

304

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

305

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

306

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
307
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
308
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

309

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
310

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

311

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

312

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

313

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

314

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
315
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

316

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

317

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

318

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

319

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

320

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

321

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

322

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
323

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

324

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

325

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

326

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
327
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

328

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
329
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

330

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
331
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

332

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

333

333

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

334

334

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
335

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

335

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

336

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
337

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
338
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

338

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
339
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

339

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
340
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

340

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
341
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

341

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
342
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

342

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
343
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

343

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
344
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

344

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
345
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

345

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
346
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

346

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
347
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

347

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

348

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
349
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

349

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
350
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

350

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
351
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

351

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
352
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

352

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
353
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

353

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
354
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

354

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
355
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

355

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
356
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

356

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
357
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

357

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
358
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

358

Data Warehouse Testing

359

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

360

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

361

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

362

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

363

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

364

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

365

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

366

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

367

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

368

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

369

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

370

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

371

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

372

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

373

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
374
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

375

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

376

376

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

377

377

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
378

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

378

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

379

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
380

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
381
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

381

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
382
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

382

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
383
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

383

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
384
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

384

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
385
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

385

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
386
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

386

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
387
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

387

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
388
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

388

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
389
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

389

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
390
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

390

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

391

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
392
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

392

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
393
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

393

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
394
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

394

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
395
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

395

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
396
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

396

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
397
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

397

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
398
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

398

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
399
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

399

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
400
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

400

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
401
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

401

Data Warehouse Testing

402

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

403

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

404

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

405

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

406

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

407

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

408

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

409

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

410

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

411

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

412

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

413

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

414

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

415

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

416

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Components of Warehouse
 Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files.  ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target.  Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods.  Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files.  Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes.  End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.

417

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Architecture


This is a basic design, where there are source files, which are loaded to a warehouse and users query the data for different purposes.

This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.

418

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Modeling
Effective way of using a Data Warehouse

419

2009 Wipro Ltd - Confidential

Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.

 Different Perceptive of Data Modeling.


o Conceptual Data Model o Logical Data Model o Physical Data Model

 Types of Dimensional Data Models most commonly used:


o Star Schema o Snowflake Schema
420
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Terms used in Dimensional Data Model


To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:  Dimension: A category of information. For example, the time dimension.  Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.  Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.  Fact Table: A table that contains the measures of interest.  Lookup Table: It provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse.  Surrogate Keys: To avoid the data integrity, surrogate keys are used. They are helpful for Slow Changing Dimensions and act as index/primary keys.
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
421
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5

Dimension Table
store storeId c1 c2 c3 city nyc sfo la

Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50

Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la

422

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy sType tId t1 t2 size small large location downtown suburbs regId north south

Dimension Table
city cityId pop sfo 1M la 5M

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
423

region regId name north cold region south warm region

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Overview of Data Cleansing

424

2009 Wipro Ltd - Confidential

The Need For Data Quality


      Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with error detection error rework customer service fixing customer problems

425

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Six Steps To Data Quality


Understand Information Flow In Organization
y Identify authoritative data sources y Interview Employees & Customers y Data Entry Points y Cost of bad data

Identify Potential Problem Areas & Asses Impact

Measure Quality Of Data

y Use business rule discovery tools to identify data with

inconsistent, missing, incomplete, duplicate or incorrect values


y Use data cleansing tools to clean data at the source y Load only clean data into the data warehouse

Clean & Load Data

Continuous Monitoring

y Schedule Periodic Cleansing of Source Data

Identify Areas of Improvement

y Identify & Correct Cause of Defects y Refine data capture mechanisms at source y Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

426

Data Quality Solution


Customized Programs  Strengths: Addresses specific needs No bulky one time investment  Limitations Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts Data Quality Assessment tools  Strength Provide automated assessment  Limitation No measure of data accuracy

427

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Quality Solution


Business Rule Discovery tools  Strengths Detect Correlation in data values Can detect Patterns of behavior that indicate fraud  Limitations Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields. Data Reengineering & Cleansing tools  Strengths Usually are integrated packages with cleansing features as Add-on  Limitations Error prevention at source is usually absent The ETL tools have limited cleansing facilities
428
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Tools In The Market


 Business Rule Discovery Tools Integrity Data Reengineering Tool from Vality Technology Trillium Software System from Harte -Hanks Data Technologies Migration Architect from DB Star  Data Reengineering & Cleansing Tools Carlton Pureview from Oracle ETI-Extract from Evolutionary Technologies PowerMart from Informatica Corp Sagent Data Mart from Sagent Technology  Data Quality Assessment Tools Migration Architect, Evoke Axio from Evoke Software Wizrule from Wizsoft  Name & Address Cleansing Tools Centrus Suite from Sagent I.d.centric from First Logic

429

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Extraction, Transformation, Load

430

2009 Wipro Ltd - Confidential

ETL Architecture

Visitors

Web Browsers

The Internet

External Data Demographics, Household, Webographics, Income

Staging Area
Web Server Logs & E-comm Transaction Data Flat Files Clean Transform Match Merge

Meta Data Repository

Scheduled Extraction

RDBMS

Scheduled Loading

Enterprise Data Warehouse

Other OLTP Systems

Data Collection

Data Extraction

Data Transformation

Data Loading

Data Storage & Integration

431

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database

Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data

Data Extraction Cleanup


Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, etc...

Data loading
Initial and incremental loading Updation of metadata

432

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Why ETL ?
 Companies have valuable data lying around throughout their networks that needs to be moved from one place to another.  The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.  To solve the problem, companies use extract, transform and load (ETL) software.  The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.

433

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing

434

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Major components involved in ETL Processing


  Design manager Lets developers define source-to-target mappings, transformations, process flows, and jobs Meta data management Provides a repository to define, document, and manage information about the ETL design and runtime processes Extract The process of reading data from a database. Transform The process of converting the extracted data Load The process of writing the data into the target database. Transport services ETL tools use network and file protocols to move data between source and target systems and in-memory protocols to move data between ETL run-time components. Administration and operation ETL utilities let administrators schedule, run, monitor ETL jobs, log all events, manage errors, recover from failures, reconcile outputs with source systems
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

   

435

ETL Tools
 Provides facility to specify a large number of transformation rules with a GUI  Generate programs to transform data  Handle multiple data sources  Handle data redundancy  Generate metadata as output  Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation  PowerCentre/Mart from Informatica  Data Mart Solution from Sagent Technology  DataStage from Ascential
436

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Metadata Management

437

2009 Wipro Ltd - Confidential

What Is Metadata?
Metadata is Information...

  

That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes

438

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result? Interpreting information How many times have businesses needed to rework or recall products?  What impact does it have on the bottom line ? How many mistakes were due to misinterpretation of existing How much interpretation results form too much metadata? How much time is spent trying to determine if any of the metadata is accurate? Integrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making documentation?

439

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements for DW Metadata Management


 Provide a simple catalogue of business metadata descriptions and views  Document/manage metadata descriptions from an integrated development environment  Enable DW users to identify and invoke pre-built queries against the data stores  Design and enhance new data models and schemas for the data warehouse  Capture data transformation rules between the operational and data warehousing databases  Provide change impact analysis, and update across these technologies
440
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Consumers of Metadata
 Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool

441

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Third Party Bridging Tools  Oracle Exchange
Technology of choice for a long list of repository, enterprise and workgroup vendors

 Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata

 Ardent Software/ Dovetail Software -Interplay


Hub and Spoke solution for enabling metadata interoperability Ardent focussing on own engagements, not selling it as independent product

 Informix's Metadata Plug-ins


Available with Ardent Datastage version 3.6.2 free of cost for Erwin, Oracle Designer, Sybase Powerdesigner, Brio, Microstrategy
442
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Repositories  IBM, Oracle and Microsoft to offer free or near-free basic repository services  Enable organisations to reuse metadata across technologies  Integrate DB design, data transformation and BI tools from different vendors  Multi-tool vendors taking a bridged or federated rather than integrated approach to sharing metadata  Both IBM and Oracle have multiple repositories for different lines of products e.g., One for AD and one for DW, with bridges between them

443

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Trends in the Metadata Management Tools


Metadata Interchange Standards  CDIF (CASE Data Interchange Format)
Most frequently used interchange standard Addresses only a limited subset of metadata artifacts

 OMG (Object Management Group)-CWM


XML-addresses context and data meaning, not presentation Can enable exchange over the web employing industry standards for storing and sharing programming data Will allow sharing of UML and MOF objects b/w various development tools and repositories

 MDC (Metadata Coalition)


Based on XML/UML standards Promoted by Microsoft Along With 20 partners including Object Management Group (OMG), Oracle Carleton Group, CA-PLATINUM Technology (Founding Member),Viasoft
444
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP

445

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Agenda
OLAP Definition Distinction between OLTP and OLAP MDDB Concepts Implementation Techniques Architectures Features Representative Tools

1/13/2012

446

446

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

OLAP: On-Line Analytical Processing


 OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.) Used interchangeably with BI Multidimensional view of data is the foundation of OLAP Users :Analysts, Decision makers

1/13/2012

447

447

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Distinction between OLTP and OLAP


OLTP System Source of data OLAP System

Operational data; OLTPs are Consolidation data; OLAP the original source of the data comes from the data various OLTP databases To control and run fundamental business tasks A snapshot of ongoing business processes Decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
448

Purpose of data What the data reveals

Inserts and Updates Short and fast inserts and updates initiated by end users
1/13/2012

448

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is  intimately related and  stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data.  The edges of the cube are called dimensions  Individual items within each dimensions are called members

449

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

RDBMS v/s MDDB: Increased Complexity...


Relational DBMS
MODEL MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN ... COLOR BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE DEALER Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr Clyde Gleason Carr VOL. 6 3 2 5 3 1 3 1 4 3 3 3 4 3 6 2 3 5 4 3 2 ...

MDDB

Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

DEALERSHIP

COLOR

27 x 4 = 108 cells
450

3 x 3 x 3 = 27 cells

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Benefits of MDDB over RDBMS


 Ease of Data Presentation & Navigation A great deal of information is gleaned immediately upon direct inspection of the array User is able to view data along presorted dimensions with data arranged in an inherently more organized, and accessible fashion than the one offered by the relational table.  Storage Space Very low Space Consumption compared to Relational DB  Performance Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.  Ease of Maintenance No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance
1/13/2012
451
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

451

Issues with MDDB

Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions

Data Explosion
-Due to Sparsity -Due to Summarization

Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)

1/13/2012
452
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

452

Issues with MDDB - Sparsity Example


If dimension members of different dimensions do not interact , then blank cell is left behind.
Employee Age
21 19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33

LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 D Coupe 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19

Smith

Regan

Fox

L A S T N A M E

Weld

Kelly

Link

Kranz

Lucas

Weiss

EMPLOYEE #

1/13/2012
453
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

453

OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data

1/13/2012
454
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

454

Features of OLAP - Rotation

Complex Queries & Sorts in Relational environment translated to simple rotation.


Sales Volumes

M O D E L

Mini Van

6 3 4
Blue

5 5 3
Red

4 5 2
( ROTATE 90 )
White
o

Coupe

C O L O R

Blue

6 5 4

3 5 5
MODEL

4 3 2
Sedan

Red

Sedan

White

Mini Van Coupe

COLOR

View #1

View #2

2 dimensional array has 2 views.


1/13/2012
455
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

455

Features of OLAP - Rotation


Sales Volumes

M O D E L

Mini Van

Coupe Carr Gleason Clyde Blue Red White

Sedan

C O L O R

Blue

Red White Sedan Coupe Mini Van Carr Gleason Clyde

C O L O R

Blue

Red White Carr Gleason Clyde Mini Van Coupe Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

( ROTATE 90 )

DEALERSHIP

DEALERSHIP

MODEL

View #1
D E A L E R S H I P D E A L E R S H I P

View #2

View #3

Carr

Carr Gleason Blue Red White Mini Van Coupe Sedan

Mini Van

Gleason Mini Van Coupe Sedan White Red Blue

Clyde

Clyde

M O D E L

Coupe Blue Red White Clyde Gleason Carr

Sedan

COLOR

( ROTATE 90 )

MODEL

DEALERSHIP
( ROTATE 90 )
o

MODEL

COLOR

COLOR

View #4

View #5

View #6

3 dimensional array has 6 views.


1/13/2012
456
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

456

Features of OLAP - Slicing / Filtering


 MDDB allows end user to quickly slice in on exact view of the data required.

Sales Volumes

M O D E L

Mini Van Mini Van

Coupe Coupe Normal Metal Blue Blue

Carr Clyde

Carr Clyde

Normal Blue

Metal Blue

DEALERSHIP

COLOR
1/13/2012
457
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

457

Features of OLAP - Drill Down / Up

ORGANIZATION DIMENSION
REGION Midwest

DISTRICT

Chicago

St. Louis

Gary

DEALERSHIP

Clyde

Gleason

Carr

Levi

Lucas

Bolton

Sales at region/District/Dealership Level

Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down

1/13/2012
458
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

458

OLAP Reporting - Drill Down

Inflows ( Region , Year)


200 150 Inflows 100 ($M) 50 0 Year Year 1999 2000 Years

East West Central

1/13/2012
459
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

459

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999)


90 80 70 60 50 Inflows ( $M) 40 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr Year 1999 4th Qtr

East West Central

Drill-down from Year to Quarter


1/13/2012
460
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

460

OLAP Reporting - Drill Down

Inflows ( Region , Year - Year 1999 - 1st Qtr)


20 15 Inflows ( $M 10 ) 5 0 January February March Year 1999 East West Central

Drill-down from Quarter to Month

461

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Implementation Techniques -OLAP Architectures

MOLAP - Multidimensional OLAP


Multidimensional Databases for database and application logic layer

ROLAP - Relational OLAP


Access Data stored in relational Data Warehouse for OLAP Analysis. Database and Application logic provided as separate layers

HOLAP - Hybrid OLAP


OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server

DOLAP - Desk OLAP


Personal MDDB Server and application on the desktop

1/13/2012
462
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

462

MOLAP - MDDB storage

OLAP
Cube
OLAP Calculation Engine

Web Browser

OLAP Tools

OLAP Appli cations


1/13/2012
463
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

463

MOLAP - Features

Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
1/13/2012
464
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

464

ROLAP - Standard SQL storage

MDDB - Relational Mapping

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
465
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

465

ROLAP - Features
Three-tier hardware/software architecture:
GUI on client; multidimensional processing on midtier server; target database on database server Processing split between mid-tier & database servers

Ad hoc query capabilities to very large databases DW integration Data scalability

1/13/2012
466
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

466

HOLAP - Combination of RDBMS and MDDB


OLAP Cube
Any Client

Relational DW

Web Browser
OLAP Calculation Engine

SQL

OLAP Tools

OLAP Applications
1/13/2012
467
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

467

HOLAP - Features

RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user

1/13/2012
468
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

468

Architecture Comparison

MOLAP
Definition

ROLAP

HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent

MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent

Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed

Slow

Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis

Cost

Medium: MDDB Server + large disk space cost

Low: Only RDBMS + disk space cost

Where to apply?

Very large transactional Small transactional data + complex model + data & it needs to be viewed / sorted frequent summary analysis

1/13/2012
469
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

469

Representative OLAP Tools:

Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS

Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence

1/13/2012
470
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

470

Sample OLAP Applications

Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
1/13/2012
471
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

471

Data Warehouse Testing

472

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Overview


 There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions  The methodology required for testing a Data Warehouse is different from testing a typical transaction system

473

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Data warehouse testing is different on the following counts: User-Triggered vs. System triggered Volume of Test Data Possible scenarios/ Test Cases Programming for testing challenge

474

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System.


 User-Triggered vs. System triggered In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing,Valuation.)

475

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


 Volume of Test Data The test data in a transaction system is a very small sample of the overall production data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination of dimensions and facts.  Possible scenarios/ Test Cases In case of Data Warehouse, the permutations and combinations one can possibly test is virtually unlimited due to the core objective of Data Warehouse is to allow all possible views of data.

476

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Difference In Testing Data warehouse and Transaction System


Programming for testing challenge In case of transaction systems, users/business analysts typically test the output of the system. In case of data warehouse, most of the 'Data Warehouse data Quality testing' and ETL testing is done at backend by running separate stand-alone scripts. These scripts compare pre-Transformation to post Transformation of data.

477

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Testing Process


Data-Warehouse testing is basically divided into two parts : 'Back-end' testing where the source systems data is compared to the end-result data in Loaded area 'Front-end' testing where the user checks the data by comparing their MIS with the data displayed by the end-user tools like OLAP. Testing phases consists of :  Requirements testing  Unit testing  Integration testing  Performance testing  Acceptance testing

478

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors.  Are the requirements Complete?  Are the requirements Singular?  Are the requirements Ambiguous?  Are the requirements Developable?  Are the requirements Testable?

479

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source. All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data. Testing the rejected records that dont fulfil transformation rules.

480

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Unit Testing
Unit Testing the Report data: Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified

481

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Integration Testing
Integration testing will involve following:  Sequence of ETLs jobs in batch.  Initial loading of records on data warehouse.  Incremental loading of records at a later date to verify the newly inserted or updated data.  Testing the rejected records that dont fulfil transformation rules.  Error log generation

482

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Performance Testing
Performance Testing should check for :  ETL processes completing within time window.  Monitoring and measuring the data quality issues.  Refresh times for standard/complex reports.

483

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.

484

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Questions

485

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Thank You

486

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Data Warehouse Concepts

Avinash Kanumuru Diya Jana Debyajit Majumder

2009 Wipro Ltd - Confidential

Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing

5 Data Extraction,Transformation, Load

488

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing

489

2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential

An Overview
Understanding What is a Data Warehouse

490

2009 Wipro Ltd - Confidential

S-ar putea să vă placă și