Documente Academic
Documente Profesional
Documente Cultură
Textbook : Data Warehousing Fundamentals A comprehensive guide for IT Professionals, by Paulraj Ponniah, Publisher: John Wiley & Sons, 2nd Edition
Objectives
Understand the desperate need for strategic information Recognize the information crisis at every enterprise Distinguish between operational and informational systems Learn why past attempts to provide strategic information failed Clearly see why data warehousing is the viable solution
Were told we live in the information age. People often talk about data and information as if they were the same. They are, in many regards, opposite. A datum is just a fact : your name is a fact, your phone number is a fact. Information is data that is presented in a meaningful, understandable and beneficial format. Information is data that has been organized, sequenced, correlated and summarized, such as a phone book.
A phone book is information. It not only contains names and phone numbers, but it correctly associates each persons phone number with their names. It presents this list of correlated names and phone numbers in alphabetical sequence, so that we find the phone number from the name. In addition, it divides the phone numbers into two types; personal and business. It is the function of the computer to convert data to information.
Definitions
Database:
The database is a place where you put your data; data that you wish to convert to information at some future time. Management System: A DBMS is the software that converts the data in your database to information. It is the DBMS that provides you the capability for cross-referencing, correlating, sorting, summarizing, etc.
Database
Integrated: Must have a single, enterprise-wide view. Data Integrity: Information must be accurate and must conform to business rules. Accessible: Easily accessible with intuitive access paths, and responsive for analysis. Credible: Every business factor must have one and one value. Timely: Information must be available within the stipulated time frame.
A software solution that addresses enterprise needs taking the process view of an organization to meet the organization goals. It integrates all the departments and functions across a company into a single computer system that can serve all those different departments particular needs. It is a single application that supports (manages) all aspects (domains) of a company. It supports the day to day operations of the company. To ensure that the transactions are fast it maintains only the recent data.
of relational database tables, designed and normalized for running the business operations were not at all suitable for providing strategic information. data repositories lacked data from external sources and from other operational systems in the company.
ERP
IT receives too many ad hoc requests, resulting in a large overload. With limited resources, IT is unable to respond to the numerous requests in a timely fashion. Requests keep on changing all the time. The users require more reports to expand and understand the earlier reports. Users go into a Spiral of asking more, therefore increasing IT load. Users have to depend on IT to provide information. Not usercentric. IT unable to provide a flexible and conducive environment for strategic decision making.
Operational System
Informational Systems
Operational vs DSS
Since
1970s,
organizations
gained
competitive
advantage through systems that automate business processes to offer more efficient and cost-effective services to the customer.
Organizations now focused on ways to use operational data to support decision-making, as a means of gaining competitive advantage. However, operational systems were never designed to support such business activities. Involved with day to day transactions only. Businesses typically have numerous operational systems with overlapping and sometimes contradictory definitions.
need to turn their archives of data into a source of knowledge, so that a single integrated or consolidated view of the organizations data is presented to the user. data warehouse was deemed the solution to meet the requirements of a system capable of supporting decision-making, receiving data from multiple operational data sources.
Easily summarize and roll up the information across subject areas and business dimensions
Data is scattered in many types of incompatible structures. Lack of documentation has prevented from integrating older legacy systems with newer systems Accurate and accessible metadata across multiple organizations is hard to get
Data is designed for analytical tasks Data from multiple applications Easy to use and conductive to long interactive sessions by users Read-intensive data usage Direct interaction with the system by the users without IT assistance Content updated periodically and stable Content to include current and historical data Ability for users to run queries and get results online Ability for users to initiate reports
In modern organization, at least four levels of analytical processing should be supported by information systems First level: Consists of simple queries and reports against current and historical data Second level: Goes deeper and requires the ability to do what if processing across data store dimensions
Data Warehousing is a decision support system. It extracts data from various source systems eg : ERP, CRM. It has historical data kept in a single uniform format.
So summarizing, A DW is : An ideal environment for data analysis and decision support. Flexible and interactive. 100% user-driven. Very responsive and conducive to the ask-answer-askagain pattern. Provides the ability to discover answers to complex, unpredictable questions.
Characteristics
1.
The new concept is not to generate fresh data, but to make use of the large volumes of existing data and to transform it into forms suitable for providing strategic information. It is an user-centric environment not a product. A computing environment where users can find strategic information. A central database that is loaded from multiple operational databases for the purpose of end-user access and decision support. A data warehouse differs from an operational system in that the data it contains is normally static and updated in a scheduled manner through massive loading procedures. A data warehouse is developed to accommodate random, ad hoc queries and to allow users to drill down to minute levels of detail.
2.
3.
4.
Take all the data from the operational systems. Where necessary, include relevant data from outside, such as industry benchmark indicators. Integrate all the data from the various sources. Remove inconsistencies and transform the data. Store the data in formats suitable for easy access for decision making. This simple concept, involves different functions : data extraction, loading the data, transformation, storage, providing user interfaces.
Blend of Technologies
Different technologies needed to support data warehousing functions.
Scenario 1
ABC Pvt Ltd is a company with branches at Mumbai, Delhi, Chennai and Banglore. The Sales Manager wants quarterly sales report. Each branch has a separate operational system.
Delhi Sales per item type per branch for first quarter. Chennai Sales Manager
Banglore
from each database. Store the information in a common repository at a single site.
Report Delhi Data Warehouse Chennai Query & Analysis tools Sales Manager
Banglore
Scenario 2
One Stop Shopping Super Market has huge operational database.Whenever Executives wants some report the OLTP system becomes slow and data entry operators have to wait for some time.
Solution 2
Extract data
database. Store it in a warehouse. Refresh warehouse at regular interval so that it contains up to date information for analysis. Warehouse will contain data with historical perspective.
Solution 2
Data Entry Operator Report Transaction Operational database Extract data Data Warehouse
Manager
Scenario 3
Cakes & Cookies is a small,new company.President of the company wants his company should grow.He needs information so that he can make correct decisions.
Solution 3
Improve
the quality of data before loading it into the warehouse. Perform data cleaning and transformation before loading the data. Use query analysis tools to support adhoc queries.
Solution 3
Expansion
President
Industry has huge amount of operational data Knowledge worker wants to turn this data into useful information.
It is a platform for consolidated historical data for analysis. It stores data of good quality so that knowledge worker can make correct decisions.
business perspective
it is latest marketing weapon helps to keep customers by learning more about their needs . valuable tool in todays competitive fast evolving world.
Inmonss definition
A data warehouse is -subject-oriented, -integrated, -time-variant, -nonvolatile collection of data in support of managements decision making process.
Subject-oriented
Data
warehouse is organized around subjects such as sales,product,customer. It focuses on modeling and analysis of data for decision makers. Excludes data not useful in decision support process.
Integration
Data
Warehouse is constructed by integrating multiple heterogeneous sources. Data Preprocessing are applied to ensure consistency.
RDBMS
Legacy System
Data Warehouse
Flat File
Integration
In
terms of data.
physical attribute.
of data
remarks
Time-variant
Provides
information from historical perspective e.g. past 5-10 years Every key structure contains either implicitly or explicitly an element of time
Nonvolatile
Data
once recorded cannot be updated. Data warehouse requires two operations in data accessing Initial loading of data Access of data
load
access
Operational
Operational processing Transaction Clerk,DBA,database professional Day to day operation Current Detailed,flat relational Application oriented Read/write
Information
Informational processing Analysis Knowledge workers Decision support Historical Summarized, multidimensional Subject oriented Mostly read
Operational
Data in tens thousands 100MB to GB
Information
Information out millions hundreds 100 GB to TB
High performance,high High flexibility,endavailability user autonomy Transaction throughput Query througput
Reconciled data
External Sources
Analysis
Serve
Query/Reporting
Operational Dbs
Data Mining
DATA SOURCES
DATA MARTS
TOOLS
Warehouse server almost always a relational DBMS,rarely flat files OLAP servers to support and operate on multi-dimensional data structures Clients Query and reporting tools Analysis tools Data mining tools
Star Schema
A
single,large and central fact table and one table for each dimension. Every fact points to one tuple in each of the dimensions and has additional attributes. Does not capture hierarchies directly.
Benefits: Easy to understand, easy to define hierarchies, reduces no. of physical joins.
SnowFlake Schema
Variant
of star schema model. A single,large and central fact table and one or more tables for each dimension. Dimension tables are normalized i.e. split dimension table data into additional tables
Fact Constellation
Multiple fact
tables share dimension tables. This schema is viewed as collection of stars hence called galaxy schema or fact constellation. Sophisticated application requires such schema.
Selection Data Preprocessing Fill missing values Remove inconsistency Data Transformation & Integration Data Loading Data in warehouse is stored in form of fact tables and dimension tables.
Case Study
Afco
Foods & Beverages is a new company which produces dairy,bread and meat products with production unit located at Baroda. There products are sold in North,North West and Western region of India. They have sales units at Mumbai, Pune , Ahemdabad ,Delhi and Baroda. The President of the company wants sales information.
Sales Information
Report: The number of units sold. 113
January 14
February 41
March 33
April 25
Sales Information
Report : The number of items sold for each product with time
17 8
Sales Information
Report: The number of items sold in each City for each product with time
Jan Mumbai Wheat Bread Cheese Swiss Rolls Pune Wheat Bread Cheese Swiss Rolls 3 4 3 4
Feb Mar 3 16 16 6 6 3
Apr 10
7 8
Produ ct
15
Time
Sales Information
Report: The number of items sold and income in each region for each product with time. Jan Rs Mumbai Wheat Bread Cheese Swiss Rolls Pune Wheat Bread Cheese Swiss Rolls 7.95 7.32 3 4 16.47 9 27.45 15 7.95 7.32 3 4 42.40 29.98 U Feb Rs U Mar Rs 7.44 16 15.90 16 10.98 7.44 U 3 6 6 3 17.36 21.20 7 8 Apr Rs 24.80 U 10
Units 3 4 3 4 16
Product_Category_ID 1 1 2
Product_Category_Id 1 2
City_ID 1 2
Sales Fact
Product
Product Category
Region
It enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.
Produc t Data Warehouse
Time
OLAP Cube
City All Mumbai Mumbai Mumbai Mumbai Mumbai Product All All White Bread Time All All All Units 113 64 38 13 3 3 Dollars 251.26 146.07 98.49 32.24 7.44 7.44
OLAP Operations
Drill Down Product Category e.g Electrical Appliance Sub Category e.g Kitchen Product e.g Toaster
Time
OLAP Operations
Drill Up Product Category e.g Electrical Appliance Sub Category e.g Kitchen Product e.g Toaster
Time
OLAP Operations
Slice and Dice Product Product=Toaster
Time
Time
OLAP Operations
Pivot Product Product
Time
Region
OLAP Server
An
OLAP Server is a high capacity,multi user data manipulation engine specifically designed to support and operate on multi-dimensional data structure. OLAP server available are
MOLAP server ROLAP server HOLAP server
Presentation
Product
Reporting Tool
Report Time
Flat File
Client
Warehouse SQL Server 2000 DTS Oracle 8i Warehouse Builder OLAP tools SQL Server Analysis Services Oracle Express Server Reporting tools MS Excel Pivot Chart VB Applications