Sunteți pe pagina 1din 46

Data Warehousing

Basics of Data Warehousing


Session Objectives

At the end of this session, you will be able to answer

 What, Why and Where -- Data Warehouse

 Terminologies / Technical jargons used

 OLTP and OLAP

 DW Architecture

 ETL process in a Data Warehousing environment

 Understand the Evolution of ETL tools


DW – 3 W’s

Data Warehouse --- What? Why? Where?

Decision Support System

In order to make correct decisions, accurate, meaningful information about


business environments, external issues, and internal workings must be available
in a timely fashion.
What is a Data Warehouse ?

A data warehouse is a subject-oriented, integrated,


nonvolatile, time-variant collection of data in
support of management's decisions.
- WH Inmon

WH Inmon - Regarded As Father Of Data Warehousing


Integrated - Characteristics of a Data
Warehouse
Appl A - m,f
Appl B - 1,0 m,f
Appl C - male,female

Appl A - balance dec fixed (13,2) balance dec


Appl B - balance pic 9(9)V99 fixed (13,2)
Appl C - balance pic S9(7)V99 comp-3

Appl A - bal-on-hand
Appl B - current-balance Current balance
Appl C - cash-on-hand

Appl A - date (julian)


Appl B - date (yymmdd) date (julian)
Appl C - date (absolute)

Integrated View Is The Essence Of A Data Warehouse


Non-volatile - Characteristics of a Data
Warehouse
insert change

Operational Data
Warehouse
insert
delete
load
read only
access
replace
change

Data Warehouse Is Relatively Static In Nature


Historical Look at
Informational Processing
The goal of Informational Processing is to turn data into
information!

Why?

Because business questions are answered using information and


the knowledge of how to apply that information to a given problem.

Data Information Knowledge


A Need For New Technology
 Government and industrial entities have been collecting data in electronic format
since the 1960s.
 Today, organizations collect millions of pieces of information about every
aspect of their operation on a daily basis.
 Data is obtained from multiple disparate sources.
 Often information is replicated, leading to confusion.
 Related data is often retained in seemingly heterogeneous and incompatible
platforms.
 Common data attributes are represented in nonstandard formats and naming
constructs across systems.
 Most systems are built for data collection (transaction based).
 Designed to support On-Line Transaction Processing (OLTP).
 Designed to support day-to-day business operations.
 Very specific applications built to support interaction with the data.
 Perform best when handling small specific volumes of data.
 Does not accept information from dissimilar sources readily.
A Need for New Technology
contd..
 Capable of answering questions of a specific nature and time frame.
 How many items do I have in stock today?
 How many tickets were sold on a specific date?
 What is the current price of an item?

 Transaction based systems experience great difficulty in answering analytical


and decision support questions.
 Analysis takes a long time, interfering with:
 transaction performance
 daily operations
 The nature of the data is dynamic and dispersed.
A Need for New Technology
contd..
Most organizations have created a “spider web” of systems and data sources.

Databases Applications
A Need for New Technology

All of this has created “data overload” and “data confusion”.


 What do I do with all of this data?
 What does it mean?
 Do I really need this data?
 I am overwhelmed with the amount of data I am confronted with.
 I cannot make a timely decision (too much data from too many sources).
Data, Data everywhere
• I can’t get the data I need
– need an expert to get the data

 I can’t find the data I need


 data is scattered over the network
 many versions, subtle differences
• I can’t understand the data I found
– available data poorly documented

• I can’t use the data I found


– results are unexpected
– data needs to be transformed from one
form to other
Data Warehousing

Data warehousing is:


A large historical database designed to accept key analytical data from
multiple and disparate sources that manage the day-to-day management of
enterprise data. Furthermore, the role of the warehouse is to transform
transaction data into corporate information.
information The warehouse is provided in
a read-only fashion to a user.
Data Warehousing

A data warehouse will provide:

The ability to ask business analysis questions in a real-time, iterative


fashion, obtaining decision support information readily and quickly.
Data Warehousing
 A data warehouse is not:
not
A repository for all corporate data.
 A data warehouse will not:
not
Single handedly solve all of the problems associated to an enterprise.
DW Jargons
 OLTP – Online Transaction Processing
 OLAP – Online Analytical Processing
 ETL – Extraction Transformation and Loading
 DSS – Decision Support System
 Metadata – Data About Data
 Fact – Numeric values
 Dimensions – Subject Areas of Business
 ODS – Operational Data Store
 STG – Data Staging Layer
 Cube
 Star Schema
 Snowflake Schema
What is an Operational System /
OLTP?
 Operational systems are just what their name implies; they are the systems that help
us run the day-to-day enterprise operations.

 These are the backbone systems of any enterprise, such as order entry inventory etc.

 The classic examples are airline reservations, credit-card

authorizations, and ATM withdrawals etc.,


Characteristics of Operational
Systems

 Continuous availability

 Predefined access paths

 Transaction integrity

 Volume of transaction - High

 Data volume per query - Low

 Used by operational staff

 Supports day to day control operations


What is OLAP?

 OLAP (On Line Analytical Processing) applications - designed for online

ad-hoc data access and analysis.

 Data organized into multiple dimensions.

 Access to analytical content such as time series and trend analysis views

and summary level information.

 A set of functionality that attempts to facilitate multidimensional analysis.

 Offers drill-down, drill-across and slice and dice capabilities.


OLAP - Fast Analysis

• On Line No piles of paper, please!

• Analytical Establish patterns

• Processing Data-based

• Fast Analysis of Shared Multidimensional Information


What is ETL?
 ETL (Extraction, Transformation and Loading) is a process by which data is
integrated and transformed from the operational systems into the Data
Warehouse environment
Filters and
Extractors
Cleanser
Error
Operational systems Cleaning View
Rules Check
• Rule 1 Correct
• Rule 2
• Rule 3
Transformation
Rules

• Rule 1
• Rule 2
• Rule 3

Transformation
Engine
Integrator

Error
View
Check
Correct Loader Warehouse
Data Warehousing Architecture
Overview
Data Warehouse Architecture

Information Directory Repository

Data
Legacy Data
Transformation

Data Warehouse

Legacy Data Data Warehouse Management Layer

External Data Source


Data Warehousing

Model concepts:
 Fact table(s)
 A table containing multiple measurable descriptors relating to a

specific area of business


 Each fact can be viewed, calculated, and aggregated against various
defining areas of the business (time, geography, customer)
Data Warehousing

Model concepts:
 Dimension Table(s)
 Retains information (product description, geography description,
customer description) that is descriptive and remains moderately
constant over time
Data Warehousing

Data Warehouse Modeling


 Special modeling techniques must be applied to provide rapid response of
queries on large volumes of data.
 OLTP systems are built with update operations in mind, resulting in
normalization and greatly reduced browse performance.
Data Warehousing

Common data model techniques are as follows:


 star schema
 snowflake
 fact constellation
 relational
Data Warehousing
Sample Star Schema Model

TIME GEOGRAPHY

Dimensions Dimensions
SALES

STORE CUSTOMER
Sales Facts
Data Warehousing
Year
North
Sample Snowflake Model
Qtr South

Month TIME East


GEOGRAPHY
West

Dimensions Dimensions
SALES

East Region
STORE CUSTOMER
West Region Sales Facts
Data Warehousing
Sample Fact Constellation Model

TIME Regional GEOGRAPHY


Sales

District
Dimensions Dimensions
Sales

Store
Sales
STORE CUSTOMER
Data Mart

Data mart is:


A functional segment of an enterprise restricted for purposes of security,
locality, performance, or business necessity using modeling and information
delivery techniques identical to data warehousing.
Data Mart

Why build a data mart?


 Allows an organization to visualize the large but focus on the small and
attainable.
 Provides a platform for rapid delivery of an operational system.
 Minimizes risk.
 A corporate warehouse can be constructed from the union of the enterprise
data marts.
Data Mart

Data From
Transaction Sources Data
Warehouse
Update From the
Warehouse

The data warehouse


populates
the data marts. Financial Logistics Contract
Data Mart Data Mart Data Mart
Data Mart

The data marts populate


the data warehouse.
Data
Warehouse
Update From the
Data Marts

Financial Logistics Contract


Data Mart Data Mart Data Mart

Data From
Transaction Sources
Data Mart

Virtual Data Warehouse


Abstract Data Warehouse
Data is moved through the Access Layer
abstract layer on demand.

The data warehouse layer


manages the data marts Financial Logistics Contract
as a warehouse. Data Mart Data Mart Data Mart
Data From
Transaction Sources
OLAP
 OLAP is a powerful graphics-oriented tool used to access the data warehouse
 OLAP supports
 Business analysis queries
 Data visualization
 Trend analysis
 Scenario analysis
 User defined queries
OLAP
 Drill Down
 Move from summary to detail
 Roll Up
 Move from detail to summary
 Slice and Dice
 Look at a specific interest of the business
 Pivot and Rotate
 Looking at data from varying perspectives
 Drill Through
 Move to a near transaction level of detail
OLAP
 The flavors of OLAP
 Multidimensional On-Line Analytical Processing (MOLAP)
 Relational On-Line Analytical Processing (ROLAP)
 Hybrid On-Line Analytical Processing (HOLAP)
OLAP
 MOLAP
 Produces a hypercube
 Pre-aggregated and pre-calculated
 Rapid response times
 Limited in the amount of data that can be managed
 ROLAP
 Data remains in a relational format
 Some degree of aggregation
 Slower response times
 Scales to large amounts of data
 HOLAP
 Can manage data both as ROLAP and MOLAP
 Currently evolving
 MOLAP vendors are finding it easier to move into the HOLAP market space
Data Mining

As defined by the Gartner Group in 1995, data mining is:

“…the process of discovering meaningful new correlations, patterns, and


trends by sifting through large amounts of data stored in a repository, using
pattern recognition technologies and statistical and mathematical
techniques.”
Data Mining
 Data mining requires an analyst who is familiar with the domain to appropriately
model scenarios.
 Data mining assists analysts in uncovering nontrivial data relationships.
 Analysis must be conducted to determine the meanings of these newly identified
relationships.
Why Use a Data Warehouse ?
 Data warehousing is a must for anyone who uses multiple data sources to make
decisions and understand business (trends, forecasting).
 Those who do not move to warehousing will not be capable of responding to
problems and business conditions, thus falling behind the competition.
 For organizations wanting to minimize costs and maximize productivity,
warehousing is a must.
 Individuals who spend time gathering data instead of analyzing data require the
assistance of a warehouse.
 Organizations that collect data but have difficulty determining meanings and
impacts need a data warehouse.
Making the Warehouse a
Reality
 Think big but work small.
 Match technology to requirements.
 Build for the future (scalability).
 Work closely with the users.
 Requirements
 Rapid Application Development (RAD)
 Periodic releases to the user community
Real World Success Stories
 Radio Shack
 Sales and stocking analysis
 Marketing (regionalized mailings)
 Wal-Mart
 Sales and stock analysis
 Trend analysis
 Vendor analysis
 Naval Surface Warfare Center (NSWC)
 Procurement
 Supply
 Workload
 Harris Semiconductor
 Yield
 Product
 Personnel productivity
A Few Observations About
Data Warehouses
 Industry and our experience indicate that:
 Warehouses that succeed average an ROI of 400% with the top end being as
much as 600% in the first year.
 The incremental approach is most successful (build the warehouse a
functional area at a time).
 The average time to gather requirements, perform a design, and deploy a
warehouse increment is six months.
 New tools may be required that differ from the transaction environment.
 Software oriented toward intelligent analysis and query of the data

warehouse
 Hardware oriented to support the massive storage requirements and

analytical queries
Keys to Success
 Do you understand why you are building the warehouse?
 Have you identified both technical and business professionals that you will need
to build the warehouse?
 Do you have a strong management sponsor?
 Are you managing the expectations of the users?
Careers in Data Warehousing
 System Administration  DBA
 DW Architect  Application Developer
 Data Architect  Data Cleansing/ Transformation Analyst
 DW Manager  Business Analyst
 DW Administrator  Management
 Decision Support Analysts

S-ar putea să vă placă și