Sunteți pe pagina 1din 50

Data Warehouses Lecture 2

Introduction to Data Warehouses, Definitions, Concepts, Architecture

Agenda
Definitions, Concepts, Architecture, Requirements Sources used for this lecture
Stefano Rizzi, Open problems in data warehousing: eight years later, DMDW, 2003 (Keynote slides). http://sunsite.informatik.rwth-aachen.de/Publications/CEURWS//Vol-77/keynote.pdf Ralph Kimball, Joe Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data H.W. Inmon, The Data Warehouse Environment: Building the Data Warehouse Ralph Kimball, Margy Ross, Data Warehouse Toolkit: The complete Guide to Dimensional Modeling

R. Kimballs definition of a DW
A data warehouse is a copy of transactional data specifically structured for querying and analysis. According to this definition: The form of the stored data (RDBMS, flat file) has nothing to do with whether something is a data warehouse. Data warehousing is not necessarily for the needs of "decision makers" or used in the process of decision making.

Inmons Definition of a Data Warehouse


A data warehouse is a
subject-oriented, integrated, nonvolatile, and time-variant

collection of data in support of managements decisions. The data warehouse contains granular corporate data.

Subject-Oriented Data Collections


Classical operations systems are organized around the applications of the company. For an insurance company, the applications may be auto, health, life, and casualty. The major subject areas of the insurance corporation might be customer, policy, premium, and claim. For a manufacturer, the major subject areas might be product, order, vendor, bill of material, and raw goods. For a retailer, the major subject areas may be product, SKU, sale, vendor, and so forth. Each type of company has its own unique set of subjects

Integrated Data Collections


Of all the aspects of a data warehouse, integration is the most important. Data is fed from multiple disparate sources into the data warehouse. As the data is fed it is converted, reformatted, resequenced, summarized, and so forth. The result is that dataonce it resides in the data warehousehas a single physical corporate image.

Non-volatile Data Collections


Data is updated in the operational environment as a regular matter of course, but warehouse data exhibits a very different set of characteristics. Data warehouse data is loaded (usually en masse) and accessed, but it is not updated (in the general sense). Instead, when data in the data warehouse is loaded, it is loaded in a snapshot, static format. When subsequent changes occur, a new snapshot record is written. In doing so a history of data is kept in the data warehouse.

Time-variant Data Collections


Time variancy implies that every unit of data in the data warehouse is accurate as of some one moment in time. In some cases, a record is time stamped. In other cases, a record has a date of transaction. But in every case, there is some form of time marking to show the moment in time during which the record is accurate. A 60-to-90-day time horizon is normal for operational systems; a 5-to-10-year time horizon is normal for the data warehouse. As a result of this difference in time horizons, the data warehouse contains much more history than any other environment.

Operational Data Store (ODS)


The Operational Data Store is used for tactical decision making while the DW supports strategic decisions. It contains transaction data, at the lowest level of detail for the subject area
subject-oriented, just like a DW integrated, just like a DW volatile (or updateable) , unlike a DW
an ODS is like a transaction processing system information gets overwritten with updated data no history is maintained (other than audit trail) or operational history

current, i.e., not time-variant, unlike a DW


current data, up to a few years no history is maintained (other than audit trail) or operational history

Data Warehouses and Data Marts


A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect. Enables strategic decision making. A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. In scope, the data may derive from an enterprise-wide database or data warehouse or be more specialized. The emphasis of a data mart is on meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use. Users of a data mart can expect to have data presented in terms that are familiar. In practice, the terms data mart and data warehouse each tend to imply the presence of the other in some form. However, most writers using the term seem to agree that the design of a data mart tends to start from an analysis of user needs and that a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically); a data mart is a data repository that may derive from a data warehouse or not and that emphasizes ease of access and usability for a particular designed purpose.

A data warehouse tends to be a strategic but somewhat unfinished concept; a data mart tends to be tactical and aimed at meeting an immediate need.

History of DWs: The early 90s


W.H. Inmon coins the term data warehousing Widening interest from enterprises Widening interest from vendors Almost ignored in the academic world Some topics from traditional databases:
integration of heterogeneous sources materialized views aggregation queries ....

History of DWs: Back to 1995


Start of the Stanford DW Project
develop algorithms and tools for the efficient collection and integration of information from heterogeneous sources

Widening interest from the academic world


dedicated workshops gaining attention in major conferences

Dedicated commercial tools for Data Warehousing The CIKM95 paper by J. Widom Jennifer Widom, Research Problems in Data Warehousing, Int'l Conf. on Information and Knowledge Management, 1995.
change detection, view maintenance, data scrubbing, optimization, design, evolution

History of DWs: Into 2k


End of the Data Warehouse Quality European Project
study of quality of service factors and their relationship with design, operation, and evolution of data warehouse systems\M.
Jarke and Y. Vassiliou, Data Warehouse Quality Design: A Review of the DWQ Project, Information Quality, 1997.

On-going interest from research A large selection of commercial tools and platforms available Many mature implementations of DWs in enterprises

History of DWs: Into 2K


The DMDW00 paper by Vassiliades
P. Vassiliadis. Gulliver in the land of data warehousing: practical experiences and observations of a researcher. In Proc. of Design and Management of Data Warehouses (DMDW'2000) 2nd International Workshop in conjunction with the 12th Conference on Advanced Information Systems Engineering CAiSE*00, June 5-6, 2000, Stockholm, Sweden. [MS powerpoint presentation]

A significant gap between researchers and practictioners


researchers overlook practical problems little acceptance of research results by the industrial world increasing market for Data Warehousing systems about 20 papers in VLDB, PODS, SIGMOD problems and failures
no rextbook design methodology no standards for metadata no solutions for ETL no approach for view size estimation

The goals of a Data Warehouse


We have mountains of data in this company, but we can't access it." "We need to slice and dice the data every which way." "You've got to make it easy for business people to get at the data directly." "Just show me what is important." "It drives me crazy to have two people present the same business metrics at a meeting, but with different numbers." "We want people to use information to support more factbased decision making."

The goals of a Data Warehouse


The data warehouse must make an organization's information easily accessible. The data warehouse must present the organization's information consistently. The data warehouse must be adaptive and resilient to change. The data warehouse must be a secure bastion that protects our information assets. The data warehouse must serve as the foundation for improved decision making. The business community must accept the data warehouse if it is to be deemed successful.

Data Warehouse Architecture

Requirements for DW
Business Needs
data expectations, data sources, end user interviews, complexities and limitations

Compliance Requirements
Sarbanes-Oxley 2002 (SOX) archived copies, data staging, proof of data flow, algorithms for data adjustments, proof of security of on-line and off-line data copies

Data Profiling
quality, scope, context of a data source, dirty data sources, missing data, best-guesses, human intervention, data elimination, realistic development schedules

Requirements for DW (contd)


Security Requirements
a paradox:
Data Warehouse: publish data widely Security: restrict data to those with a need to know

role-based security at the final applications (not grant or revoke at the DBMS level)
security for developers (separate subnet), backups (tapes, disks)

Requirements for DW(contd)


Data Integration
at the core of the IT business, aka, the 360 degree view of the business specific to Data Warehouses: establishing common attributes (conforming dimensions), agreeing on common business metrics (conforming facts) so that one can perform mathematical calculations (differences, ratios, etc)

Requirements for DW (contd)


Data Latency
how quickly to deliver data to the end user improvements with algorithms, parallel processing, streaming

Archiving and Lineage


change calculations legal compliance lineage requirements

End User
reports, OLAP, data handoff

Technical Requirements for DW (contd)


Architecture
ETL tool versus hand coding batch updates versus data streaming horizontal (orders/shipments) versus vertical (customers/orders) task dependency scheduler automation quality handling/data cleansing metadata security staging

Data Warehouses and Design


Design of the interface with the operational systems Design of the data warehouse itself Iterative approach to DW design -- a waterfall approach does not work well

Beginning with Operational Data


Data are not simply extracted Data across operational systems are severely un-integrated (extract programmers nightmare!)
same data, different name same name, different data different keys, same data, etc non-standard data encodings (male/female, 0/1, etc) and field transformations different measurements, measurements at different levels of detail different source system technologies (DB2, SQL Server, OS/390 DB2, VSAM, IMS, etc)

Loading archival data, first-time load, ongoing deltas The next lecture is devoted to the concept of Extract-TransformLoad or aka ETL process

Data Warehouse Design


Start with the ER-Diagram that represents the corporate data model or with one ore more operational data models to be integrated Remove data used purely in the operational environment Enhance key structures with an element of time Add derived (calculated) data (i.e., summaries) Turn relationships of the ER model into artifacts in the data warehouse Perform stability analysis (see next slide)

Stability Analysis
Perform stability analysis group attributes together based on their likelihood to change, i.e., create different groups of data

Data Models
High-Level Data Model
ERD model, ER-model, conceptual model, corporate model, enterprise model defines the scope of source and data integration must be agreed upon by the modeler, the end-user and management

Mid-Level Data Model


Logical data model, DIS model, data information model, derives from the High-Level model might not be developed at once in its entirety

Low-Level Data Model Physical data model

Logical or DIS Data Model


1. Primary Grouping of data Represents primary entities Represents major subject areas Attributes and keys of major subject areas 2. Secondary Grouping of data Can exist multiple times for each subject area 3. Connectors relationships in the ER model foreign keys 4. Type of Data Supertypes to the left Subtypes to the right

An example DIS model

Another example: Subcategorization

Design Techniques: Merging Tables


Many tables imply much dynamic I/O Merging many tables together makes access faster

Design Techniques: Introduction of Redundant Data

Design Technique: Separation of Data when there is a disparity of probability of access

Design Technique: Introduce Derived Data


Calculated once Forever available

Design Techniques: Creative Indexes

The top 10 customers The average transaction value for this extract The largest transaction The number of customers who showed activity without purchasing.

Design Technique: Forget Referential Integrity


In the operational environment, referential integrity appears as a dynamic link among tables of data. Not in a data warehouse because
volume of data is too large the data warehouse is not updated, just appended to the warehouse represents data over time and relationships do not remain static

relationships of data are represented by an artifact in the data warehouse environment. Therefore, some data will be duplicated, and some data will be deleted when other data is still in the warehouse. In any case, trying to replicate referential integrity in the data warehouse environment is a patently incorrect approach.

ERASE THIS SLIDE


Understand star schemas and snowflake designs Understand dimensional modeling Understand dimension outriggers, degenerate dimensions and other variations in the data warehouse model Create a star schema Determine which dimensions to include in the data warehouse and select the dimension granularity Understand the three kinds of fact tables

Dimensional Modeling
Data modeling for Data Warehouses Based on
fact tables dimension tables

Fact Tables
Represent a business process, i.e., models the business process as an artifact in the data model contain the measurements or metrics or facts of business processes
"monthly sales number" in the Sales business process most are additive (sales this month), some are semi-additive (balance as of), some are not additive (unit price)

the level of detail is called the grain of the table contain foreign keys for the dimension tables

Dimension Tables
Represent the who, what, where, when and how of a measurement/artifact Represent real-world entities not business processes Give the context of a measurement (subject) For example for the Sales fact table, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What). The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships. Before designing your data warehouse, you need to decide what this data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers across multiple store locations, across time and across products then your dimensions are:

Location Time Product

Possible OLTP Location Design

In order to query for all locations that are in country 'USA' we will have to join these three tables:

SELECT * FROM Locations, States, Countries where Locations.State_Id = States.State_Id AND Locations.Country_id=Countries.Country_Id AND Country_Name='USA'

Location Dimension
Dim_id 1001 1002 1003 1004 1005 Loc_cd IL01 IL02 NY01 TO01 MX01 Name Chicago Loop Arlington Brooklyn Toronto Mexico City State_NM Illinois Illinois New York Ontario Distrito Federal Country_NM USA USA USA Canada Mexico

In order to query for all locations that are in country 'USA' SELECT * FROM Location_dim where Country_Name='USA'

Note the redundancy

Time Dimension
Dim_id 1001 1002 1003 1004 1005 Month 1 2 3 4 5 MonthName Jan Feb Mar Apr May Quarter 1 1 1 2 2 QuarterName Q1 Q1 Q1 Q2 Q2 Year 2005 2005 2005 2005 2005

Not as trivial at it seems ;-)

Product Dimension
Prod_id Prod_cd Name Category

1001
1002 1003 1004 1005

STD
LTD GUL PA VADD

Short-Term-Disability
Long-Term Disability Group Universal Life Personal Accident Voluntary Accident

Disability
Disability Life Accident Accident

Star Schemas
Select the measurements SELECT P.Name, SUM(F.Sales) JOIN the FACT table with Dimensions FROM Sales F, Time T, Product P, Location L WHERE F.TM_Dim_Id = T.Dim_Id AND F.PR_Dim_Id = P.Dim_Id AND F.LOC_Dim_Id = L.Dim_Id Constrain the Dimensions AND T.Month='Jan' AND T.Year='2003' AND L.Country_Name='USA'

Advantages: -easy to understand -better performance -extensible

Group by' for the aggregation level GROUP BY P.Category

Snow-flake Schemas

If we did not de-normalize the dimensions

Rule of thumb: dont use them

Dimensional Modeling Steps


Identify the business process Identify the level of detail needed (grain) Identify the dimensions Identify the facts

Dimensional Modeling Myths


Myth #1: Dimensional models and data marts are for summary data only
cant predict all the queries users will ask, therefore information must be stored at the detail level in a DW and summarized levels for a DM

Myth #2: Dimensional models and data marts are departmental, not enterprise, solutions Myth #3: Dimensional models and data marts are not scalable Myth #4: Dimensional models and data marts are only appropriate when there is a predictable usage pattern Myth #5: Dimensional models and data marts can't be integrated and therefore lead to stovepipe solutions.

Dimensional Modeling Pitfalls


Pitfall 5. Make the supposedly query-able data in the presentation area overly complex. Database designers who prefer a more complex presentation should spend a year supporting business users; they'd develop a much better appreciation for the need to seek simpler solutions. Pitfall 4. Populate dimensional models on a standalone basis without regard to a data architecture that ties them together using shared, conformed dimensions. Pitfall 3. Load only summarized data into the presentation area's dimensional structures. Pitfall 2. Presume that the business, its requirements and analytics, and the underlying data and the supporting technology are static. Pitfall 1. Neglect to acknowledge that data warehouse success is tied directly to user acceptance. If the users haven't accepted the data warehouse as a foundation for improved decision making, then your efforts have been exercises in futility.

Dimensional Modeling Pitfalls


Pitfall 10. Become overly enamored with technology and data rather than focusing on the business's requirements and goals. Pitfall 9. Fail to embrace or recruit an influential, accessible, and reasonable management visionary as the business sponsor of the data warehouse. Pitfall 8. Tackle a galactic multiyear project rather than pursuing more manageable, while still compelling, iterative development efforts. Pitfall 7. Allocate energy to construct a normalized data structure, yet run out of budget before building a viable presentation area based on dimensional models. Pitfall 6. Pay more attention to backroom operational performance and ease of development than to front-room query performance and ease of use.

S-ar putea să vă placă și