Sunteți pe pagina 1din 14

Data Warehouse

Data Warehouse Architecture

Data Warehouse definition by William H. Inmonna:

A data warehouse is a:
subject-oriented
integrated
timevarying
non-volatile collection of data in support of the management's decision-making
process.

A data warehouse is a centralized repository that stores data from multiple information
sources and transforms them into a common, multidimensional data model for efficient
querying and analysis.
OLTP vs. OLAP
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general
we can assume that OLTP systems
provide source data to data warehouses, whereas OLAP systems help to analyze it.

- OLTP (On-line Transaction Processing) is characterized by a large number of short


on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP
systems is put on very fast query processing, maintaining data integrity in multi-access
environments and an effectiveness measured by number of transactions per second. In
OLTP database there is detailed and current data, and schema used to store
transactional databases is the entity model (usually 3NF).

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of


transactions. Queries are often very complex and involve aggregations. For OLAP
systems a response time is an effectiveness measure. OLAP applications are widely
used by Data Mining techniques. In OLAP database there is aggregated, historical data,
stored in multi-dimensional schemas (usually star schema).

The following table summarizes the major differences between OLTP and OLAP system
design.

OLTP System OLAP System


Online Transaction Processing Online Analytical Processing
(Operational System) (Data Warehouse)
Operational data; OLTPs are the Consolidation data; OLAP data comes
Source of data
original source of the data. from the various OLTP Databases
Purpose of To control and run fundamental To help with planning, problem
data business tasks solving, and decision support
Reveals a snapshot of ongoing Multi-dimensional views of various
What the data
business processes kinds of business activities
Inserts and Short and fast inserts and updates Periodic long-running batch jobs
Updates initiated by end users refresh the data
Relatively standardized and
Often complex queries involving
Queries simple queries Returning
aggregations
relatively few records
Depends on the amount of data
involved; batch data refreshes and
Processing
Typically very fast complex queries may take many
Speed
hours; query speed can be improved
by creating indexes
Larger due to the existence of
Space Can be relatively small if aggregation structures and history
Requirements historical data is archived data; requires more indexes than
OLTP
Typically de-normalized with fewer
Database Highly normalized with many
tables; use of star and/or snowflake
Design tables
schemas
Backup religiously; operational
Instead of regular backups, some
data is critical to run the business,
Backup and environments may consider simply
data loss is likely to entail
Recovery reloading the OLTP data as a
significant monetary loss and legal
recovery method
liability

OLTP

Current data
Short database transactions
Online update/insert/delete
Normalization is promoted
High volume transactions
Transaction recovery is necessary

OLAP
Current and historical data
Long database transactions
Batch update/insert/delete
Denormalization is promoted
Low volume transactions
Transaction recovery is not necessary

What is Business Intelligence?


Business Ingelligence (BI) - technology infrastructure for gaining maximum information
from available data for the purpose of improving business processes. Typical BI
infrastructure components are as follows: software solution for gathering, cleansing,
integrating, analyzing and sharing data. Business Intelligence produces analysis and
provides believable information to help making effective and high quality business
decisions.

The most common kinds of Business Intelligence systems are:

EIS - Executive Information Systems


DSS - Decision Support Systems
MIS - Management Information Systems
GIS - Geographic Information Systems
OLAP - Online Analytical Processing and multidimensional analysis
CRM - Customer Relationship Management

Business Intelligence systems based on Data Warehouse technology. A Data


Warehouse(DW) gathers information from a wide range of company's operational
systems, Business Intelligence systems based on it. Data loaded to DW is usually good
integrated and cleaned that allows to produce credible information which reflected so
called 'one version of the true'.

Business Intelligence tools


The most popular BI tools on the market are:

Oracle - Siebel Business Analytics Applications


SAS - Business Intelligence
SAP - BusinessObjects XI
IBM - Cognos 8 BI
Oracle - Hyperion System 9 BI+
Microsoft - Analysis Services
MicroStrategy - Dynamic Enterprise Dashboards
Pentaho - Open BI Suite
Information Builders - WebFOCUS Business Intelligence
QlikTech - QlikView
TIBCO Spotfire - Enterprise Analytics
Sybase - InfoMaker
KXEN - IOLAP
SPSS ShowCase

ETL tools
List of the most popular ETL tools:

Informatica - Power Center


IBM - Websphere DataStage(Formerly known as Ascential DataStage)
SAP - BusinessObjects Data Integrator
IBM - Cognos Data Manager (Formerly known as Cognos DecisionStream)
Microsoft - SQL Server
Integration Services
Oracle - Data Integrator (Formerly known as Sunopsis Data Conductor)
SAS - Data Integration Studio
Oracle - Warehouse Builder
AB Initio
Information Builders - Data Migrator
Pentaho - Pentaho Data Integration
Embarcadero Technologies - DT/Studio
IKAN - ETL4ALL
IBM - DB2 Warehouse Edition
Pervasive - Data Integrator
ETL Solutions Ltd. - Transformation Manager
Group 1 Software (Sagent) - DataFlow
Sybase - Data Integrated Suite ETL
Talend - Talend Open Studio
Expressor Software - Expressor Semantic Data Integration System
Elixir - Elixir Repertoire
OpenSys - CloverETL

ETL process
ETL (Extract, Transform and Load) is a process in data warehousing responsible for
pulling data out of the source systems and placing it into a data warehouse. ETL involves
the following tasks:

- extracting the data from source systems (SAP, ERP, other oprational systems), data
from different source systems is converted into one consolidated data warehouse format
which is ready for transformation processing.

- transforming the data may involve the following tasks:


applying business rules (so-called derivations, e.g., calculating new measures and
dimensions),
cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),
filtering (e.g., selecting only certain columns to load),
splitting a column into multiple columns and vice versa,
joining together data from multiple sources (e.g., lookup, merge),
transposing rows and columns,
applying any kind of simple or complex data validation (e.g., if the first 3 columns in
a row are empty then reject the row from processing)
- loading the data into a data warehouse or data repository other reporting applications

Magic Quadrant for Business Intelligence


In the picture below you can see a global view of the market of the main Business
Intelligence software vendors prepared by Gartner, Inc. - the world's leading information
technology research and advisory company.

Magic Quadrant for Data Integration Tools


Data Integration Tools are strategic basics for enterprise data management. In the picture
below you can see a global view of the market of the main Data Integration Tools
software vendors prepared by Gartner, Inc. - the world's leading information technology
research and advisory company.
The discipline of data integration comprises the practices, architectural techniques and
tools for achieving consistent access to, and delivery of, data across the spectrum of data
subject areas and data structure types in the enterprise, to meet the data consumption
requirements of all applications and business processes. As such, data integration
capabilities are at the heart of the information-centric infrastructure and will power the
frictionless sharing of data across all organizational and system boundaries.
Contemporary pressures are leading to an increased investment in data integration in all
industries and geographic regions. Business drivers, such as the imperative for speed to
market and agility to change business processes and models, are forcing organizations to
manage their data assets differently. Simplification of processes and the IT infrastructure
are necessary to achieve transparency, and transparency requires a consistent and
complete view of the data, which represents the performance and operation of the
business. Data integration is a critical component of an overall enterprise information
management (EIM) strategy that can address these data-oriented issues.

Data Warehouse Database Management Systems


The data warehouse database management system (DBMS) market liders are:

- IBM DB2 Warehouse 9.5


- Microsoft SQL Server 2008
- Oracle Database 11g
- Teradata Enterprise Data Warehouse 12.0
- Sybase IQ
- Netezza Performance Server

The table below lists Worldwide Vendor Revenue Estimates from RDBMS Software,
Based on Total Software Revenue, 2006 (Millions of Dollars).

2006 Market 2005 Market 2005-2006


Company Share (%) Share (%) Growth (%)
2006 2005
Oracle 7,168.0 47.1 6,238.2 46.8 14.9
IBM 3,204.1 21.1 2,945.7 22.1 8.8
Microsoft 2,654.4 17.4 2,073.2 15.6 28.0
Teradata 494.2 3.2 467.6 3.5 5.7
Sybase 486.7 3.2 449.9 3.4 8.2
Other
Vendors 1,206.3 7.9 1,149.0 8.6 5.0
Total 15,213.7 100.0 13,323.5 100.0 14.2

source: Gartner Dataquest (June 2007)

Comparing the table with the list we can see that the same database
vendors offer software for relational database systems as well as for data warehouses.

Business Intelligence market consolidation


There were a lot of acquisitions in Business Intelligence market in 2007 - making it the
most turbulent year, so far, in BI. The BI industry has seen a wave of takeovers since the
mid 1990s, with a dramatic boost in 1994 and 2003 but non of them can compare to
2007.

The following consolidations belongs to the most spectacular ones:


Hyperion Solutions by Oracle
Business Objects by SAP
Cognos by IBM
Some of them were non-consolidating acquisitions like SAP with Business Objects and
IBM with Cognos where Business Objects and Cognos become separate business units,
whereas some of them were a genuine consolidations like Oracle with Hyperion where
Hyperion was integrated to a much greater extent.

The chart below shows consolidation trends and marekt share of top few vendors in
1994-2006. When we take into consideration top three OLAP vendors they control ca
64% of market, however five of them control almost 80%.
Data Warehouse Schema Architecture
Data Warehouse environment usually transforms the relational data model into some
special architectures. There are many schema models designed for data warehousing but
the most commonly used are:

- Star schema

- Snowflake schema

- Fact constellation schema

The determination of which schema model should be used for a data warehouse should be
based upon the analysis of project requirements, accessible tools and project team
preferences

Star schema
What is star schema? The star schema architecture is the simplest data warehouse
schema. It is called a star schema because the diagram resembles a star, with points
radiating from a center. The center of the star consists of fact table and the points of the
star are the dimension tables. Usually the fact tables in a star schema are in third normal
form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star
schema is the simplest architecture, it is most commonly used nowadays and is
recommended by Oracle.

Fact Tables

A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.

Dimension Tables

A dimension is a structure usually composed of one or more hierarchies that categorizes


data. If a dimension hasn't got a hierarchies and levels it is called flat dimension or list.
The primary keys of each of the dimension tables are part of the composite primary key
of the fact table. Dimensional attributes help to describe the dimensional value. They are
normally descriptive, textual values. Dimension tables are generally small in size then
fact table.

Typical fact tables store data about sales while dimension tables data about geographic
region(markets, cities) , clients, products, times, channels.

The main characteristics of star schema:


Simple structure -> easy to understand schema
Great query effectives -> small number of tables to join
Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
The most commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools

Snowflake schema
What is snowflake schema? The snowflake schema architecture is a more complex
variation of the star schema used in a data warehouse, because the tables which describe
the dimensions are normalized.

Fact constellation schema

What is fact constellation schema? For each star schema it is possible to construct fact
constellation schema(for example by splitting the original star schema into more star
schemes each of them describes facts on another level of dimension hierarchies). The fact
constellation architecture contains multiple fact tables that share many dimension tables.

The main shortcoming of the fact constellation schema is a more complicated design
because many variants for particular kinds of aggregation must be considered and
selected. Moreover, dimension tables are still large.
Magic Quadrant for Data Quality Tools 2008
Data quality is one of the main factors which influence making good decisions based on
various BI tools. Having even the newest and the most sophisticated reporting tools but
'dirty data' we can come to conclusions which could be partly or whole untrue. So the
main task of Data quality tools is to work as a process which inputs data from any source,
in any condition (e.g. from third party or custom applications, eCommerce or Web
service platforms, ETL tools or legacy systems, records may be jumbled, the fields
inconsistent, the values misplaced) and outputs clean, standardized and actionable data.
It's data you can use with confidence. For those reasons BI vendors often have in their
all-in-one products tools for data governance and data quality management.

Data quality tools generally fall into one of three categories:

auditing
cleansing
migration
Within the leaders of that market we have: Business Objects, DataFlux, IBM, Trillium
Software or Informatica. In the picture below we can find the full data quality tools
market overview prepared by world's leading information technology research and
advisory company - Gartner. A yellow dot shows the 2007 position and an orange dot
2008 position, whereas a green line between them shows the change:

source:it.toolbox.com
Business Intelligence Platforms
Modern Business Intelligence platform should provide an end-to-end infrastructure,
solutions and technologies that support following issues:

information integration
master data management
data warehousing
BI tools
repository of best practices and business models

Among the many solutions available on the market we might pay attention to the
following solution providers:

- Microsoft

- Sybase

- IBM

- QlikTech

- SAS

S-ar putea să vă placă și