Documente Academic
Documente Profesional
Documente Cultură
A data warehouse is a:
subject-oriented
integrated
timevarying
non-volatile collection of data in support of the management's decision-making
process.
A data warehouse is a centralized repository that stores data from multiple information
sources and transforms them into a common, multidimensional data model for efficient
querying and analysis.
OLTP vs. OLAP
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general
we can assume that OLTP systems
provide source data to data warehouses, whereas OLAP systems help to analyze it.
The following table summarizes the major differences between OLTP and OLAP system
design.
OLTP
Current data
Short database transactions
Online update/insert/delete
Normalization is promoted
High volume transactions
Transaction recovery is necessary
OLAP
Current and historical data
Long database transactions
Batch update/insert/delete
Denormalization is promoted
Low volume transactions
Transaction recovery is not necessary
ETL tools
List of the most popular ETL tools:
ETL process
ETL (Extract, Transform and Load) is a process in data warehousing responsible for
pulling data out of the source systems and placing it into a data warehouse. ETL involves
the following tasks:
- extracting the data from source systems (SAP, ERP, other oprational systems), data
from different source systems is converted into one consolidated data warehouse format
which is ready for transformation processing.
The table below lists Worldwide Vendor Revenue Estimates from RDBMS Software,
Based on Total Software Revenue, 2006 (Millions of Dollars).
Comparing the table with the list we can see that the same database
vendors offer software for relational database systems as well as for data warehouses.
The chart below shows consolidation trends and marekt share of top few vendors in
1994-2006. When we take into consideration top three OLAP vendors they control ca
64% of market, however five of them control almost 80%.
Data Warehouse Schema Architecture
Data Warehouse environment usually transforms the relational data model into some
special architectures. There are many schema models designed for data warehousing but
the most commonly used are:
- Star schema
- Snowflake schema
The determination of which schema model should be used for a data warehouse should be
based upon the analysis of project requirements, accessible tools and project team
preferences
Star schema
What is star schema? The star schema architecture is the simplest data warehouse
schema. It is called a star schema because the diagram resembles a star, with points
radiating from a center. The center of the star consists of fact table and the points of the
star are the dimension tables. Usually the fact tables in a star schema are in third normal
form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star
schema is the simplest architecture, it is most commonly used nowadays and is
recommended by Oracle.
Fact Tables
A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
Dimension Tables
Typical fact tables store data about sales while dimension tables data about geographic
region(markets, cities) , clients, products, times, channels.
Snowflake schema
What is snowflake schema? The snowflake schema architecture is a more complex
variation of the star schema used in a data warehouse, because the tables which describe
the dimensions are normalized.
What is fact constellation schema? For each star schema it is possible to construct fact
constellation schema(for example by splitting the original star schema into more star
schemes each of them describes facts on another level of dimension hierarchies). The fact
constellation architecture contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design
because many variants for particular kinds of aggregation must be considered and
selected. Moreover, dimension tables are still large.
Magic Quadrant for Data Quality Tools 2008
Data quality is one of the main factors which influence making good decisions based on
various BI tools. Having even the newest and the most sophisticated reporting tools but
'dirty data' we can come to conclusions which could be partly or whole untrue. So the
main task of Data quality tools is to work as a process which inputs data from any source,
in any condition (e.g. from third party or custom applications, eCommerce or Web
service platforms, ETL tools or legacy systems, records may be jumbled, the fields
inconsistent, the values misplaced) and outputs clean, standardized and actionable data.
It's data you can use with confidence. For those reasons BI vendors often have in their
all-in-one products tools for data governance and data quality management.
auditing
cleansing
migration
Within the leaders of that market we have: Business Objects, DataFlux, IBM, Trillium
Software or Informatica. In the picture below we can find the full data quality tools
market overview prepared by world's leading information technology research and
advisory company - Gartner. A yellow dot shows the 2007 position and an orange dot
2008 position, whereas a green line between them shows the change:
source:it.toolbox.com
Business Intelligence Platforms
Modern Business Intelligence platform should provide an end-to-end infrastructure,
solutions and technologies that support following issues:
information integration
master data management
data warehousing
BI tools
repository of best practices and business models
Among the many solutions available on the market we might pay attention to the
following solution providers:
- Microsoft
- Sybase
- IBM
- QlikTech
- SAS