Sunteți pe pagina 1din 10

AIM: - To learn architectural framework for data warehousing

THEORY:-

Overview of Data Warehousing


Data warehouse databases provide a decision support system (DSS) environment in which you
can evaluate the performance of an entire enterprise over time. In the broadest sense, the
term data warehouse is used to refer to a database that contains very large stores of historical
data. The data is stored as a series of snapshots, in which each record represents data at a
specific time. By analyzing these snapshots you can make comparisons between different time
periods. You can then use these comparisons to help make important business decisions.

1. A database that is optimized for data retrieval to facilitate reporting and analysis.
2. A data warehouse incorporates information about many subject areas, often the entire
enterprise.  
3. The design of a data warehouse often starts from an analysis of what data already exists
and how to collected in such a way that the data can later be used. 

PROCEDURE:- ARCHITECTURAL FRAMEWORK

Generally a data warehouses adopts a three-tier architecture. Following are


the three tiers of the data warehouse architecture.
 Bottom Tier − The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end tools
and utilities to feed data into the bottom tier. These back end tools and utilities
perform the Extract, Clean, Load, and refresh functions.

 Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.

o By Relational OLAP (ROLAP), which is an extended relational database


management system. The ROLAP maps the operations on
multidimensional data to standard relational operations.

o By Multidimensional OLAP (MOLAP) model, which directly implements the


multidimensional data and operations.

 Top-Tier − This tier is the front-end client layer. This layer holds the query
tools and reporting tools, analysis tools and data mining tools.

BUIILDING BLOCKS OF DWH:-


Operational Source Systems

 Operational systems are used to process everyday transactions of an organization


 The operational systems are designed in such a way that the transactions occur
smoothly and the data-integrity is maintained efficiently
 The operational systems have very fast insert/update since minimal data is affected
each time a transaction occurs
 In order to improve performance the old data is purged systematically

Data Staging Area

ETL - Extraction, Transformation and Loading.


Extraction

 The extraction methods in a data warehouse depend on the performance of the source
system and the demands of the business.
 Full extraction is applied when the data is required to be retrieved and loaded the first
time. Hence, this extraction represents the current data available in the source system
 Incremental extraction is a process where the differences in the source data since the
last extraction are captured. Only the changes will be loaded based on the last changed
timestamp
 Online extraction is a process where the data is extracted from the source system
directly
 Offline extraction is a process of extraction where the source system is emptied into a
flat file outside of the source. This flat file is used to extract the data

Transformation

 The data is transformed based on the transformation rules provided by the business.
The data is converted to a standard format and common semantics
 Data cleansing is the process of distinguishing and correcting the discrepant data from a
database or table. Data cleansing also involves the synchronization of data. For example, the
compliance of Male/Female to M/F

Loading

 Once the data is cleansed and transformed into a structure persistent with the data
warehouse requisites, the data is then qualified to be loaded into a data warehouse
 Populating the data into the tables present in a data warehouse and verifying if the data
is ready for use is the first step of loading
 After loading the facts and dimensions a DBA should check for referential integrity i.e.
each record from the fact table should be related to a dimension record
Data Presentation Area

 The presentation area represents a collection of data marts. A data mart is a sub set of a
data warehouse
 Data marts are preferred for smaller data volumes and fewer data sources. It enables
easier data cleaning process
 Dependent data marts retrieve data from a central data warehouse whereas the
independent data marts are standalone systems that extract data directly from the operational
systems or external sources

Data Access Tools

 Business Intelligence tools are used for accessing the data for strategic, operational, and
analytical purposes
 Senior executives and managers access the data warehouse for taking critical decisions.
They devise strategies and observe the business performance

E.g. Balance Scorecards

 Operational managers execute the details of the strategies against the targets.

E.g. Sales Forecasts

 Analytical operations are performed by analysts to evaluate the outcomes of a business


process and understand the functioning of the business

E.g. Financial and Sales Analysis


MANAGEMENT MODULE
INFRASTRUCTURE SUPPORT REQUIRED BY DWH

Data Warehouse Data Size


Data Warehouses grow fast in terms of size. This is not only the increment to the data as per the current design you
have. A data warehouse will have frequent additions of new dimensions, attributes and measures. With each such
addition the data could take quantum jump, as you may bring in the entire historical data related to that
additional dimensional model element. Therefore as you estimate your data size, be on conservative side.

Data Dynamics for Data Warehouse:


The volume and frequency of increment of data determines the processing speed and memory of the hardware
platform. The increment of data should be typically on the daily basis. However, the level of increment could be
different depending upon which data you are pulling in. Many a times, you may pull in huge amount of data from the
source system into staging area, but load much smaller size summary data in Data Warehouse.
This is generic chapter on what you need to consider as you set-up the infrastructure for a Data Warehouse. These
are the considerations for Data Warehouse Platform alone. If you are placing OLAP Server the end-user tools (like
data mining, enterprise reporting, analytics), they will be having their own considerations.
A Data Warehouse is a 'business infrastructure'. In a practical world, it does not do anything on its own, but provides
sanitized, consistent and integrated information for host of applications and end-user tools. Therefore, the stability,
availability and response time of this platform is critical. Just like a foundation pillar, its strength is core to your
information management success.

Data Warehouse Data Size


Data Warehouses grow fast in terms of size. This is not only the increment to the data as per the current design you
have. A data warehouse will have frequent additions of new dimensions, attributes and measures. With each such
addition the data could take quantum jump, as you may bring in the entire historical data related to that
additional dimensional model element. Therefore as you estimate your data size, be on conservative side.

Data Dynamics for Data Warehouse:


The volume and frequency of increment of data determines the processing speed and memory of the hardware
platform. The increment of data should be typically on the daily basis. However, the level of increment could be
different depending upon which data you are pulling in. Many a times, you may pull in huge amount of data from the
source system into staging area, but load much smaller size summary data in Data Warehouse.
TIP- Even if you are placing only the summary data in the data warehouse, I would advise that you should assume
that soon you will have the granular data in the data warehouse. Sometimes, the demand for more detailed data
comes within months of implementing the summary data mode. Refer Data Warehouse can have broader
applications and keep granular data in data warehouse.

Number of Users of Data Warehouse:


The number of users are essentially the number of concurrent logins which are on a data warehouse platform.
Guessing he number of users of a data warehouse has the following complication:
Sometimes the user can be an end user tool, which may result in the actual number of users.
For example an enterprise reporting server can access the data warehouse in form of few users to generate all the
enterprise reports Post that, the actual users are accessing the database and reports repository of the enterprise
reporting system and not that of the data warehouse. Similarly, you might be using an analytics system, which
creates its own local cube from a data warehouse. The actual users may be accessing that cube without logging into
the data warehouse. Some times the users could be referring to the cache of the data warehouse distributed
database and not referring to the main data warehouse.

Number of Star-Schemas in Data Warehouse


We are assuming that you may have multiple star-schemas (which finally get translated into cubes as per OLAP).

Nature of Use-
There are expert opinions on considering the nature of use (ad-hoc vs. scheduled, simple vs. complex queries, data
mining vs. reporting) to plan out the hardware. Our view is that it is a non-relevant factor. Once you have placed a
DW platform, you would not be able to control the kind of its applications and the tools which one can make sit over
the DW.
Financial Readiness:
It depends on the monies which one can spend. If you have a monetary constraint, instead of DW platform with all the
attendant infrastructure elements, we would suggest you go for few data marts first. Building an enterprise data
warehouse, which has a risk of running out of memory or disk-space, will be risky.

Skills Readiness:
We feel that it should not be a major issue as skills can be bought, especially in today's world where more and more
companies are outsourcing their IT services.

HARDWARE SUPPORT REQUIRED BY DWH


Data Warehouse Hardware
Data warehouse designers and administrators should always have forethought about the
Input/Output performance while implementing a data warehouse. The data warehouse operations
mainly consist of huge data loads and index builds, generation of materialized views, and queries
over large volumes of data. The elemental I/O system of a data warehouse should be built to
meet these heavy requirements.

Number of CPUs
 CPUs are responsible for the calculation abilities of a data warehouse
 Parallel operations are CPU- intensive when compared to the serial operations
 The number of CPUs is based on the highest throughput. The number of CPUs is calculated
roughly based on the below formula:

No of CPUs = Maximum throughput in Mb/s /200

200 Mb/sec is the amount of data a CPU can sustain

Memory of data warehouse


 Large sorts is an example of a memory-intensive operation
 The memory requirements of a data warehouse are not the same when compared to mission-
critical OLTP applications
 The amount of memory is calculated based on the number of CPUs as stated below:

Amount of memory in GB = 2 * Number of CPUs

Number of Disks
 The number of disks is based on the maximum throughput. The storage provider’s specifications
should be used to find out the throughput a disk array can withstand
 The number of disks is calculated as stated below:

Number of disks = Throughput in MB/s / Individual controller throughput in MB/s


Disk Redundancy
 Data warehouses are the largest storage systems used widely across several enterprises. They
have many disks which are liable to failure
 Disk redundancy is important to avoid the failure of the entire system in case of hardware
malfunctioning
 Redundancy should be achieved based on the cost constraints and performance of a data
warehouse. In case of failure of one disk the data is stored in another disk is always critical for a data
warehouse

Plan for expansion


 The data in a data warehouse keeps on growing. The data warehouse designer should focus on the
growth of the I/O system without hindering the I/O bandwidth
 In order to prevent unwanted data from burdening the systems businesses should pay attention to
the age and overall quality of the archived data. Depending on the needs of the businesses the archiving
methods should be effectively performed periodically
 An example is IBM InfoSphere Balanced Warehouse available in all configurations and sizes that
helps designers discover, model and standardize the data and the IBM Optim software helps in automated
archival and storage of historical records

OS AND DATABASE SUPPORT REQUIRED BY


DWH
Data warehousing demands the following prominent features when deciding a platform for
functioning that are listed below:
 A prospect of combining various management systems
 A possibility of enhancing the structure of the queries
 A possibility of improving the load processes

The below table states the software with the versions considered for a data warehouse
Vendor Products and Version
Name

IBM InfoSphere Balanced Warehouse 9.5

SAP NetWeaver BI (Business Warehouse) 7.0

Teradata Active Enterprise Data Warehouse 5550, Data Warehouse


Appliance 2550, Data Mart Appliance 550, 12

Microsoft SQL Server 2008

Oracle Optimized Warehouses, Database 11g, Warehouse Builder 11g

Sybase Analytic Appliance, IQ 12.7

Netezza Performance Server 1000 Series Data Warehousing Appliance


4.5

S-ar putea să vă placă și