Documente Academic
Documente Profesional
Documente Cultură
THEORY:-
1. A database that is optimized for data retrieval to facilitate reporting and analysis.
2. A data warehouse incorporates information about many subject areas, often the entire
enterprise.
3. The design of a data warehouse often starts from an analysis of what data already exists
and how to collected in such a way that the data can later be used.
Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.
Top-Tier − This tier is the front-end client layer. This layer holds the query
tools and reporting tools, analysis tools and data mining tools.
The extraction methods in a data warehouse depend on the performance of the source
system and the demands of the business.
Full extraction is applied when the data is required to be retrieved and loaded the first
time. Hence, this extraction represents the current data available in the source system
Incremental extraction is a process where the differences in the source data since the
last extraction are captured. Only the changes will be loaded based on the last changed
timestamp
Online extraction is a process where the data is extracted from the source system
directly
Offline extraction is a process of extraction where the source system is emptied into a
flat file outside of the source. This flat file is used to extract the data
Transformation
The data is transformed based on the transformation rules provided by the business.
The data is converted to a standard format and common semantics
Data cleansing is the process of distinguishing and correcting the discrepant data from a
database or table. Data cleansing also involves the synchronization of data. For example, the
compliance of Male/Female to M/F
Loading
Once the data is cleansed and transformed into a structure persistent with the data
warehouse requisites, the data is then qualified to be loaded into a data warehouse
Populating the data into the tables present in a data warehouse and verifying if the data
is ready for use is the first step of loading
After loading the facts and dimensions a DBA should check for referential integrity i.e.
each record from the fact table should be related to a dimension record
Data Presentation Area
The presentation area represents a collection of data marts. A data mart is a sub set of a
data warehouse
Data marts are preferred for smaller data volumes and fewer data sources. It enables
easier data cleaning process
Dependent data marts retrieve data from a central data warehouse whereas the
independent data marts are standalone systems that extract data directly from the operational
systems or external sources
Business Intelligence tools are used for accessing the data for strategic, operational, and
analytical purposes
Senior executives and managers access the data warehouse for taking critical decisions.
They devise strategies and observe the business performance
Operational managers execute the details of the strategies against the targets.
Nature of Use-
There are expert opinions on considering the nature of use (ad-hoc vs. scheduled, simple vs. complex queries, data
mining vs. reporting) to plan out the hardware. Our view is that it is a non-relevant factor. Once you have placed a
DW platform, you would not be able to control the kind of its applications and the tools which one can make sit over
the DW.
Financial Readiness:
It depends on the monies which one can spend. If you have a monetary constraint, instead of DW platform with all the
attendant infrastructure elements, we would suggest you go for few data marts first. Building an enterprise data
warehouse, which has a risk of running out of memory or disk-space, will be risky.
Skills Readiness:
We feel that it should not be a major issue as skills can be bought, especially in today's world where more and more
companies are outsourcing their IT services.
Number of CPUs
CPUs are responsible for the calculation abilities of a data warehouse
Parallel operations are CPU- intensive when compared to the serial operations
The number of CPUs is based on the highest throughput. The number of CPUs is calculated
roughly based on the below formula:
Number of Disks
The number of disks is based on the maximum throughput. The storage provider’s specifications
should be used to find out the throughput a disk array can withstand
The number of disks is calculated as stated below:
The below table states the software with the versions considered for a data warehouse
Vendor Products and Version
Name