Documente Academic
Documente Profesional
Documente Cultură
Design aspects
Starflake Schemas
Facts are transactions that have occurred at some point in the
past, and are unlikely to change in the future
Facts can be analyzed in different ways by cross-referencing
the facts with different reference information
One of the major technical challenges within the design of a
data warehouse is to structure a solution that will be effective
for a reasonable period of time
data should not have to be restructured when the business
changes or the query profiles change
Star schemas exploit the fact that the content of facmal transactions is unlikely to change, regardless of
how it is analyzed.
it can be very effective to treat fact data as primarily read-only data,
reference data as data that will change over a period of time
Star schemas are physical database structures that store the factual data in the "center," surrounded by the
reference (dimension) data
A customer-profiling star schema, used in retail banking
Guidelines
Look for denormalized dimensions within candidate tact table It may be-the case that the candidate.fact
table is a dimension containing repeating group of factual attributes
Design fact tables to store rows that will not vary over time. if the entity appears to vary over time, consider
creating a fact table that represents events that change the status of that entity.
STEP 4: CHECK THAT A CANDIDATE DIMENSION IS NOT ACTUALLY A FACT TABLE
to ensure that none of the key dimensions are in themselves fact tables
may be inappropriate to satisfy campaign management - require access to all the detailed customer records
Example:
retail sales analysis - basket transactions are stored for a 15% spread of stores across the US - can be used to
spot trends across all stores - inappropriate to determine buying patterns in all stores located at seaside resorts
unless you are absolutely certain that the identifiers will not change or be reassigned in the future, it is safer
to use non-intelligent keys.
The performance loss of the query can be recovered by other means, such as pre-building appropriate
summary tables
View expanding a stock position fact table using date ranges, by joining against a time dimension
the query can be speeded up if all the constraining information is in the same table
achieved by denormalizing all the additional entities related to the product entity into a single product table
Fig 1) Product by store by day cube Fig 2) Sales by store on a specific day
In most cases, the user access tool will generate SQL queries to populate the data on demand.
The design of the multidimensional cube will determine to what extent data is preloaded as part of the load
process, and how much data is pulled through on the fly.
Partitioning Strategy
Introduction
Horizontal Partitioning
o Partitioning By Time Into Equal Segments
o Partitioning By Time Into Different-Sized Segments
o Partitioning On A Different Dimension
o Partitioning By Size Of Table
o Partitioning Dimensions
o Using Round-Robin Partitions
Vertical Partitioning
Hardware Partitioning
o Maximizing Processing Power And Avoiding Bottlenecks
o Striping Data Across MPP Nodes
o Horizontal Hardware Partitioning
Which Key To.Partition By
Sizing The Partition
Introduction
Partitioning is performed for a number of performance-related and manageability reasons
the strategy as a whole must balance all the various requirements
objective is to determinethe appropriate database-partitioning strategy
Horizontal Partitioning
in many organizations, not all the information in the fact table may be actively used at a single point in time
horizontally partitioning the fact table is a good way to speed up queries
by minimizing the set of data to be scanned
if we were to partition a fact table into time segments, each segment may not be of the same size
example.
higher number of transaction during peak periods like festival seasons
To address this discrepancy, we need to consider the various ways in which fact data could be partitioned,
before deciding on the optimum solution
Partitioning Dimensions
In some cases, the dimension may contain such a large number of entries that it may need to be partitioned
in the same way as a fact table
we need to check the size of the dimension over the lifetime of the data warehouse
Example: large product catalog
Vertical Partitioning
data is split vertically
Normalization
a standard relational method of database organization.
It allows common fields to be collapsed into single rows, thereby reducing space usage
In data warehouse, Large tables are often denormalized to avoid the overheads of joining the data
Vertical partitioning can sometimes be used in a data warehouse to split less used column information out
from a frequently accessed fact table
Row Splitting
The aim or row splitting is to speed access to the large table by reducing its size
Normalization versus row splitting
Hardware Partitioning
The precise mechanism used varies depending on the specific hardware platform, but must address the
following areas
maximizing the processing power available;
maximizing disk and I/O performance;
avoiding bottlenecking on a single CPU;
avoiding bottlenecking on I/O throughput
Aggregations
Introduction
Why Aggregate
What Is An Aggregation
Designing Summary Tables
Which Summaries To Create
o When Is A Summary Table Too Big To Be Useful
Introduction
Aggregation is performed in order to speed up common queries, and must offset the operational costs of
creating and managing the aggregations, against the benefits of doing so.
let us find how to determine the appropriate aggregation strategy
Why Aggregate
provide cost-effective query performance
by avoiding the need for substantial investment in processing power to address performame requirements
Too many aggregations will lead to unacceptable operational costs;
too few will lead to an overall lack of system performance
o 70% of the queries respond within reasonable amount of time
allow us to spot trends within the data that may be difficult to see by looking at the detail
What Is An Aggregation
Process of optimizing query performance by performing bulk of the processing in advance and storing
intermediate results for repeated use
rely on the fact that most queries will analyze either a subset or an aggregation of the detailed data.
the aggregations are geared around the popular queries at a specific point in time
the value of a summary is that the bulk of the processing has been carried out prior to the query's executing
A database view that performs the same aggregation might provide the same interface, but not the same
saving of processing time
it also allows us to look at a specific set of trends more easily
Summary table that identifies meat-buying customers
Guideline
If common queries use more than one aggregated value on the same dimension, aggregate the required values
into a set of columns within the same summary table.
Data marting
Introduction
When is a data mart appropriate?
o Identify Functional Splits
o Identify User Access Tool Requirements
o Identify Access Control Issues
Designing Data Marts
o Data Subsets And Summaries
o Different Database Designs
Costs Of Data Marting
o Hardware And Software
o Network Access
o Time Window Constraints
Introduction
Data mart is a subset of the information content of a data warehouse that is stored in its own database,
summarized or in detail.
The data in the data mart may or may not be sourced from an enterprise data warehouse
choice of user access tool requires the use of data marting
can improve query performance,
o simply by reducing the volume of data that needs to be scanned to satisfy a query
Example
a retail organization in which each merchant is responsible for maximizing the sales of a group of products
information in the data warehouse
o sales transactions on a daily level, to monitor actual sales;
o sales forecasts on a weekly basis;
o stock positions on a daily basis, to monitor stock levels
o stock movements on a daily basis, to monitor supplier or shrinkage issues
the merchant is not interested in products that he or she is not responsible
the merchant is extremely unlikely to query information about other products
Guideline
you do not support more than a handful of data marts for a data warehouse that has an overnight load process.
Higher numbers of data marts are likely to lead to load window problems
Network Access
In many cases, each data mart will be in a different geographical location from the data warehouse
you need to ensure that the LAN or WAN has sufficient capacity to hfindle the data volumes being
transferred within the data mart load process
Meta data
Introduction
Data Transformation And Load
Data Management
Query Generation
Metadata And Tools
Introduction
Metadata is data about data
Metadata is used for:
o data transformation and load
o data management
o query generation
base data
to distinguish between the original data load destination and any copies, such as aggregations, in which the
data may later appear
Issues
different data type notations
o source systems are non-relational data, DWH might be relational systems
different vendors for different RDBMS
o time data might be stored in different fields, where as others may store in the same field
char over varchar
Byte ordering of numeric fields
Another common difficulty is how to deal with many-to-one, one-to-many, and many-to-many mappings
from source to destination
Data Management
Metadata is required to describe the data as it resides in the data warehouse.
This is needed by the warehouse manager to allow it to track and control all data movements.
Every object in the database needs to be described.
Metadata is needed for all of the following:
indexes views constraints field
tables
o columns o columns o name o unique identifier
o columns
name name o type o field_name
name
type type o table o description
type
columns
main requirements is to be able to crossreference columns in different tables that contain the same data, even if
they have different names.
For each table the following information needs to be stored
aggregation partition
table Aggregation
o aggregation_name o partition_n ame
o table_name functions
o columns o table_name
o columns
column_name min range allowed
column_name
reference max range contained
reference
identifier average partition key
identifier
aggregation sum reference identifier
Query Generation
Metadata is also required by the query manager to enable it to generate queries.
The same metadata used by the warehouse manager to describe the data in the data warehouse is also
required by the query manager.
The query manager will also generate metadata about the queries it has run.
o This metadata can be used to build a history of all queries run and generate a query profile for each
user, group of users and the data warehouse as a whole.
The metadata that is required for each query is:
o column name syntax
tables accessed
o table name execution plan
o columns accessed
o reference identifier resources
name
aggregate functions used o disk
reference identifier
o column name read
restrictions applied
o reference identifier write
o column name
o aggregate function temporary
o table name
group by criteria o CPU
o reference identifier
o column name o memory
o restriction
o reference identifier start date and time
join criteria applied
o formula end date and time
o column name
sort criteria user name
o table name
o column name
o reference identifier
o reference identifier
o sort direction
Introduction
Manager is a process or part of the data warehouse application as a whole
The system managers refer to areas of management of the underlying hardware and operating system or
systems themselves, such as scheduling and backup recovery.
The data warehouse process managers refer more specifically to the data warehouse itself
A specific manager will consist of the tools and procedures required to manage its area of responsibility
System Managers
The most important system managers are:
configuration manager
schedule manager
event manager
database manager
backup recovery manager
Configuration Manager
responsible for the setup and configuration of the hardware
like configuration of the machine to configuration and management of the disk farm
the single most important configuration tool is the I/O manager
Schedule Manager
every successful data warehouse has a good solid scheduler
Almost all operations in the data warehouse will require some scheduling control
List of the features that the scheduling software must have
Handle multiple queues.
Deal with inter-queue processing.
Work across cluster or MPP boundaries.
Maintain job schedules across system outages.
Deal with international time differences.
Handle job failures.
Restart or re-queue failed jobs.
Support job priorities.
Support the stopping and starting of queues.
Log queued jobs.
Notify a user or process when a job has completed.
Re-queue jobs to other queues.
Some of the jobs the scheduler has to be able to handle are
overnight processing
o data load
o data transformation
o index creation
o aggregation creation
o data movement operations
o backup
Event Manager
An event is a measurable, observable occurrence of a defined action
An event manager is a piece of software that deals with defined events on the data warehouse system.
It will monitor continuously for the occurrence of any events, and deal with them
a tool should be used is because a DWH environment is just too complex to be managed manually
Having an event manager tool will free the DBA and the system manager to manage the data warehouse
proactively, rather than reactively
Observability & measurability are the attributes of the tool
List of the most common events that need to be tracked
running out of space on certain key disks
a process dying
a process using excessive resource (such as CPU)
a process returning an error
disks exhibiting I/O bottlenecks
hardware failure
a table reaching its maximum size
a table failing to extend because of lack of space
CPU usage exceeding an 80% threshold
excessive memory swapping
usage of temporary or sort area reaching a certain threshold
buffer cache hit ratios exceeding or falling below thresholds
any other database shared memory usage
internal contention on database serialization points
Load Manager
Functional Description
The load manager is responsible for any data transformation required and for the loading of data into the
database
These responsibilities can be summarized as follows:
data source interaction
data transformation
data load
Data Transformation
An amount of data transformation will almost certainly need to be performed
Examples
stripping of extra fields from the source
change of format from the RDBMS
splitting (1 to many) & merging (many to 1) of fields
adding extra fields
Data Load
The ultimate goal and the responsibility of the load manager is to get data loaded into the data warehouse
database
Commonly used approaches are
loading from flat files,
3GL programs for extracting and loading data,
gateway products for connecting different RDBMSs,
copy management tools.
Warehouse Manager
responsible for maintaining the data while it is in the data warehouse
creates and maintains the metadata that describes where everything resides within the data warehouse
The responsibilities of the warehouse manager are
data movement
metadata management
performance monitoring and tuning
data archiving
Data Movement
The warehouse manager is responsible for any data movement within the warehouse, such as aggregation
creation and maintenance.
Create and maintain any tables, indexes or other objects required
Creating and maintain the indexes on the fact and dimension data and on the aggregations
schedule any refreshes of the data marts
may also be responsible for pushing the data out to the different data mart machines
Metadata Management
The warehouse manager is responsible for managing the metadata
warehouse manager will need to be aware of the partitioning key, and the range of key values that each
partition contains
Data Archiving
data will need to be archived off the data warehouse, either to nearline storage, or to tape
number of details that need to be established while designing archiving process
data life expectancy: in raw form, in aggregated form
data archiving: start date, cycle, workload
Query Manager
Query manager used to control all of the following:
o user access to the data
o query scheduling
o query monitoring
each area requires its own tools, bespoke software and procedures
Ideally all query access to the data Warehouse should go via the query manager
Query Scheduling
Scheduling of ad hoc queries is a responsibility of the query manager.
Simultaneous large ad hoc queries, if not controlled, can severely affect the performance of any system:
in particular if the queries are run using parallelism where a single query can potentially use all the CPU
resource made available to it.
Typically this will be achieved using queuing mechanisms to ensure that large jobs do not clash
inability to predict how long a query will take to complete
Query Monitoring
One of the main functions of the query manager will be to monitor the queries as they run
As the usage of the data warehouse grows, and the number. of concurrent users and the number of ad hoc
queries grow, the success of the data warehouse will depend on how these queries are managed
the query profiles of the different groups of users need to be known to tune the ad hoc environment
statistics of the resources used, and the query syntax itself are to be stored for tuning the query
Disclaimer
Intended for educational purposes only. Not intended for any sort of commercial use
Purely created to help students with limited preparation time. No guarantee on the adequacy or accuracy
Text and picture used were taken from the reference items
Reference
Sam Anabory & Dennis Murray , Data Warehousing in the real world, Addison Wesley, 1997
http://books.google.co.in/books?id=f_C30AM0NkIC&printsec=frontcover#v=onepage&q&f=false
Credits
Thanks to my family members who supported me, while I spent considerable amount of time to prepare these notes.
Feedback is always welcome at GHCRajan@gmail.com