Sunteți pe pagina 1din 33

Kamesh.

chalasani

DATA WAREHOUSE CONCEPTS


Data warehousing is the coordinated, architected, and periodic copying of data from
various sources, both inside and outside the enterprise, into an environment optimized
for analytical and informational processing.

A data warehouse system has the following characteristics:


✓ It provides centralization of corporate data assets.
✓ It’s contained in a well-managed environment.
✓ It has consistent and repeatable processes defined for loading data from
corporate applications.
✓ It’s built on an open and scalable architecture that can handle future expansion of
data.
✓ It provides tools that allow its users to effectively process the data into information
without a high degree of technical support.

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile


collection of data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject
area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For
example, source A and source B may have different ways of identifying a product, but
in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can
retrieve data from 3 months, 6 months, 12 months, or even older data from a data
warehouse. This contrasts with a transactions system, where often only the most
recent data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated
with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical
data in a data warehouse should never be altered.
Ralph Kimball provided a more concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query
and analysis.
This is a functional view of a data warehouse. Kimball did not address how the data
warehouse is built like Inmon did, rather he focused on the functionality of a data
warehouse.

Different data warehousing systems have different structures. Some may have an
ODS (operational data store), while some may have multiple data marts. Some may
Kamesh.chalasani

have a small number of data sources, while some may have dozens of data sources.
In view of this, it is far more reasonable to present the different layers of a data
warehouse architecture rather than discussing the specifics of any one system.
In general, all data warehouse systems have the following layers:
• Data Source Layer
• Data Extraction Layer
• Staging Area
• ETL Layer
• Data Storage Layer
• Data Logic Layer
• Data Presentation Layer
• Metadata Layer
• System Operations Layer
The picture below shows the relationships among the different components of the data
warehouse architecture:

Each component is discussed individually below:


Data Source Layer
Kamesh.chalasani

This represents the different data sources that feed data into the data warehouse. The
data source can be of any format -- plain text file, relational database, other types of
database, Excel file, ... can all act as a data source.
Many different types of data can be a data source:
• Operations -- such as sales data, HR data, product data, inventory data,
marketing data, systems data.
• Web server logs with user browsing data.
• Internal market research data.
• Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely
some minimal data cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data warehouse
/ data mart. Having one common area makes it easier for subsequent data
processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data from
a transactional nature to an analytical nature. This layer is also where data cleansing
happens.
Data Storage Layer
This is where the transformed and cleansed data sit. Based on scope and functionality,
3 types of entities can be found here: data warehouse, data mart, and operational data
store (ODS). In any given system, you may have just one of the three, two of the three,
or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the
underlying data transformation rules, but does affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users. This can be in a form of a
tabular / graphical report in a browser, an emailed report that gets automatically
generated and sent everyday, or an alert that warns users of exceptions, among
others.
Metadata Layer
Kamesh.chalasani

This is where information about the data stored in the data warehouse system is
stored. A logical data model would be an example of something that's in the metadata
layer.
System Operations Layer
This layer includes information on how the data warehouse system operates, such as
ETL job status, system performance, and user access history.

Data Warehouse Design


After the tools and team personnel selections are made, the data warehouse design
can begin. The following are the typical steps involved in the datawarehousing project
cycle.
• Requirement Gathering
• Physical Environment Setup
• Data Modeling
• ETL
• OLAP Cube Design
• Front End Development
• Report Development
• Performance Tuning
• Query Optimization
• Quality Assurance
• Rolling out to Production
• Production Maintenance
• Incremental Enhancements
Each page listed above represents a typical data warehouse design phase, and has
several sections:
• Task Description: This section describes what typically needs to be
accomplished during this particular data warehouse design phase.
• Time Requirement: A rough estimate of the amount of time this particular data
warehouse task takes.
• Deliverables: Typically at the end of each data warehouse task, one or more
documents are produced that fully describe the steps and results of that
particular task. This is especially important for consultants to communicate their
results to the clients.
• Possible Pitfalls: Things to watch out for. Some of them obvious, some of them
not so obvious. All of them are real.
The Additional Observations section contains my own observations on data warehouse
processes not included in any of the design steps.
Kamesh.chalasani

Requirement Gathering
Task Description
The first thing that the project team should engage in is gathering requirements from end
users. Because end users are typically not familiar with the data warehousing process or
concept, the help of the business sponsor is essential. Requirement gathering can happen as
one-to-one meetings or as Joint Application Development (JAD) sessions, where multiple
people are talking about the project scope in the same meeting.
The primary goal of this phase is to identify what constitutes as a success for this particular
phase of the data warehouse project. In particular, end user reporting / analysis requirements
are identified, and the project team will spend the remaining period of time trying to satisfy
these requirements.
Associated with the identification of user requirements is a more concrete definition of other
details such as hardware sizing information, training requirements, data source identification,
and most importantly, a concrete project plan indicating the finishing date of the data
warehousing project.
Based on the information gathered above, a disaster recovery plan needs to be developed so
that the data warehousing system can recover from accidents that disable the system.
Without an effective backup and restore strategy, the system will only last until the first major
disaster, and, as many data warehousing DBA's will attest, this can happen very quickly after
the project goes live.
Time Requirement
2 - 8 weeks.
Deliverables
• A list of reports / cubes to be delivered to the end users by the end of this current
phase.
• A updated project plan that clearly identifies resource loads and milestone delivery
dates.
Possible Pitfalls
This phase often turns out to be the most tricky phase of the data warehousing
implementation. The reason is that because data warehousing by definition includes data
from multiple sources spanning many different departments within the enterprise, there are
often political battles that center on the willingness of information sharing. Even though a
successful data warehouse benefits the enterprise, there are occasions where departments
may not feel the same way. As a result of unwillingness of certain groups to release data or to
participate in the data warehousing requirement definition, the data warehouse effort either
never gets off the ground, or could not start in the direction originally defined.
When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at
the CXO level, she can often exert enough influence to make sure everyone cooperates.
Physical Environment Setup
Task Description
Kamesh.chalasani

Once the requirements are somewhat clear, it is necessary to set up the physical servers and
databases. At a minimum, it is necessary to set up a development environment and a
production environment. There are also many data warehousing projects where there are
three environments: Development, Testing, and Production.
It is not enough to simply have different physical environments set up. The different processes
(such as ETL, OLAP Cube, and reporting) also need to be set up properly for each
environment.
It is best for the different environments to use distinct application and database servers. In
other words, the development environment will have its own application server and database
servers, and the production environment will have its own set of application and database
servers.
Having different environments is very important for the following reasons:
• All changes can be tested and QA'd first without affecting the production environment.
• Development and QA can occur during the time users are accessing the data
warehouse.
• When there is any question about the data, having separate environment(s) will allow
the data warehousing team to examine the data without impacting the production
environment.
Time Requirement
Getting the servers and databases ready should take less than 1 week.
Deliverables
• Hardware / Software setup document for all of the environments, including hardware
specifications, and scripts / settings for the software.
Possible Pitfalls
To save on capital, often data warehousing teams will decide to use only a single database
and a single server for the different environments. Environment separation is achieved by
either a directory structure or setting up distinct instances of the database. This is problematic
for the following reasons:
1. Sometimes it is possible that the server needs to be rebooted for the development
environment. Having a separate development environment will prevent the production
environment from being impacted by this.
2. There may be interference when having different database environments on a single box.
For example, having multiple long queries running on the development database could affect
the performance on the production database.

Data Modeling

Task Description
Kamesh.chalasani

This is a very important step in the data warehousing project. Indeed, it is fair to say that the
foundation of the data warehousing system is the data model. A good data model will allow
the data warehousing system to grow easily, as well as allowing for good performance.
In data warehousing project, the logical data model is built based on user requirements, and
then it is translated into the physical data model. The detailed steps can be found in the
section.Conceptual, Logical, and Physical Data Modeling
Part of the data modeling exercise is often the identification of data sources. Sometimes this
step is deferred until the ETL step. However, my feeling is that it is better to find out where the
data exists, or, better yet, whether they even exist anywhere in the enterprise at all. Should
the data not be available, this is a good time to raise the alarm. If this was delayed until the
ETL phase, rectifying it will becoming a much tougher and more complex process.
Time Requirement
2 - 6 weeks.

Deliverables
• Identification of data sources.
• Logical data model.
• Physical data model.
Possible Pitfalls
It is essential to have a subject-matter expert as part of the data modeling team. This person
can be an outside consultant or can be someone in-house who has extensive experience in
the industry. Without this person, it becomes difficult to get a definitive answer on many of the
questions, and the entire project gets dragged out.

ETL
Task Description
The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop,
and this can easily take up to 50% of the data warehouse implementation cycle or longer. The
reason for this is that it takes time to get the source data, understand the necessary columns,
understand the business rules, and understand the logical and physical data models.
Time Requirement
1 - 6 weeks.
Deliverables
• Data Mapping Document
• ETL Script / ETL Package in the ETL tool
Possible Pitfalls
There is a tendency to give this particular phase too little development time. This can prove
suicidal to the project because end users will usually tolerate less formatting, longer time to
Kamesh.chalasani

run reports, less functionality (slicing and dicing), or fewer delivered reports; one thing that
they will not tolerate is wrong information.
A second common problem is that some people make the ETL process more complicated
than necessary. In ETL design, the primary goal should be to optimize load speed without
sacrificing on quality. This is, however, sometimes not followed. There are cases where the
design goal is to cover all possible future uses, whether they are practical or just a figment of
someone's imagination. When this happens, ETL performance suffers, and often so does the
performance of the entire data warehousing system.

OLAP Cube Design


Task Description
Usually the design of the olap cube can be derived from theRequirement Gathering phase.
More often than not, however, users have some idea on what they want, but it is difficult for
them to specify the exact report / analysis they want to see. When this is the case, it is usually
a good idea to include enough information so that they feel like they have gained something
through the data warehouse, but not so much that it stretches the data warehouse scope by a
mile. Remember that data warehousing is an iterative process - no one can ever meet all the
requirements all at once.
Time Requirement
1 - 2 weeks.
Deliverables
• Documentation specifying the OLAP cube dimensions and measures.
• Actual OLAP cube / report.
Possible Pitfalls
Make sure your olap cube-bilding process is optimized. It is common for the data warehouse
to be on the bottom of the nightly batch load, and after the loading of the data warehouse,
there usually isn't much time remaining for the olap cube to be refreshed. As a result, it is
worthwhile to experiment with the olap cube generation paths to ensure optimal performance.
Front End Development
Regardless of the strength of the OLAP engine and the integrity of the data, if the users
cannot visualize the reports, the data warehouse brings zero value to them. Hence front end
development is an important part of a data warehousing initiative.
So what are the things to look out for in selecting a front-end deployment methodology? The
most important thing is that the reports should need to be delivered over the web, so the only
thing that the user needs is the standard browser. These days it is no longer desirable nor
feasible to have the IT department doing program installations on end users desktops just so
that they can view reports. So, whatever strategy one pursues, make sure the ability to deliver
over the web is a must.
The front-end options ranges from an internal front-end development using scripting
languages such as ASP, PHP, or Perl, to off-the-shelf products such as Seagate Crystal
Kamesh.chalasani

Reports, to the more higher-level products such as Actuate. In addition, many OLAP vendors
offer a front-end on their own. When choosing vendor tools, make sure it can be easily
customized to suit the enterprise, especially the possible changes to the reporting
requirements of the enterprise. Possible changes include not just the difference in report
layout and report content, but also include possible changes in the back-end structure. For
example, if the enterprise decides to change from Solaris/Oracle to Microsoft 2000/SQL
Server, will the front-end tool be flexible enough to adjust to the changes without much
modification?
Another area to be concerned with is the complexity of the reporting tool. For example, do the
reports need to be published on a regular interval? Are there very specific formatting
requirements? Is there a need for a GUI interface so that each user can customize her
reports?
Time Requirement
1 - 4 weeks.
Deliverables
Front End Deployment Documentation
Possible Pitfalls
Just remember that the end users do not care how complex or how technologically advanced
your front end infrastructure is. All they care is that they receives their information in a timely
manner and in the way they specified.

Report Development
Task Description
Report specification typically comes directly from the requirements phase. To the end user,
the only direct touchpoint he or she has with the data warehousing system is the reports they
see. So, report development, although not as time consuming as some of the other steps
such as ETL and data modeling, nevertheless plays a very important role in determining the
success of the data warehousing project.
One would think that report development is an easy task. How hard can it be to just follow
instructions to build the report? Unfortunately, this is not true. There are several points the
data warehousing team need to pay attention to before releasing the report.
User customization: Do users need to be able to select their own metrics? And how do
users need to be able to filter the information? The report development process needs to take
those factors into consideration so that users can get the information they need in the shortest
amount of time possible.
Report delivery: What report delivery methods are needed? In addition to delivering the
report to the web front end, other possibilities include delivery via email, via text messaging,
or in some form of spreadsheet. There are reporting solutions in the marketplace that support
report delivery as a flash file. Such flash file essentially acts as a mini-cube, and would allow
end users to slice and dice the data on the report without having to pull data from an external
source.
Kamesh.chalasani

Access privileges: Special attention needs to be paid to who has what access to what
information. A sales report can show 8 metrics covering the entire company to the company
CEO, while the same report may only show 5 of the metrics covering only a single district to a
District Sales Director.
Report development does not happen only during the implementation phase. After the system
goes into production, there will certainly be requests for additional reports. These types of
requests generally fall into two broad categories:
1. Data is already available in the data warehouse. In this case, it should be fairly
straightforward to develop the new report into the front end. There is no need to wait for a
major production push before making new reports available.
2. Data is not yet available in the data warehouse. This means that the request needs to be
prioritized and put into a future data warehousing development cycle.
Time Requirement
1 - 2 weeks.
Deliverables
• Report Specification Documentation.
• Reports set up in the front end / reports delivered to user's preferred channel.
Possible Pitfalls
Make sure the exact definitions of the report are communicated to the users. Otherwise, user
interpretation of the report can be errenous.
Performance Tuning
Task Description
There are three major areas where a data warehousing system can use a little performance
tuning:
• ETL - Given that the data load is usually a very time-consuming process (and hence
they are typically relegated to a nightly load job) and that data warehousing-related
batch jobs are typically of lower priority, that means that the window for data loading is
not very long. A data warehousing system that has its ETL process finishing right on-
time is going to have a lot of problems simply because often the jobs do not get started
on-time due to factors that is beyond the control of the data warehousing team. As a
result, it is always an excellent idea for the data warehousing group to tune the ETL
process as much as possible.
• Query Processing - Sometimes, especially in a ROLAP environment or in a system
where the reports are run directly against the relationship database, query performance
can be an issue. A study has shown that users typically lose interest after 30 seconds
of waiting for a report to return. My experience has been that ROLAP reports or reports
that run directly against the RDBMS often exceed this time limit, and it is hence ideal
for the data warehousing team to invest some time to tune the query, especially the
most popularly ones. We present a number of query optimizationideas.
• Report Delivery - It is also possible that end users are experiencing significant delays
in receiving their reports due to factors other than the query performance. For example,
Kamesh.chalasani

network traffic, server setup, and even the way that the front-end was built sometimes
play significant roles. It is important for the data warehouse team to look into these
areas for performance tuning.
Time Requirement
3 - 5 days.
Deliverables
• Performance tuning document - Goal and Result
Possible Pitfalls
Make sure the development environment mimics the production environment as much as
possible - Performance enhancements seen on less powerful machines sometimes do not
materialize on the larger, production-level machines.
Query Optimization
For any production database, SQL query performance becomes an issue sooner or
later. Having long-running queries not only consumes system resources that makes
the server and application run slowly, but also may lead to table locking and data
corruption issues. So, query optimization becomes an important task.
First, we offer some guiding principles for query optimization:
1. Understand how your database is executing your query
Nowadays all databases have their own query optimizer, and offers a way for users to
understand how a query is executed. For example, which index from which table is being
used to execute the query? The first step to query optimization is understanding what the
database is doing. Different databases have different commands for this. For example, in
MySQL, one can use "EXPLAIN [SQL Query]" keyword to see the query plan. In Oracle, one
can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan.
2. Retrieve as little data as possible
The more data returned from the query, the more resources the database needs to expand to
process and store these data. So for example, if you only need to retrieve one column from a
table, do not use 'SELECT *'.
3. Store intermediate results
Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desired
result through the use of subqueries, inline views, and UNION-type statements. For those
cases, the intermediate results are not stored in the database, but are immediately used
within the query. This can lead to performance issues, especially when the intermediate
results have a large number of rows.
The way to increase query performance in those cases is to store the intermediate results in a
temporary table, and break up the initial SQL statement into several SQL statements. In many
cases, you can even build an index on the temporary table to speed up the query
performance even more. Granted, this adds a little complexity in query management (i.e., the
need to manage temporary tables), but the speedup in query performance is often worth the
trouble.
Kamesh.chalasani

Below are several specific query optimization strategies.


• Use Index
Using an index is the first strategy one should use to speed up a query. In fact, this
strategy is so important that index optimization is also discussed.
• Aggregate Table
Pre-populating tables at higher levels so less amount of data need to be parsed.
• Vertical Partitioning
Partition the table by columns. This strategy decreases the amount of data a SQL
query needs to process.
• Horizontal Partitioning
Partition the table by data value, most often time. This strategy decreases the amount
of data a SQL query needs to process.
• Denormalization
The process of denormalization combines multiple tables into a single table. This
speeds up query performance because fewer table joins are needed.
• Server Tuning
Each server has its own parameters, and often tuning server parameters so that it can
fully take advantage of the hardware resources can significantly speed up query
performance.

Quality Assurance
Task Description
Once the development team declares that everything is ready for further testing, the QA team
takes over. The QA team is always from the client. Usually the QA team members will know
little about data warehousing, and some of them may even resent the need to have to learn
another tool or tools. This makes the QA process a tricky one.
Sometimes the QA process is overlooked. On my very first data warehousing project, the
project team worked very hard to get everything ready for Phase 1, and everyone thought that
we had met the deadline. There was one mistake, though, the project managers failed to
recognize that it is necessary to go through the client QA process before the project can go
into production. As a result, it took five extra months to bring the project to production (the
original development time had been only 2 1/2 months).
Time Requirement
1 - 4 weeks.
Deliverables
• QA Test Plan
• QA verification that the data warehousing system is ready to go to production
Possible Pitfalls
Kamesh.chalasani

As mentioned above, usually the QA team members know little about data warehousing, and
some of them may even resent the need to have to learn another tool or tools. Make sure the
QA team members get enough education so that they can complete the testing themselves.
Rollout To Production
Task Description
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some
may think this is as easy as flipping on a switch, but usually it is not true. Depending on the
number of end users, it sometimes take up to a full week to bring everyone online!
Fortunately, nowadays most end users access the data warehouse over the web, making
going production sometimes as easy as sending out an URL via email.
Time Requirement
1 - 3 days.
Deliverables
• Delivery of the data warehousing system to the end users.
Possible Pitfalls
Take care to address the user education needs. There is nothing more frustrating to spend
several months to develop and QA the data warehousing system, only to have little usage
because the users are not properly trained. Regardless of how intuitive or easy the interface
may be, it is always a good idea to send the users to at least a one-day course to let them
understand what they can achieve by properly using the data warehouse.

Production Maintenance
Task Description
Once the data warehouse goes production, it needs to be maintained. Tasks as such regular
backup and crisis management becomes important and should be planned out. In addition, it
is very important to consistently monitor end user usage. This serves two purposes: 1. To
capture any runaway requests so that they can be fixed before slowing the entire system
down, and 2. To understand how much users are utilizing the data warehouse for return-on-
investment calculations and future enhancement considerations.
Time Requirement
Ongoing.
Deliverables
Consistent availability of the data warehousing system to the end users.
Possible Pitfalls
Usually by this time most, if not all, of the developers will have left the project, so it is
essential that proper documentation is left for those who are handling production
Kamesh.chalasani

maintenance. There is nothing more frustrating than staring at something another person did,
yet unable to figure it out due to the lack of proper documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of
the data warehouse planned, start on that as soon as possible.
Incremental Enhancements
Task Description
Once the data warehousing system goes live, there are often needs for incremental
enhancements. I am not talking about a new data warehousing phases, but simply small
changes that follow the business itself. For example, the original geographical designations
may be different, the company may originally have 4 sales regions, but now because sales
are going so well, now they have 10 sales regions.
Deliverables
• Change management documentation
• Actual change to the data warehousing system
Possible Pitfalls
Because a lot of times the changes are simple to make, it is very tempting to just go ahead
and make the change in production. This is a definite no-no. Many unexpected problems will
pop up if this is done. I would very strongly recommend that the typical cycle of development
--> QA --> Production be followed, regardless of how simple the change may seem.

Observations
1)Quick Implementation Time
2)Lack Of Collaboration With Data Mining Efforts
3)Industry Consolidation
4)How To Measure Success

Business Intelligence
Business intelligence is a term commonly associated with data warehousing. In fact, many
of the tool vendors position their products as business intelligence software rather than
data warehousing software. There are other occasions where the two terms are used
interchangeably. So, exactly what is business inteligence?
Business intelligence usually refers to the information that is available for the enterprise to
make decisions on. A data warehousing (or data mart) system is the backend, or the
Kamesh.chalasani

infrastructural, component for achieving business intellignce. Business intelligence also


includes the insight gained from doing data mining analysis, as well as unstrctured data (thus
the need fo content management systems). For our purposes here, we will discuss business
intelligence in the context of using a data warehouse infrastructure.
1)Tools :-
The most common tools used for business intelligence are as follows. They are listed in the
following order: Increasing cost, increasing functionality, increasing business intelligence
complexity, and decreasing number of total users.
Excel
Take a guess what's the most common business intelligence tool? You might be surprised to
find out it's Microsoft Excel. There are several reasons for this:
1. It's relatively cheap.
2. It's commonly used. You can easily send an Excel sheet to another person without worrying
whether the recipient knows how to read the numbers.
3. It has most of the functionalities users need to display data.
In fact, it is still so popular that all third-party reporting / OLAP tools have an "export to Excel"
functionality. Even for home-built solutions, the ability to export numbers to Excel usually
needs to be built.
Excel is best used for business operations reporting and goals tracking.
Reporting tool
In this discussion, I am including both custom-built reporting tools and the commercial
reporting tools together. They provide some flexibility in terms of the ability for each user to
create, schedule, and run their own reports. The Reporting Tool Selectionselection discusses
how one should select an OLAP tool.
Business operations reporting and dashboard are the most common applications for a
reporting tool.

OLAP tool
OLAP tools are usually used by advanced users. They make it easy for users to look at the
data from multiple dimensions. The OLAP Tool Selection selection discusses how one should
select an OLAP tool.
OLAP tools are used for multidimensional analysis.
Data mining tool
Data mining tools are usually only by very specialized users, and in an organization, even
large ones, there are usually only a handful of users using data mining tools.
Data mining tools are used for finding correlation among different factors.
2)Uses:-
Business intelligence usage can be categorized into the following categories:
Kamesh.chalasani

1. Business operations reporting


The most common form of business intelligence is business operations reporting. This
includes the actuals and how the actuals stack up against the goals. This type of business
intelligence often manifests itself in the standard weekly or monthly reports that need to be
produced.
2. Forecasting
Many of you have no doubt run into the needs for forecasting, and all of you would agree that
forecasting is both a science and an art. It is an art because one can never be sure what the
future holds. What if competitors decide to spend a large amount of money in advertising?
What if the price of oil shoots up to $80 a barrel? At the same time, it is also a science
because one can extrapolate from historical data, so it's not a total guess.
3. Dashboard
The primary purpose of a dashboard is to convey the information at a glance. For this
audience, there is little, if any, need for drilling down on the data. At the same time,
presentation and ease of use are very important for a dashboard to be useful.
4. Multidimensional analysis
Multidimensional analysis is the "slicing and dicing" of the data. It offers good insight into the
numbers at a more granular level. This requires a solid data warehousing / data mart
backend, as well as business-savvy analysts to get to the necessary data.
5. Finding correlation among different factors
This is diving very deep into business intelligence. Questions asked are like, "How do different
factors correlate to one another?" and "Are there significant time trends that can be
leveraged/anticipated?"
Dimensional Data Model
Dimensional data model is most often used in data warehousing systems. This is different
from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can
imagine, the same data would then be stored differently in a dimensional model than in a 3rd
normal form model.
To understand dimensional data modeling, let's define some of the terms commonly used in
this type of modeling:
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time
Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes
within a dimension. For example, one possible hierarchy in the Time dimension is Year →
Quarter → Month → Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales
amount would be such a measure. This measure is stored in the fact table with the
appropriate granularity. For example, it can be sales amount by store by day. In this case, the
Kamesh.chalasani

fact table would contain three columns: A date column, a store column, and a sales amount
column.
Lookup Table: The lookup table provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of the quarters
available in the data warehouse. Each row (each quarter) may have several fields, one for the
unique ID that identifies the quarter, and one or more additional fields that specifies how that
particular quarter is represented on a report (for example, first quarter of 2001 may be
represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or
more lookup tables, but fact tables do not have direct relationships to one another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key
columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used schema
types are Star Schema andSnowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and
business needs. Personally, I am partial to snowflakes, when there is a business case to
analyze the information at that particular level.
Fact Table Granularity
Granularity
The first step in designing a fact table is to determine thegranularity of the fact table.
By granularity, we mean the lowest level of information that will be stored in the fact table.
This constitutes two steps:
1. Determine which dimensions will be included.
2. Determine where along the hierarchy of each dimension the information will be kept.
The determining factors usually goes back to the requirements.
Which Dimensions To Include
Determining which dimensions to include is usually a straightforward process, because
business processes will often dictate clearly what are the relevant dimensions.
For example, in an off-line retail world, the dimensions for a sales fact table are usually time,
geography, and product. This list, however, is by no means a complete list for all off-line
retailers. A supermarket with a Rewards Card program, where customers provide some
personal information in exchange for a rewards card, and the supermarket would offer lower
prices for certain items for customers who present a rewards card at checkout, will also have
the ability to track the customer dimension. Whether the data warehousing system includes
the customer dimension will then be a decision that needs to be made.
What Level Within Each Dimensions To Include
Determining which part of hierarchy the information is stored along each dimension is a bit
more tricky. This is where user requirement (both stated and possibly future) plays a major
role.
Kamesh.chalasani

In the above example, will the supermarket wanting to do analysis along at the hourly level?
(i.e., looking at how certain products may sell by different hours of the day.) If so, it makes
sense to use 'hour' as the lowest level of granularity in the time dimension. If daily analysis is
sufficient, then 'day' can be used as the lowest level of granularity. Since the lower the level of
detail, the larger the data amount in the fact table, the granularity exercise is in essence
figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage.
Note that sometimes the users will not specify certain requirements, but based on the industry
knowledge, the data warehousing team may foresee that certain requirements will be
forthcoming that may result in the need of additional details. In such cases, it is prudent for
the data warehousing team to design the fact table such that lower-level information is
included. This will avoid possibly needing to re-design the fact table in the future. On the other
hand, trying to anticipate all future requirements is an impossible and hence futile exercise,
and the data warehousing team needs to fight the urge of the "dumping the lowest level of
detail into the data warehouse" symptom, and only includes what is practically needed.
Sometimes this can be more of an art than science, and prior experience will become
invaluable here.
Fact And Fact Table Types

Types of Facts
There are three types of facts:
• Additive: Additive facts are facts that can be summed up through all of the dimensions
in the fact table.
• Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others.
• Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes
that we are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a
daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact,
because you can sum up this fact along any of the three dimensions present in the fact table
-- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week
represent the total sales amount for that week.
Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
Kamesh.chalasani

The purpose of this table is to record the current balance for each account at the end of each
day, as well as the profit margin for each account for each
day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive
fact, as it makes sense to add them up for all accounts (what's the total current balance for all
accounts in the bank?), but it does not make sense to add them up through time (adding up
all current balances for a given account for each day of the month does not give us any useful
information). Profit_Margin is a non-additive fact, for it does not make sense to add them up
for the account level or the day level.

Types of Fact Tables


Based on the above classifications, there are two types of fact tables:
• Cumulative: This type of fact table describes what has happened over a period of
time. For example, this fact table may describe the total sales by product by store by
day. The facts for this type of fact tables are mostly additive facts. The first example
presented here is a cumulative fact table.
• Snapshot: This type of fact table describes the state of things in a particular instance
of time, and usually includes more semi-additive and non-additive facts. The second
example presented here is a snapshot fact table.

In the star schema design, a single object (the fact table) sits in the middle and is radially
connected to other surrounding objects (dimension lookup tables) like a star. Each dimension
is represented as a single table. The primary key in each dimension table is related to a
forieng key in the fact table.
Sample star schema
All measures in the fact table are
related to all the dimensions that fact
table is related to. In other words,
they all have the same level of
granularity.
A star schema can be simple or
complex. A simple star consists of
one fact table; a complex star can
have more than one fact table.Let's
look at an example: Assume our data
warehouse keeps store sales data,
and the different dimensions are
time, store, product, and customer. In
this case, the figure on the left repesents our star schema. The lines between two tables
indicate that there is a primary key / foreign key relationship between the two tables. Note that
different dimensions are not related to one another.
Kamesh.chalasani

The snowflake schema is an extension of the star schema, where each point of the star
explodes into more points. In a star schema, each dimension is represented by a single
dimensional table, whereas in a snowflake schema, that dimensional table is normalized into
multiple lookup tables, each representing a level in the dimensional hierarchy.
Sample snowflake schema
For example, the Time Dimension
that consists of 2 different
hierarchies:
1. Year → Month → Day
2. Week → Day
We will have 4 lookup tables in a
snowflake schema: A lookup table for
year, a lookup table for month, a
lookup table for week, and a lookup
table for day. Year is connected to
Month, which is then connected to
Day. Week is only connected to Day.
A sample snowflake schema illustrating the above relationships in the Time Dimension is
shown to the right.
The main advantage of the snowflake schema is the improvement in query performance due
to minimized disk storage requirements and joining smaller lookup tables. The main
disadvantage of the snowflake schema is the additional maintenance efforts needed due to
the increase number of lookup tables.
Slowly Changing Dimension
The "Slowly Changing Dimension" problem is a common one particular to data warehousing.
In a nutshell, this applies to cases where the attribute for a record varies over time. We give
an example below:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry
in the customer lookup table has the following record:
Customer Key Name State
1001 Christina Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc.
now modify its customer table to reflect this change? This is the "Slowly Changing Dimension"
problem.
There are in general three ways to solve this type of problem, and they are categorized as
follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is
treated essentially as two people.
Type 3: The original record is modified to reflect the change.
Kamesh.chalasani

We next take a look at each of the scenarios and how the data model and the data looks like
for each of them. Finally, we compare and contrast among the three alternatives.

In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, the new information replaces the new record,
and we have the following table:
Customer Key Name State
1001 Christina California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no
need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois
before.
Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data
warehouse to keep track of historical changes.

In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the
new information. Therefore, both the original and the new record will be present. The newe
record gets its own primary key.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, we add the new information as a new row into
the table:
Customer Key Name State
1001 Christina Illinois
1005 Christina California
Advantages:
- This allows us to accurately keep all historical information.
Kamesh.chalasani

Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the
table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular
attribute of interest, one indicating the original value, and one indicating the current value.
There will also be a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following
columns:
• Customer Key
• Name
• Original State
• Current State
• Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we
have the following table (assuming the effective date of change is January 15, 2003):
Customer Key Name Original State Current State Effective Date
1001 Christina Illinois California 15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information
will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Kamesh.chalasani

Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite
number of time.
The three level of data modeling,conceptual data model,logical data model,physical data
model, were discussed in prior sections. Here we compare these three types of data models.
The table below compares the different features:
Feature Conceptual Logical Physical
Entity Names ✓ ✓
Entity Relationships ✓ ✓
Attributes ✓
Primary Keys ✓ ✓
Foreign Keys ✓ ✓
Table Names ✓
Column Names ✓
Column Data Types ✓
Below we show the conceptual, logical, and physical versions of a single data model.
Logical Model Design Physical Model Design

Conceptual Model Design

We can see that the complexity increases from conceptual to logical to physical. This is why
we always first start with the conceptual data model (so we understand at high level what are
the different entities in our data and how they relate to one another), then move on to the
logical data model (so we understand the details of our data without worrying about how they
will actually implemented), and finally the physical data model (so we know exactly how to
implement our data model in the database of choice). In a data warehousing project,
sometimes the conceptual data model and the logical data model are considered as a single
deliverable.
Kamesh.chalasani

Data integrity refers to the validity of data, meaning data is consistent and correct. In the
data warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no
data integrity in the data warehouse, any resulting report and analysis will not be useful.
In a data warehouse or a data mart, there are three areas of where data integrity needs to be
enforced:
Database level
We can enforce data integrity at the database level. Common ways of enforcing data integrity
include:
Referential integrity
The relationship between the primary key of one table and the foreign key of another table
must always be maintained. For example, a primary key cannot be deleted if there is still a
foreign key that refers to this primary key.
Primary key / Unique constraint
Primary keys and the UNIQUE constraint are used to make sure every row in a table can be
uniquely identified.
Not NULL vs NULL-able
For columns identified as NOT NULL, they may not have a NULL value.
Valid Values
Only allowed values are permitted in the database. For example, if a column can only have
positive integers, a value of '-1' cannot be allowed.
ETL process
For each step of the ETL process, data integrity checks should be put in place to ensure that
source data is the same as the data in the destination. Most common checks include record
counts or record sums.
Access level
We need to ensure that data is not altered by any unauthorized means either during the ETL
process or in the data warehouse. To do this, there needs to be safeguards against
unauthorized access to data (including physical access to the servers), as well as logging of
all data access history. Data integrity can only ensured if there is no unauthorized access to
the data.
Kamesh.chalasani

Source System
A database, application, file, or other storage facility from which the data in a
data warehouse is derived.
Mapping
The definition of the relationship and data flow between source and target
objects.
Metadata
Data that describes data and other structures, such as objects, business rules,
and processes. For example, the schema design of a data warehouse is
typically stored in a repository as metadata, which is used to generate scripts
used to build and populate the data warehouse. A repository contains
metadata.
Staging Area
A place where data is processed before entering the warehouse.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source
data, typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying is a
transformation. Examples include cleansing, aggregating, and integrating
data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to a data
warehouse.
Target System
A database, application, file, or other storage facility to which the
"transformed source data" is loaded in a data warehouse.
Kamesh.chalasani

Guidelines to work with Informatica Power Center


• Repository: This is where all the metadata information is stored in the
Informatica suite. The Power Center Client and the Repository Server
would access this repository to retrieve, store and manage metadata.
• Power Center Client: Informatica client is used for managing users,
identifiying source and target systems definitions, creating mapping and
mapplets, creating sessions and run workflows etc.
• Repository Server: This repository server takes care of all the
connections between the repository and the Power Center Client.
• Power Center Server: Power Center server does the extraction from
source and then loading data into targets.
• Designer: Source Analyzer, Mapping Designer and Warehouse
Designer are tools reside within the Designer wizard. Source Analyzer
is used for extracting metadata from source systems.
Mapping Designer is used to create mapping between sources and
targets. Mapping is a pictorial representation about the flow of data
from source to target.
Warehouse Designer is used for extracting metadata from target
systems or metadata can be created in the Designer itself.
• Data Cleansing: The PowerCenter's data cleansing technology
improves data quality by validating, correctly naming and
standardization of address data. A person's address may not be same in
all source systems because of typos and postal code, city name may not
match with address. These errors can be corrected by using data
cleansing process and standardized data can be loaded in target systems
(data warehouse).
• Transformation: Transformations help to transform the source data
according to the requirements of target system. Sorting, Filtering,
Aggregation, Joining are some of the examples of transformation.
Transformations ensure the quality of the data being loaded into target
and this is done during the mapping process from source to target.
• Workflow Manager: Workflow helps to load the data from source to
target in a sequential manner. For example, if the fact tables are loaded
before the lookup tables, then the target system will pop up an error
message since the fact table is violating the foreign key validation. To
Kamesh.chalasani

avoid this, workflows can be created to ensure the correct flow of data
from source to target.
• Workflow Monitor: This monitor is helpful in monitoring and tracking
the workflows created in each Power Center Server.
• Power Center Connect: This component helps to extract data and
metadata from ERP systems like IBM's MQSeries, Peoplesoft, SAP,
Siebel etc. and other third party applications.
• Power Center Exchange: This component helps to extract data and
metadata from ERP systems like IBM's MQSeries, Peoplesoft, SAP,
Siebel etc. and other third party applications.

Power Exchange:
Informatica Power Exchange as a stand alone service or along with Power
Center, helps organizations leverage data by avoiding manual coding of data
extraction programs. Power Exchange supports batch, real time and changed
data capture options in main frame(DB2, VSAM, IMS etc.,), mid range
(AS400 DB2 etc.,), and for relational databases (oracle, sql server, db2 etc)
and flat files in unix, linux and windows systems.

Power Channel:
This helps to transfer large amount of encrypted and compressed data over
LAN, WAN, through Firewalls, tranfer files over FTP, etc.
Meta Data Exchange:
Metadata Exchange enables organizations to take advantage of the time and
effort already invested in defining data structures within their IT environment
when used with Power Center. For example, an organization may be using
data modeling tools, such as Erwin, Embarcadero, Oracle designer, Sybase
Power Designer etc for developing data models. Functional and technical
team should have spent much time and effort in creating the data model's data
structures(tables, columns, data types, procedures, functions, triggers etc). By
using meta deta exchange, these data structures can be imported into power
center to identifiy source and target mappings which leverages time and
effort. There is no need for informatica developer to create these data
structures once again.
Kamesh.chalasani

Power Analyzer:
Power Analyzer provides organizations with reporting facilities.
PowerAnalyzer makes accessing, analyzing, and sharing enterprise data
simple and easily available to decision makers. PowerAnalyzer enables to
gain insight into business processes and develop business intelligence.
With PowerAnalyzer, an organization can extract, filter, format, and analyze
corporate information from data stored in a data warehouse, data mart,
operational data store, or otherdata storage models. PowerAnalyzer is best
with a dimensional data warehouse in a relational database. It can also run
reports on data in any table in a relational database that do not conform to the
dimensional model.
Super Glue:
Superglue is used for loading metadata in a centralized place from several
sources. Reports can be run against this superglue to analyze meta data.
Power Mart:
Power Mart is a departmental version of Informatica for building, deploying,
and managing data warehouses and data marts. Power center is used for
corporate enterprise data warehouse and power mart is used for departmental
data warehouses like data marts. Power Center supports global repositories
and networked repositories and it can be connected to several sources. Power
Mart supports single repository and it can be connected to fewer sources
when compared to Power Center. Power Mart can extensibily grow to an
enterprise implementation and it is easy for developer productivity through a
codeless environment.

Active Transformation
An active transformation can change the number of rows that pass through it
from source to target i.e it eliminates rows that do not meet the condition in
transformation.
Passive Transformation
A passive transformation does not change the number of rows that pass
through it i.e it passes all rows through the transformation.
Transformations can be Connected or UnConnected.
Connected Transformation
Connected transformation is connected to other transformations or directly to
Kamesh.chalasani

target table in the mapping.


UnConnected Transformation
An unconnected transformation is not connected to other transformations in
the mapping. It is called within another transformation, and returns a value to
that transformation.
Following are the list of Transformations available in Informatica:
• Aggregator Transformation
• Expression Transformation
• Filter Transformation
• Joiner Transformation
• Lookup Transformation
• Normalizer Transformation
• Rank Transformation
• Router Transformation
• Sequence Generator Transformation
• Stored Procedure Transformation
• Sorter Transformation
• Update Strategy Transformation
• XML Source Qualifier Transformation
• Advanced External Procedure Transformation
• External Transformation

Aggregator Transformation
Aggregator transformation is an Active and Connected transformation. This
transformation is useful to perform calculations such as averages and sums
(mainly to perform calculations on multiple rows or groups). For example, to
calculate total of daily sales or to calculate average of monthly or yearly
sales. Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE,
MAX, SUM etc. can be used in aggregate transformation.

Expression Transformation
Expression transformation is a Passive and Connected transformation. This
can be used to calculate values in a single row before writing to the target.
For example, to calculate discount of each product or to concatenate first and
Kamesh.chalasani

last names or to convert date to a string field.


Filter Transformation
Filter transformation is an Active and Connected transformation. This can be
used to filter rows in a mapping that do not meet the condition. For example,
to know all the employees who are working in Department 10 or to find out
the products that falls between the rate category $500 and $1000.

Joiner Transformation
Joiner Transformation is an Active and Connected transformation. This can
be used to join two sources coming from two different locations or from
same location. For example, to join a flat file and a relational source or to
join two flat files or to join a relational source and a XML source.
In order to join two sources, there must be atleast one matching port. at least
one matching port. While joining two sources it is a must to specify one
source as master and the other as detail.
The Joiner transformation supports the following types of joins:
• Normal
• Master Outer
• Detail Outer
• Full Outer
Normal join discards all the rows of data from the master and detail source
that do not match, based on the condition.

Master outer join discards all the unmatched rows from the master source and
keeps all the rows from the detail source and the matching rows from the
master source.

Detail outer join keeps all rows of data from the master source and the
matching rows from the detail source. It discards the unmatched rows from
the detail source.

Full outer join keeps all rows of data from both the master and detail sources.

Lookup Transformation
Lookup transformation is Passive and it can be both Connected and UnConnected as well. It
Kamesh.chalasani

is used to look up data in a relational table, view, or synonym. Lookup definition can be
imported either from source or from target tables

Difference between Connected and UnConnected Lookup


Transformation:
Connected lookup receives input values directly from mapping pipeline
whereas UnConnected lookup receives values from: LKP expression from
another transformation.

Connected lookup returns multiple columns from the same row whereas
UnConnected lookup has one return port and returns one column from each
row.

Connected lookup supports user-defined default values whereas


UnConnected lookup does not support user defined values.
Normalizer Transformation
Normalizer Transformation is an Active and Connected transformation. It is
used mainly with COBOL sources where most of the time data is stored in
de-normalized format. Also, Normalizer transformation can be used to create
multiple rows from a single row of data.
Rank Transformation
Rank transformation is an Active and Connected transformation. It is used to
select the top or bottom rank of data. For example, to select top 10 Regions
where the sales volume was very high or to select 10 lowest priced products.
Router Transformation
Router is an Active and Connected transformation. It is similar to filter
transformation. The only difference is, filter transformation drops the data
that do not meet the condition whereas router has an option to capture the
data that do not meet the condition. It is useful to test multiple conditions. It
has input, output and default groups. For example, if we want to filter data
like where State=Michigan, State=California, State=New York and all other
States. It’s easy to route data to different tables.

Sequence Generator Transformation


Sequence Generator transformation is a Passive and Connected
transformation. It is used to create unique primary key values or cycle
Kamesh.chalasani

through a sequential range of numbers or to replace missing keys.

It has two output ports to connect transformations. By default it has two


fields CURRVAL and NEXTVAL(You cannot add ports to this
transformation). NEXTVAL port generates a sequence of numbers by
connecting it to a transformation or target. CURRVAL is the NEXTVAL
value plus one or NEXTVAL plus the Increment By value.
Stored Procedure Transformation
Stored Procedure transformation is a Passive and Connected & UnConnected
transformation. It is useful to automate time-consuming tasks and it is also
used in error handling, to drop and recreate indexes and to determine the
space in database, a specialized calculation etc.
The stored procedure must exist in the database before creating a Stored
Procedure transformation, and the stored procedure can exist in a source,
target, or any database with a valid connection to the Informatica Server.
Stored Procedure is an executable script with SQL statements and control
statements, user-defined variables and conditional statements.
Sorter Transformation
Sorter transformation is a Connected and an Active transformation. It allows
to sort data either in ascending or descending order according to a specified
field. Also used to configure for case-sensitive sorting, and specify whether
the output rows should be distinct.
Source Qualifier Transformation
Source Qualifier transformation is an Active and Connected transformation.
When adding a relational or a flat file source definition to a mapping, it is
must to connect it to a Source Qualifier transformation. The Source Qualifier
performs the various tasks such as overriding default SQL query, filtering
records; join data from two or more tables etc.
Update Strategy Transformation
Update strategy transformation is an Active and Connected transformation. It
is used to update data in target table, either to maintain history of data or
recent changes. You can specify how to treat source rows in table, insert,
update, delete or data driven.

XML Source Qualifier Transformation


Kamesh.chalasani

XML Source Qualifier is a Passive and Connected transformation. XML


Source Qualifier is used only with an XML source definition. It represents
the data elements that the Informatica Server reads when it executes a session
with XML sources.

Advanced External Procedure Transformation


Advanced External Procedure transformation is an Active and Connected
transformation. It operates in conjunction with procedures, which are created
outside of the Designer interface to extend PowerCenter/PowerMart
functionality. It is useful in creating external transformation applications,
such as sorting and aggregation, which require all input rows to be processed
before emitting any output rows.
External Procedure Transformation
External Procedure transformation is an Active and Connected/UnConnected
transformations. Sometimes, the standard transformations such as Expression
transformation may not provide the functionality that you want. In such cases
External procedure is useful to develop complex functions within a dynamic
link library (DLL) or UNIX shared library, instead of creating the necessary
Expression transformations in a mapping.
Differences between Advanced External Procedure and External
Procedure Transformations:

External Procedure returns single value, where as Advanced External


Procedure returns multiple values.
External Procedure supports COM and Informatica procedures where as AEP
supports only Informatica Procedures.

S-ar putea să vă placă și