Sunteți pe pagina 1din 3

The Conformed ETL Architecture

Pounding Square Pegs Into Round Holes (Without Cutting Corners)


By Pat Minton and Don Steffen

equirements
have
been
defined, the data model is
complete, source systems have
been identified, tool selections
have been made, and now the
only thing left to do is connect the
dots. Sounds easy, right? Designated
with creating the extract, transform
and load (ETL) architecture to move
the source data into the warehouse,
you begin drawing boxes and lines
depicting the individual ETL processes
that must be created in order to reconcile the idiosyncratic source system
data into its generic business view of
the world. The solution seems obvious: create an ETL process for each
source you must introduce.
And theyre off! The ETL analysts
charge forward, creating the detail
design (source-to-target mappings)
for each of the processes you identified. However, upon reviewing said
designs, you begin to discover that
many of them are repetitively performing many of the same actions
(and not always consistently). Maybe
theres a better way.
ETL architects and data warehouse designers are faced with the
task of homogenizing data into a stan-

SRC X Customer
CUST_STATUS
BILLDT
SRC Y Customer
DISCONNECT_DATE
LAST_BILL

dard and consistent format. Whether


by geography, business unit or anything (and everything!) else, data in the
source systems feeding the data warehouse is inherently different. This article examines the conformed ETL architecture an approach that eliminates
proliferating, redundant ETL actions by
distinguishing between where to apply
source-specific and non source-specific
business rules during the ETL load
processes within the data warehouse.

Defining ETL Actions


Creating a stable, timely, maintainable and extensible framework for
converging like, but disparate, data
sources into homogenized entities is
the goal of any successful ETL architecture. This meshing of source system-specific data into the business
view represented in the warehouse
requires two distinctly different types
of ETL actions:
Source-Specific Actions: Those
actions which apply source-specific
business rules in order to transform
and decode the data from a format
that is recognizable only by the
source systems into a standardized
format.

Source-Specific

Non Source-Specific

SRC X Rules:
If CUST_STATUS = A or X, then
STATUS = ACTIVE, else
INACTIVE.

Generic Rule for Billing Status:


If CUST_STATUS = ACTIVE and
LAST_BILL_DATE is NULL, then
BILLING_STATUS = NOT BILLED,
else BILLING_STATUS = BILLED.

LAST_BILL_DATE = BILLDT.
SRC Y Rules:
If DISCONNECT_DATE = NULL,
then STATUS = ACTIVE, else
INACTIVE.
LAST_BILL_DATE = LAST_BILL

Figure 1: Source-Specific and Non Source-Specific Business Rules


Reprinted from DM Review September 2004

Non Source-Specific Actions:


Those actions performed on data,
regardless of the source system of
origination. Non source-specific
actions include the application of
business rules, as well as the ETL
logic necessary to effect the change
in the warehouse (insert/update).
A logical ETL design outlines the
necessary actions and mapping of specific elements that must be performed
in order to successfully integrate disparate data sources. This detailed
abstraction provides insight into the
transformations that the source data
must undergo at an atomic level,
regardless of the ETL tool used for the
physical implementation. Accurately
categorizing these actions into their
appropriate designations (source- or
non source-specific) can be a difficult
undertaking. Consider the following
scenario where the customers status
(either ACTIVE or INACTIVE) and
the customers billing status (either
BILLED or UNBILLED) are being
determined (see Figure 1).
It would be easy to mistakenly
determine that the assignment of the
customers billing status would be an
action specific to the source. However,
by isolating the individual transformations and business rules to apply to
source data, we have better insight
into the true scope of the business rule
to apply, allowing us to create generic,
reusable business rules and definitions.

Traditional Versus
Conformed ETL Architecture
Following is a simple example as
context for this discussion.
Company ABC has two sales systems
(systems X and Y), which contain information surrounding the sale of widgets.
Each sale of a widget needs to be captured
www.dmreview.com

Source X Data

each source to apply source-specific


business rules necessary to conform
the source into a generic format (conformed widget sales) and create
reusable, generic processes to execute
non source-specific business rules and
actions (see Figure 4).
The conformed ETL architecture
involves creating an intermediary definition of the data entity (the conformed table). This operational data
entity contains the standardized format of the converged source systems,
clearly defines the boundary between
source-specific and reusable generic
actions and rules, and serves as the
gateway for moving data into the
warehouse. Obviously, the disadvantages of this approach are that there
are more objects and processes that
must be created and maintained.
However, the underlying advantages
of this architecture are plentiful:
Modularization of ETL processes:
The creation of smaller, less complex ETL processes makes troubleshooting problems and creating
future enhancements easier.
Reusability of post-conform
processes: Often, the simple
process of enforcing referential
integrity (looking up foreign key
assignments) and performing
inserts and/or updates to the final
target destination is not so simple. The conform approach prevents having to duplicate the
efforts and logic necessary to apply

Customer
PK

Date of Sale
Customer Number
Sale Category
Sale Amount
Sales Tax

Customer ID
....

Widget Sales
PK, FK1 Widget Type ID
PK, FK2 Calendar ID
PK, FK3 Customer ID
Source Y Data

Total Sale Amount

Customer Code
Sales Date
Widget
Amount

Widget Type

Calendar

PK Widget Type ID

PK

....

Source

ETL

Calendar ID
....

Target

Figure 2: Widget Sales Example Sources and Targets

as part of the data warehouse implementation. The sales are qualified by the date of
the sale, the customer purchasing the widget,
the type of widget purchased (commercial
or residential) and the total dollar
amount of the sale (see Figure 2).
When faced with the issue of how
to create the necessary ETL processes
to converge like data sources into
standardized entities within the warehouse, there are two choices:
Traditional ETL Architecture:
Create individual ETL processes for
each source system (as shown in
Figure 3).
A traditional ETL architecture
would create one ETL process to perform all of the logic necessary to
transform the source data into its target destination. The advantages to
this approach are that there are fewer

ETL processes to create and maintain.


The hidden shortcoming of this
approach is that each individual ETL
process must redundantly perform
many of the same actions (the non
source-specific actions). This duplication of efforts results in additional
efforts throughout the life cycle of the
data warehouse:
During development, this redundancy results in an increase in the
amount and complexity of ETL
code to create and test, thereby
lengthening the development time
line.
Upon implementation, modifications require more effort as changes
potentially must be made in multiple places.
Conformed ETL Architecture:
Create individual ETL processes for

Source-Specific and Non Source-Specific


Customer

Source X Data
Date of Sale
Customer Number
Sale Category
Sale Amount
Sales Tax

PK

ETL

Customer ID
....

Calendar

Widget Sales
PK, FK1 Customer ID
PK, FK2 Calendar ID
PK, FK3 Widget Type ID

PK

Calendar ID
....

Total Sale Amount


Source Y Data
Customer Code
Sales Date
Widget
Amount

ETL

Widget Type
PK

Widget Type ID
....

Figure 3: Traditional ETL Architecture


Reprinted from DM Review September 2004

www.dmreview.com

Source-Specific

Non Source-Specific

Date of Sale
Customer Number
Sale Category
Sale Amount
Sales Tax

Calendar

Customer

Source X Data
PK

Calendar ID
....

....

ETL

Conformed Widget Sales


Widget Type LKP Key
Calendar LKP Key
Customer LKP Key
Sale Amount
Tax Amount

Widget Sales

ETL

PK, FK1 Customer ID


PK, FK2 Calendar ID
PK, FK3 Widget Type ID
Total Sale Amount

Source Y Data
Customer Code
Sales Date
Widget
Amount

PK

Customer ID

Widget Type
PK

ETL

Widget Type ID
....

Figure 4: Conformed ETL Architecture

non source-specific actions to data.


Eliminating redundant processing
and ETL logic reduces the chances
of error, ultimately improving data
quality.
Extensibility: The reusable aspects
of the conformed ETL architecture
allow for the rapid acquisition of
additional source systems. When
new source data is introduced to
the warehouse, only the sourcespecific ETL processes to conform
the data must be created.

To Conform or Not to Conform?


While the conform ETL architecture has many benefits, consideration
should be made concerning where to
apply the approach, as it is not appli-

Reprinted from DM Review September 2004

cable in all scenarios. Analysis should


be performed to decide which sources
to conform to ensure that the extra
development overhead will have
quantifiable benefits (increased performance, flexibility in ETL scheduling, modularization of processes,
etc.). In making the decision to follow
a conformed architecture or not, be
sure to keep in mind the future
requirements of the warehouse (additional sources of data, etc.), not just
the current state.
Redundancy within a data warehouse is a given where the data is
concerned. By properly identifying
and categorizing the business rules
and logic to be applied to source data,
many processing redundancies can be

reduced as well. The conformed ETL


architecture allows for the creation of
a scalable, extensible framework,
which helps to ensure that timely
information is consistently delivered
to the masses.
Pat Minton is a manager with DiamondCluster
International, a premier global management consulting
firm that helps leading organizations develop and
implement growth strategies, improve operations and
capitalize on technology. He has more than 10 years of
database architecture experience in the telecommunications, media and transportation industries. He may
be reached at patrick.minton@diamondcluster.com.
Don Steffen is a cofounder and partner of AmberLeaf
Partners, Inc. (formerly BI Solutions, Inc.), a consulting
firm dedicated to enabling innovative companies with
the information to make critical investment decisions.
Steffen has been designing and delivering technical
architecture and solutions in the business intelligence
and data warehouse industry for more than a decade.
He can be reached at dsteffen@amberleaf.net.

www.dmreview.com

S-ar putea să vă placă și