Documente Academic
Documente Profesional
Documente Cultură
Description
Prior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific
transformations and to those server variables that were global in nature. Transformation variables were defined as
variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression,
Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect
the subdirectories for source files, target files, log files, and so forth.
More current versions of PowerCenter made variables and parameters available across the entire mapping rather
than for a specific transformation object. In addition, they provide built-in parameters for use within Workflow
Manager. Using parameter files, these values can change from session-run to session-run. With the addition of
workflows, parameters can now be passed to every session contained in the workflow, providing more flexibility
and reducing parameter file maintenance. Other important functionality that has been added in recent releases is
the ability to dynamically create parameter files that can be used in the next session in a workflow or in other
workflows.
use, as necessary. To specify the parameter file that the Integration Service uses with a workflow, worklet, or
session, do either of the following:
Enter the parameter file name and directory in the workflow, worklet, or session properties.
Start the workflow, worklet, or session using pmcmd and enter the parameter filename and directory in the
command line.
If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd
command line, the Integration Service uses the information entered in the pmcmd command line.
Parameter and
Variable Name
Desired Definition
String Mapping
Parameter
$$State
MA
Datetime Mapping
Variable
$$Time
10/1/2000 00:00:00
$InputFile1
Sales.txt
Database Connection
(Session Parameter)
$DBConnection_Target
Sales (database
connection)
$PMSessionLogFile
d:/session
logs/firstrun.txt
The parameter file for the session includes the folder and session name, as well as each parameter and variable:
[Production.s_MonthlyCalculations]
$$State=MA
$$Time=10/1/2000 00:00:00
$InputFile1=sales.txt
$DBConnection_target=sales
$PMSessionLogFile=D:/session logs/firstrun.txt
The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable.
This allows the Integration Service to use the value for the variable that was set in the previous session run
Mapping Variables
Declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and
Variables (See the first figure, below). After selecting mapping variables, use the pop-up window to create a
variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to
creating a port in most transformations (See the second figure, below).
Variables, by definition, are objects that can change value dynamically. PowerCenter has four functions to affect
change to mapping variables:
SetVariable
SetMaxVariable
SetMinVariable
SetCountVariable
A mapping variable can store the last value from a session run in the repository to be used as the starting value for
the next session run.
Name. The name of the variable should be descriptive and be preceded by $$ (so that it is easily identifiable
as a variable). A typical variable name is: $$Procedure_Start_Date.
Aggregation type. This entry creates specific functionality for the variable and determines how it stores
data. For example, with an aggregation type of Max, the value stored in the repository at the end of each
session run would be the maximum value across ALL records until the value is deleted.
Initial value. This value is used during the first session run when there is no corresponding and overriding
parameter file. This value is also used if the stored repository value is deleted. If no initial value is identified,
then a data-type specific default value is used.
Variable values are not stored in the repository when the session:
Fails to complete.
Is configured for a test load.
Is a debug session.
Runs in debug mode and is configured to discard session output.
Order of Evaluation
The start value is the value of the variable at the start of the session. The start value can be a value defined in the
parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined
initial value for the variable, or the default value based on the variable data type. The Integration Service looks for
the start value in the following order:
1.
2.
3.
4.
[folder_name.session_name]
parameter_name=value
variable_name=value
mapplet_name.parameter_name=value
[folder2_name.session_name]
parameter_name=value
variable_name=value
mapplet_name.parameter_name=value
Specify headings in any order. Place headings in any order in the parameter file. However, if defining the
same parameter or variable more than once in the file, the Integration Service assigns the parameter or
variable value using the first instance of the parameter or variable.
Specify parameters and variables in any order. Below each heading, the parameters and variables can
be specified in any order.
When defining parameter values, do not use unnecessary line breaks or spaces. The Integration
Service may interpret additional spaces as part of the value.
List all necessary mapping parameters and variables. Values entered for mapping parameters and
variables become the start value for parameters and variables in a mapping. Mapping parameter and
variable names are not case sensitive.
List all session parameters. Session parameters do not have default values. An undefined session
parameter can cause the session to fail. Session parameter names are not case sensitive.
Use correct date formats for datetime values. When entering datetime values, use the following date
formats:
MM/DD/RR
MM/DD/RR HH24:MI:SS
MM/DD/YYYY
MM/DD/YYYY HH24:MI:SS
Do not enclose parameters or variables in quotes. The Integration Service interprets everything after the
equal sign as part of the value.
Do enclose parameters in single quotes. In a Source Qualifier SQL Override use single quotes if the
parameter represents a string or date/time value to be used in the SQL Override.
Precede parameters and variables created in mapplets with the mapplet name as follows:
mapplet_name.parameter_name=value
mapplet2_name.variable_name=value
SETMAXVARIABLE($$Post_Date,DATE_ENTERED)
The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be passed
forward. For example:
DATE_ENTERED
ResultantPOST_DATE
9/1/2000
9/1/2000
10/30/2001
10/30/2001
9/2/2000
10/30/2001
The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900
providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it
encounters. Upon successful completion of the session, the variable is updated in the repository for use in the next
session run. To view the current value for a particular variable associated with the session, right-click on the
session in the Workflow Monitor and choose View Persistent Values.
The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this
session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered > 02/03/1998
will be processed.
To reset the persistent value to the initial value declared in the mapping, view the persistent value from Workflow
Manager (see graphic above) and press Delete Values. This deletes the stored value from the repository, causing
the Order of Evaluation to use the Initial Value declared from the mapping.
If a session run is needed for a specific date, use a parameter file. There are two basic ways to accomplish this:
Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A
session may (or may not) have a variable, and the parameter file need not have variables and parameters
defined for every session using the parameter file. To override the variable, either change, uncomment, or
delete the variable in the parameter file.
Run pmcmd for that session, but declare the specific parameter file within the pmcmd command.
The next graphic shows the parameter filename and location specified in the Workflow.
In this example, after the initial session is run, the parameter file contents may look like:
[Test.s_Incremental]
;$$Post_Date=
By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the
subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a
simple Perl script or manual change can update the parameter file to:
[Test.s_Incremental]
$$Post_Date=04/21/2001
Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value
and uses that value for the session run. After successful completion, run another script to reset the parameter file.
Schema
Table
User
Password
ORC1
aardso
orders
Sam
max
ORC99
environ
orders
Help
me
HALC
hitme
order_done
Hi
Lois
UGLY
snakepit
orders
Punch
Judy
GORF
gmer
orders
Brer
Rabbit
Each sales order table has a different name, but the same definition:
ORDER_ID
NUMBER (28)
NOT NULL,
DATE_ENTERED
DATE
NOT NULL,
DATE_PROMISED
DATE
NOT NULL,
DATE_SHIPPED
DATE
NOT NULL,
EMPLOYEE_ID
NUMBER (28)
NOT NULL,
CUSTOMER_ID
NUMBER (28)
NOT NULL,
SALES_TAX_RATE
NUMBER (5,4)
NOT NULL,
STORE_ID
NUMBER (28)
NOT NULL
Sample Solution
Using Workflow Manager, create multiple relational connections. In this example, the strings are named according
to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then
create a Mapping Parameter named $$Source_Schema_Table with the following attributes:
Note that the parameter attributes vary based on the specific environment. Also, the initial value is not required
since this solution uses parameter files.
Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.
Open the Expression Editor and select Generate SQL. The generated SQL statement shows the columns.
Override the table names in the SQL statement with the mapping parameter.
Using Workflow Manager, create a session based on this mapping. Within the Source Database connection dropdown box, choose the following parameter:
$DBConnection_Source.
Point the target to the corresponding target and finish.
Now create the parameter files. In this example, there are five separate parameter files.
Parmfile1.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=aardso.orders
$DBConnection_Source= ORC1
Parmfile2.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=environ.orders
$DBConnection_Source= ORC99
Parmfile3.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=hitme.order_done
$DBConnection_Source= HALC
Parmfile4.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=snakepit.orders
$DBConnection_Source= UGLY
Parmfile5.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table= gmer.orders
$DBConnection_Source= GORF
Use pmcmd to run the five sessions in parallel. The syntax for pmcmd for starting sessions with a particular
parameter file is as follows:
-paramfile '$PMRootDir/myfile.txt'
For Windows command prompt users, the parameter file name cannot have beginning or trailing spaces. If the
name includes spaces, enclose the file name in double quotes:
pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w wSalesAvg -paramfile
'\$PMRootDir/myfile.txt'
In the event that it is necessary to run the same workflow with different parameter files, use the following five
separate commands:
[folder_name.session_name]
parameter_name= <parameter_code>
variable_name=value
mapplet_name.parameter_name=value
[folder2_name.session_name]
parameter_name= <parameter_code>
variable_name=value
mapplet_name.parameter_name=value
In place of the text <parameter_code> one could place the text filename_<timestamp>.dat. The mapping would
then perform a string replace wherever the text <timestamp> occurred and the output might look like:
Src_File_Name= filename_20080622.dat
This method works well when values change often and parameter groupings utilize different parameter sets. The
overall benefits of using this method are such that if many mappings use the same parameter file, changes can be
made by updating the source table and recreating the file. Using this process is faster than manually updating the
file line by line.
Description
On hardware systems that are under-utilized, you may be able to improve performance by processing partitioned
data sets in parallel in multiple threads of the same session instance running onthe PowerCenter Server engine.
However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity.
In addition to hardware, consider these other factors when determining if a session is an ideal candidate for
partitioning: source and target database setup, target type, mapping design, and certain assumptions that are
explained in the following paragraphs. Use the Workflow Manager client tool to implement session partitioning.
Assumptions
The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning.
These factors can help to maximize the benefits that can be achieved through partitioning.
Indexing has been implemented on the partition key when using a relational source.
Source files are located on the same physical machine as the PowerCenter Server process when
partitioning flat files, COBOL, and XML, to reduce network overhead and delay.
All possible constraints are dropped or disabled on relational targets.
All possible indexes are dropped or disabled on relational targets.
Table spaces and database partitions are properly managed on the target system.
Target files are written to same physical machine that hosts the PowerCenter process in order to reduce
network overhead and delay.
Oracle External Loaders are utilized whenever possible
First, determine if you should partition your session. Parallel execution benefits systems that have the following
characteristics:
Check idle time and busy percentage for each thread. This gives the high-level information of the bottleneck
point/points. In order to do this, open the session log and look for messages starting with PETL_ under the RUN
INFO FOR TGT LOAD ORDER GROUP section. These PETL messages give the following details against the
reader, transformation, and writer threads:
Total Run Time
Total Idle Time
Busy Percentage
Under-utilized or intermittently-used CPUs. To determine if this is the case, check the CPU usage of your
machine. The column ID displays the percentage utilization of CPU idling during the specified interval without any
I/O wait. If there are CPU cycles available (i.e., twenty percent or more idle time), then this session's performance
may be improved by adding a partition.
Windows 2000/2003 - check the task manager performance tab.
UNIX - type VMSTAT 1 10 on the command line.
Sufficient I/O. To determine the I/O statistics:
Windows 2000/2003 - check the task manager performance tab.
UNIX - type IOSTAT on the command line. The column %IOWAIT displays the percentage of CPU time
spent idling while waiting for I/O requests. The column %idle displays the total percentage of the time that
the CPU spends idling (i.e., the unused capacity of the CPU.)
Sufficient memory. If too much memory is allocated to your session, you will receive a memory allocation error.
Check to see that you're using as much memory as you can. If the session is paging, increase the memory. To
determine if the session is paging:
Windows 2000/2003 - check the task manager performance tab.
UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages swapped in from the page
space during the specified interval. PO displays the number of pages swapped out to the page space during
the specified interval. If these values indicate that paging is occurring, it may be necessary to allocate more
memory, if possible.
If you determine that partitioning is practical, you can begin setting up the partition.
Partition Types
PowerCenter provides increased control of the pipeline threads. Session performance can be improved by adding
partitions at various pipeline partition points. When you configure the partitioning information for a pipeline, you
must specify a partition type. The partition type determines how the PowerCenter Server redistributes data across
partition points. The Workflow Manager allows you to specify the following partition types:
Round-robin Partitioning
The PowerCenter Server distributes data evenly among all partitions. Use round-robin partitioning when you need
to distribute rows evenly and do not need to group data among partitions.
In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider
a session based on a mapping that reads data from three flat files of different sizes.
Source file 1: 100,000 rows
Source file 2: 5,000 rows
Source file 3: 20,000 rows
In this scenario, the recommended best practice is to set a partition point after the Source Qualifier and set the
partition type to round-robin. The PowerCenter Server distributes the data so that each partition processes
approximately one third of the data.
Hash Partitioning
The PowerCenter Server applies a hash function to a partition key to group data among partitions.
Use hash partitioning where you want to ensure that the PowerCenter Server processes groups of rows with the
same partition key in the same partition. For example, in a scenario where you need to sort items by item ID, but
do not know the number of items that have a particular ID number. If you select hash auto-keys, the PowerCenter
Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of
ports to form the partition key.
An example of this type of partitioning is when you are using Aggregators and need to ensure that groups of data
based on a primary key are processed in the same partition.
Pass-through Partitioning
In this type of partitioning, the PowerCenter Server passes all rows at one partition point to the next partition point
without redistributing them.
Use pass-through partitioning where you want to create an additional pipeline stage to improve performance, but
do not want to (or cannot) change the distribution of data across partitions. The Data Transformation Manager
spawns a master thread on each session run, which in turn creates three threads (reader, transformation, and
writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence, three
data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take
a longer time than the other threads, which can slow data throughput.
It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces
the overhead of a single transformation thread.
When you have considered all of these factors and selected a partitioning strategy, you can begin the iterative
process of adding partitions. Continue adding partitions to the session until you meet the desired performance
threshold or observe degradation in performance.
performance. You can only use Oracle external loaders for partitioning. Refer to the Session and Server
Guide for more information on using and setting up the Oracle external loader for partitioning.
Write throughput. Check the session statistics to see if you have increased the write throughput.
Paging. Check to see if the session is now causing the system to page. When you partition a session and
there are cached lookups, you must make sure that DTM memory is increased to handle the lookup caches.
When you partition a source that uses a static lookup cache, the PowerCenter Server creates one memory
cache for each partition and one disk cache for each transformation. Thus, memory requirements grow for
each partition. If the memory is not bumped up, the system may start paging to disk, causing degradation in
performance.
When you finish partitioning, monitor the session to see if the partition is degrading or improving session
performance. If the session performance is improved and the session meets your requirements, add another
partition
Dynamic Partitioning
Dynamic partitioning is also called parameterized partitioning because a single parameter can determine the
number of partitions. With the Session on Grid option, more partitions can be added when more resources are
available. Also the number of partitions in a session can be tied to partitions in the database to facilitate
maintenance of PowerCenter partitioning to leverage database partitioning.
Description
PowerCenter with real-time option can be used to process data from real-time data sources. PowerCenter supports
the following types of real-time data:
Messages and message queues. PowerCenter with the real-time option can be used to integrate thirdparty messaging applications using a specific PowerExchange data access product. Each PowerExchange
product supports a specific industry-standard messaging application, such as WebSphere MQ, JMS,
MSMQ, SAP NetWeaver, TIBCO, and webMethods. You can read from messages and message queues
and write to messages, messaging applications, and message queues. WebSphere MQ uses a queue to
store and exchange data. Other applications, such as TIBCO and JMS, use a publish/subscribe model. In
this case, the message exchange is identified using a topic.
Web service messages. PowerCenter can receive a web service message from a web service client
through the Web Services Hub, transform the data, and load the data to a target or send a message back to
a web service client. A web service message is a SOAP request from a web service client or a SOAP
response from the Web Services Hub. The Integration Service processes real-time data from a web service
client by receiving a message request through the Web Services Hub and processing the request. The
Integration Service can send a reply back to the web service client through the Web Services Hub or write
the data to a target.
Changed source data. PowerCenter can extract changed data in real time from a source table using the
PowerExchange Listener and write data to a target. Real-time sources supported by PowerExchange are
ADABAS, DATACOM, DB2/390, DB2/400, DB2/UDB, IDMS, IMS, MS SQL Server, Oracle and VSAM.
Connection Setup
PowerCenter uses some attribute values in order to correctly connect and identify the third-party messaging
application and message itself. Each PowerExchange product supplies its own connection attributes that need to
be configured properly before running a real-time session.
Message Count - Controls the number of messages the PowerCenter Server reads from the source before
the session stops reading from the source.
Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive before it stops
reading from the source.
Time Slice Mode - Indicates a specific range of time that the server read messages from the source. Only
PowerExchange for WebSphere MQ uses this option.
Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends reading messages
from the source.
The specific filter conditions and options available to you depend on which Real-Time source is being used. For
example -Attributes for PowerExchange for DB2 for i5/OS:
Set the attributes that control how the reader ends. One or more attributes can be used to control the end of
session.
For example, set the Reader Time Limit attribute to 3600. The reader will end after 3600 seconds. The idle time
limit is set to 500 seconds. The reader will end if it doesnt process any changes for 500 seconds (i.e., it remains
idle for 500 seconds).
If more than one attribute is selected, the first attribute that satisfies the condition is used to control the end of
session.
Note:: The real-time attributes can be found in the Reader Properties for PowerExchange for JMS, TIBCO,
webMethods, and SAP iDoc. For PowerExchange for WebSphere MQ , the real-time attributes must be specified
as a filter condition.
The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how often PowerCenter
should flush messages, expressed in milli-seconds.
For example, if the Real-time Flush Latency is set to 2000, PowerCenter flushes messages every two seconds.
The messages will also be flushed from the reader buffer if the Source Based Commit condition is reached. The
Source Based Commit condition is defined in the Properties tab of the session.
The message recovery option can be enabled to ensure that no messages are lost if a session fails as a result of
unpredictable error, such as power loss. This is especially important for real-time sessions because some
messaging applications do not store the messages after the messages are consumed by another application.
A unit of work (UOW) is a collection of changes within a single commit scope made by a transaction on the source
system from an external application. Each UOW may consist of a different number of rows depending on the
transaction to the source system. When you use the UOW Count Session condition, the Integration Service
commits source data to the target when it reaches the number of UOWs specified in the session condition.
For example, if the value for UOW Count is 10, the Integration Service commits all data read from the source after
the 10th UOW enters the source. The lower you set the value, the faster the Integration Service commits data to
the target. The lower value also causes the system to consume more resources.
result may or may not be correct depending on the requirement. Use the active transformation with real-time
session if you want to process the data per transaction.
Custom transformations can also be defined to handle data per transaction so that they can be used in a real-time
session.
JMSAdmin.config Settings:
INITIAL_CONTEXT_FACTORY
PROVIDER_URL
PROVIDER_USERDN
JNDI UserName
PROVIDER_PASSWORD
JNDI Password
The JMS connection is defined using a tool in JMS called jmsadmin, which is available in the WebSphere MQ Java
installation/bin directory. Use this tool to configure the JMS Connection Factory.
The JMS Connection Factory can be a Queue Connection Factory or Topic Connection Factory.
When Queue Connection Factory is used, define a JMS queue as the destination.
When Connection Factory is used, define a JMS topic as the destination.
The command to define a queue connection factory (qcf) is:
def qcf(<qcf_name>) qmgr(queue_manager_name)
hostname (QM_machine_hostname) port (QM_machine_port)
The command to define JMS queue is:
def q(<JMS_queue_name>) qmgr(queue_manager_name) qu(queue_manager_queue_name)
The command to define JMS topic connection factory (tcf) is:
def tcf(<tcf_name>) qmgr(queue_manager_name)
hostname (QM_machine_hostname) port (QM_machine_port)
The command to define the JMS topic is:
def t(<JMS_topic_name>) topic(pub/sub_topic_name)
The topic name must be unique. For example: topic (application/infa)
The following table shows the JMS object types and the corresponding attributes in the JMS application connection
in the Workflow Manager:
QueueConnectionFactory or
TopicConnectionFactory
JMS Destination
INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory
PROVIDER_URL=iiop://<hostname>/
For example:
PROVIDER_URL=iiop://localhost/
PROVIDER_USERDN=cn=informatica,o=infa,c=rc
PROVIDER_PASSWORD=test
JMS Connection
The JMS configuration is similar to the JMS Connection for WebSphere MQ.
JMS Destination
In addition to JNDI and JMS setting, BEA WebLogic also offers a function called JMS Store, which can be used for
persistent messaging when reading and writing JMS messages. The JMS Stores configuration is available from the
Console pane: select Services > JMS > Stores under your domain.
tibrv_transports = enabled
1. Enter the following transports in the transports.conf file:
[RV]
type = tibrv // type of external messaging system
topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer
daemon = tcp:localhost:7500 // default daemon for the Rendezvous server
The transports in the transports.conf configuration file specify the communication protocol between TIBCO
Enterprise for JMS and the TIBCO Rendezvous system. The import and export properties on a destination
can list one or more transports to use to communicate with the TIBCO Rendezvous system.
1. Optionally, specify the name of one or more transports for reliable and certified message delivery in the
export property in the file topics.conf. as in the following example:
The export property allows messages published to a topic by a JMS client to be exported to the external systems
with configured transports. Currently, you can configure transports for TIBCO Rendezvous reliable and certified
messaging protocols.
crpc23232
Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing webMethods source
definition. This step is only required for importing PowerExchange for webMethods sources into the Designer.
If you are using the request/reply model in webMethods, PowerCenter needs to send an appropriate document
back to the broker for every document it receives. PowerCenter populates some of the envelope fields of the
webMethods target to enable webMethods broker to recognize that the published document is a reply from
PowerCenter. The envelope fields destid and tag are populated for the request/reply model. Destid should be
populated from the pubid of the source document and tag should be populated from tag of the source
document. Use the option Create Default Envelope Fields when importing webMethods sources and targets into
the Designer in order to make the envelope fields available in PowerCenter.
if you are using multiple request/reply document pairs, you need to setup different webMethods connections for
each pair because they cannot share a client ID.
Description
A logical view of the MDM Hub, the data flow through the Hub, and the physical architecture of the Hub are
described in the following sections.
Logical View
A logical view of the MDM Hub is shown below:
The Hub supports access of data in the form of batch, real-time and/or asynchronous messaging. Typically, this
access is supported through a combination of data integration tools, such as Informatica Power Center and
embedded Hub functionality. In order to master the data in the hub optimally, the source data needs to be
analyzed. This analysis typically takes place using a data quality tool, such as Informatica Data Quality.
The goal of the Hub is to master data for one or more domains within a Customers environment. In the MDM Hub,
there is a significant amount of metadata maintained in order to support data mastering functionality, such as
lineage, history, survivorship and the like. The MDM Hub data model is completely flexible and can start from a
Customers existing model, and industry standard model, or a model may be created from scratch.
Once the data model has been defined, data needs to be cleansed and standardized. The MDM Hub provides an
open architecture allowing a Customer to leverage any Cleanse engine which they may already leverage, and it
provides an optimized interface for Informatica Data Quality.
Data is then matched in the system using a combination of deterministic and fuzzy matching. Informatica Identity
Recognition is the underlying match technology in the Hub, and the interfaces to it have been optimized for Hub
use and the interfaces abstracted such that they are easily leveraged by business users.
After matching has been performed, the Hub can consolidate records by linking them together to produce a registry
of related records or by merging them to produce a Golden Record or a Best Version of the Truth (BVT). When a
BVT is produced, survivorship rules defined in the MDM trust framework are applied such that the appropriate
attributes from the contributing source records are promoted into the BVT.
The BVT provides a basis for indentifying and managing relationships across entities and sources. By building on
top of the BVT, the MDM Hub can expose relationships which are cross source or cross entity and are not visible
within an individual source.
A data governance framework is exposed to data stewards through the Informatica Data Director (IDD). IDD
provides data governance task management functionality, rudimentary data governance workflows, and data
steward views of the data. If more complex workflows are required, external workflow engines can be easily
integrated into the Hub. Individual views of data from within the IDD can also be exposed directly into applications
through Informatica Data Controls.
There is an underlying security framework within the MDM Hub that provides fine grained controls of the access of
data within the Hub. The framework supports configuration of the security policies locally, or by consuming them
from external sources, based on a customers desired infrastructure.
Data Flow
A typical data flow through the Hub is shown below:
Implementations of the MDM hub start by defining the data model into which all of the data will be consolidated.
This target data model will contain the BVT and the associated metadata to support it. Source data is brought into
the hub by putting it into a set of Landing Tables. A Landing Table is a representation of the data source in the
general form of the source. There is an equivalent table known as a Staging Table, which represents the source
data, but in the format of the Target Data model. Therefore, data needs to be transformed from the Landing Table
to the Staging table, and this happens within the MDM Hub as follows:
1. The incoming data is run through a Delta Detection process to determine if it has changed since the last
time it was processed. Only records that have changed are processed.
2. Records are run through a staging process which transforms the data to the form of the Target Model. The
staging process is a mapping within the MDM Hub which may perform any number of standardization,
cleansing or transformation processes. The mappings also allow for external cleanse engines to be invoked.
3. Records are then loaded into the landing table. The pre-cleansed version of the records are stored in a
RAW table, and records which are inappropriate to stage (for example, they have structural deficiencies
such as a duplicate PKEY) are written to a REJECT table to be manually corrected at a later time.
The data in the Staging Table is then loaded into the Base Objects. This process first applies a trust scores to
attributes for which it has been defined. Trust scores represent the relative survivorship of an attribute and are
calculated at the time the record is loaded, based on the currency of the data, the data source, and other
characteristics of the attribute.
Records are then pushed through a matching process which generates a set of candidates for merging. Depending
on which match rules caused a record to match, the record will be queued either for automatic merging or for
manual merging. Records that do not match will be loaded into the Base Object as unique records. Records
queued for automatic merge will be processed by the Hub without human intervention; those queued for manual
merge will be displayed to a Data Steward for further processing.
All data in the hub is available for consumption as a batch, as a set of outbound asynchronous messages or
through a real-time services interface.
Physical Architecture
The MDM Hub is designed as three-tier architecture. These tiers consist of the MDM Hub Store, the MDM Hub
Server(s) (includes Cleanse-Match Servers) and the MDM User Interface.
The Hub Store is where business data is stored and consolidated. The Hub Store contains common information
about all of the databases that are part of an MDM Hub implementation. It resides in a supported database server
environment. The Hub Server is the run-time component that manages core and common services for the MDM
Hub. The Hub Server is a J2EE application, deployed on a supported application server that orchestrates the data
processing within the Hub Store, as well as integration with external applications. Refer to the latest Product
Availability Matrix for which versions of databases, application servers, and operating systems are currently
supported for the MDM Hub.
The Hub may be implemented in either a standard architecture or in a high availability architecture. In order to
achieve high availability, Informatica recommends the configuration shown below:
This configuration employs a properly sized DB server and application server(s). The DB server is configured as
multiple DB cluster nodes. The database is distributed in SAN architecture. The application server requires
sufficient file space to support efficient match batch group sizes. Refer to the MDM Sizing Guidelines to properly
size each of these tiers.
Data base redundancy is provided through the use of the database cluster, and application server redundancy is
provided through application server clustering.
To support geographic distribution, the HA architecture described above is replicated in a second node, with
failover provided using a log replication approach. This configuration is intended to support Hot/Warm or Hot/Cold
environments, but does not support Hot/Hot operation.
Description
Use Case Scenarios
Message Queue Processing
When data is read from a message queue, the data values in the queue can be used to determine which source
data to process and which targets to load the processed data. In this scenario, different instances of the same
workflow should run concurrently and pass different connection parameters to the instances of the workflow
depending on the parameters read from the message queue. One example is a hosted data warehouse for 120
financial institutions where it is necessary to execute workflows for all the institutions in a small time frame.
Web Services
Different consumers of a web service need the capability to launch workflows to extract data from different external
systems and integrate it with internal application data. Each instance of the workflow can accept different
parameters to determine where to extract the data from and where to load the data to. For example, the Web
Services Hub needs to execute multiple instances of the same web service workflow when web services requests
increase.
Workflows. If the target is a database system, database partitioning can be used to prevent contention issues when
inserting or updating the same table. When database partitioning is used, concurrent writes to the same table will
less likely encounter deadlock issues.
Competing resources such as lookups are another source of concern that should be addressed when running
Concurrent Workflows. Lookup caches as well as log files should be exclusive for concurrent workflows to avoid
contention.
Partitioning should also be considered. Mapping Partitioning or data partitioning is not impacted by the Concurrent
Workflow feature and can be used with minimal impact.
On the other hand, parameter files should be created dynamically for the dynamic concurrent workflow option. This
requires the development of a methodology to generate the parameter files at run time. A database driven option
can be used for maintaining the parameters in database tables. During the execution of the Concurrent Workflows,
the parameter files can be generated from the database.