Documente Academic
Documente Profesional
Documente Cultură
Best Practices
Configuring Security Data Analyzer Security Database Sizing Deployment Groups Migration Procedures - PowerCenter Migration Procedures - PowerExchange Running Sessions in Recovery Mode Using PowerCenter Labels Deploying Data Analyzer Objects Installing Data Analyzer Data Connectivity using PowerCenter Connect for BW Integration Server Data Connectivity using PowerCenter Connect for MQSeries Data Connectivity using PowerCenter Connect for SAP Data Connectivity using PowerCenter Connect for Web Services Data Migration Principles Data Migration Project Challenges Data Migration Velocity Approach Build Data Audit/Balancing Processes Data Cleansing Data Profiling
Data Connectivity
r
Data Migration
r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
2 of 702
Data Quality Mapping Rules Data Quality Project Estimation and Scheduling Factors Effective Data Matching Techniques Effective Data Standardizing Techniques Managing Internal and External Reference Data Testing Data Quality Plans Tuning Data Quality Plans Using Data Explorer for Data Discovery and Analysis Working with Pre-Built Plans in Data Cleanse and Match Designing Data Integration Architectures Development FAQs Event Based Scheduling Key Management in Data Warehousing Solutions Mapping Design Mapping Templates Naming Conventions Performing Incremental Loads Real-Time Integration with PowerCenter Session and Data Partitioning Using Parameters, Variables and Parameter Files Using PowerCenter with UDB Using Shortcut Keys in PowerCenter Designer Working with JAVA Transformation Object Error Handling Process Error Handling Strategies - Data Warehousing Error Handling Strategies - General Error Handling Techniques - PowerCenter Mappings Error Handling Techniques - PowerCenter Workflows and Data Analyzer
BEST PRACTICE 3 of 702
Development Techniques
r
Error Handling
r
INFORMATICA CONFIDENTIAL
Planning the ICC Implementation Selecting the Right ICC Model Creating Inventories of Reusable Objects & Mappings Metadata Reporting and Sharing Repository Tables & Metadata Management Using Metadata Extensions Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance Configuring Standard XConnects Custom XConnect Implementation Customizing the Metadata Manager Interface Estimating Metadata Manager Volume Requirements Metadata Manager Load Validation Metadata Manager Migration Procedures Metadata Manager Repository Administration Upgrading Metadata Manager Daily Operations Data Integration Load Traceability High Availability Load Validation Repository Administration Third Party Scheduler Updating Repository Statistics Determining Bottlenecks Performance Tuning Databases (Oracle) Performance Tuning Databases (SQL Server)
BEST PRACTICE 4 of 702
Operations
r
INFORMATICA CONFIDENTIAL
Performance Tuning Databases (Teradata) Performance Tuning UNIX Systems Performance Tuning Windows 2000/2003 Systems Recommended Performance Tuning Procedures Tuning and Configuring Data Analyzer and Data Analyzer Reports Tuning Mappings for Better Performance Tuning Sessions for Better Performance Tuning SQL Overrides and Environment for Better Performance Using Metadata Manager Console to Tune the XConnects Advanced Client Configuration Options Advanced Server Configuration Options Causes and Analysis of UNIX Core Files Domain Configuration Managing Repository Size Organizing and Maintaining Parameter Files & Variables Platform Sizing PowerCenter Admin Console Understanding and Setting UNIX Resources for PowerCenter Installations PowerExchange CDC for Oracle PowerExchange Installation (for Mainframe) Assessing the Business Case Defining and Prioritizing Requirements Developing a Work Breakdown Structure (WBS) Developing and Maintaining the Project Plan Developing the Business Case Managing the Project Lifecycle
PowerCenter Configuration
r
PowerExchange Configuration
r
Project Management
r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
5 of 702
Using Interviews to Determine Corporate Data Integration Requirements Upgrading Data Analyzer Upgrading PowerCenter Upgrading PowerExchange
Upgrades
r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
6 of 702
Description
Security is an often overlooked area within the Informatica ETL domain. However, without paying close attention to the repository security, one ignores a crucial component of ETL code management. Determining an optimal security configuration for a PowerCenter environment requires a thorough understanding of business requirements, data content, and end-user access requirements. Knowledge of PowerCenter's security functionality and facilities is also a prerequisite to security design. Implement security with the goals of easy maintenance and scalability. When establishing repository security, keep it simple. Although PowerCenter includes the utilities for a complex web of security, the more simple the configuration, the easier it is to maintain. Securing the PowerCenter environment involves the following basic principles:
q q q
Create users and groups Define access requirements Grant privileges and permissions
Before implementing security measures, ask and answer the following questions:
q q
Who will administer the repository? How many projects need to be administered? Will the administrator be able to manage security for all PowerCenter projects or just a select few? How many environments will be supported in the repository? Who needs access to the repository? What do they need the ability to do? How will the metadata be organized in the repository? How many folders will be required? Where can we limit repository privileges by granting folder permissions instead? Who will need Administrator or Super User-type access?
q q q q q
After you evaluate the needs of the repository users, you can create appropriate user groups, assign repository privileges and folder permissions. In most implementations, the administrator takes care of maintaining the repository. Limit the number of administrator accounts for PowerCenter. While this concept is important in a development/unit test environment, it is critical for protecting the production environment.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
7 of 702
Informatica offers multiple layers of security, which enables you to customize the security within your data warehouse environment. Metadata level security controls access to PowerCenter repositories, which contain objects grouped by folders. Access to metadata is determined by the privileges granted to the user or to a group of users and the access permissions granted on each folder. Some privileges do not apply by folder, as they are granted by privilege alone (i.e., repository-level tasks). Just beyond PowerCenter authentication is the connection to the repository database. All client connectivity to the repository is handled by the PowerCenter Repository Service over a TCP/IP connection. The particular database account and password is specified at installation and during the configuration of the Repository Service. Developers need not have knowledge of this database account and password; they should only use their individual repository user ids and passwords. This information should be restricted to the administrator. Other forms of security available in PowerCenter include permissions for connections. Connections include database, FTP, and external loader connections. These permissions are useful when you want to limit access to schemas in a relational database and can be set-up in the Workflow Manager when source and target connections are defined. Occasionally, you may want to restrict changes to source and target definitions in the repository. A common way to approach this security issue is to use shared folders, which are owned by an Administrator or Super User. Granting read access to developers on these folders allows them to create read-only copies in their work folders.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
8 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
9 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
10 of 702
2. Enter the password environment variable in the Variable field. Enter the encrypted password in the Value field.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
11 of 702
Users
Users are the fundamental objects of security in a PowerCenter environment. Each individual logging into the PowerCenter repository should have a unique user account. Informatica does not recommend creating shared accounts; unique accounts should be created for each user. Each repository user needs a user name and password, provided by the PowerCenter Repository Administrator, to access the repository. Users are created and managed through Repository Manager. Users should change their passwords from the default immediately after receiving the initial user id from the Administrator. Passwords can be reset by the user if they are granted the privilege Use Repository Manager. When you create the repository, the repository automatically creates two default users:
q q
Administrator - The default password for Administrator is Administrator. Database user - The username and password used when you created the repository.
These default users are in the Administrators user group, with full privileges within the repository. They cannot be deleted from the repository, nor have their group affiliation changed. To administer repository users, you must have one of the following privileges:
q q
against the external directory. The repository maintains a status for each user. Users can be enabled or disabled by modifying this status. Prior to implementing LDAP, the administrator must know:
q q q
Repository server username and password An administrator or superuser user name and password for the repository An external login name and password
To configure LDAP, follow these steps: 1. Edit ldap_authen.xml, modify the following attributes:
r r
NAME the .dll that implements the authentication OSTYPE Host operating system
2. Register ldap_authen.xml in the Repository Server Administration Console. 3. In the Repository Server Administration Console, configure the authentication module.
User Groups
When you create a repository, the Repository Manager creates two repository user groups. These two groups exist so you can immediately create users and begin developing repository objects. These groups cannot be deleted from the repository nor have their configured privileges changed. The default repository user groups are:
q q
Administrators - which has super-user access Public - which has a subset of default repository privileges
You should create custom user groups to manage users and repository privileges effectively. The number and types of groups that you create should reflect the needs of your development teams, administrators, and operations group. Informatica recommends minimizing the number of custom user groups that you create in order to facilitate the maintenance process. A starting point is to create a group for each type of combination of privileges needed to support the development cycle and production process. This is the recommended method for assigning privileges. After creating a user group, you assign a set of privileges for that group. Each repository user must be assigned to at least one user group. When you assign a user to a group, the user:
q q q
Receives all group privileges. Inherits any changes to group privileges. Loses and gains privileges if you change the user group membership.
You can also assign users to multiple groups, which grants the user the privileges of each group. Use the Repository Manager to create and edit repository user groups.
Folder Permissions
INFORMATICA CONFIDENTIAL
BEST PRACTICE
13 of 702
When you create or edit a folder, you define permissions for the folder. The permissions can be set at three different levels:
q q q
First, choose an owner (i.e., user) and group for the folder. If the owner belongs to more than one group, you must select one of the groups listed. Once the folder is defined and the owner is selected, determine what level of permissions you would like to grant to the users within the group. Then determine the permission level for the remainder of the repository users. The permissions that can be set include: read, write, and execute. Any combination of these can be granted to the owner, group or repository. Be sure to consider folder permissions very carefully. They offer the easiest way to restrict users and/or groups from having access to folders or restricting access to folders. The following table gives some examples of folders, their type, and recommended ownership.
Folder Name DEVELOPER_1 Folder Type Initial development, temporary work area, unit test Integrated development Proposed Owner Individual developer
DEVELOPMENT
Development lead, Administrator or Super User UAT lead, Administrator or Super User Administrator or Super User Production support lead, Administrator or Super User
Repository Privileges
Repository privileges work in conjunction with folder permissions to give a user or group authority to perform tasks. Repository privileges are the most granular way of controlling a users activity. Consider the privileges that each user group requires, as well as folder permissions, when determining the breakdown of users into groups. Informatica recommends creating one group for each distinct combination of folder permissions and privileges. When you assign a user to a user group, the user receives all privileges granted to the group. You can also assign privileges to users individually. When you grant a privilege to an individual user, the user retains that privilege, even if his or her user group affiliation changes. For example, you have a user in a Developer group who has limited group privileges, and you want this user to act as a backup administrator when you are not available. For the user to perform every task in every folder in the repository, and to administer the Integration Service, the user must have the Super User privilege. For tighter security, grant the Super User privilege to the individual user, not the entire Developer group. This limits the number of users with the Super User privilege, and ensures that the user retains the privilege even if you remove the user from the Developer group. The Repository Manager grants a default set of privileges to each new user and group for working within the repository. You can add or remove privileges from any user or group except:
INFORMATICA CONFIDENTIAL BEST PRACTICE 14 of 702
q q
Administrators and Public (the default read-only repository groups) Administrator and the database user who created the repository (the users automatically created in the Administrators group)
The Repository Manager automatically grants each new user and new group the default privileges. These privileges allow you to perform basic tasks in Designer, Repository Manager, Workflow Manager, and Workflow Monitor. The following table lists the default repository privileges:
Default Repository Privileges
Default Privilege
Folder Permission
Use Designer
N/A
N/A
q q
Connect to the repository using the Designer. Configure connection information. View objects in the folder. Change folder versions. Create shortcuts to objects in the folder. Copy objects from the folder. Export objects. Create or edit metadata. Create shortcuts from shared folders. Copy objects into the folder. Import objects. Connect to the repository using the Repository Manager. Add and remove reports. Import, export, or remove the registry. Search by keywords. Change your user password. View dependencies. Unlock objects, versions, and folders locked by your username. Edit folder properties for folders you own. Copy a version. (You must also have Administer Repository or Super User privilege in the target repository and write permission on the target folder.) Copy a folder. (You must also have Administer Repository or Super User privilege in the target repository.) Connect to the repository using the Workflow Manager. Create database, FTP, and external loader connections in the Workflow Manager. Run the Workflow Monitor. Edit database, FTP, and external loader connections in the Workflow Manager.
q q
Read
N/A
q q q
Read/Write
N/A
q q q q
Browse Repository
N/A
N/A
q q q q
q q
Read
N/A
q q
N/A
N/A
N/A
Read/Write
INFORMATICA CONFIDENTIAL
BEST PRACTICE
15 of 702
q q
Export sessions. View workflows. View sessions. View tasks. View session details and session performance details. Create and edit workflows and tasks. Import sessions. Validate workflows and tasks. Create and edit sessions. View session log. Schedule or unschedule workflows. Start workflows immediately. Restart workflow. Stop workflow. Abort workflow. Resume workflow. Remove label references. Delete from deployment group. Change objects version comments if not owner. Change status of object.
Read
N/A
q q q
Read/Write
N/A
q q
Read/Write Read/Execute
Read N/A
Read/Execute
Execute
q q
Execute
N/A
q q q
q q q
Check in.
Check out/undo check-out. Delete objects from folder. Mass validation (needs write permission if options selected). Recover after delete. Export objects. Add to deployment group. Copy objects. Import objects. Apply label
Write
Folder
q q q
Read Read/Write
Read/Write
q q
Extended Privileges
In addition to the default privileges listed above, Repository Manager provides extended privileges that you can assign to users and groups. These privileges are granted to the Administrator group by default. The following table lists the extended repository privileges:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
16 of 702
Extended Privilege
Folder Permission
Admin Repository
N/A
Create, upgrade, backup, delete, and restore the repository. Manage passwords, users, groups, and privileges. Start, stop, enable, disable, and check the status of the repository. Check in or undo check out for other users. Purge (in version-enabled repository). Disable the Integration Service using the infacmd program. Connect to the Integration Service from PowerCenter client applications when running the Integration Service in safe mode. Perform all tasks, across all folders in the repository. Manage connection object permissions. Manage global object permissions. Perform mass validate. Connect to the Integration Service. View the session log. View the workflow log. View session details and performance details. Abort workflow. Restart workflow. Resume workflow. Stop workflow. Schedule and unschedule workflows. Start workflows immediately.
q q
Folder N/A
q q q
Super User
N/A
N/A
q q q q
Workflow Operator
N/A Read
N/A Folder
q q
Execute
Folder
q q q q q
Use pmcmd to start workflows in folders for which you have execute permission.
q q q
Create and edit connection objects. Delete connection objects. Manage connection object permissions. Create labels. Delete labels.
Manage Label
N/A
N/A
q q
Extended privileges allow you to perform more tasks and expand the access you have to repository objects. Informatica recommends that you reserve extended privileges for individual users and grant default privileges to groups.
Audit Trails
You can track changes to Repository users, groups, privileges, and permissions by selecting the
INFORMATICA CONFIDENTIAL BEST PRACTICE 17 of 702
SecurityAuditTrail configuration option in the Repository Service properties in the PowerCenter Administration Console. When you enable the audit trail, the Repository Service logs security changes to the Repository Service log. The audit trail logs the following operations:
q q q q q q q
Changing the owner, owner's group, or permissions for a folder. Changing the password of another user. Adding or removing a user. Adding or removing a group. Adding or removing users from a group. Changing global object permissions. Adding or removing user and group privileges.
GROUP NAME
FOLDER
PRIVILEGES
DEVELOPERS
INFORMATICA CONFIDENTIAL
BEST PRACTICE
18 of 702
DEVELOPERS
Read
Use Designer, Browse Repository, Use Workflow Manager Use Designer, Browse Repository, Use Workflow Manager Use Designer, Browse Repository, Use Workflow Manager Browse Repository, Workflow Operator
UAT
UAT
Production
Read
OPERATIONS
Read, Execute
Production
Read
Browse Repository
Creating user accounts. Defining and creating groups. Defining and granting folder permissions. Defining and granting repository privileges. Enforcing changes in passwords. Controlling requests for changes in privileges. Creating and maintaining database, FTP, and external loader connections in conjunction with database administrator. Working with operations group to ensure tight security in production environment.
Remember, you must have one of the following privileges to administer repository users:
q q
Summary of Recommendations
When implementing your security model, keep the following recommendations in mind:
q
INFORMATICA CONFIDENTIAL
q q q q q q q q
Do not use shared accounts. Limit user and group access to multiple repositories. Customize user privileges. Limit the Super User privilege. Limit the Administer Repository privilege. Restrict the Workflow Operator privilege. Follow a naming convention for user accounts and group names. For more secure environments, turn Audit Trail logging on.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
20 of 702
Description
Four main architectural layers must be completely secure: user layer, transmission layer, application layer and data layer. Users must be authenticated and authorized to access data. Data Analyzer integrates seamlessly with the following LDAP-compliant directory servers:
SunOne/iPlanet Directory Server 4.1
INFORMATICA CONFIDENTIAL
BEST PRACTICE
21 of 702
Novell eDirectory Server IBM SecureWay Directory IBM SecureWay Directory IBM Tivoli Directory Server Microsoft Active Directory Microsoft Active Directory
In addition to the directory server, Data Analyzer supports Netegrity SiteMinder for centralizing authentication and access control for the various web applications in the organization.
Transmission Layer
The data transmission must be secure and hacker-proof. Data Analyzer supports the standard security protocol Secure Sockets Layer (SSL) to provide a secure environment.
Application Layer
Only appropriate application functionality should be provided to users with associated privileges. Data Analyzer provides three basic types of application-level security:
q
Report, Folder and Dashboard Security. Restricts access for users or groups to specific reports, folders, and/or dashboards. Column-level Security. Restricts users and groups to particular metric and attribute columns. Row-level Security. Restricts users to specific attribute values within an attribute column of a table.
q q
Roles. A role can consist of one or more privileges. You can use system roles or create custom roles. You can grant roles to groups and/or individual users. When you edit a custom role, all groups and users with the role automatically inherit the change. Groups. A group can consist of users and/or groups. You can assign one or more roles to a group. Groups are created to organize logical sets of users and roles. After you create groups, you can assign users to the groups. You can also assign groups to other groups to organize privileges for related users. When you edit a group, all users and groups within the edited group inherit the change. Users. A user has a user name and password. Each person accessing Data Analyzer must have a unique user name. To set the tasks a user can perform, you can assign roles to the user or
BEST PRACTICE 22 of 702
INFORMATICA CONFIDENTIAL
Types of Roles
System roles - Data Analyzer provides a set of roles when the repository is created. Each role has sets of privileges assigned to it.
q
Custom roles - The end user can create and assign privileges to these roles.
Managing Groups
Groups allow you to classify users according to a particular function. You may organize users into groups based on their departments or management level. When you assign roles to a group, you grant the same privileges to all members of the group. When you change the roles assigned to a group, all users in the group inherit the changes. If a user belongs to more than one group, the user has the privileges from all groups. To organize related users into related groups, you can create group hierarchies. With hierarchical groups, each subgroup automatically receives the roles assigned to the group it belongs to. When you edit a group, all subgroups contained within it inherit the changes. For example, you may create a Lead group and assign it the Advanced Consumer role. Within the Lead group, you create a Manager group with a custom role Manage Data Analyzer. Because the Manager group is a subgroup of the Lead group, it has both the Manage Data Analyzer and Advanced Consumer role privileges.
Belonging to multiple groups has an inclusive effect. For example, if group 1 has access to something but group 2 is excluded from that object, a user belonging to both groups 1 and 2 will have access to the object.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
23 of 702
3. To prevent Data Analyzer from updating group information in the Data Analyzer repository, change the value of the enableGroupSynchronization property to false: <init-param> <param-name> InfSchedulerStartup.com.informatica.ias. scheduler.enableGroupSynchronization </param-name> <param-value>false</param-value> </init-param> When the value of enableGroupSynchronization property is false, Data Analyzer does not synchronize the groups in the repository with the groups in the Windows Domain or LDAP directory service. 4. Save the web.xml file and add it back to the Data Analyzer EAR file. 5. Restart Data Analyzer. When the enableGroupSynchronization property in the web.xml file is set to false, Data Analyzer updates only the user accounts in Data Analyzer the next time it synchronizes with the Windows Domain or LDAP authentication server. You must create and manage groups, and assign users to groups in Data Analyzer.
Managing Users
Each user must have a unique user name to access Data Analyzer. To perform Data Analyzer tasks, a user must have the appropriate privileges. You can assign privileges to a user with roles or groups. Data Analyzer creates a System Administrator user account when you create the repository. The default user name for the System Administrator user account is admin. The system daemon, ias_scheduler/ padaemon, runs the updates for all time-based schedules. System daemons must have a unique user name and password in order to perform Data Analyzer system functions and tasks. You can change the password for a system daemon, but you cannot change the system daemon user name via the GUI. Data Analyzer permanently assigns the daemon role to system daemons. You cannot assign new roles to system daemons or assign them to groups. To change the password for a system daemon, complete the following steps: 1. Change the password in the Administration tab in Data Analyzer 2. Change the password in the web.xml file in the Data Analyzer folder. 3. Restart Data Analyzer.
To access contacts in the LDAP directory service, you can add the LDAP server on the LDAP Settings page. After you set up the connection to the LDAP directory service, users can email reports and shared documents to LDAP directory contacts. When you add an LDAP server, you must provide a value for the BaseDN (distinguished name) property. In the BaseDN property, enter the Base DN entries for your LDAP directory. The Base distinguished name entries define the type of information that is stored in the LDAP directory. If you do not know the value for BaseDN, contact your LDAP system administrator.
Access permissions. Restrict user and/or group access to folders, reports, dashboards, attributes, metrics, template dimensions, or schedules. Use access permissions to restrict access to a particular folder or object in the repository. Data restrictions. Restrict user and/or group access to information in fact and dimension tables and operational schemas. Use data restrictions to prevent certain users or groups from accessing specific values when they create reports. Password restrictions. Restrict users from changing their passwords. Use password restrictions when you do not want users to alter their passwords.
When you create an object in the repository, every user has default read and write permissions for that object. By customizing access permissions for an object, you determine which users and/or groups can read, write, delete, or change access permissions for that object. When you set data restrictions, you determine which users and groups can view particular attribute values. If a user with a data restriction runs a report, Data Analyzer does not display the restricted data to that user.
Read. Allows you to view a folder or object. Write. Allows you to edit an object. Also allows you to create and edit folders and objects within a folder. Delete. Allows you to delete a folder or an object from the repository. Change permission. Allows you to change the access permissions on a folder or object.
q q
By default, Data Analyzer grants read and write access permissions to every user in the repository. You can use the General Permissions area to modify default access permissions for an object, or turn off default access permissions.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
26 of 702
Data Restrictions
You can restrict access to data based on the values of related attributes. Data restrictions are set to keep sensitive data from appearing in reports. For example, you may want to restrict data related to the performance of a new store from outside vendors. You can set a data restriction that excludes the store ID from their reports. You can set data restrictions using one of the following methods:
q
Set data restrictions by object. Restrict access to attribute values in a fact table, operational schema, real-time connector, and real-time message stream. You can apply the data restriction to users and groups in the repository. Use this method to apply the same data restrictions to more than one user or group. Set data restrictions for one user at a time. Edit a user account or group to restrict user or group access to specified data. You can set one or more data restrictions for each user or group. Use this method to set custom data restrictions for different users or groups
Inclusive. Use the IN option to allow users to access data related to the attributes you select. For example, to allow users to view only data from the year 2001, create an IN 2001 rule. Exclusive. Use the NOT IN option to restrict users from accessing data related to the attributes you select. For example, to allow users to view all data except from the year 2001, create a NOT IN 2001 rule.
When Data Analyzer runs scheduled reports that have provider-based security, it runs reports against the data restrictions for the report owner. However, if the reports have consumer-based security, the Data Analyzer Server creates a separate report for each unique security profile.
INFORMATICA CONFIDENTIAL BEST PRACTICE 27 of 702
The following information applies to the required steps for changing admin user for weblogic only.
To change the Data Analyzer system administrator username on Weblogic 8.1(DA 8.1)
Repository authentication. You must use the Update System Accounts utility to change the system administrator account name in the repository. q LDAP or Windows Domain Authentication. Set up the new system administrator account in Windows Domain or LDAP directory service. Then use the Update System Accounts utility to change the system administrator account name in the repository.
Duser=userName -Dpassword=userPassword -Dias_scheduler=pa_scheduler -Dadmin=paadmin repositoryutil.refresh.InfChangeSystemUserNames REM END OF BATCH FILE 7. Make changes in the batch file as directed in the remarks [REM lines] 8. Save the file and open up a command prompt window and navigate to D:\Temp\Repository Utils \Refresh\ 9. At the prompt, type change_sys_user.bat and press Enter. The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and "paadmin", respectively. 10. Modify web.xml, and weblogic.xml (located at .\bea\wlserver6.1\config\informatica\applications\ias \WEB-INF) by replacing ias_scheduler with 'pa_scheduler' 11. Replace ias_scheduler with pa_scheduler in the xml file weblogic-ejb-jar.xml This file is in iasEjb.jar file located in the directory .\bea\wlserver6.1\config\informatica\applications\ To edit the file Make a copy of the iasEjb.jar:
q q q q q q q
mkdir \tmp cd \tmp jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF cd META-INF Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler cd \ jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .
Note: There is a tailing period at the end of the command above. 12. Restart the server.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
29 of 702
Description
The first step in database sizing is to review system requirements to define such things as:
q
Expected data architecture elements (will there be staging areas? operational data stores? centralized data warehouse and/or master data? data marts?) Each additional database element requires more space. This is even more true in situations where data is being replicated across multiple systems, such as a data warehouse maintaining an operational data store as well. The same data in the ODS will be present in the warehouse as well, albeit in a different format.
Expected source data volume It is useful to analyze how each row in the source system translates into the target system. In most situations the row count in the target system can be calculated by following the data flows from the source to the target. For example, say a sales order table is being built by denormalizing a source table. The source table holds sales data for 12 months in a single row (one column for each month). Each row in the source translates to 12 rows in the target. So a source table with one million rows ends up as a 12 million row table.
Data granularity and periodicity Granularity refers to the lowest level of information that is going to be stored in a fact table. Granularity affects the size of a database to a great extent, especially for aggregate tables. The level at which a table has been aggregated
INFORMATICA CONFIDENTIAL
BEST PRACTICE
30 of 702
increases or decreases a table's row count. For example, a sales order fact table's size is likely to be greatly affected by whether the table is being aggregated at a monthly level or at a quarterly level. The granularity of fact tables is determined by the dimensions linked to that table. The number of dimensions that are connected to the fact tables affects the granularity of the table and hence the size of the table.
q
Load frequency and method (full refresh? incremental updates?) Load frequency affects the space requirements for the staging areas. A load plan that updates a target less frequently is likely to load more data at one go. Therefore, more space is required by the staging areas. A full refresh requires more space for the same reason. Estimated growth rates over time and retained history.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
31 of 702
FROM TABLE(DBMS_SPACE.object_growth_trend ('schema','tablename','TABLE')) ORDER BY timepoint; The results of this query are shown below: TIMEPOINT SPACE_USAGE SPACE_ALLOC QUALITY ------------------------------ ----------- ----------- -------------------11-APR-04 02.55.14.116000 PM 12-APR-04 02.55.14.116000 PM 13-APR-04 02.55.14.116000 PM 13-MAY-04 02.55.14.116000 PM 14-MAY-04 02.55.14.116000 PM 15-MAY-04 02.55.14.116000 PM 16-MAY-04 02.55.14.116000 PM 6372 6372 6372 6372 6372 6372 6372 65536 INTERPOLATED 65536 INTERPOLATED 65536 INTERPOLATED 65536 PROJECTED 65536 PROJECTED 65536 PROJECTED 65536 PROJECTED
GOOD - The data for the timepoint relates to data within the AWR repository with a timestamp within 10 percent of the interval. INTERPOLATED - The data for this timepoint did not meet the GOOD criteria but was based on data gathered before and after the timepoint. PROJECTED - The timepoint is in the future, so the data is estimated based on previous growth statistics.
Baseline Volumetric
Next, use the physical data models for the sources and the target architecture to develop a baseline sizing estimate. The administration guides for most DBMSs contain sizing guidelines for the various database structures such as tables, indexes, sort space, data files, log files, and database cache. Develop a detailed sizing using a worksheet inventory of the tables and indexes from the physical data model, along with field data types and field sizes. Various database products use different storage methods for data types. For this reason, be sure to use the database manuals to determine the size of each data type. Add up the field sizes to determine row size. Then use the data volume projections to determine the number of rows to multiply by the table size. The default estimate for index size is to assume same size as the table size. Also estimate the temporary space for sort operations. For data warehouse applications where summarizations are common, plan on large temporary spaces. The temporary space can be as much as 1.5 times larger than the largest table in the database. Another approach that is sometimes useful is to load the data architecture with representative data and determine the resulting database sizes. This test load can be a
INFORMATICA CONFIDENTIAL
BEST PRACTICE
32 of 702
fraction of the actual data and is used only to gather basic sizing statistics. You then need to apply growth projections to these statistics. For example, after loading ten thousand sample records to the fact table, you determine the size to be 10MB. Based on the scenario analysis, you can expect this fact table to contain 60 million records after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB * (60,000,000/10,000)]. Don't forget to add indexes and summary tables to the calculations.
Guesstimating
When there is not enough information to calculate an estimate as described above, use educated guesses and rules of thumb to develop as reasonable an estimate as possible.
q
If you dont have the source data model, use what you do know of the source data to estimate average field size and average number of fields in a row to determine table size. Based on your understanding of transaction volume over time, determine your growth metrics for each type of data and calculate out your source data volume (SDV) from table size and growth metrics. If your target data architecture is not completed so that you can determine table sizes, base your estimates on multiples of the SDV:
r
If it includes staging areas: add another SDV for any source subject area that you will stage multiplied by the number of loads youll retain in staging. If you intend to consolidate data into an operational data store, add the SDV multiplied by the number of loads to be retained in the ODS for historical purposes (e.g., keeping one years worth of monthly loads = 12 x SDV) Data warehouse architectures are based on the periodicity and granularity of the warehouse; this may be another SDV + (.3n x SDV where n = number of time periods loaded in the warehouse over time) If your data architecture includes aggregates, add a percentage of the warehouse volumetrics based on how much of the warehouse data will be aggregated and to what level (e.g., if the rollup level represents 10 percent of the dimensions at the details level, use 10 percent). Similarly, for data marts add a percentage of the data warehouse based on how much of the warehouse data is moved into the data mart. Be sure to consider the growth projections over time and the history to be retained in all of your calculations.
BEST PRACTICE 33 of 702
INFORMATICA CONFIDENTIAL
And finally, remember that there is always much more data than you expect so you may want to add a reasonable fudge-factor to the calculations for a margin of safety.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
34 of 702
Description
Deployment Groups are containers that hold references to objects that need to be migrated. This includes objects such as mappings, mapplets, reusable transformations, sources, targets, workflows, sessions and tasks, as well as the object holders (i.e., the repository folders). Deployment groups are faster and more flexible than folder moves for incremental changes. In addition, they allow for migration rollbacks if necessary. Migrating a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects to copy, rather than the entire contents of a folder. There are two types of deployment groups: static and dynamic.
q
Static deployment groups contain direct references to versions of objects that need to be moved. Users explicitly add the version of the object to be migrated to the deployment group. Create a static deployment group if you do not expect the set of deployment objects to change between the deployments. Dynamic deployment groups contain a query that is executed at the time of deployment. The results of the query (i.e., object versions in the repository) are then selected and copied to the target repository/folder. Create a dynamic deployment group if you expect the deployment objects to change frequently between deployments.
Dynamic deployment groups are generated from a query. While any available criteria can be used, it is advisable to have developers use labels to simplify the query. See the Best Practice on Using PowerCenter Labels , Strategies for Labels section, for further information. When generating a query for deployment groups with mappings and mapplets that contain non-reusable objects, you must use a query condition in addition to specific selection criteria. The query must include a condition for Is Reusable and use a qualifier of either Reusable and Non-Reusable. Without this qualifier, the
INFORMATICA CONFIDENTIAL
BEST PRACTICE
35 of 702
deployment may encounter errors if there are non-reusable objects held within the mapping or mapplet. A deployment group exists in a specific repository. It can be used to move items to any other accessible repository/folder. A deployment group maintains a history of all migrations it has performed. It tracks what versions of objects were moved from which folders in which source repositories, and into which folders in which target repositories those versions were copied (i.e., it provides a complete audit trail of all migrations performed). Given that the deployment group knows what it moved and to where, then if necessary, an administrator can have the deployment group undo the most recent deployment, reverting the target repository to its pre-deployment state. Using labels (as described in the Using PowerCenter Labels Best Practice) allows objects in the subsequent repository to be tracked back to a specific deployment. It is important to note that the deployment group only migrates the objects it contains to the target repository/folder. It does not, itself, move to the target repository. It still resides in the source repository.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
36 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
37 of 702
deployment group. Instead of migrating the group itself to the next repository, you can use a query to select the objects for migration and save them to a single XML file which can be then be transmitted to the on-shore environment though alternative methods. If the on-shore repository is versioned, it activates the import wizard as if a deployment group was being received.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
38 of 702
Description
Ensuring that an application has a smooth migration process between development, QA, and production environments is essential for the deployment of an application. Deciding which migration strategy works best for a project depends on two primary factors.
q
How is the PowerCenter repository environment designed? Are there individual repositories for development, QA, and production or are there just one or two environments that share one or all of these phases. How has the folder architecture been defined?
Each of these factors plays a role in determining the migration procedure that is most beneficial to the project. PowerCenter offers flexible migration options that can be adapted to fit the need of each application. PowerCenter migration options include repository migration, folder migration, object migration, and XML import/export. In versioned PowerCenter repositories, users can also use static or dynamic deployment groups for migration, which provides the capability to migrate any combination of objects within the repository with a single command. This Best Practice is intended to help the development team decide which technique is most appropriate for the project. The following sections discuss various options that are available, based on the environment and architecture selected. Each section describes the major advantages of its use, as well as its disadvantages.
Repository Environments
The following section outlines the migration procedures for standalone and distributed repository environments. The distributed environment section touches on several migration architectures, outlining the pros and cons of each. Also, please note that any methods described in the Standalone section may also be used in a Distributed environment.
separate development folders for each of the individual developers for development and unit test purposes. A single shared or common development folder, SHARED_MARKETING_DEV, holds all of the common objects, such as sources, targets, and reusable mapplets. In addition, two test folders are created for QA purposes. The first contains all of the unit-tested mappings from the development folder. The second is a common or shared folder that contains all of the tested shared objects. Eventually, as the following paragraphs explain, two production folders will also be built.
The first, and most common method, is object migration via an object copy. In this case, a user opens the SHARED_MARKETING_TEST folder and drags the object from the SHARED_MARKETING_DEV into the appropriate workspace (i.e., Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file from one folder to another using Windows Explorer. The second approach is object migration via object XML import/export. A user can export each of the objects in the SHARED_MARKETING_DEV folder to XML, and then re-import each object into the SHARED_MARKETING_TEST via XML import. With the XML import/export, the XML files can be uploaded to a third-party versioning tool, if the organization has standardized on such a tool. Otherwise, versioning can be enabled in PowerCenter. Migrations with versioned PowerCenter repositories is covered later in this document.
After you've copied all common or shared objects, the next step is to copy the individual mappings from each development folder into the MARKETING_TEST folder. Again, you can use either of the two object-level migration methods described above to copy the mappings to the folder, although the XML import/export method is the most intuitive method for resolving shared object conflicts. However, the migration method is slightly different here when you're copying the mappings because you must ensure that the shortcuts in the mapping are associated with the SHARED_MARKETING_TEST folder. Designer prompts the user to choose the correct shortcut folder that you created in the previous example, which point to the
INFORMATICA CONFIDENTIAL BEST PRACTICE 40 of 702
SHARED_MARKETING_TEST (see image below). You can then continue the migration process until all mappings have been successfully migrated. In PowerCenter 7 and later versions, you can export multiple objects into a single XML file, and then import them at the same time.
The final step in the process is to migrate the workflows that use those mappings. Again, the object-level migration can be completed either through drag-and-drop or by using XML import/export. In either case, this process is very similar to the steps described above for migrating mappings, but differs in that the Workflow Manager provides a Workflow Copy Wizard to guide you through the process. The following steps outline the full process for successfully copying a workflow and all of its associated tasks. 1. The Wizard prompts for the name of the new workflow. If a workflow with the same name exists in the destination folder, the Wizard prompts you to rename it or replace it. If no such workflow exists, a default name is used. Then click Next to continue the copy process. 2. The next step for each task is to see if it exists (as shown below). If the task is present, you can rename or replace the current one. If it does not exist, then the default name is used (see below). Then click Next.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
41 of 702
3. Next, the Wizard prompts you to select the mapping associated with each session task in the workflow. Select the mapping and continue by clicking Next".
4. If connections exist in the target repository, the Wizard prompts you to select the connection to use for the source and target. If no connections exist, the default settings are used. When this step is completed, click "Finish" and save the work.
to production migration. 1. Open the PowerCenter Repository Manager client tool and log into the repository. 2. To make a shared folder for the production environment, highlight the SHARED_MARKETING_TEST folder, drag it, and drop it on the repository name. 3. The Copy Folder Wizard appears to guide you through the copying process.
4. The first Wizard screen asks if you want to use the typical folder copy options or the advanced options. In this example, we'll use the advanced options.
5. The second Wizard screen prompts you to enter a folder name. By default, the folder name that appears on this screen is the folder name followed by the date. In this case, enter the name as
INFORMATICA CONFIDENTIAL BEST PRACTICE 43 of 702
SHARED_MARKETING_PROD.
6. The third Wizard screen prompts you to select a folder to override. Because this is the first time you are transporting the folder, you wont need to select anything.
7. The final screen begins the actual copy process. Click "Finish" when the process is complete.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
44 of 702
Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST folder as the original to copy and associate the shared objects with the SHARED_MARKETING_PROD folder that you just created. At the end of the migration, you should have two additional folders in the repository environment for production: SHARED_MARKETING_PROD and MARKETING_ PROD (as shown below). These folders contain the initially migrated objects. Before you can actually run the workflow in these production folders, you need to modify the session source and target connections to point to the production environment.
occur. These types of changes in production take place on a case-by-case or periodically-scheduled basis. The following steps outline the process of moving these objects individually. 1. Log into PowerCenter Designer. Open the destination folder and expand the source folder. Click on the object to copy and drag-and-drop it into the appropriate workspace window. 2. Because this is a modification to an object that already exists in the destination folder, Designer prompts you to choose whether to Rename or Replace the object (as shown below). Choose the option to Replace the object.
3. In PowerCenter 7 and later versions, you can choose to compare conflicts whenever migrating any object in Designer or Workflow Manager. By comparing the objects, you can ensure that the changes that you are making are what you intend. See below for an example of the mapping compare window.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
46 of 702
4. After the object has been successfully copied, save the folder so the changes can take place. 5. The newly copied mapping is now tied to any sessions that the replaced mapping was tied to. 6. Log into Workflow Manager and make the appropriate changes to the session or workflow so it can update itself with the changes.
Drag all of the newly copied objects from the SHARED_MARKETING_TEST folder to MARKETING_TEST. Save your changes.
2. Copy the mapping from Development into Test. r In the PowerCenter Designer, open the MARKETING_TEST folder, and drag and drop the mapping from each development folder into the MARKETING_TEST folder.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
47 of 702
When copying each mapping in PowerCenter, Designer prompts you to either Replace, Rename, or Reuse the object, or Skip for each reusable object, such as source and target definitions. Choose to Reuse the object for all shared objects in the mappings copied into the MARKETING_TEST folder. Save your changes.
3. If a reusable session task is being used, follow these steps. Otherwise, skip to step 4. r In the PowerCenter Workflow Manager, open the MARKETING_TEST folder and drag and drop each reusable session from the developers folders into the MARKETING_TEST folder. A Copy Session Wizard guides you through the copying process.
r
Open each newly copied session and click on the Source tab. Change the source to point to the source database for the Test environment. Click the Target tab. Change each connection to point to the target database for the Test environment. Be sure to double-check the workspace from within the Target tab to ensure that the load options are correct. Save your changes.
4. While the MARKETING_TEST folder is still open, copy each workflow from Development to Test. r Drag each workflow from the development folders into the MARKETING_TEST folder. The Copy Workflow Wizard appears. Follow the same steps listed above to copy the workflow to the new folder.
r
As mentioned earlier, in PowerCenter 7 and later versions, the Copy Wizard allows you to compare conflicts from within Workflow Manager to ensure that the correct migrations are being made. Save your changes.
5. Implement the appropriate security. r In Development, the owner of the folders should be a user(s) in the development group.
r r r
In Test, change the owner of the test folder to a user(s) in the test group. In Production, change the owner of the folders to a user in the production group. Revoke all rights to Public other than Read for the production folders.
work performed in development cannot impact QA or production. With a fully distributed approach, separate repositories function much like the separate folders in a standalone environment. Each repository has a similar name, like the folders in the standalone environment. For instance, in our Marketing example we would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following example, we discuss a distributed repository architecture. There are four techniques for migrating from development to production in a distributed repository architecture, with each involving some advantages and disadvantages.
q q q q
Repository Copy
So far, this document has covered object-level migrations and folder migrations through drag-and-drop object copying and object XML import/export. This section discusses migrations in a distributed repository environment through repository copies. The main advantages of this approach are:
q
The ability to copy all objects (i.e., mappings, workflows, mapplets, reusable transformation, etc.) at once from one environment to another. The ability to automate this process using pmrep commands, thereby eliminating many of the manual processes that users typically perform. The ability to move everything without breaking or corrupting any of the objects.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
49 of 702
The first is that everything is moved at once (which is also an advantage). The problem with this is that everything is moved -- ready or not. For example, we may have 50 mappings in QA, but only 40 of them are production-ready. The 10 untested mappings are moved into production along with the 40 production-ready mappings, which leads to the second disadvantage. Significant maintenance is required to remove any unwanted or excess objects. There is also a need to adjust server variables, sequences, parameters/variables, database connections, etc. Everything must be set up correctly before the actual production runs can take place. Lastly, the repository copy process requires that the existing Production repository be deleted, and then the Test repository can be copied. This results in a loss of production environment operational metadata such as load statuses, session run times, etc. High-performance organizations leverage the value of operational metadata to track trends over time related to load success/failure and duration. This metadata can be a competitive advantage for organizations that use this information to plan for future growth.
Now that we've discussed the advantages and disadvantages, we'll look at three ways to accomplish the Repository Copy method:
q q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
50 of 702
If the Production repository already exists, you must delete the repository before you can copy the Test repository. Before you can delete the repository, you must run the repository in the exclusive mode. 1. Click on the INFA_PROD Repository on the left pane to select it and change the running mode to exclusive mode by clicking on the edit button on the right pane under the properties tab.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
51 of 702
2. Delete the Production repository by selecting it and choosing Delete from the context menu.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
52 of 702
3. Click on the Action drop-down list and choose Copy contents from
INFORMATICA CONFIDENTIAL
BEST PRACTICE
53 of 702
4. In the new window, choose the domain name, repository service INFA_TEST from the drop-down menu. Enter the username and password of the Test repository.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
54 of 702
5. Click OK to begin the copy process. 6. When you've successfully copied the repository to the new location, exit from the PowerCenter Administration Console. 7. In the Repository Manager, double-click on the newly copied repository and log-in with a valid username and password. 8. Verify connectivity, then highlight each folder individually and rename them. For example, rename the MARKETING_TEST folder to MARKETING_PROD, and the SHARED_MARKETING_TEST to SHARED_MARKETING_PROD. 9. Be sure to remove all objects that are not pertinent to the Production environment from the folders before beginning the actual testing process. 10. When this cleanup is finished, you can log into the repository through the Workflow Manager. Modify the server information and all connections so they are updated to point to the new Production locations for all existing tasks and workflows.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
55 of 702
2. A screen appears and prompts you to supply a name for the backup file as well as the Administrator username and password. The file is saved to the Backup directory within the repository servers home directory.
3. After you've selected the location and file name, click OK to begin the backup process.
INFORMATICA CONFIDENTIAL BEST PRACTICE 56 of 702
4. The backup process creates a .rep file containing all repository information. Stay logged into the Manage Repositories screen. When the backup is complete, select the repository connection to which the backup will be restored to (i.e., the Production repository).
5. The system will prompt you to supply a username, password, and the name of the file to be restored. Enter the appropriate information and click OK. When the restoration process is complete, you must repeat the steps listed in the copy repository option in order to delete all of the unused objects and renaming of the folders.
PMREP
Using the PMREP commands is essentially the same as the Backup and Restore Repository method except that it is run from the command line rather than through the GUI client tools. PMREP utilities can be used from the Informatica Server or from any client machine connected to the server. Refer to the Repository Manager Guide for a list of PMREP commands. The following is a sample of the command syntax used within a Windows batch file to connect to and backup a repository. Using this code example as a model, you can write scripts to be run on a daily basis to perform functions such as connect, backup, restore, etc: backupproduction.bat REM This batch file uses pmrep to connect to and back up the repository Production on the server Central
INFORMATICA CONFIDENTIAL BEST PRACTICE 57 of 702
@echo off echo Connecting to Production repository... C:\Program Files\Informatica PowerCenter 7 and later versions\RepositoryServer\bin\pmrep connect -r INFAPROD -n Administrator -x Adminpwd h infarepserver o 7001 echo Backing up Production repository... C:\Program Files\Informatica PowerCenter 7 and later versions\RepositoryServer\bin\pmrep backup o c:\backup\Production_backup.rep
Disable the workflows not being used in the Workflow Manager by opening the workflow properties, then checking the Disabled checkbox under the General tab. Delete the tasks not being used in the Workflow Manager and the mappings in the Designer
2. Modify the database connection strings to point to the production sources and targets.
r r
In the Workflow Manager, select Relational connections from the Connections menu. Edit each relational connection by changing the connect string to point to the production sources and targets. If you are using lookup transformations in the mappings and the connect string is anything other than $SOURCE or $TARGET, you will need to modify the connect strings appropriately.
In the Workflow Manager, open the session task properties, and from the Components tab make the required changes to the pre- and post-session scripts.
In Development, ensure that the owner of the folders is a user in the development group. In Test, change the owner of the test folders to a user in the test group. In Production, change the owner of the folders to a user in the production group. Revoke all rights to Public other than Read for the Production folders.
Folder Copy
INFORMATICA CONFIDENTIAL BEST PRACTICE 58 of 702
Although deployment groups are becoming a very popular migration method, the folder copy method has historically been the most popular way to migrate in a distributed environment. Copying an entire folder allows you to quickly promote all of the objects located within that folder. All source and target objects, reusable transformations, mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of this, however, everything in the folder must be ready to migrate forward. If some mappings or workflows are not valid, then developers (or the Repository Administrator) must manually delete these mappings or workflows from the new folder after the folder is copied. The three advantages of using the folder copy method are:
q
The Repository Managers Folder Copy Wizard makes it almost seamless to copy an entire folder and all the objects located within it. If the project uses a common or shared folder and this folder is copied first, then all shortcut relationships are automatically converted to point to this newly copied common or shared folder. All connections, sequences, mapping variables, and workflow variables are copied automatically.
The primary disadvantage of the folder copy method is that the repository is locked while the folder copy is being performed. Therefore, it is necessary to schedule this migration task during a time when the repository is least utilized. Remember that a locked repository means than no jobs can be launched during this process. This can be a serious consideration in real-time or near real-time environments. The following example steps through the process of copying folders from each of the different environments. The first example uses three separate repositories for development, test, and production. 1. If using shortcuts, follow these sub steps; otherwise skip to step 2:
q q q q q
Open the Repository Manager client tool. Connect to both the Development and Test repositories. Highlight the folder to copy and drag it to the Test repository. The Copy Folder Wizard appears to step you through the copy process. When the folder copy process is complete, open the newly copied folder in both the Repository Manager and Designer to ensure that the objects were copied properly.
2. Copy the Development folder to Test. If you skipped step 1, follow these sub-steps:
q q q
Open the Repository Manager client tool. Connect to both the Development and Test repositories. Highlight the folder to copy and drag it to the Test repository. The Copy Folder Wizard will appear.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
59 of 702
3.
Use the advanced options when copying the folder across. Select Next to use the default name of the folder
4. If the folder already exists in the destination repository, choose to replace the folder.
The following screen appears to prompt you to select the folder where the new shortcuts are located.
INFORMATICA CONFIDENTIAL BEST PRACTICE 60 of 702
In a situation where the folder names do not match, a folder compare will take place. The Copy Folder Wizard then completes the folder copy process. Rename the folder as appropriate and implement the security. 5. When testing is complete, repeat the steps above to migrate to the Production repository. When the folder copy process is complete, log onto the Workflow Manager and change the connections to point to the appropriate target location. Ensure that all tasks updated correctly and that folder and repository security is modified for test and production.
Object Copy
Copying mappings into the next stage in a networked environment involves many of the same advantages and disadvantages as in the standalone environment, but the process of handling shortcuts is simplified in the networked environment. For additional information, see the earlier description of Object Copy for the standalone environment. One advantage of Object Copy in a distributed environment is that it provides more granular control over objects. Two distinct disadvantages of Object Copy in a distributed environment are:
q q
Much more work to deploy an entire group of objects Shortcuts must exist prior to importing/copying mappings
Below are the steps to complete an object copy in a distributed repository environment: 1. If using shortcuts, follow these sub-steps, otherwise skip to step 2:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
61 of 702
q q
In each of the distributed repositories, create a common folder with the exact same name and case. Copy the shortcuts into the common folder in Production, making sure the shortcut has the exact same name.
In the Designer, connect to both the Test and Production repositories and open the appropriate folders in each. Drag-and-drop the mapping from Test into Production. During the mapping copy process, PowerCenter 7 and later versions allow a comparison of this mapping to an existing copy of the mapping already in Production. Note that the ability to compare objects is not limited to mappings, but is available for all repository objects including workflows, sessions, and tasks.
q q
3. Create or copy a workflow with the corresponding session task in the Workflow Manager to run the mapping (first ensure that the mapping exists in the current repository).
q q
If copying the workflow, follow the Copy Wizard. If creating the workflow, add a session task that points to the mapping and enter all the appropriate information.
In Development, ensure the owner of the folders is a user in the development group. In Test, change the owner of the test folders to a user in the test group. In Production, change the owner of the folders to a user in the production group. Revoke all rights to Public other than Read for the Production folders.
Deployment Groups
For versioned repositories, the use of Deployment Groups for migrations between distributed environments allows the most flexibility and convenience. With Deployment Groups, you can migrate individual objects as you would in an object copy migration, but can also have the convenience of a repository- or folder-level migration as all objects are deployed at once. The objects included in a deployment group have no restrictions and can come from one or multiple folders. Additionally, for additional convenience, you can set up a dynamic deployment group that allows the objects in the deployment group to be defined by a repository query, rather than being added to the deployment group manually. Lastly, because deployment groups are available on versioned repositories, they also have the ability to be rolled back, reverting to the previous versions of the objects, when necessary.
q q q q
Backup and restore of the Repository needs to be performed only once. Copying a Folder replaces the previous copy. Copying a Mapping allows for different names to be used for the same object. Uses for Deployment Groups
BEST PRACTICE 62 of 702
INFORMATICA CONFIDENTIAL
r r r r r
Deployment Groups are containers that hold references to objects that need to be migrated. Allows for version-based object migration. Faster and more flexible than folder moves for incremental changes. Allows for migration rollbacks Allows specifying individual objects to copy, rather than the entire contents of a folder.
Static
r r
Contain direct references to versions of objects that need to be moved. Users explicitly add the version of the object to be migrated to the deployment group.
Dynamic
r r
Contain a query that is executed at the time of deployment. The results of the query (i.e. object versions in the repository) are then selected and copied to the target repository
Pre-Requisites
Create required folders in the Target Repository
Creating Labels
A label is a versioning object that you can associate with any versioned object or group of versioned objects in a repository.
q
Advantages
r r r r
Tracks versioned objects during development. Improves query results. Associates groups of objects for deployment. Associates groups of objects for import and export.
Create label
r r r r r
Create labels through the Repository Manager. After creating the labels, go to edit mode and lock them. The "Lock" option is used to prevent other users from editing or applying the label. This option can be enabled only when the label is edited. Some Standard Label examples are:
s s
Development Deploy_Test
BEST PRACTICE 63 of 702
INFORMATICA CONFIDENTIAL
s s s
Apply Label
r r
Create a query to identify the objects that are needed to be queried. Run the query and apply the labels.
Queries
A query is an object used to search for versioned objects in the repository that meet specific conditions.
q
Advantages
r r r r
Tracks objects during development Associates a query with a deployment group Finds deleted objects you want to recover Finds groups of invalidated objects you want to validate
Create a query
r
The Query Browser allows you to create, edit, run, or delete object queries
Execute a query
r r
Execute through Query Browser EXECUTE QUERY: ExecuteQuery -q query_name -t query_type -u persistent_output_file_name -a append -c column_separator -r end-of-record_separator -l end-oflisting_indicator -b verbose
Launch the Repository Manager client tool and log in to the source repository.
2. Expand the repository, right-click on Deployment Groups and choose New Group.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
64 of 702
3.
In the dialog window, give the deployment group a name, and choose whether it should be static or dynamic. In this example, we are creating a static deployment group. Click OK.
In Designer, Workflow Manager, or Repository Manger, right-click an object that you want to add to the deployment group and choose Versioning -> View History. The View History window appears.
2.
In the View History window, right-click the object and choose Add to Deployment Group.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
66 of 702
3.
In the Deployment Group dialog window, choose the deployment group that you want to add the object to, and click OK.
4.
In the final dialog window, choose whether you want to add dependent objects. In most cases, you will want to add dependent objects to the deployment group so that they will be migrated as well. Click OK.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
67 of 702
NOTE: The All Dependencies option should be used for any new code that is migrating forward. However, this option can cause issues when moving existing code forward because All Dependencies also flags shortcuts. During the deployment, PowerCenter tries to re-insert or replace the shortcuts. This does not work, and causes the deployment to fail. The object will be added to the deployment group at this time. Although the deployment group allows the most flexibility, the task of adding each object to the deployment group is similar to the effort required for an object copy migration. To make deployment groups easier to use, PowerCenter allows the capability to create dynamic deployment groups.
First, create a deployment group, just as you did for a static deployment group, but in this case, choose the dynamic option. Also, select the Queries button.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
68 of 702
2.
The Query Browser window appears. Choose New to create a query for the dynamic deployment group.
3.
In the Query Editor window, provide a name and query type (Shared). Define criteria for the objects that should be migrated. The drop-down list of parameters lets you choose from 23 predefined metadata categories. In this case, the developers have assigned the RELEASE_20050130 label to all objects that need to be migrated, so the query is defined as Label Is Equal To RELEASE_20050130. The creation and application of labels are discussed in Using PowerCenter Labels.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
69 of 702
4.
Save the Query and exit the Query Editor. Click OK on the Query Browser window, and close the Deployment Group editor window.
Automated Deployments
For the optimal migration method, you can set up a UNIX shell or Windows batch script that calls the pmrep DeployDeploymentGroup command, which can execute a deployment group migration without human intevention. This is ideal since the deployment group allows ultimate flexibility and convenience as the script can be scheduled to run overnight, thereby causing minimal impact on developers and the PowerCenter administrator. You can also use the pmrep utility to automate importing objects via XML.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
70 of 702
Recommendations
Informatica recommends using the following process when running in a three-tiered environment with development, test, and production servers.
Non-Versioned Repositories
For migrating from development into test, Informatica recommends using the Object Copy method. This method gives you total granular control over the objects that are being moved. It also ensures that the latest development mappings can be moved over manually as they are completed. For recommendations on performing this copy procedure correctly, see the steps listed in the Object Copy section.
Versioned Repositories
For versioned repositories, Informatica recommends using the Deployment Groups method for repository migration in a distributed repository environment. This method provides the greatest flexibility in that you can promote any object from within a development repository (even across folders) into any destination repository. Also, by using labels, dynamic deployment groups, and the enhanced pmrep command line utility, the use of the deployment group migration method results in automated migrations that can be executed without manual intervention.
Third-Party Versioning
Some organizations have standardized on third-party version control software. PowerCenters XML import/ export functionality offers integration with such software and provides a means to migrate objects. This method is most useful in a distributed environment because objects can be exported into an XML file from one repository and imported into the destination repository. The XML Object Copy Process allows you to copy nearly all repository objects, including sources, targets, reusable transformations, mappings, mapplets, workflows, worklets, and tasks. Beginning with PowerCenter 7 and later versions, the export/import functionality allows the export/import of multiple objects to a single XML file. This can significantly cut down on the work associated with object level XML import/export.
INFORMATICA CONFIDENTIAL BEST PRACTICE 71 of 702
The following steps outline the process of exporting the objects from source repository and importing them into the destination repository:
Exporting
1. From Designer or Workflow Manager, login to the source repository. Open the folder and highlight the object to be exported. 2. Select Repository -> Export Objects 3. The system prompts you to select a directory location on the local workstation. Choose the directory to save the file. Using the default name for the XML file is generally recommended. 4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter 7 and later versions x\Client directory. (This may vary depending on where you installed the client tools.) 5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the directory where you saved the XML file. 6. Together, these files are now ready to be added to the version control software
Importing
Log into Designer or the Workflow Manager client tool and login to the destination repository. Open the folder where the object is to be imported. 1. Select Repository -> Import Objects. 2. The system prompts you to select a directory location and file to import into the repository. 3. The following screen appears with the steps for importing the object.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
72 of 702
5. Click "Next", and then click "Import". Since the shortcuts have been added to the folder, the mapping will now point to the new shortcuts and their parent folder. 6. It is important to note that the pmrep command line utility was greatly enhanced in PowerCenter 7 and later versions, allowing the activities associated with XML import/export to be automated through pmrep. 7. Click on the destination repository service on the left pane and choose the Action drop-down list box -> Restore. Remember, if the destination repository has content, it has to be deleted prior to restoring).
INFORMATICA CONFIDENTIAL
BEST PRACTICE
73 of 702
Description
There are two approaches to perform a migration.
q q
Using the DTLURDMO utility Using the Power Exchange Client tool (Detail Navigator)
DTLURDMO Utility Step 1: Validate connectivity between the client and listeners
Test communication between clients and all listeners in the production environment with: dtlrexeprog=ping <loc>=<nodename>.
Run selected jobs to exercise data access through PowerExchange data maps.
INFORMATICA CONFIDENTIAL
q q
On MVS, the input statements for this utility are taken from SYSIN. On non-MVS platforms, the input argument point to a file containing the input definition. If no input argument is provided, it looks for a file dtlurdmo.ini in the current path. The utility runs on all capture platforms.
DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. If no definition file is specified, it looks for a file dtlurdmo.ini in the current path.
DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates and is read from the SYSIN card.
AS/400 utility
Syntax: CALL PGM(<location and name of DTLURDMO executable file>) For example: CALL PGM(dtllib/DTLURDMO)
q
DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. By default, the definition is in the member CFG/DTLURDMO in the current datalib library.
If you want to create a separate DTLURDMO definition file rather than use the default location, you must give the library and filename of the definition file as a parameter. For example: CALL PGM(dtllib/ DTLURDMO) parm ('datalib/deffile(dtlurdmo)')
Running DTLURDMO
The utility should be run extracting information from the files locally, then writing out the datamaps through the new PowerExchange V8.x.x Listener. This causes the datamaps to be written out in the format required for the upgraded PowerExchange. DTLURDMO must be run once for the datamaps, then again for the registrations, and then the extract maps if this is a capture environment. Commands for mixed datamaps, registrations, and extract maps cannot be run together.
INFORMATICA CONFIDENTIAL BEST PRACTICE 75 of 702
If only a subset of the PowerExchange datamaps, registrations, and extract maps are required, then selective copies can be carried out. Details of performing selective copies are documented fully in the PowerExchange Utilities Guide. This document assumes that everything is going to be migrated from the existing environment to the new V8.x.x format.
Power Exchange Client tool (Detail Navigator) Step 1: Validate connectivity between the client and listeners
Test communication between clients and all listeners in the production environment with: dtlrexeprog=ping loc=<nodename>.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
76 of 702
Run selected jobs to exercise data access through PowerExchange data maps.
q q
Select the datamap that is going to be promoted to production. On the menu bar, select a file to send to the remote node.
On the drop-down list box, choose the appropriate location ( in this case mvs_prod).
q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
77 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
78 of 702
Description
When a task in the workflow fails at any point, one option is to truncate the target and run the workflow again from the beginning. Load Manager architecture offers an alternative to this scenario: the workflow can be suspended and the user can fix the error rather than re-processing the portion of the workflow with no errors. This option, "Suspend on Error", results in accurate and complete target data, as if the session completed successfully with one run.
Integration Service saves session recovery information and updates recovery tables for a target database. If session interrupts, Integration Service uses saved recovery information to recover it.
Restart task
INFORMATICA CONFIDENTIAL
BEST PRACTICE
79 of 702
r r
Integration Service does not save session recovery information. If session interrupts, Integration Service reruns the session during recovery.
When a task fails in the workflow, the Integration Service stops running tasks in the path. The Integration Service does not evaluate the output link of the failed task. If no other task is running in the workflow, the Workflow Monitor displays the status of the workflow as "Suspended." If one or more tasks are still running in the workflow when a task fails, the Integration Service stops running the failed task and continues running tasks in other paths. The Workflow Monitor displays the status of the workflow as "Suspending." When the status of the workflow is "Suspended" or "Suspending," you can fix the error, such as a target database error, and recover the workflow in the Workflow Monitor. When you recover a workflow, the Integration Service restarts the failed tasks and continues evaluating the rest of the tasks in the workflow. The Integration Service does not run any task that already completed successfully. Note: You can no longer recover individual sessions in a workflow. To recover a session, you recover the workflow.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
80 of 702
If the truncate table option is enabled in a recovery-enabled session, the target table is not truncated during recovery process.
Session Logs
In a suspended workflow scenario, the Integration Service uses the existing session log when it resumes the workflow from the point of suspension. However, the earlier runs that caused the suspension are recorded in the historical run information in the repository.
Suspension Email
The workflow can be configured to send an email when the Integration Service suspends the workflow. When a task fails, the server suspends the workflow and sends the suspension email. The user can then fix the error and resume the workflow. If another task fails while the Integration Service is suspending the workflow, the server does not send another suspension email. The Integration Service only sends out another suspension email if another task fails after the workflow resumes. Check the "Browse Emails" button on the General tab of the Workflow Designer Edit sheet to configure the suspension email.
Suspending Worklets
When the "Suspend On Error" option is enabled for the parent workflow, the Integration Service also suspends the worklet if a task within the worklet fails. When a task in the worklet fails, the server stops executing the failed task and other tasks in its path. If no other task is running in the worklet, the status of the worklet is "Suspended". If other tasks are still running in the worklet, the status of the worklet is "Suspending". The parent workflow is also suspended when the worklet is "Suspended" or "Suspending".
Starting Recovery
The recovery process can be started using Workflow Manager Client tool or Workflow Monitor client tool. Alternatively, the recovery process can be started using pmcmd in command line mode or using a script.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
81 of 702
recovers the session, it uses information in the recovery tables to determine where to begin loading data to target tables. If you want the Integration Service to create the recovery tables, grant table creation privilege to the database user name for the target database connection. If you do not want the Integration Service to create the recovery tables, create the recovery tables manually. The Integration Service creates the following recovery tables in the target database:
q
PM_RECOVERY. Contains target load information for the session run. The Integration Service removes the information from this table after each successful session and initializes the information at the beginning of subsequent sessions. PM_TGT_RUN_ID. Contains information the Integration Service uses to identify each target on the database. The information remains in the table between session runs. If you manually create this table, you must create a row and enter a value other than zero for LAST_TGT_RUN_ID to ensure that the session recovers successfully.
Do not edit or drop the recovery tables before you recover a session. If you disable recovery, the Integration Service does not remove the recovery tables from the target database. You must manually remove the recovery tables.
Unrecoverable Sessions
The following options affect whether the session is incrementally recoverable:
q
Output is deterministic. A property that determines if the transformation generates the same set of data for each session run. You can set this property for SDK sources and Custom transformations. Output is repeatable. A property that determines if the transformation generates the data in the same order for each session run. You can set this property for Custom transformations. Lookup source is static. A Lookup transformation property that determines if the lookup source is the same between the session and recovery. The Integration Service uses this property to determine if the output is deterministic.
For recovery to be effective, the recovery session must produce the same set of rows and in the same order. Any change after initial failure in mapping, session and/or in the server that changes the ability to produce repeatable data results in inconsistent data during recovery process. The following cases may produce inconsistent data during a recovery session:
q q q q q q q
Session performs incremental aggregation and server stops unexpectedly. Mapping uses sequence generator transformation. Mapping uses a normalizer transformation. Source and/or target changes after initial session failure. Data movement mode change after initial session failure. Code page (server, source or target) changes, after initial session failure. Mapping changes in a way that causes server to distribute or filter or aggregate rows differently. Session configurations are not supported by PowerCenter for session recovery. Mapping uses a lookup table and the data in the lookup table changes between session runs. Session sort order changes, when server is running in Unicode mode.
HA Recovery
Highly-available recovery allows the workflow to resume automatically in case of the Integration Service has failed over. The following options are available in the properties tab of the workflow:
q
Enable HA recovery Allows the workflow to be configured for Highly Availability. Automatically recover terminated tasks Recover terminated Session or Command tasks without user intervention. You must have high availability and the workflow must still be running. Maximum automatic recovery attempts When you automatically recover terminated tasks you can choose the number of times the Integration Service attempts to recover the task. Default is 5.
Note: To run a workflow in HA recovery, you must have HA License for the Repository
INFORMATICA CONFIDENTIAL BEST PRACTICE 83 of 702
Service.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
84 of 702
Description
A label is a versioning object that can be associated with any versioned object or group of versioned objects in a repository. Labels provide a way to tag a number of object versions with a name for later identification. Therefore, a label is a named object in the repository, whose purpose is to be a pointer or reference to a group of versioned objects. For example, a label called Project X version X can be applied to all object versions that are part of that project and release. Labels can be used for many purposes:
q q q q
Track versioned objects during development Improve object query results. Create logical groups of objects for future deployment. Associate groups of objects for import and export.
Note that labels apply to individual object versions, and not objects as a whole. So if a mapping has ten versions checked in, and a label is applied to version 9, then only version 9 has that label. The other versions of that mapping do not automatically inherit that label. However, multiple labels can point to the same object for greater flexibility. The Use Repository Manager privilege is required in order to create or edit labels, To create a label, choose Versioning-Labels from the Repository Manager.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
85 of 702
When creating a new label, choose a name that is as descriptive as possible. For example, a suggested naming convention for labels is: Project_Version_Action. Include comments for further meaningful description. Locking the label is also advisable. This prevents anyone from accidentally associating additional objects with the label or removing object references for the label. Labels, like other global objects such as Queries and Deployment Groups, can have user and group privileges attached to them. This allows an administrator to create a label that can only be used by specific individuals or groups. Only those people working on a specific project should be given read/write/execute permissions for labels that are assigned to that project.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
86 of 702
Once a label is created, it should be applied to related objects. To apply the label to objects, invoke the Apply Label wizard from the Versioning >> Apply Label menu option from the menu bar in the Repository Manager (as shown in the following figure).
Applying Labels
Labels can be applied to any object and cascaded upwards and downwards to parent and/or child objects. For example, to group dependencies for a workflow, apply a label to all children objects. The Repository Server applies labels to sources, targets, mappings, and tasks associated with the workflow. Use the Move label property to point the label to the latest version of the object(s). Note: Labels can be applied to any object version in the repository except checked-out versions. Execute permission is required for applying labels. After the label has been applied to related objects, it can be used in queries and deployment groups (see the Best Practice on Deployment Groups ). Labels can also be used to manage the size of the repository (i.e. to purge object versions).
INFORMATICA CONFIDENTIAL
BEST PRACTICE
87 of 702
Repository Administrators and other individuals in charge of migrations should develop their own label strategies and naming conventions in the early stages of a data integration project. Be sure that developers are aware of the uses of these labels and when they should apply labels. For each planned migration between repositories, choose three labels for the development and subsequent repositories:
q q q
The first is to identify the objects that developers can mark as ready for migration. The second should apply to migrated objects, thus developing a migration audit trail. The third is to apply to objects as they are migrated into the receiving repository, completing the migration audit trail.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
88 of 702
When preparing for the migration, use the first label to construct a query to build a dynamic deployment group. The second and third labels in the process are optionally applied by the migration wizard when copying folders between versioned repositories. Developers and administrators do not need to apply the second and third labels manually. Additional labels can be created with developers to allow the progress of mappings to be tracked if desired. For example, when an object is successfully unit-tested by the developer, it can be marked as such. Developers can also label the object with a migration label at a later time if necessary. Using labels in this fashion along with the query feature allows complete or incomplete objects to be identified quickly and easily, thereby providing an object-based view of progress.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
89 of 702
Description
Data Analyzer repository objects can be exported to and imported from Extensible Markup Language (XML) files. Export/import facilitates archiving the Data Analyzer repository and deploying Data Analyzer Dashboards and reports from development to production. The following repository objects in Data Analyzer can be exported and imported:
q q q q q q q q q q
Schemas Reports Time Dimensions Global Variables Dashboards Security profiles Schedules Users Groups Roles
The XML file created after exporting objects should not be modified. Any change might invalidate the XML file and result in failure of import objects into a Data Analyzer repository. For more information on exporting objects from the Data Analyzer repository, refer to the Data Analyzer Administration Guide.
Exporting Schema(s)
INFORMATICA CONFIDENTIAL
BEST PRACTICE
90 of 702
To export the definition of a star schema or an operational schema, you need to select a metric or folder from the Metrics system folder in the Schema Directory. When you export a folder, you export the schema associated with the definitions of the metrics in that folder and its subfolders. If the folder you select for export does not contain any objects, Data Analyzer does not export any schema definition and displays the following message: There is no content to be exported. There are two ways to export metrics or folders containing metrics:
q
Select the Export Metric Definitions and All Associated Schema Table and Attribute Definitions option. If you select to export a metric and its associated schema objects, Data Analyzer exports the definitions of the metric and the schema objects associated with that metric. If you select to export an entire metric folder and its associated objects, Data Analyzer exports the definitions of all metrics in the folder, as well as schema objects associated with every metric in the folder. Alternatively, select the Export Metric Definitions Only option. When you choose to export only the definition of the selected metric, Data Analyzer does not export the definition of the schema table from which the metric is derived or any other associated schema object.
1. Login to Data Analyzer as a System Administrator. 2. Click on the Administration tab XML Export/Import Export Schemas. 3. All the metric folders in the schema directory are displayed. Click Refresh Schema to display the latest list of folders and metrics in the schema directory. 4. Select the check box for the folder or metric to be exported and click Export as XML option. 5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.
Exporting Report(s)
To export the definitions of more than one report, select multiple reports or folders. Data Analyzer exports only report definitions. It does not export the data or the schedule for cached reports. As part of the Report Definition export, Data Analyzer exports the report table, report chart, filters, indicators (i.e., gauge, chart, and table indicators), custom metrics, links to similar reports, and all reports in an analytic workflow, including links to similar reports.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
91 of 702
Reports can have public or personal indicators associated with them. By default, Data Analyzer exports only public indicators associated with a report. To export the personal indicators as well, select the Export Personal Indicators check box. To export an analytic workflow, you need to export only the originating report. When you export the originating report of an analytic workflow, Data Analyzer exports the definitions of all the workflow reports. If a report in the analytic workflow has similar reports associated with it, Data Analyzer exports the links to the similar reports. Data Analyzer does not export the alerts, schedules, or global variables associated with the report. Although Data Analyzer does not export global variables, it lists all global variables it finds in the report filter. You can, however, export these global variables separately. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export Reports. Select the folder or report to be exported. Click Export as XML. Enter XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.
Exporting a Dashboard
Whenever a dashboard is exported, Data Analyzer exports the reports, indicators, shared documents, and gauges associated with the dashboard. Data Analyzer does not, however, export the alerts, access permissions, attributes or metrics in the report (s), or real-time objects. You can export any of the public dashboards defined in the repository, and can export more than one dashboard at one time. 1. Login to Data Analyzer as a System Administrator. 2. Click Administration XML Export/Import Export Dashboards. 3. Select the Dashboard to be exported.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
92 of 702
4. Click Export as XML. 5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.
Exporting a Schedule
You can export a time-based or event-based schedule to an XML file. Data Analyzer runs a report with a time-based schedule on a configured schedule. Data Analyzer runs a report with an event-based schedule when a PowerCenter session completes. When you export a schedule, Data Analyzer does not export the history of the schedule. 1. 2. 3. 4. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export Schedules. Select the Schedule to be exported. Click Export as XML.
BEST PRACTICE 93 of 702
INFORMATICA CONFIDENTIAL
5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.
Login name Description First, middle, and last name Title Password Change password privilege Password never expires indicator Account status Groups to which the user belongs Roles assigned to the user Query governing settings
Data Analyzer does not export the email address, reply-to address, department, or color scheme assignment associated with the exported user(s). 1. 2. 3. 4. 5. 6. 7. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export User/Group/Role. Click Export Users/Group(s)/Role(s). Select the user(s) to be exported. Click Export as XML. Enter XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
94 of 702
Exporting Groups
You can export any group defined in the repository, and can export the definitions of multiple groups. You can also export the definitions of all the users within a selected group. Use the asterisk (*) or percent symbol (%) as wildcard characters to search for groups to export. Each group definition includes the following information:
q q q q q q q q
Name Description Department Color scheme assignment Group hierarchy Roles assigned to the group Users assigned to the group Query governing settings
Data Analyzer does not export the color scheme associated with an exported group. 1. 2. 3. 4. 5. 6. 7. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export User/Group/Role. Click Export Users/Group(s)/Role(s). Select the group to be exported. Click Export as XML. Enter XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.
Exporting Roles
You can export the definitions of the custom roles defined in the repository. However, you cannot export the definitions of system roles defined by Data Analyzer. You can export the definitions of more than one role. Each role definition includes the name and description of the role and the permissions assigned to each role. 1. 2. 3. 4. 5. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export User/Group/Role. Click Export Users/Group(s)/Role(s). Select the role to be exported. Click Export as XML.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
95 of 702
6. Enter XML filename and click Save to save the XML file. 7. The XML file will be stored locally on the client machine.
Importing Objects
You can import objects into the same repository or a different repository. If you import objects that already exist in the repository, you can choose to overwrite the existing objects. However, you can import only global variables that do not already exist in the repository. When you import objects, you can validate the XML file against the DTD provided by Data Analyzer. Informatica recommends that you do not modify the XML files after you export from Data Analyzer. Ordinarily, you do not need to validate an XML file that you create by exporting from Data Analyzer. However, if you are not sure of the validity of an XML file, you can validate it against the Data Analyzer DTD file when you start the import process. To import repository objects, you must have the System Administrator role or the Access XML Export/Import privilege. When you import a repository object, you become the owner of the object as if you created it. However, other system administrators can also access imported repository objects. You can limit access to reports for users who are not system administrators. If you select to publish imported reports to everyone, all users in Data Analyzer have read and write access to them. You can change the access permissions to reports after you import them.
Importing Schemas
When importing schemas, if the XML file contains only the metric definition, you must make sure that the fact table for the metric exists in the target repository. You can import a metric only if its associated fact table exists in the target repository or the definition of its associated fact table is also in the XML file. When you import a schema, Data Analyzer displays a list of all the definitions contained in the XML file. It then displays a list of all the object definitions in the XML file that already exist in the repository. You can choose to overwrite objects in the repository. If you import a schema that contains time keys, you must import or create a time dimension. 1. Login to Data Analyzer as a System Administrator.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
96 of 702
2. 3. 4. 5. 6.
Click Administration XML Export/Import Import Schema. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.
Importing Reports
A valid XML file of exported report objects can contain definitions of cached or ondemand reports, including prompted reports. When you import a report, you must make sure that all the metrics and attributes used in the report are defined in the target repository. If you import a report that contains attributes and metrics not defined in the target repository, you can cancel the import process. If you choose to continue the import process, you may not be able to run the report correctly. To run the report, you must import or add the attribute and metric definitions to the target repository. You are the owner of all the reports you import, including the personal or public indicators associated with the reports. You can publish the imported reports to all Data Analyzer users. If you publish reports to everyone, Data Analyzer provides read-access to the reports to all users. However, it does not provide access to the folder that contains the imported reports. If you want another user to access an imported report, you can put the imported report in a public folder and have the user save or move the imported report to his or her personal folder. Any public indicator associated with the report also becomes accessible to the user. If you import a report and its corresponding analytic workflow, the XML file contains all workflow reports. If you choose to overwrite the report, Data Analyzer also overwrites the workflow reports. Also, when importing multiple workflows, note that Data Analyzer does not import analytic workflows containing the same workflow report names. Thus, ensure that all imported analytic workflows have unique report names prior to being imported. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Report. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
97 of 702
You can import global variables that are not defined in the target repository. If the XML file contains global variables already in the repository, you can cancel the process. If you continue the import process, Data Analyzer imports only the global variables not in the target repository. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Global Variables. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.
Importing Dashboards
Dashboards display links to reports, shared documents, alerts, and indicators. When you import a dashboard, Data Analyzer imports the following objects associated with the dashboard:
q q q q
Data Analyzer does not import the following objects associated with the dashboard:
q q q q
Alerts Access permissions Attributes and metrics in the report Real-time objects
If an object already exists in the repository, Data Analyzer provides an option to overwrite it. Data Analyzer does not import the attributes and metrics in the reports associated with the dashboard. If the attributes or metrics in a report associated with the dashboard do not exist, the report does not display on the imported dashboard. 1. 2. 3. 4. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Dashboard. Click Browse to choose an XML file to import. Select Validate XML against DTD.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
98 of 702
5. Click Import XML. 6. Verify all attributes on the summary page, and choose Continue.
To associate the imported security profiles with all the users on the page, select the "Users" check box at the top of the list. To associate the imported security profiles with all the users in the repository, select Import to All.. To overwrite the selected users current security profile with the imported security profile, select Overwrite.. To append the imported security profile to the selected users current security profile, select Append..
5. 6. 7. 8.
Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
99 of 702
Importing Schedule(s)
A time-based schedule runs reports based on a configured schedule. An event-based schedule runs reports when a PowerCenter session completes. You can import a timebased or event-based schedules from an XML file. When you import a schedule, Data Analyzer does not attach the schedule to any reports. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Schedule. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
100 of 702
Data Analyzer activity, when most of the users are not accessing the Data Analyzer repository. This should help to prevent users from experiencing timeout errors or degraded response time. Only the System Administrator should perform import/export operations.
q
Take a backup of the Data Analyzer repository prior to performing an import/ export operation. This backup should be completed using the Repository Backup Utility provided with Data Analyzer. Manually add user/group permissions for the report. These permissions will not be exported as part of exporting Reports and should be manually added after the report is imported in the desired server. Use a version control tool. Prior to importing objects into a new environment, it is advisable to check the XML documents with a version-control tool such as Microsoft's Visual Source Safe, or PVCS. This facilitates the versioning of repository objects and provides a means for rollback to a prior version of an object, if necessary. Attach cached reports to schedules. Data Analyzer does not import the schedule with a cached report. When you import cached reports, you must attach them to schedules in the target repository. You can attach multiple imported reports to schedules in the target repository in one process immediately after you import them. Ensure that global variables exist in the target repository. If you import a report that uses global variables in the attribute filter, ensure that the global variables already exist in the target repository. If they are not in the target repository, you must either import the global variables from the source repository or recreate them in the target repository. Manually add indicators to the dashboard. When you import a dashboard, Data Analyzer imports all indicators for the originating report and workflow reports in a workflow. However, indicators for workflow reports do not display on the dashboard after you import it until added manually. Check with your System Administrator to understand what level of LDAP integration has been configured (if any). Users, groups, and roles need to be exported and imported during deployment when using repository authentication. If Data Analyzer has been integrated with an LDAP (Lightweight Directory Access Protocol) tool, then users, groups, and/or roles may not require deployment.
When you import users into a Microsoft SQL Server or IBM DB2 repository, Data Analyzer blocks all user authentication requests until the import process is complete.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
101 of 702
Description
Consider the following questions when determining what type of hardware to use for Data Analyzer: If the hardware already exists: 1. 2. 3. 4. 5. Is the processor, operating system, and database software supported by Data Analyzer? Are the necessary operating system and database patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the Data Analyzer application? Will Data Analyzer share the machine with other applications? If yes, what are the CPU and memory requirements of the other applications?
If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? (e.g., Solaris, Windows, AIX, HP-UX, Redhat AS, SuSE) 3. What database and version is preferred and supported for the Data Analyzer repository? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the reporting response time requirements for Data Analyzer. The following questions should be answered in order to estimate the size of a Data Analyzer server: 1. 2. 3. 4. How many users are predicted for concurrent access? On average, how many rows will be returned in each report? On average, how many charts will there be for each report? Do the business requirements mandate a SSL Web server?
The hardware requirements for the Data Analyzer environment depend on the number of concurrent users, types of reports being used (i.e., interactive vs. static), average number of records in a report, application server and operating system used, among other factors. The following table should be used as a general guide for hardware recommendations for a Data Analyzer installation. Actual results may vary depending upon exact hardware configuration and user volume. For exact sizing recommendations, contact Informatica Professional Services for a Data Analyzer Sizing and Baseline Architecture engagement.
Windows
# of Concurrent Users Average Number of Rows per Report Average # of Charts per Report Estimated # of CPUs for Peak Usage Estimated Total RAM (For Data Analyzer alone) Estimated # of App servers in a Clustered Environment 1
50
1000
1 GB
INFORMATICA CONFIDENTIAL
BEST PRACTICE
102 of 702
100 200 400 100 -100 100 100 100 100 100 100
1000 1000 1000 1000 2000 5000 10000 1000 1000 1000 1000
2 2 2 2 2 2 2 2 5 7 10
3 6 12 3 3 4 5 3 3 3 3-4
Notes:
1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running BEA WebLogic 8.1 SP3, Windows 2000, on a 4 CPU 2.5 GHz Xeon Processor. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations. 4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance. 5. There will be an increase in overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesnt have to be across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.
IBM AIX
# of Concurrent Users Average Number of Rows per Report Average # of Charts per Report Estimated # of CPUs for Peak Usage Estimated Total RAM (For Data Analyzer alone) Estimated # of App servers in a Clustered Environment 1 1 2-3 4-5
2 2 2 2
2 2-3 4-5 9 - 10
1 GB 2 GB 3.5 GB 6 GB
INFORMATICA CONFIDENTIAL
BEST PRACTICE
103 of 702
2 2 2 2 2 5 7 10
2 GB 2 GB 3 GB 4 GB 2 GB 2 GB 2 GB 2.5 GB
Notes:
1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running IBM WebSphere 5.1.1.1 and AIX 5.2.02 on a 4 CPU 2.4 GHz IBM p630. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations. 4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance. 5. Add 30 to 50 percent overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesnt have to be across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.
Verify that the hardware meets the minimum system requirements for Data Analyzer. Ensure that the combination of hardware, operating system, application server, repository database, and, optionally, authentication software are supported by Data Analyzer. Ensure that sufficient space has been allocated to the Data Analyzer repository. Apply all necessary patches to the operating system and database software. Verify connectivity to the data warehouse database (or other reporting source) and repository database. If LDAP or NT Domain is used for Data Analyzer authentication, verify connectivity to the LDAP directory server or the NT primary domain controller. The Data Analyzer license file has been obtained from technical support. On UNIX/Linux installations, the OS user that is running Data Analyzer must have execute privileges on all Data Analyzer installation executables.
BEST PRACTICE 104 of 702
q q q
q q
INFORMATICA CONFIDENTIAL
In addition to the standard Data Analyzer components that are installed by default, you can also install Metadata Manager. With Version 8.0, the Data Analyzer SDK and Portal Integration Kit are now installed with Data Analyzer. Refer to the Data Analyzer documentation for detailed information for these components.
Repository Configuration
To properly install Data Analyzer you need to have connectivity information for the database server where the repository is going to reside. This information includes:
q q q
Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of JBOSS. Also, other applications can coexist with Data Analyzer on a single instance of JBOSS. Although this architecture should be considered during hardware sizing estimates, it allows greater flexibility during installation. For JBOSS installations on UNIX, the JBOSS Server installation program requires an X-Windows server. If JBOSS Server is installed on a machine where an X-Windows server is not installed, an X-Windows server must be installed on another machine in order to render graphics for the GUI-based installation program. For more information on installing on UNIX, please see the UNIX Servers section of the installation and configuration tips below. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary format To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation as the installer will configure all properties files at installation. The Data Analyzer license file must be applied prior to starting Data Analyzer.
Configuration Screen
INFORMATICA CONFIDENTIAL
BEST PRACTICE
106 of 702
1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install BEA WebLogic and apply the BEA license. 3. Install Data Analyzer. 4. Apply the Data Analyzer license key. 5. Install the Data Analyzer Online Help.
TIP When creating a repository in an Oracle database, make sure the storage parameters specified for the tablespace that contains the repository are not set too large. Since many target tablespaces are initially set for very large INITIAL and NEXT values, large storage parameters cause the repository to use excessive amounts of space. Also verify that the default tablespace for the user that owns the repository tables is set correctly. The following example shows how to set the recommended storage parameters, assuming the repository is stored in the REPOSITORY tablespace: ALTER TABLESPACE REPOSITORY DEFAULT STORAGE ( INITIAL 10K NEXT 10K MAXEXTENTS UNLIMITED PCTINCREASE 50 );
Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebLogic. Also, other applications can coexist with Data Analyzer on a single instance of WebLogic. Although this architecture should be factored in during hardware sizing estimates, it allows greater flexibility during installation. With Data Analyzer 8, there is a console version of the installation available. X-Windows is no longer required for WebLogic installations. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary format To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation since the installer will configure all properties files at installation. The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.
Configuration Screen
INFORMATICA CONFIDENTIAL
BEST PRACTICE
107 of 702
Starting in Data Analyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebSphere. Also, other applications can coexist with Data Analyzer on a single instance of WebSphere. Although this architecture should be considered during sizing estimates, it allows greater flexibility during installation. q With Data Analyzer 8 there is a console version of the installation available. X-Windows is no longer required for WebSphere installations.
q
For WebSphere on UNIX installations, Data Analyzer must be installed using the root user or system administrator account. Two groups (mqm and mqbrkrs) must be created prior to the installation and the root account should be added to both of these groups. For WebSphere on Windows installations, ensure that Data Analyzer is installed under the padaemon local Windows user ID that is in the Administrative group and has the advanced user rights: "Act as part of the operating system" and "Log on as a service." During the installation, the padaemon account will need to be added to the mqm group.
BEST PRACTICE 108 of 702
INFORMATICA CONFIDENTIAL
If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary format. To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the WebSphere installation process, the user will be prompted to enter a directory for the application server and the HTTP (web) server. In both instances, it is advisable to keep the default installation directory. Directory names for the application server and HTTP server that include spaces may result in errors. During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is utilized, have the configuration parameters available during installation as the installer will configure all properties files at installation. The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.
Configuration Screen
C shell:
Configuration
Data Analyzer requires a means to render graphics for charting and indicators. When graphics rendering is not configured properly, charts and indicators do not display properly on dashboards or reports. For Data Analyzer installations using an application server with JDK 1.4 and greater, the java.awt.headless=true setting can be set in the application server startup scripts to facilitate graphics rendering for Data Analyzer. If the application server does not use JDK 1.4 or later, use an X-Windows server or XVFB to render graphics. The DISPLAY environment variable should be set to the IP address of the X-Windows or XVFB server prior to starting Data Analyzer. The application server heap size is the memory allocation for the JVM. The recommended heap size depends on the memory available on the machine hosting the application server and server load, but the recommended starting point is 512MB. This setting is the first setting that should be examined when tuning a Data Analyzer instance.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
110 of 702
Description
The PowerCenter Connect for SAP NetWeaver - BW Option supports the SAP Business Information Warehouse as both a source and target.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
111 of 702
BW uses a pull model. The BW must request data from a source system before the source system can send data to the BW. PowerCenter must first register with the BW using SAPs Remote Function Call (RFC) protocol. The native interface to communicate with BW is the Staging BAPI, an API published and supported by SAP. Three products in the PowerCenter suite use this API. PowerCenter Designer uses the Staging BAPI to import metadata for the target transfer structures; PowerCenter Integration Server for BW uses the Staging BAPI to register with BW and receive requests to run sessions; and the PowerCenter Server uses the Staging BAPI to perform metadata verification and load data into BW. Programs communicating with BW use the SAP standard saprfc.ini file to communicate with BW. The saprfc.ini file is similar to the tnsnames file in Oracle or the interface file in Sybase. The PowerCenter Designer reads metadata from BW and the PowerCenter Server writes data to BW. BW requires that all metadata extensions be defined in the BW Administrator Workbench. The definition must be imported to Designer. An active structure is the target for PowerCenter mappings loading BW. Because of the pull model, BW must control all scheduling. BW invokes the PowerCenter session when the InfoPackage is scheduled to run in BW. BW only supports insertion of data into BW. There is no concept of update or deletes through the staging BAPI.
3. 4. 5. 6.
which calls the workflow created in the Workflow Manager. Create a mapping. Create a mapping in the Designer that uses the database table or file output target as a source. Create a workflow to extract data from BW. Create a workflow and session task to automate data extraction from BW. Create a Process Chain. A BW Process Chain links programs together to run in sequence. Create a Process Chain to link the InfoSpoke and ABAP programs together. Schedule the data extraction from BW. Set up a schedule in BW to automate data extraction.
Transports for SAP versions 4.0B to 4.6B, 4.6C, and non-Unicode versions 4.7 and above.
q
Transports for SAP Unicode versions 4.7 and above; this category has been added for Unicode extraction support which was not previously available in SAP versions 4.6 and earlier.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
113 of 702
2. Build the BW Components. To load data into BW, you must build components in both BW and PowerCenter. You must first build the BW components in the Administrator Workbench:
q
Define PowerCenter as a source system to BW. BW requires an external source definition for all non-R/3 sources.
q q
3. Configure the sparfc.ini file. Required for PowerCenter and Connect to connect to BW. PowerCenter uses two types of entries to connect to BW through the saprfc.ini file:
q
Type A. Used by PowerCenter Client and PowerCenter Server. Specifies the BW application server. Type R. Used by the PowerCenter Connect for SAP NetWeaver - BW Option. Specifies the external program, which is registered at the SAP gateway. Note: Do not use Notepad to edit the sparfc.ini file because Notepad can corrupt the file. Set RFC_INI environment variable for all Windows NT, Windows 2000, and Windows 95/98 machines with saprfc.ini file. RFC_INI is used to locate the saprfc.ini.
4. Start the Connect for BW server Start Connect for BW server after you start PowerCenter Server and before you create InfoPackage in BW.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
114 of 702
5. Build mappings Import the InfoSource into the PowerCenter repository and build a mapping using the InfoSource as a target. The following restrictions apply to building mappings with BW InfoSource target:
q q q q q
You cannot use BW as a lookup table. You can use only one transfer structure for each mapping. You cannot execute stored procedure in a BW target. You cannot partition pipelines with a BW target. You cannot copy fields that are prefaced with /BIC/ from the InfoSource definition into other transformations. You cannot build an update strategy in a mapping. BW supports only inserts; it does not support updates or deletes. You can use Update Strategy transformation in a mapping, but the Connect for BW Server attempts to insert all records, even those marked for update or delete.
6. Load data To load data into BW from PowerCenter, both PowerCenter and the BW system must be configured. Use the following steps to load data into BW:
q
Configure a workflow to load data into BW. Create a session in a workflow that uses a mapping with an InfoSource target definition. Create and schedule an InfoPackage. The InfoPackage associates the PowerCenter session with the InfoSource. When the Connect for BW Server starts, it communicates with the BW to register itself as a server. The Connect for BW Server waits for a request from the BW to start the workflow. When the InfoPackage starts, the BW communicates with the registered Connect for BW Server and sends the workflow name to be scheduled with the PowerCenter Server. The Connect for BW Server reads information about the workflow and sends a request to the PowerCenter Server to run the workflow. The PowerCenter Server validates the workflow name in the repository and the
INFORMATICA CONFIDENTIAL
BEST PRACTICE
115 of 702
workflow name in the InfoPackage. The PowerCenter Server executes the session and loads the data into BW. You must start the Connect for BW Server after you restart the PowerCenter Server.
Supported Datatypes
The PowerCenter Server transforms data based on the Informatica transformation datatypes. BW can only receive data in 250 bytes per packet. The PowerCenter Server converts all data to a CHAR datatype and puts it into packets of 250 bytes, plus one byte for a continuation flag. BW receives data until it reads the continuation flag set to zero. Within the transfer structure, BW then converts the data to the BW datatype. Currently, BW only supports the following datatypes in transfer structures assigned to BAPI source systems (PowerCenter ): CHAR, CUKY, CURR, DATS, NUMC, TIMS, UNIT. All other datatypes result in the following error in BW: Invalid data type (data type name) for source system of type BAPI.
Date/Time Datatypes
The transformation date/time datatype supports dates with precision to the second. If you import a date/time value that includes milliseconds, the PowerCenter Server truncates to seconds. If you write a date/time value to a target column that supports milliseconds, the PowerCenter Server inserts zeros for the millisecond portion of the date.
Binary Datatypes
BW does not allow you to build a transfer structure with binary datatypes. Therefore, you cannot load binary data from PowerCenter into BW.
Numeric Datatypes
PowerCenter does not support the INT1 datatype.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
116 of 702
If you see a performance slowdown for sessions that load into SAP BW, set the default buffer block size to 15MB to 20MB to enhance performance. You can put 5,000 to 10,000 rows per block, so you can calculate the buffer block size needed with the following formula: Row size x Rows per block = Default Buffer Block size< /FONT > For example, if your target row size is 2KB: 2 KB x 10,000 = 20MB.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
117 of 702
Description
MQSeries applications communicate by sending messages asynchronously rather than by calling each other directly. Applications can also request data using a "request message" on a message queue. Because no open connection is required between systems, they can run independently of one another. MQSeries enforces no structure on the content or format of the message; this is defined by the application. With more and more requirements for on-demand or real-time data integration, as well as the development of Enterprise Application Integration (EAI) capabilities, MQ Series has become an important vehicle for providing information to data warehouses in a real-time mode. PowerCenter provides data integration for transactional data generated by online continuously messaging systems (such as MQ Series). For these types of messaging systems, PowerCenters Zero Latency (ZL) Engine provides immediate processing of trickle-feed data, allowing the processing of real-time data flow in both uni-directional and bi-directional manner.
TIP In order to enable PowerCenters ZL engine to process MQ messages in real-time, the workflow must be configured to run continuously and a real-time MQ filter needs to be applied to the MQ source qualifier (such as idle time, reader time limit, or message count).
MQSeries Architecture
IBM MQSeries is a messaging and queuing application that permits programs to communicate with one another across heterogeneous platforms and network protocols using a consistent application-programming interface. MQSeries architecture has three parts: 1. Queue Manager 2. Message Queue, which is a destination to which messages can be sent 3. MQSeries Message, which incorporates a header and a data component
Queue Manager
q q q q
PowerCenter connects to Queue Manager to send and receive messages. A Queue Manager may publish one or more MQ queues. Every message queue belongs to a Queue Manager. Queue Manager administers queues, creates queues, and controls queue operation.
Message Queue
q q
PowerCenter connects to Queue Manager to send and receive messages to one or more message queues. PowerCenter is responsible to deleting the message from the queue after processing it.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
118 of 702
TIP There are several ways to maintain transactional consistency (i.e., clean up the queue after reading). Refer to the Informatica Webzine article on Transactional Consistency for details on the various ways to delete messages from the queue.
MQSeries Message
An MQSeries message is composed of two distinct sections:
q
MQSeries header. This section contains data about the queue message itself. Message header data includes the message identification number, message format, and other message descriptor data. In PowerCenter, MQSeries sources and dynamic MQSeries targets automatically incorporate MQSeries message header fields. MQSeries message data block. A single data element that contains the application data (sometime referred to as the "message body"). The content and format of the message data is defined by the application that puts the message on the queue.
XML
q
COBOL
q
Binary When reading a message from a queue, the PowerCenter mapping must contain an MQ Source Qualifier (MQSQ). If the mapping also needs to read the message data block, then an Associated Source Qualifier (ASQ) is also needed. When developing an MQ Series mapping, the MESSAGE_DATA block is re-defined by the ASQ. Based on the format of the source data, PowerCenter will generate the appropriate transformation for parsing the MESSAGE_DATA. Once associated, the MSG_ID field is linked within the associated source qualifier transformation.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
119 of 702
Using MQ Functions
PowerCenter provides built-in functions that can also be used to filter message data.
q
Functions can be used to enable PowerCenter real-time data extraction. Available Functions:
Function Idle(n) MsgCount(n) StartTime(time) EndTime(time) FlushLatency(n) ForcedEOQ(n) RemoveMsg(TRUE) Description Time RT remains idle before stopping. Number of messages read from the queue before stopping. GMT time when RT begins reading queue. GMT time when RT stops reading queue. Time period RT waits before committing messages read from the queue. Time period RT reads messages from the queue before stopping. Removes messages from the queue.
TIP In order to enable real-time message processing, use the FlushLatency() or ForcedEOQ() MQ functions as part of the filter expression in the MQSQ.
Static MQ Targets. Used for loading message data (instead of header data) to the target. A Static target does not load data to the message header fields. Use the target definition specific to the format of the message data (i.e., flat file, XML, or COBOL). Design the mapping as if it were not using MQ Series, then configure the target connection to point to a MQ message queue in the session when using MQSeries. Dynamic. Used for binary targets only, and when loading data to a message header. Note that certain message headers in an MQSeries message require a predefined set of values assigned by IBM.
UserIdentifier
q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
120 of 702
AccountingToken
q
ApplIdentityData
q
PutApplType
q
PutApplName
q
PutDate
q
PutTime
q
ApplOriginData
Flat file
q
XML
q
COBOL
q
XML targets with multiple hierarchies can generate one or more MQ messages (configurable).
MQSeries mappings cannot be partitioned if an associated source qualifier is used. For MQ Series sources, set the Source Type to the following:
q
Heterogeneous - when there is an associated source definition in the mapping. This indicates that the source data is coming from an MQ source, and the message data is in flat file, COBOL or XML format. Message Queue - when there is no associated source definition in the mapping.
Note that there are two pages on the Source Options dialog: XML and MQSeries. You can alternate between the two pages to set configurations for each.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
121 of 702
For Static MQSeries targets, select File Target type from the list. When the target is an XML file or XML message data for a target message queue, the target type is automatically set to XML.
q q q
If you load data to a dynamic MQ target, the target type is automatically set to Message Queue. On the MQSeries page, select the MQ connection to use for the source message queue, and click OK. Be sure to select the MQ checkbox in Target Options for the Associated file type. Then click Edit Object Properties and type:
r r r
the connection name of the target message queue. the format of the message data in the target queue (ex. MQSTR). the number of rows per message (only applies to flat file MQ targets).
The following features and functions are not available to PowerCenter when using MQSeries:
q q q
Lookup transformations can be used in an MQSeries mapping, but lookups on MQSeries sources are not allowed. No Debug "Sessions". You must run an actual session to debug a queue mapping. Certain considerations are necessary when using AEPs, Aggregators, Joiners, Sorters, Rank, or Transaction Control transformations because they can only be performed on one queue, as opposed to a full data set. The MQSeries mapping cannot contain a flat file target definition if you are trying to target an MQSeries queue. PowerCenter version 6 and earlier performs a browse of the MQ queue. PowerCenter version 7 provides the ability to perform a destructive read of the MQ queue (instead of a browse). PowerCenter version 7 also provides support for active transformations (i.e., Aggregators) in an MQ source mapping. PowerCenter version 7 provides MQ message recovery on restart of failed sessions. PowerCenter version 7 offers enhanced XML capabilities for mid-stream XML parsing.
q q
q q
Appendix Information
PowerCenter uses the following datatypes in MQSeries mappings:
q
IBM MQSeries datatypes. IBM MQSeries datatypes appear in the MQSeries source and target definitions in a mapping. Native datatypes. Flat file, XML, or COBOL datatypes associated with an MQSeries message data. Native datatypes appear in flat file, XML and COBOL source definitions. Native datatypes also appear in flat file and XML target definitions in the mapping. Transformation datatypes. Transformation datatypes are generic datatypes that PowerCenter uses during the transformation process. They appear in all the transformations in the mapping.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
122 of 702
MQHEX
Structure identifier Structure version number Options for report messages Message type Message lifetime Feedback or reason code Data encoding Coded character set identifier Format name Message priority Message persistence Message identifier Correlation identifier Backout counter Name of reply queue Name of reply gueue Manager Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is null. Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is MQACT_NONE. Application data relating to identity. The value for ApplIdentityData is null. Type of application that put the message on queue. Defined by the environment. Name of application that put the message on queue. Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is null. Date when the message arrives in the queue.
AccountingToken
PutDate
INFORMATICA CONFIDENTIAL
BEST PRACTICE
123 of 702
Time when the message arrives in queue. Application data relating to origin. Value for ApplOriginData is null. Group identifier Sequence number of logical messages within group. Offset of data in physical message from start of logical message. Message flags Length of original message
INFORMATICA CONFIDENTIAL
BEST PRACTICE
124 of 702
Description
SAP R/3 is an ERP software that provides multiple business applications/modules, such as financial accounting, materials management, sales and distribution, human resources, CRM and SRM. The CORE R/3 system (BASIS layer) is programmed in Advance Business Application Programming-Fourth Generation (ABAP/4, or ABAP), a language proprietary to SAP. PowerCenter Connect for SAP R/3 can write/read/change data in R/3 via BAPI/RFC and IDoc interfaces. The ABAP interface of PowerCenter Connect can only read data from SAP R/3. PowerCenter Connect for SAP R/3 provides the ability to extract SAP R/3 data into data warehouses, data integration applications, and other third-party applications. All of this is accomplished without writing complex ABAP code. PowerCenter Connect for SAP R/3 generates ABAP programs and is capable of extracting data from transparent tables, pool tables, and cluster tables. When integrated with R/3 using ALE (Application Link Enabling), PowerCenter Connect for SAP R/3 can also extract data from R/3 using outbound IDocs (Intermediate Documents) in near real-time. The ALE concept available in R/3 Release 3.0 supports the construction and operation of distributed applications. It incorporates controlled exchange of business data messages while ensuring data consistency across loosely-coupled SAP applications. The integration of various applications is achieved by using synchronous and asynchronous communication, rather than by means of a central database. The database server stores the physical tables in the R/3 system, while the application server stores the logical tables. A transparent table definition on the application server is represented by a single physical table on the database server. Pool and cluster tables are logical definitions on the application server that do not have a one-to-one relationship with a physical table on the database server.
Communication Interfaces
TCP/IP is the native communication interface between PowerCenter and SAP R/3. Other interfaces between the two include: Common Program Interface-Communications (CPI-C). CPI-C communication protocol enables online data exchange and data conversion between R/3 and PowerCenter. To initialize CPI-C communication with PowerCenter, SAP R/3 requires information such as the host name of the application server and the SAP gateway. This information is stored on the PowerCenter Server in a configuration file named sideinfo. The PowerCenter Server uses parameters in the sideinfo file to execute ABAP stream mode sessions. Remote Function Call (RFC). RFC is the remote communication protocol used by SAP and is based on RPC (Remote Procedure Call). To execute remote calls from PowerCenter, SAP R/3 requires information such as the connection type and the service name and gateway on the application server. This information is stored on the PowerCenter Client and PowerCenter Server in a configuration file named saprfc.ini. PowerCenter makes remote function calls when importing source definitions, installing ABAP programs and running ABAP file mode sessions. Transport system. The transport system in SAP is a mechanism to transfer objects developed on one system to another system. Transport system is primarily used to migrate code and configuration from development to QA and production systems. It can be used in the following cases:
q q
PowerCenter Connect for SAP R/3 installation transports PowerCenter Connect generated ABAP programs
INFORMATICA CONFIDENTIAL
BEST PRACTICE
125 of 702
Note: If the ABAP programs are installed in the $TMP development class, they cannot be transported from development to production. Ensure you have a transportable development class/package for the ABAP mappings. Security You must have proper authorizations on the R/3 system to perform integration tasks. The R/3 administrator needs to create authorizations, profiles, and users for PowerCenter users.
Integration Feature Import Definitions, Install Programs Authorization Object S_DEVELOP Activity All activities. Also need to set Development Object ID to PROG READ
Extract Data
S_TABU_DIS
S_DATASET
WRITE
S_PROGRAM S_BTCH_JOB
BTCSUBMIT, SUBMIT DELE, LIST, PLAN, SHOW Also need to set Job Operation to RELE
S_CPIC S_RFC
You also need access to the SAP GUI, as described in following SAP GUI Parameters table:
Parameter User ID Feature references to this variable $SAP_USERID Comments Identify the username that connects to the SAP GUI and is authorized for read-only access to the following transactions: - SE12 - SE15 - SE16 - SPRO Password System Number Client Number Server $SAP_PASSWORD $SAP_SYSTEM_NUMBER $SAP_CLIENT_NUMBER $SAP_SERVER Identify the password for the above user Identify the SAP system number Identify the SAP client number Identify the server on which this instance of SAP is running
Extract data from SAP R/3 using ABAP BAPI /RFC and IDoc interfaces. Migrate/load data from any source into R/3 using IDoc, BAPI/RFC and DMI interfaces. Generate DMI files ready to be loaded into SAP via SXDA TOOLS or LSMW or SAP standard delivered programs. Support calling BAPI and RFC functions dynamically from PowerCenter for data integration. PowerCenter Connect for SAP R/3 can make BAPI and RFC function calls dynamically from mappings to extract or load. Capture changes to the master and transactional data in SAP R/3 using ALE. PowerCenter Connect for SAP R/3 can receive outbound IDocs from SAP R/3 in real time and load into SAP R/3 using inbound IDocs. To receive IDocs in real time using ALE, install PowerCenter Connect for SAP R/3 on PowerCenterRT. Provide rapid development of the data warehouse based on R/3 data using Analytic Business Components for SAP R/3 (ABC). ABC is a set of business content that includes mappings, mapplets, source objects, targets, and transformations. Set partition points in a pipeline for outbound/inbound IDoc sessions; sessions that fail when reading outbound IDocs from an SAP R/3 source can be configured for recovery. You can also receive data from outbound IDoc files and write data to inbound IDoc files. Insert ABAP Code Block to add functionality to the ABAP program flow and use static/dynamic filters to reduce return rows. Customize the ABAP program flow with joins, filters, SAP functions, and code blocks. For example: qualifying table = table1-field1 = table2-field2 where the qualifying table is the last table in the condition based on the join order including outer joins. Create ABAP program variables to represent SAP R/3 structures, structure fields, or values in the ABAP program. Remove ABAP program information from SAP R/3 and the repository when a folder is deleted. Provide enhanced platform support by running on 64-bit AIX and HP-UX (Itanium). You can install PowerCenter Connect for SAP R/3 for the PowerCenter Server and Repository Server on SuSe Linux or on Red Hat Linux.
q q q
Transport the development objects on the PowerCenter CD to R/3. PowerCenter calls these objects each time it makes a request to the R/3 system. Run the transport program that generates unique Ids. Establish profiles in the R/3 system for PowerCenter users. Create a development class for the ABAP programs that PowerCenter installs on the SAP R/3 system.
q q q
For PowerCenter
INFORMATICA CONFIDENTIAL BEST PRACTICE 127 of 702
The PowerCenter server and client need drivers and connection files to communicate with SAP R/3. Preparing PowerCenter for integration involves the following tasks:
q q
Run installation programs on PowerCenter Server and Client machines. Configure the connection files:
r
The sideinfo file on the PowerCenter Server allows PowerCenter to initiate CPI-C with the R/3 system. Following are the required parameters for sideinfo : DEST logical name of the R/3 system TYPE set to A to indicate connection to specific R/3 system. ASHOST host name of the SAP R/3 application server. SYSNR system number of the SAP R/3 application server.
r
-The saprfc.ini file on the PowerCenter Client and Server allows PowerCenter to connect to the R/3 system as an RFC client. The required parameterts for sideinfo are: DEST logical name of the R/3 system LU host name of the SAP application server machine TP set to sapdp<system number> GWHOST host name of the SAP gateway machine. GWSERV set to sapgw<system number> PROTOCOL set to I for TCP/IP connection.
Following is the summary of required steps: 1. 2. 3. 4. 5. 6. 7. 8. Install PowerCenter Connect for SAP R/3 on PowerCenter. Configure the sideinfo file. Configure the saprfc.ini Set the RFC_INI environment variable. Configure an application connection for SAP R/3 sources in the Workflow Manager. Configure SAP/ALE IDoc connection in the Workflow Manager to receive IDocs generated by the SAP R/3 system. Configure the FTP connection to access staging files through FTP. Install the repository plug-in in the PowerCenter repository.
UNIX
Services file is located in /etc
q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
128 of 702
sapgw<system number> <port# of gateway service>/TCP The system number and port numbers are provided by the BASIS administrator.
Streaming Mode does not create any intermediate files on the R/3 system. This method is faster, but uses more CPU cycles on the R/3 system.
q
File Mode creates an intermediate file on the SAP R/3 system, which is then transferred to the machine running the PowerCenter Server. If you want to run file mode sessions, you must provide either FTP access or NFS access from the machine running the PowerCenter Server to the machine running SAP R/3. This, of course, assumes that PowerCenter and SAP R/3 are not running on the same machine; it is possible to run PowerCenter and R/3 on the same system, but highly unlikely. If you want to use File mode sessions and your R/3 system is on a UNIX system, you need to do one of the following:
q
Provide the login and password for the UNIX account used to run the SAP R/3 system.
q
Provide a login and password for a UNIX account belonging to same group as the UNIX account used to run the SAP R/3 system.
q
Create a directory on the machine running SAP R/3, and run chmod g+s on that directory. Provide the login and password for the account used to create this directory. Configure database connections in the Server Manager to access the SAP R/3 system when running a session, then configure an FTP connection to access staging file through FTP.
Extraction Process
R/3 source definitions can be imported from the logical tables using RFC protocol. Extracting data from R/3 is a four-step process: Import source definitions. The PowerCenter Designer connects to the R/3 application server using RFC. The Designer calls a function in the R/3 system to import source definitions. Note: If you plan to join two or more tables in SAP, be sure you have the optimized join conditions. Make sure you have identified your driving table (e.g., if you plan to extract data from bkpf and bseg accounting tables, be sure to drive your extracts from bkpf table). There is a significant difference in performance if the joins are properly defined. Create a mapping. When creating a mapping using an R/3 source definition, you must use an ERP source qualifier. In the ERP source qualifier, you can customize properties of the ABAP program that the R/3 server uses to extract source data. You can also use joins, filters, ABAP program variables, ABAP code blocks, and SAP functions to customize the ABAP program. Generate and install ABAP program. You can install two types of ABAP programs for each mapping:
q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
129 of 702
File mode. Extract data to file. The PowerCenter Server accesses the file through FTP or NFS mount. This mode is used for large extracts as there are timeouts set in SAP for long running queries.
q
Stream Mode. Extract data to buffers. The PowerCenter Server accesses the buffers through CPI-C, the SAP protocol for program-to-program communication. This mode is preferred for short running extracts. You can modify the ABAP program block and customize according to your requirements (e.g., if you want to get data incrementally, create a mapping variable/parameter and use it in the ABAP program).
Stream Mode. In stream mode, the installed ABAP program creates buffers on the application server. The program extracts source data and loads it into the buffers. When a buffer fills, the program streams the data to the PowerCenter Server using CPI-C. With this method, the PowerCenter Server can process data when it is received. File Mode. When running a session in file mode, the session must be configured to access the file through NFS mount or FTP. When the session runs, the installed ABAP program creates a file on the application server. The program extracts source data and loads it into the file. When the file is complete, the PowerCenter Server accesses the file through FTP or NFS mount and continues processing the session.
Financial Accounting
q
Controlling
q
Materials Management
q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
130 of 702
Sales and Distribution Refer to the ABC Guide for complete installation and configuration information.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
131 of 702
Description
PowerCenter Connect for Web Services (WebServices Consumer) allows PowerCenter to act as a web services client to consume external web services. PowerCenter Connect for Web Services uses the Simple Object Access Protocol (SOAP) to communicate with the external web service provider. An external web service can be invoked from PowerCenter in three ways:
q q q
PowerCenter also supports a request-response type of operation using Web Services transformation. You can use the web service as a transformation if your input data is available midstream and you want to capture the response values from the web service. The following steps serve as an example for invoking a Stock Quote web service to learn the price for each of the ticker symbols available in a flat file: 1. In transformation developer, create a web service consumer transformation. 2. Specify URL http://services.xmethods.net/soap/urn:xmethods-delayed-quotes. wsdl and pick operation getQuote. 3. Connect the input port of this transformation to the field containing the ticker symbols. 4. To invoke the web service for each input row, change to source-based commit and the interval to 1. Also change the Transaction Scope to Transaction in the web services consumer transformation.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
133 of 702
This is helpful because Informatica Professional Services will most likely not have access to the client code or the Web Service.
q
In Web Services Provider, PowerCenter acts as a Service Provider and exposes many key functionalities as web services. In PowerCenter Connect for Web Services, PowerCenter acts as a web service client and consumes external web services. It is not necessary to install or configure Web Services Provider in order to use PowerCenter Connect for Web Services. Web Services exposed through PowerCenter have two formats
r
Real-Time: In real time web enabled workflows are exposed. The Web Services Provider must be used and point to the Web Service that the mapping is going to consume. Workflows can be started and protected. Batch: In batch mode a preset of services are exposed to run and monitor workflows in your system. Good for reporting engines, etc.
Truststore. Truststore holds the public keys for the entities it can trust. PowerCenter uses the entries in the Truststore file to authenticate the external web services servers. Keystore (Clientstore). Clientstore holds both the entitys public and private keys. PowerCenter sends the entries in the Clientstore file to the web services server so that the web services server can authenticate the PowerCenter server.
By default, the keystore files jssecacerts and cacerts in the $(JAVA_HOME)/lib/ security directory are used for Truststores. You can also create new keystore files and configure the TrustStore and ClientStore parameters in the PowerCenter Server setup to point to these files. Keystore files can contain multiple certificates and are managed using utilities like keytool. SSL authentication can be performed in three ways:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
134 of 702
q q q
Server authentication:
When establishing an SSL session in server authentication, the web services server sends its certificate to PowerCenter and PowerCenter verifies whether the server certificate can be trusted. Only the truststore file needs to be configured in this case. Assumptions: Web Services Server certificate is stored in server.cer file PowerCenter Server(Client) public/private key pair is available in keystore client.jks Steps: 1. Import the servers certificate into the PowerCenter Servers truststore file. You can use either the default keystores jssecacerts, cacerts or create your own keystore file. 2. keytool -import -file server.cer -alias wserver -keystore trust.jks trustcacerts storepass changeit 3. At the prompt for trusting this certificate, type yes. 4. Configure PowerCenter to use this truststore file. Open the PowerCenter Server setup-> JVM options tab and in the value for Truststore, give the full path and name of the keystore file (i.e., c:\trust.jks)
Client authentication:
When establishing an SSL session in client authentication, PowerCenter sends its certificate to the web services server. The web services server then verifies whether the PowerCenter Server can be trusted. In this case, you need only the clientstore file. Steps: 1. Keystore containing the private/public key pair is called client.jks. Be sure the client private key password and the keystore password are the same, (e.g., changeit)
INFORMATICA CONFIDENTIAL BEST PRACTICE 135 of 702
2. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server setup-> JVM options tab and in the value for Clientstore, type the full path and name of the keystore file (i.e., c:\client.jks) 3. Add an additional JVM parameter in the PowerCenter Server setup and give the value as Djavax.net.ssl.keyStorePassword=changeit
Mutual authentication:
When establishing an SSL session in mutual authentication, both PowerCenter Server and the Web Services server send their certificates to each other and both verify if the other one can be trusted. You need to configure both the clientstore and the truststore files. Steps: 1. Import the servers certificate into the PowerCenter Servers truststore file. 2. keytool -import -file server.cer -alias wserver -keystore trust.jks trustcacerts storepass changeit 3. Configure PowerCenter to use this truststore file. Open the PowerCenter server setup-> JVM options tab and in the value for Truststore, type the full path and name of the keystore file (i.e., c:\trust.jks). 4. Keystore containing the client public/private key pair is called client.jks. Be sure the client private key password and the keystore password are the same (e.g., changeit). 5. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server setup-> JVM options tab and in the value for Clientstore, type the full path and name of the keystore file (i.e., c:\client.jks). 6. Add an additional JVM parameter in the PowerCenter Server setup and type the value as Djavax.net.ssl.keyStorePassword=changeit Note: If your client private key is not already present in the keystore file, you cannot use keytool command to import it. Keytool can only generate a private key; it cannot import a private key into a keystore. In this case, use an external java utility such as utils.ImportPrivateKey(weblogic), KeystoreMove (to convert PKCS#12 format to JKS) to move it into the JKS keystore.
Refer to the openssl documentation for complete information on such conversions. A few examples are given below: To convert from PEM to DER: assuming that you have a PEM file called server. pem
q
openssl x509 -in server.pem -inform PEM -out server.der -outform DER
To convert a PKCS12 file, you must first convert to PEM, and then from PEM to DER: Assuming that your PKCS12 file is called server.pfx, the two commands are:
q q
openssl pkcs12 -in server.pfx -out server.pem openssl x509 -in server.pem -inform PEM -out server.der -outform DER
INFORMATICA CONFIDENTIAL
BEST PRACTICE
137 of 702
Description
Leverage Staging Strategies
As discussed elsewhere in Velocity, in data migration it is recommended to employ both a legacy staging and pre-load staging area. The reason for this is simple, it provides the ability to pull data from the production system and use it for data cleaning and harmonization activities without interfering with the production systems. By leveraging this type of strategy you are able to see real production data sooner and follow the guiding principle of Convert Early, Convert Often, and with Real Production Data'.
Developers frequently find themselves in positions where they need to perform a large amount of cross-referencing, hard-coding of values, or other repeatable transformations during a Data Migration. These transformations often have a probability to change over time. Without a table driven approach this will cause code changes, bug fixes, re-testing, and re-deployments during the development effort. This work is unnecessary on many occasions and could be avoided with the use of configuration or reference data tables. It is recommend to use table driven approaches such as these whenever possible. Some common table driven approaches include:
q
Default Values hard-coded values for a given column, stored in a table where the values could be changed whenever a requirement changes. For example, if you have a hard coded value of NA for any value not populated and then want to change that value to NV you could simply change the value in a default value table rather then change numerous hard-coded values. Cross-Reference Values frequently in data migration projects there is a need to take values from the source system and convert them to the value of the target system. These values are usually identified up-front, but as the source system changes additional values are also needed. In a typical mapping development situation this would require adding additional values to a series of IIF or Decode statements. With a table driven situation, new data could be added to a cross-reference table and no coding, testing, or deployment would be required. Parameter Values by using a table driven parameter file you can reduce the need for scripting and accelerate the development process. Code-Driven Table in some instances a set of understood rules are known. By taking those rules and building code against them, a table-driven/code solution can be very productive. For example, if you had a rules table that was keyed by table/column/rule id, then whenever that combination was found a pre-set piece of code would be executed. If at a later date the rules change to a different set of pre-determined rules, the rule table could change for the column and no additional coding would be required.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
139 of 702
Focus On Re-Use
Re-use should always be considered during Informatica development. However, due to such a high degree of repeatability, on data migration projects re-use is paramount to success. There is often tremendous opportunity for re-use of mappings/strategies/ processes/scripts/testing documents. This reduces the staff time for migration projects and lowers project costs.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
140 of 702
projects and add to the large set of typical best practice items that are available in Velocity. The key to Data Migration projects is architect well, design better, and execute best.
Last updated: 01-Feb-07 18:52
INFORMATICA CONFIDENTIAL
BEST PRACTICE
141 of 702
Description
Unlike other Velocity Best Practices we will not specify the full solution to each. Rather, it is more important to understand these three challenges and take action to address them throughout the implementation.
Migration Specifications
During the execution of data migration projects a challenge that projects always encounter is problems with migration specifications. Projects require the completion of functional specs to identify what is required of each migration interface. Definitions:
q
A migration interface is defined as 1 to many mapping/sessions/workflows or scripts used to migrate a data entity from one source system to one target system. A Functional Requirements Specification is normally comprised of a document covering details including security, database join needs, audit needs, and primary contact details. These details are normally at the interface level rather then at the column level. It also includes a Target-Source Matrix target-source matrix which identifies details at the column level such as how source table/columns map to target table/columns, business rules, data cleansing rules, validation rules, and other column level specifics.
Many projects attempt to complete these migrations without these types of specifications. Often these projects have little to no chance to complete on-time or on-budget. Time and subject matter expertise
INFORMATICA CONFIDENTIAL BEST PRACTICE 142 of 702
is needed to complete this analysis; this is the baseline for project success. Projects are disadvantaged when functional specifications are not completed on-time. Developers can often be in a wait mode for extended periods of time when these specs are not completed at the time specified by the project plan. Another project risk occurs when the right individuals are not used to write these specs or often inappropriate levels of importance are applied to this exercise. These situations cause inaccurate or incomplete specifications which prevent data integration developers from successfully building the migration processes. To address the spec challenge for migration projects, projects must have specifications that are completed with accuracy and delivered on time.
Data Quality
Most projects are affected by data quality due to the need to address problems in the source data that fit into the six dimensions of data quality:
Data Quality Dimension Completeness Conformity Consistency Accuracy Duplicates Integrity Description What data is missing or unusable? What data is stored in a non-standard format? What data values give conflicting Informatica? What data is incorrect or out of date? What data records or attributes are repeated? What data is missing or not referenced?
Data migration data quality problems are typically worse then planned for. Projects need to allow enough time to identify and fix data quality problems BEFORE loading the data into the new target system. Informaticas data integration platform provides data quality capabilities that can help to identify the data quality problems in an efficient manner, but Subject-Matter Experts are required to address how these data problems should be addressed within business context and process.
Project Management
Project managers are often disadvantaged on these types of projects as they are mainly much larger, more expensive, and more complex then any prior project they have been involved with. They need to
INFORMATICA CONFIDENTIAL
BEST PRACTICE
143 of 702
understand early in the project the importance of correctly completed specs and the importance of addressing data quality and establish a set of tools to accurately and objectively plan the project with the ability to evaluate progress. Informaticas Velocity Migration Methodology, its tool sets, and the metadata reporting capabilities are key to addressing these project challenges. The key challenge is to fully understand the pitfalls early on in the project and how PowerCenter and Informatica Data Quality can address these challenges, and how metadata reporting can provide objective information relative to project status. In summary, data migration projects are challenged by specification issues, data quality issues, and project management difficulties. By understanding the Velocity Methodology focus on data migration and how Informaticas products can handle these changes for a successful migration, these challenges can be minimized.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
144 of 702
Description
The Velocity approach to data migration is illustrated here. While it is possible to migrate data in one step it is more productive to break these processes up into two or three simpler steps. The goal for data migration is to get the data into the target application as early as possible for large scale implementations. Typical implementations will have three to four trial cutovers or mock-runs before the final implementation of Go-Live. The mantra for the Informatica based migration is to Convert Early, Convert Often, and Convert with Real Production Data. To do this the following approach is encouraged:
Analysis
In the analysis phase the functional specs will be completed, these will include both functional specs and target-source matrix. See the Best Practice Data Migration Project Challenges for related information.
Acquire
INFORMATICA CONFIDENTIAL BEST PRACTICE 145 of 702
In the acquire phase the targets-source matrix will be reviewed and all source systems/tables will be identified. These tables will be used to develop one mapping per source table to populate a mirrored structure in a legacy data based schema. For example if there were 50 source tables identified in all the Target-Source Matrix documents, 50 legacy tables would be created and one mapping would be developed; one for each table. It is recommended to perform the initial development against test data, but once complete run a single extract of the current production data. This will assist in addressing data quality problems without impacting production systems. It is recommended to run these extracts in low use time periods and with the cooperation of the operations group responsible for these systems. It is also recommended to take advantage of the Visio Generation Option if available. These mappings are very straight forward and the use of autogeneration can increase consistency and lower required staff time for the project.
Convert
In this phase data will be extracted from the legacy stage tables (merged, transformed, and cleansed) to populate a mirror of the target application. As part of this process a standard exception process should be developed to determine exceptions and expedite data cleansing activities. The results of this convert process should be profiled, and appropriate data quality scorecards should be reviewed. During the convert phase the basic set of exception tests should be executed, with exception details collected for future reporting and correction. The basic exception tests include: 1. 2. 3. 4. 5. Data Type Data Size Data Length Valid Values Range of Values
Exception Type Data Type Exception Description Will the source data value load correctly to the target data type such as a numeric date loading into an Oracle date type? Will a numeric value from a source value load correctly to the target column or will a numeric overflow occur? Is the input value too large for the target column? (This is appropriate for all data types but of particular interest for string data types. For example, in one system a field could be char(256) but most of the values are char(10). In the target the new field is varchar(20) so any value over char(20) should raise an exception.) Is the input value within a tolerable range for the new system? (For example, does the birth date for an Insurance Subscriber fall between Jan 1, 1900 and Jan 1, 2006? If this test fails the date is unreasonable and should be addressed.) Is the input value in a list of tolerant values in the target system? (An example of this would be does the state code for an input record match the list of states in the new target system? If not the data should be corrected prior to entry to the new system.)
Data Size
Data Length
Range of Values
Valid Values
Once profiling exercises, exception reports and data quality scorecards are complete a list of data quality issues should be created. This list should then be reviewed with the functional business owners to generate new data quality rules to correct the data. These details should be added to the spec and the original convert process should be modified with the new data quality
INFORMATICA CONFIDENTIAL BEST PRACTICE 146 of 702
rules. The convert process should then be re-executed as well as the profiling, exception reporting and data scorecarding until the data is correct and ready for load to the target application.
Migrate
In the migrate phase the data from the convert phase should be loaded to the target application. The expectation is that there should be no failures on these loads. The data should be corrected in the covert phase prior to loading the target application. Once the migrate phase is complete, validation should occur. It is recommended to complete an audit/balancing step prior to validation. This is discussed in the Best Practice Build Data Audit/Balancing Processes. Additional detail about these steps are defined in the Best Practice Data Migration Principles.
Last updated: 06-Feb-07 12:08
INFORMATICA CONFIDENTIAL
BEST PRACTICE
147 of 702
Description
The common practice for audit and balancing solutions is to produce a set of common tables that can hold various control metrics regarding the data integration process. Ultimately, business intelligence reports provide insight at a glance to verify that the correct data has been pulled from the source and completely loaded to the target. Each control measure that is being tracked will require development of a corresponding PowerCenter process to load the metrics to the Audit/Balancing Detail table. To drive out this type of solution execute the following tasks: 1.
2. 3. 4. 5.
Work with business users to identify what audit/balancing processes are needed. Some examples of this may be: a. Customers (Number of Customers or Number of Customers by Country) b. Orders (Qty of Units Sold or Net Sales Amount) c. Deliveries (Number of shipments or Qty of units shipped of Value of all shipments) d. Accounts Receivable (Number of Accounts Receivable Shipments or Total Accounts Receivable Outstanding) Define for each process defined in #1 which columns should be used for tracking purposes for both the source and target system. Develop a data integration process that will read from the source system and populate the detail audit/balancing table with the control totals. Develop a data integration process that will read from the target system and populate the detail audit/balancing table with the control totals. Develop a reporting mechanism that will query the audit/balancing table and identify the the source and target entries match or if there is a discrepancy.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
148 of 702
CONTROL_SUB_AREA CONTROL_COUNT_1 CONTROL_COUNT_2 CONTROL_COUNT_3 CONTROL_COUNT_4 CONTROL_COUNT_5 CONTROL_SUM_1 CONTROL_SUM_2 CONTROL_SUM_3 CONTROL_SUM_4 CONTROL_SUM_5 UPDATE_TIMESTAMP UPDATE_PROCESS
VARCHAR2 NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER (p,s) NUMBER (p,s) NUMBER (p,s) NUMBER (p,s) NUMBER (p,s) TIMESTAMP VARCHAR2
50
INFORMATICA CONFIDENTIAL
BEST PRACTICE
149 of 702
50 50 50
50
The following is a screenshot of a single mapping that will populate both the source and target values in a single mapping:
The following two screenshots show how two mappings could be used to provide the same results:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
150 of 702
Note: One key challenge is how to capture the appropriate control values from the source system if it is continually being updated. The first example with one mapping will not work due to the changes that occur in the time between the extraction of the data from the source and the completion of the load to the target application. In those cases you may want to take advantage of an aggregator transformation to collect the appropriate control totals as illustrated in this screenshot:
The following are two Straw-man Examples of an Audit/Balancing Report which is the end-result of this type of process:
Data area Leg count 11000 9827 1298 TT count 10099 9827 1288 Diff 1 0 0 11230.21 21294.22 11230.21 21011.21 Leg amt TT amt 0 0 283.01
Customer
Orders
Deliveries
INFORMATICA CONFIDENTIAL
BEST PRACTICE
151 of 702
1. Identifying what the control totals should be 2. Building processes that will collect the correct information at the correct granularity There are also a set of basic tasks that can be leveraged and shared across any audit/balancing needs. By building a common model for meeting audit/balancing needs, projects can lower the time needed to develop these solutions and still provide risk reductions by having this type of solution in place.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
152 of 702
Description
A significant portion of time in the project development process should be dedicated to data quality, including the implementation of data cleansing processes. In a production environment, data quality reports should be generated after each data warehouse implementation or when new source systems are integrated into the environment. There should also be provision for rolling back if data quality testing indicates that the data is unacceptable. Informatica offers two application suites for tackling data quality issues: Informatica Data Explorer (IDE) and Informatica Data Quality (IDQ). IDE focuses on data profiling, and its results can feed into the data integration process. However, its unique strength is its metadata profiling and discovery capability. IDQ has been developed as a data analysis, cleansing, correction, and de-duplication tool, one that provides a complete solution for identifying and resolving all types of data quality problems and preparing data for the consolidation and load processes.
Concepts
Following are some key concepts in the field of data quality. These data quality concepts provide a foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and effectiveness. The list of concepts can be read as a process, leading from profiling and analysis to consolidation. Profiling and Analysis - whereas data profiling and data analysis are often synonymous terms, in Informatica terminology these tasks are assigned to IDE and IDQ respectively. Thus, profiling is primarily concerned with metadata discovery and definition, and IDE is ideally suited to these tasks. IDQ can discover data quality issues at a record and field level, and Velocity best practices recommends the use of IDQ for such purposes. Note: The remaining items in this document will therefore, focus in the context of IDQ usage.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
153 of 702
Parsing - the process of extracting individual elements within the records, files, or data entry forms in order to check the structure and content of each field and to create discrete fields devoted to specific information types. Examples may include: name, title, company name, phone number, and SSN. Cleansing and Standardization - refers to arranging information in a consistent manner or preferred format. Examples include the removal of dashes from phone numbers or SSNs. For more information, see the Best Practice Effective Data Standardizing Techniques. Enhancement - refers to adding useful, but optional, information to existing data or complete data. Examples may include: sales volume, number of employees for a given business, and zip+4 codes. Validation - the process of correcting data using algorithmic components and secondary reference data sources, to check and validate information. Example: validating addresses with postal directories. Matching and de-duplication - refers to removing, or flagging for removal, redundant or poor-quality records where high-quality records of the same information exist. Use matching components and business rules to identify records that may refer, for example, to the same customer. For more information, see the Best Practice Effective Data Matching Techniques. Consolidation - using the data sets defined during the matching process to combine all cleansed or approved data into a single, consolidated view. Examples are building best record, master record, or house-holding.
Informatica Applications
The Informatica Data Quality software suite has been developed to resolve a wide range of data quality issues, including data cleansing. The suite comprises the following elements:
q
IDQ Workbench - a stand-alone desktop tool that provides a complete set of data quality functionality on a single computer (Windows only).
INFORMATICA CONFIDENTIAL
BEST PRACTICE
154 of 702
IDQ Server- a set of processes that enables the deployment and management of data quality procedures and resources across a network of any size through TCP/IP. IDQ Integration - a plug-in component that integrates Workbench with PowerCenter, enabling PowerCenter users to embed data quality procedures defined in IDQ in their mappings. IDQ stores all its processes as XML in the Data Quality Repository (MySQL). IDQ Server enables the creation and management of multiple repositories.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
155 of 702
In stage 1, you analyze the quality of the project data according to several metrics, in consultation with the business or project sponsor. This stage is performed in Workbench, which enables the creation of versatile and easy to use dashboards to communicate data quality metrics to all interested parties. In stage 2, you verify the target levels of quality for the business according to the data quality measurements taken in stage 1, and in accordance with project resourcing and scheduling. In stage 3, you use Workbench to design the data quality plans and projects to achieve the targets. Capturing business rules and testing the plans are also covered in this stage. In stage 4, you deploy the data quality plans. If you are using IDQ Workbench and Server, you can deploy plans and resources to remote repositories and file systems through the user interface. If you are running Workbench alone on remote computers, you can export your plans as XML. Stage 4 is the phase in which data cleansing and other data quality tasks are performed on the project data. In stage 5, youll test and measure the results of the plans and compare them to the initial data quality assessment to verify that targets have been met. If targets have not been met, this information feeds into another iteration of data quality operations in which the plans are tuned and optimized. In a large data project, you may find that data quality processes of varying sizes and impact are necessary at many points in the project plan. At a high level, stages 1 and 2 ideally occur very early in the project, at a point defined as the Manage Phase within Velocity. Stages 3 and 4 typically occur during the Design Phase of Velocity. Stage 5 can occur during the Design and/or Build Phase of Velocity, depending on the level of unit testing required.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
156 of 702
Data Quality Integration is a plug-in component that enables PowerCenter to connect to the Data Quality repository and import data quality plans to a PowerCenter transformation. With the Integration component, you can apply IDQ plans to your data without necessarily interacting with or being aware of IDQ Workbench or Server. The Integration interacts with PowerCenter in two ways:
q
On the PowerCenter client side, it enables you to browse the Data Quality repository and add data quality plans to custom transformations. The data quality plans functional details are saved as XML in the PowerCenter repository. On the PowerCenter server side, it enables the PowerCenter Server (or Integration service) to send data quality plan XML to the Data Quality engine for execution.
The Integration requires that at least the following IDQ components are available to PowerCenter:
q q
Client side: PowerCenter needs to access a Data Quality repository from which to import plans. Server side: PowerCenter needs an instance of the Data Quality engine to execute the plan instructions.
An IDQ-trained consultant can build the data quality plans, or you can use the pre-built plans provided by Informatica. Currently, Informatica provides a set of plans dedicated to cleansing and de-duplicating North American name and postal address records. The Integration component enables the following process:
q
Data quality plans are built in Data Quality Workbench and saved from there to the Data Quality repository. The PowerCenter Designer user opens a Data Quality Integration transformation and configures it to read from the Data Quality repository. Next, the users selects a plan from the Data Quality repository and adds it to the transformation. The PowerCenter Designer user saves the transformation and the mapping containing it to the PowerCenter repository. The plan information is saved with the transformation as XML.
The PowerCenter Integration service can then run a workflow containing the saved mapping. The relevant source data and plan information will be sent to the Data Quality engine, which processes the data (in conjunction with any reference data files used by the plan) and returns the results to PowerCenter.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
157 of 702
Data Cleansing Using Data Explorer for Data Discovery and Analysis
Description
Creating a Custom or Auto Profile
The data profiling option provides visibility into the data contained in source systems and enables users to measure changes in the source data over time. This information can help to improve the quality of the source data. An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level. Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short amount of time. A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a business rule that you want to validate, or if you want to test whether data matches a particular pattern.
Open the Profile Manager and choose Tools > Options. If you are profiling data using a database user that is not the owner of the tables to be sourced, check the Use source owner name during profile mapping generation option. If you are in the analysis phase of your project, choose Always run profile interactively since most of your dataprofiling tasks will be interactive. (In later phases of the project, uncheck this option because more permanent data profiles are useful in these phases.)
Use Interactive to create quick, single-use data profiles. The sessions are created with default configuration parameters. For data-profiling tasks that are likely to be reused on a regular basis, create the sessions manually in Workflow Manager and configure and schedule them appropriately.
For greater flexibility, you can also use Data Analyzer to view reports. Each PowerCenter client includes a Data Analyzer schema and reports xml file. The xml files are located in the \Extensions\DataProfile\IPAReports subdirectory of the client installation. You can create additional metrics, attributes, and reports in Data Analyzer to meet specific business requirements. You can also schedule Data Analyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.
Sampling Techniques
Four types of sampling techniques are available with the PowerCenter data profiling option:
Technique No sampling Automatic random sampling Description Uses all source data PowerCenter determines the appropriate percentage to sample, then samples random rows. PowerCenter samples random rows of the source data based on a user-specified percentage. Usage Relatively small data sources Larger data sources where you want a statistically significant data analysis Samples more or fewer rows than the automatic option chooses.
select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%'; select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';
Microsoft SQL Server
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
SYBASE
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
INFORMIX
select 'update statistics low for table ', tabname, ' ; ' from systables where table_name like 'PMDP%'
IBM DB2
INFORMATICA CONFIDENTIAL
BEST PRACTICE
159 of 702
select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP %'
TERADATA
select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and databasename = 'database_name' where database_name is the name of the repository database.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
160 of 702
Description
The issue of poor data quality is one that frequently hinders the success of data integration projects. It can produce inconsistent or faulty results and ruin the credibility of the system with the business users. This Best Practice focuses on techniques for use with PowerCenter and third-party or add-on software. Comments that are specific to the use of PowerCenter are enclosed in brackets. Bear in mind that you can augment or supplant the data quality handling capabilities of PowerCenter with Informatica Data Quality (IDQ), the Informatica application suite dedicated to data quality issues. Data analysis and data enhancement processes, or plans, defined in IDQ can deliver significant data quality improvements to your project data. A data project that has built-in data quality steps, such as those described in the Analyze and Design phases of Velocity, enjoys a significant advantage over a project that has not audited and resolved issues of poor data quality. If you have added these data quality steps to your project, you are likely to avoid the issues described below. A description of the range of IDQ capabilities is beyond the scope of this document. For a summary of Informaticas data quality methodology, as embodied in IDQ, consult the Best Practice Data Cleansing.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
161 of 702
data synchronization, or data migration. Certain questions need to be considered for all of these projects, with the answers driven by the projects requirements and the business users that are being serviced. Ideally, these questions should be addressed during the Design and Analyze Phases of the project because they can require a significant amount of re-coding if identified later. Some of the areas to consider are:
Text Formatting
The most common hurdle here is capitalization and trimming of spaces. Often, users want to see data in its raw format without any capitalization, trimming, or formatting applied to it. This is easily achievable as it is the default behavior, but there is danger in taking this requirement literally since it can lead to duplicate records when some of these fields are used to identify uniqueness and the system is combining data from various source systems. One solution to this issue is to create additional fields that act as a unique key to a given table, but which are formatted in a standard way. Since the raw data is stored in the table, users can still see it in this format, but the additional columns mitigate the risk of duplication. Another possibility is to explain to the users that raw data in unique, identifying fields is not as clean and consistent as data in a common format. In other words, push back on this requirement. This issue can be particularly troublesome in data migration projects where matching the source data is a high priority. Failing to trim leading/trailing spaces from data can often lead to mismatched results since the spaces are stored as part of the data value. The project team must understand how spaces are handled from the source systems to determine the amount of coding required to correct this. (When using PowerCenter and sourcing flat files, the options provided while configuring the File Properties may be sufficient.). Remember that certain RDBMS products use the data type CHAR, which then stores the data with trailing blanks. These blanks need to be trimmed before matching can occur. It is usually only advisable to use CHAR for 1-character flag fields.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
162 of 702
Note that many fixed-width files do not use a null as space. Therefore, developers must put one space beside the text radio button, and also tell the product that the space is repeating to fill out the rest of the precision of the column. The strip trailing blanks facility then strips off any remaining spaces from the end of the data value. Embedding database text manipulation functions in lookup transformations is not recommended because a developer must then cache the lookup table due to the presence of a SQL override. (In PowerCenter, avoid embedding database text manipulation functions in lookup transformations.) On very large tables, caching is not always realistic or feasible.
Datatype Conversions
It is advisable to use explicit tool functions when converting the data type of a particular data value. [In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is performed, and 15 digits are carried forward, even when they are not needed or desired. PowerCenter can handle some conversions without function calls (these are detailed in the product documentation), but this may cause subsequent support or
INFORMATICA CONFIDENTIAL BEST PRACTICE 163 of 702
maintenance headaches.]
Dates
Dates can cause many problems when moving and transforming data from one place to another because an assumption must be made that all data values are in a designated format. [Informatica recommends first checking a piece of data to ensure it is in the proper format before trying to convert it to a Date data type. If the check is not performed first, then a developer increases the risk of transformation errors, which can cause data to be lost]. An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT, YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL) If the majority of the dates coming from a source system arrive in the same format, then it is often wise to create a reusable expression that handles dates, so that the proper checks are made. It is also advisable to determine if any default dates should be defined, such as a low date or high date. These should then be used throughout the system for consistency. However, do not fall into the trap of always using default dates as some are meant to be NULL until the appropriate time (e.g., birth date or death date). The NULL in the example above could be changed to one of the standard default dates described here.
Decimal Precision
With numeric data columns, developers must determine the expected or required precisions of the columns. (By default, to increase performance, PowerCenter treats all numeric columns as 15 digit floating point decimals, regardless of how they are defined in the transformations. The maximum numeric precision in PowerCenter is 28 digits.) If it is determined that a column realistically needs a higher precision, then the Enable Decimal Arithmetic in the Session Properties option needs to be checked. However, be aware that enabling this option can slow performance by as much as 15 percent. The Enable Decimal Arithmetic option must be enabled when comparing two numbers for equality.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
164 of 702
The most important technique for ensuring good data quality is to prevent incorrect, inconsistent, or incomplete data from ever reaching the target system. This goal may be difficult to achieve in a data synchronization or data migration project, but it is very relevant when discussing data warehousing or ODS. This section discusses techniques that you can use to prevent bad data from reaching the system.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
165 of 702
Another solution applicable in cases with a small number of code values is to try to anticipate any mistyped error codes and translate them back to the correct codes. The cross-reference translation data can be accumulated over time. Each time an error is corrected, both the incorrect and correct values should be put into the table and used to correct future errors automatically.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
166 of 702
Description
At a high level, there are three ways to add data quality to your project:
q
Add a discrete and self-contained data quality stage, such as that enabled by using pre-built Informatica Data Quality (IDQ) processes, or plans, in conjunction with Informatica Data Cleanse and Match. Add an expanded but finite set of data quality actions to the project, for example in cases where pre-built plans do not fit the project parameters. Incorporate data quality actions throughout the project.
This document should help you decide which of these methods best suits your project and assist in estimating the time and resources needed for the first and second methods.
Data Quality Workbench, a user-interface application for building and executing data quality processes, or plans. Data Quality Integration, a plug-in component for PowerCenter that integrates PowerCenter and IDQ. At least one set of reference data files that can be read by data quality plans to validate and enrich certain types of project data. For example, Data Cleanse and Match can be used with the North America Content Pack, which includes pre-built data quality plans and complete address reference datasets for the United States and Canada.
discrete stage. In a more complex scenario, a Data Quality Developer may wish to modify the underlying data quality plans or create new plans to focus on quality analysis or enhancements in particular areas. This expansion of the data quality operations beyond the pre-built plans can also be handled within a discrete data quality stage. The Project Manager may decide to implement a more thorough approach to data quality and integrate data quality actions throughout the project plan. In many cases, a convincing case can be made for enlarging the data quality aspect to encompass the full data project. (Velocity contains several tasks and subtasks concerned with such an endeavor.) This is well worth considering. Often, businesses do not realize the extent to which their business and project goals depend on the quality of their data. The project impact of these three types of data quality activity can be summarized as follows:
DQ approach Simple stage Expanded data quality stage Data quality integrated with data project Estimated Project impact 10 days, 1-2 Data Quality Developers 15-20 days, 2 Data Quality Developers, high visibility to business Duration of data project, 2 or more project roles, impact on business and project objectives
Note: The actual time that should be allotted to the data quality stages noted above depends on the factors discussed in the remainder of this document.
Matching requirements
q
Data volumes
q
Geography Determine which scenario out of the box (Data Cleanse and Match), expanded Data Cleanse and
INFORMATICA CONFIDENTIAL BEST PRACTICE 168 of 702
Match, or a thorough data quality integration best fits your data project by considering the projects overall objectives and its mix of factors.
Baseline functionality of the pre-built data quality plans meets 80 percent of the project needs.
q
The project relates to a single country. Note that the source data quality level is not a major concern.
Run pre-built plans. Review plan results. Transfer data to the next stage in the project and (optionally) add data quality plans to PowerCenter transformations.
BEST PRACTICE 169 of 702
INFORMATICA CONFIDENTIAL
While every project is different, a single iteration of the simple model may take approximately five days, as indicated below:
q
Pass data to the next stage in the project and add plans to PowerCenter transformations (2 days) Note that these estimates fit neatly into a five-day week but may be conservative in some cases. Note also that a Data Quality Developer can tune plans on an ad-hoc basis to suit the project. Therefore you should plan for a two week simple data quality stage.
Step - Simple Stage Run pre-built plans Review plan results Fine-tune pre-built plans if necessary Re-run pre-built plans Review plan results with stakeholders Add plans to PowerCenter transformations and define mappings Run PowerCenter workflows Review results/obtain approval from stakeholders Approve and pass all files to the next project stage Days, week 1 2 1 2 2 1 1 1 Days, week 2
If a data quality audit is added early in the project, the data quality stage grows into a projectlength endeavor. If the data quality audit is included in the discrete data quality stage, the expanded, three-week Data Quality stage may look like this:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
170 of 702
Step - Enhanced DQ Stage Set up and run data analysis plans Review plan results Conduct advance tuning of pre-built plans Run pre-built plans Review plan results with stakeholders Modify pre-built plans or build new plans from scratch Re-run the plans Review plan results/obtain approval from stakeholders Add approved plans to PowerCenter transformations, define mappings Run PowerCenter workflows Review results/obtain approval from stakeholders Approve and pass all files to the next project stage
Days, week 2
Days, week 3
2 2 1 2 1 1 1
Cleansing .
Matching Requirements
Data matching plans are the most performance-intensive type of data quality plan. Moreover, matching plans are often coupled to a type of data standardization plan (i.e., grouping plan) that prepares the data for
INFORMATICA CONFIDENTIAL BEST PRACTICE 172 of 702
match analysis. Matching plans are not necessarily more complex to design than other types of plans, although they may contain sophisticated business rules. However, the time taken to execute a matching plan is exponentially proportional to the volume of data records passed through the plan. (Specifically, the time taken is proportional to the size and number of data groups created in the grouping plans.) Action: Consult the Best Practice on Effective Data Matching Techniques and determine how long your matching plans may take to run.
Data Volumes
Data matching requirements and data volumes are closely related. As stated above, the time taken to execute a matching plan is exponentially proportional to the volume of data records passed through it. In other types of plans, this exponential relationship does not exist. However, the general rule applies: the larger your data volumes, the longer it takes for plans to execute. Action: Although IDQ can handle data volumes measurable in eight figures, a dataset of more than 1.5 million records is considered larger than average. If your dataset is measurable in millions of records, and high levels of matching/de-duplication are required, consult the Best Practice on Effective Data Matching Techniques.
Geography
Geography affects the project plan in two ways:
q
First, the geographical spread of data sites is likely to affect the time needed to run plans, collate data, and engage with key business personnel. Working hours in different time zones can mean that one site is starting its business day while others are ending theirs, and this can effect the tight scheduling of the simple data quality stage. Secondly, project data that is sourced from several countries typically means multiple data sources, with opportunities for data quality issues to arise that may be specific to the country or the division of the organization providing the data source.
There is also a high correlation between the scale of the data project and the scale of the enterprise in which the project will take place. For multi-national corporations, there is rarely such a thing as a small data project!
INFORMATICA CONFIDENTIAL BEST PRACTICE 173 of 702
Action: Consider the geographical spread of your source data. If the data sites are spread across several time zones or countries, you may need to factor in time lags to your data quality planning.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
174 of 702
To identify the key performance variables that affect the design and execution of IDQ matching plans. To describe plan design and plan execution actions that will optimize plan performance and results.
To optimize your data matching operations in IDQ, you must be aware of the factors that are discussed below.
Description
All too often, an organization's datasets contain duplicate data in spite of numerous attempts to cleanse the data or prevent duplicates from occurring. In other scenarios, the datasets may lack common keys (such as customer numbers or product ID fields) that, if present, would allow clear joins between the datasets and improve business knowledge. Identifying and eliminating duplicates in datasets can serve several purposes. It enables the creation of a single view of customers; it can help control costs associated with mailing lists by preventing multiple pieces of mail from being sent to the same person or household; and it can assist marketing efforts by identifying households or individuals who are heavy users of a product or service. Data can be enriched by matching across production data and reference data sources. Business intelligence operations can be improved by identifying links between two or more systems to provide a more complete picture of how customers interact with a business. IDQs matching capabilities can help to resolve dataset duplications and deliver business results. However, a users ability to design and execute a matching plan that meets the key requirements of performance and match quality depends on understanding the best-practice approaches described in this document. An integrated approach to data matching involves several steps that prepare the data for matching and improve the overall quality of the matches. The following table outlines the processes in each step.
Step Profiling Description Typically the first stage of the data quality process, profiling generates a picture of the data and indicates the data elements that can comprise effective group keys. It also highlights the data elements that require standardizing to improve match scores. Removes noise, excess punctuation, variant spellings, and other extraneous data elements. Standardization reduces the likelihood that match quality will be affected by data elements that are not relevant to match determination. A post-standardization function in which the groups' key fields identified in the profiling stage are used to segment data into logical groups that facilitate matching plan performance. The process whereby the data values in the created groups are compared against one another and record matches are identified according to user-defined criteria.
Standardization
Grouping Matching
INFORMATICA CONFIDENTIAL
BEST PRACTICE
175 of 702
Consolidation
The process whereby duplicate records are cleansed. It identifies the master record in a duplicate cluster and permits the creation of a new dataset or the elimination of subordinate records. Any child data associated with subordinate records is linked to the master record.
The sections below identify the key factors that affect the performance (or speed) of a matching plan and the quality of the matches identified. They also outline the best practices that ensure that each matching plan is implemented with the highest probability of success. (This document does not make any recommendations on profiling, standardization or consolidation strategies. Its focus is grouping and matching.) The following table identifies the key variables that affect matching plan performance and the quality of matches identified.
Factor Impact Impact summary
Group size
Plan performance
The number and size of groups have a significant impact on plan execution speed.
Group keys
Quality of matches
The proper selection of group keys ensures that the maximum number of possible matches are identified in the plan. Processors, disk performance, and memory require consideration.
Hardware resources
Plan performance
Size of dataset(s)
Plan performance
This is not a high-priority issue. However, it should be considered when designing the plan.
Plan performance
The plan designer must weigh file-based versus database matching approaches when considering plan requirements.
Plan performance
The time taken for a matching plan to complete execution depends on its scale. Timing requirements must be understood up-front. The plan designer must weigh deterministic versus probabilistic approaches.
Match identification
Quality of matches
Group Size
Grouping breaks large datasets down into smaller ones to reduce the number of record-to-record comparisons performed in the plan, which directly impacts the speed of plan execution. When matching on grouped data, a matching plan compares the records within each group with one another. When grouping is implemented properly, plan execution speed is increased significantly, with no meaningful effect on match quality. The most important determinant of plan execution speed is the size of the groups to be processed that is, the number of data records in each group. For example, consider a dataset of 1,000,000 records, for which a grouping strategy generates 10,000 groups. If 9,999 of these groups have an average of 50 records each, the remaining group will contain more than 500,000 records; based on this one large group, the matching plan would require 87 days to complete, processing 1,000,000 comparisons a minute! In comparison, the remaining 9,999 groups could be matched in about 12 minutes if the group sizes were evenly distributed. Group size can also have an impact on the quality of the matches returned in the matching plan. Large groups perform more record comparisons, so more likely matches are potentially identified. The reverse is true for small groups: as groups
INFORMATICA CONFIDENTIAL BEST PRACTICE 176 of 702
get smaller, fewer comparisons are possible, and the potential for missing good matches is increased. Therefore, groups must be defined intelligently through the use of group keys.
Group Keys
Group keys determine which records are assigned to which groups. Group key selection, therefore, has a significant affect on the success of matching operations. Grouping splits data into logical chunks and thereby reduces the total number of comparisons performed by the plan. The selection of group keys, based on key data fields, is critical to ensuring that relevant records are compared against one another. When selecting a group key, two main criteria apply:
q
Candidate group keys should represent a logical separation of the data into distinct units where there is a low probability that matches exist between records in different units. This can be determined by profiling the data and uncovering the structure and quality of the content prior to grouping. Candidate group keys should also have high scores in three keys areas of data quality: completeness, conformity, and accuracy. Problems in these data areas can be improved by standardizing the data prior to grouping.
For example, geography is a logical separation criterion when comparing name and address data. A record for a person living in Canada is unlikely to match someone living in Ireland. Thus, the country-identifier field can provide a useful group key. However, if you are working with national data (e.g. Swiss data), duplicate data may exist for an individual living in Geneva, who may also be recorded as living in Genf or Geneve. If the group key in this case is based on city name, records for Geneva, Genf, and Geneve will be written to different groups and never compared unless variant city names are standardized.
Size of Dataset
In matching, the size of the dataset typically does not have as significant an impact on plan performance as the definition of the groups within the plan. However, in general terms, the larger the dataset, the more time required to produce a matching plan both in terms of the preparation of the data and the plan execution.
IDQ Components
All IDQ components serve specific purposes, and very little functionality is duplicated across the components. However, there are performance implications for certain component types, combinations of components, and the quantity of components used in the plan. Several tests have been conducted on IDQ (version 2.11) to test source/sink combinations and various operational components. In tests comparing file-based matching against database matching, file-based matching outperformed database matching in UNIX and Windows environments for plans containing up to 100,000 groups. Also, matching plans that wrote output to a CSV Sink outperformed plans with a DB Sink or Match Key Sink. Plans with a Mixed Field Matcher component performed more slowly than plans without a Mixed Field Matcher. Raw performance should not be the only consideration when selecting the components to use in a matching plan. Different components serve different needs and may offer advantages in a given scenario.
Time Window
IDQ can perform millions or billions of comparison operations in a single matching plan. The time available for the completion of a matching plan can have a significant impact on the perception that the plan is running correctly. Knowing the time window for plan completion helps to determine the hardware configuration choices, grouping strategy,
INFORMATICA CONFIDENTIAL BEST PRACTICE 177 of 702
Frequency of Execution
The frequency with which plans are executed is linked to the time window available. Matching plans may need to be tuned to fit within the cycle in which they are run. The more frequently a matching plan is run, the more the execution time will have to be considered.
Match Identification
The method used by IDQ to identify good matches has a significant effect on the success of the plan. Two key methods for assessing matches are:
q q
Deterministic matching applies a series of checks to determine if a match can be found between two records. IDQs fuzzy matching algorithms can be combined with this method. For example, a deterministic check may first check if the last name comparison score was greater than 85 percent. If this is true, it next checks the address. If an 80 percent match is found, it then checks the first name. If a 90 percent match is found on the first name, then the entire record is considered successfully matched. The advantages of deterministic matching are: (1) it follows a logical path that can be easily communicated to others, and (2) it is similar to the methods employed when manually checking for matches. The disadvantages to this method are its rigidity and its requirement that each dependency be true. This can result in matches being missed, or can require several different rule checks to cover all likely combinations. Probabilistic matching takes the match scores from fuzzy matching components and assigns weights to them in order to calculate a weighted average that indicates the degree of similarity between two pieces of information. The advantage of probabilistic matching is that it is less rigid than deterministic matching. There are no dependencies on certain data elements matching in order for a full match to be found. Weights assigned to individual components can place emphasis on different fields or areas in a record. However, even if a heavily-weighted score falls below a defined threshold, match scores from less heavily-weighted components may still produce a match. The disadvantages of this method are a higher degree of required tweaking on the users part to get the right balance of weights in order to optimize successful matches. This can be difficult for users to understand and communicate to one another. Also, the cut-off mark for good matches versus bad matches can be difficult to assess. For example, a matching plan with 95 to 100 percent success may have found all good matches, but matching plan success between 90 and 94 percent may map to only 85 percent genuine matches. Matches between 85 and 89 percent may correspond to only 65 percent genuine matches, and so on. The following table illustrates this principle.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
178 of 702
Close analysis of the match results is required because of the relationship between match quality and match thresholds scores assigned since there may not be a one-to-one mapping between the plans weighted score and the number of records that can be considered genuine matches.
How large is the dataset to be matched? How often will the matching plans be executed? When will the match process need to be completed? Are there any other dependent processes? What are the rules for determining a match? What process is required to sign-off on the quality of match results? What processes exist for merging records?
Test Results
Performance tests demonstrate the following:
q q
IDQ has near-linear scalability in a multi-processor environment. Scalability in standard installations, as achieved in the allocation of matching plans to multiple processors, will eventually level off.
Performance is the key to success in high-volume matching solutions. IDQs architecture supports massive scalability by allowing large jobs to be subdivided and executed across several processors. This scalability greatly enhances IDQs ability to meet the service levels required by users without sacrificing quality or requiring an overly complex solution.
INFORMATICA CONFIDENTIAL BEST PRACTICE 179 of 702
1,000 groups per one million record dataset. 500,000,000 comparisons per 1 million records +/- 20 percent
In cases where the datasets are large, multiple group keys may be required to segment the data to ensure that best practice guidelines are followed. Informatica Corporation can provide sample grouping plans that automate these requirements as far as is practicable.
Hardware Specifications
Matching is a resource-intensive operation, especially in terms of processor capability. Three key variables determine the effect of hardware on a matching plan: processor speed, disk performance, and memory. The majority of the activity required in matching is tied to the processor. Therefore, the speed of the processor has a significant affect on how fast a matching plan completes. Although the average computational speed for IDQ is one million comparisons per minute, the speed can range from as low as 250,000 comparisons to 6.5 million comparisons per minute, depending on the hardware specification, background processes running, and components used. As a best practice, higher-specification processors (e.g., 1.5 GHz minimum) should be used for high-volume matching plans. Hard disk capacity and available memory can also determine how fast a plan completes. The hard disk reads and writes data required by IDQ sources and sinks. The speed of the disk and the level of defragmentation affect how quickly data can be read from, and written to, the hard disk. Information that cannot be stored in memory during plan execution must be temporarily written to the hard disk. This increases the time required to retrieve information that otherwise could be stored in memory, and also increases the load on the hard disk. A RAID drive may be appropriate for datasets of 3 to 4 million records and a minimum of 512MB of memory should be available. The following table is a rough guide for hardware estimates based on IDQ Runtime on Windows platforms. Specifications for UNIX-based systems vary.
Match volumes < 1,500,000 records Suggested hardware specification 1.5 GHz computer, 512MB RAM
INFORMATICA CONFIDENTIAL
BEST PRACTICE
180 of 702
Multi processor server, 1GB RAM Multi-processor server, 2GB RAM, RAID 5 hard disk
Depends on operations and size of data set. Single processor time plus 20 percent. (Time equals Y) (Time equals Y * 1.20) Time for single processor matching divided by no or processors (NP) multiplied by 25 percent. (Time equals [(X / NP) * 1.25])
Matching
For example, if a single processor plan takes one hour to group and standardize the data and eight hours to match, a fourprocessor match plan should require approximately one hour and 20 minute to group and standardize and two and one half hours to match. The time difference between a single- and multi-processor plan in this case would be more than five hours (i.e., nine hours for the single processor plan versus three hours and 50 minutes for the quad-processor plan).
Possible limit to number of groups that can be None created Low High
This section discusses the challenges facing IDQ matching plan designers in opti-mizing their plans for speed of execution and quality of results. It highlights the key factors affecting matching performance and discusses the results of IDQ performance testing in single and multi-processor environments. Checking for duplicate records where no clear connection exists among data elements is a resource-intensive activity. In order to detect matching information, a record must be compared against every other record in a dataset. For a single data source, the quantity of comparisons required to check an entire dataset increases geometrically as the volume of data increases. A similar situation arises when matching between two datasets, where the number of comparisons required is a multiple of the volumes of data in each dataset. When the volume of data increases into the tens of millions, the number of comparisons required to identify matches and consequently, the amount of time required to check for matches reaches impractical levels.
The number of comparisons required to check the data. The number of comparisons that can be performed per minute.
The first factor can be controlled in IDQ through grouping, which involves logically segmenting the dataset into distinct elements, or groups, so that there is a high probability that records within a group are not duplicates of records outside of the group. Grouping data greatly reduces the total number of required comparisons without affecting match accuracy. IDQ affects the number of comparisons per minute in two ways:
q
Its matching components maximize the comparison activities assigned to the com-puter processor. This reduces the amount of disk I/O communication in the system and increases the number of comparisons per minute. Therefore, hard-ware with higher processor speeds has higher match throughputs. IDQ architecture also allows matching tasks to be broken into smaller tasks and shared across multiple processors. The use of multiple processors to handle matching operations greatly enhances IDQ scalability with regard to high-volume matching problems.
The following section outlines how a multi-processor matching solution can be imple-mented and illustrates the results obtained in Informatica Corporation testing.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
182 of 702
Execution times for multiple processors were based on the longest execution time of the jobs run in parallel. Therefore, having an even distribution of records across all proc-essors was important to maintaining scalability. When the data was
INFORMATICA CONFIDENTIAL BEST PRACTICE 183 of 702
not evenly distributed, some match plans ran longer than others, and the benefits of scaling over multiple processors was not as evident.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
184 of 702
Description
Data cleansing refers to operations that remove non-relevant information and noise from the content of the data. Examples of cleansing operations include the removal of person names, care of information, excess character spaces, or punctuation from postal address. Data standardization refers to operations related to modifying the appearance of the data, so that it takes on a more uniform structure and to enriching the data by deriving additional details from existing content.
Within IDQ, the Profile Standardizer is a powerful tool for parsing unsorted data into the correct fields. However, when using the Profile Standardizer, be aware that there is a finite number of profiles (500) that can be contained within a cleansing plan. Users can extend the number of profiles by using the first 500 profiles within one component and then feeding the data overflow into a second Profile Standardizer via the Token Parser component.
After the data is parsed and labeled, it should be evident if reference dictionaries will be needed to further standardize the data. It may take several iterations of dictionary construction and review before the data is standardized to an acceptable level. Once acceptable standardization has been achieved, data quality scorecard or dashboard reporting can be introduced. For information on dashboard reporting, see the Report Viewer chapter of the Informatica Data Quality 3.1 User Guide.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
185 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
186 of 702
Conclusion
Using the data cleansing and standardization techniques described in this Best Practice can help an organization to recognize the value of incorporating IDQ into their development methodology. Because data quality is an iterative process, the business rules initially developed may require ongoing modification, as the results produced by IDQ will be affected by the starting condition of the data and the requirements of the business users. When data arrives in multiple languages, it is worth creating similar IDQ plans for each country and applying the same rules across these plans. The data would typically be staged in a database, and the plans developed using a SQL statement as input, with a where country_code= DE clause, for example. Country dictionaries are identifiable by country code to facilitate such statements. Remember that IDQ installs with a large set of reference dictionaries and additional dictionaries are available from Informatica. IDQ provides several components that focus on verifying and correcting the accuracy of name and postal address data. These components leverage address reference data that originates from national postal carriers such as the United States Postal Service. Such datasets enable IDQ to validate an address to premise level. Please note, the reference datasets are licensed and installed as discrete Informatica products, and thus it is important to discuss their inclusion in the project with the business in advance so as to avoid budget and installation issues. Several types of reference data, with differing levels of address granularity, are available from Informatica. Pricing for the licensing of these components may vary and should be discussed with the Informatica Account Manager.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
187 of 702
Description
Reference data files can be used by a plan to verify or enhance the accuracy of the data inputs to the plan. A reference data file is a list of verified-correct terms and, where appropriate, acceptable variants on those terms. It may be a list of employees, package measurements, or valid postal addresses any data set that provides an objective reference against which project data sources can be checked or corrected. Reference files are essential to some, but not all data quality processes. Reference data can be internal or external in origin. Internal data is specific to a particular project or client. Such data is typically generated from internal company information. It may be custom-built for the project. External data has been sourced or purchased from outside the organization. External data is used when authoritative, independently-verified data is needed to provide the desired level of data quality to a particular aspect of the source data. Examples include the dictionary files that install with IDQ, postal address data sets that have been verified as current and complete by a national postal carrier, such as United States Postal Service, or company registration and identification information from an industrystandard source such as Dun & Bradstreet. Reference data can be stored in a file format recognizable to Informatica Data Quality or in a format that requires intermediary (third-party) software in order to be read by Informatica applications. Internal data files, as they are often created specifically for data quality projects, are typically saved in the dictionary file format or as delimited text files, which are easily portable into dictionary format. Databases can also be used as a source for internal data. External files are more likely to remain in their original format. For example, external data may be contained in a database or in a library whose files cannot be edited or opened on the desktop to reveal discrete data values.
The question arises, are internal data sources sufficiently reliable for use as reference? Bear in mind that in some cases the reference data does not need to be 100 percent accurate. It can be good enough to compare project data against reference data and to flag inconsistencies between them, particularly in cases where both sets of data are highly unlikely to share common errors.
You can save an appropriately formatted delimited file as a .DIC file into the Dictionaries folders of your IDQ (client or server) installation. You can use the Dictionary Manager within Data Quality Workbench. This method allows you to create text and database dictionaries. You can write from plan files directly to a dictionary using the IDQ Report Viewer (see below).
The figure below shows a dictionary file open in IDQ Workbench and its underlying .DIC file open in a text editor. Note that the dictionary file has at least two columns of data. The Label column contains the correct or standardized form of each datum from the dictionarys perspective. The Item columns contain versions of each datum that the dictionary recognizes as identical to or coterminous with the Label entry. Therefore, each datum in the dictionary must have at least two entries in the DIC file (see the text editor illustration below). A dictionary can have multiple Item columns.
To edit a dictionary value, open the DIC file and make your changes. You can make changes either through a text editor or by opening the dictionary in the Dictionary Manager.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
189 of 702
To add a value to a dictionary, open the DIC file in Dictionary Manager, place the cursor in an empty row, and add a Label string and at least one Item string. You can also add values in a text editor by placing the cursor on a new line and typing Label and Item values separated by commas. Once saved, the dictionary is ready for use in IDQ. Note: IDQ users with database expertise can create and specify dictionaries that are linked to database tables, and that thus can be updated dynamically when the underlying data is updated. Database dictionaries are useful when the reference data has been originated for other purposes and is likely to change independently of data quality. By making use of a dynamic connection, data quality plans can always point to the current version of the reference data.
The Dictionaries folders installed with Workbench and Server. The users file space in the Data Quality service domain.
IDQ does not recognize a dictionary file that is not in such a location, even if you can browse to the file when designing the data quality plan. Thus, any plan that uses a dictionary in a non-standard location will fail. This is most relevant when you publish or export a plan to another machine on the network. You must ensure that copies of any dictionary files used in the local plan are available in a suitable location on the service domain in the user space on the server, or at a location in the servers Dictionaries folders that corresponds to the dictionaries location on Workbench when the plan is copied to the server-side repository. Note: You can change the locations in which IDQ looks for plan dictionaries by editing the config.xml file. However, this is the master configuration file for the product and you should not edit it without consulting Informatica Support. Bear in mind that Data Quality looks only in the locations set in the config.xml file.
Database reference data can also be version controlled, although this presents difficulties if the database is very large in size. Bear in mind that third-party reference data, such as postal address data, should not ordinarily be changed, and so the need for a versioning strategy for these files is debatable.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
191 of 702
Lets say you have designed a data quality plan that identifies invalid or anomalous records in a customer database. Using IDQ, you can create an exception file of these bad records, and subsequently use this file to create a dictionary-compatible file. For example, lets say you have an exception file containing suspect or invalid customer account records. Using a very simple data quality plan, you can quickly parse the account numbers from this file to create a new text file containing the account serial numbers only. This file effectively constitutes the labels column of your dictionary. By opening this file in Microsoft Excel or a comparable program and copying the contents of Column A into Column B, and then saving the spreadsheet as a CSV file, you create a file with Label and Item1 columns. Rename the file with a .DIC suffix and add it to the Dictionaries folder of your IDQ installation: the dictionary is now visible to the IDQ Dictionary Manager. You now have a dictionary file of bad account numbers that you can use in any plans checking the validity of the organization's account records.
As a general rule, it is a best practice to follow the dictionary organization structure installed by the application, adding to that structure as necessary to accommodate specialized and supplemental dictionaries. Subsequent users are then relieved of the need to examine the config.xml file for possible modifications, thereby lowering the risk of accidental errors during migration. When following the original dictionary organization structure is not practical or contravenes other requirements, take care to document
INFORMATICA CONFIDENTIAL BEST PRACTICE 192 of 702
the customizations. Since external data may be obtained from third parties and may not be in file format, the most efficient way to share its content across the organization is to locate it on the Data Quality Server machine. (Specifically, this is the machine that hosts the Execution Service.)
You can reset the location to which IDQ looks by default for dictionary files. You can reconfigure the plan components that employ the dictionaries to point to the new location. Depending on the complexity of the plan concerned, this can be very labor-intensive. If deploying plans in a batch or scheduled task, you can append the new location to the plan execution command. You can do this by appending a parameter file to the plan execution instructions on the command line. The parameter file is an xml file that can contain a simple command to use one file path instead of another.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
193 of 702
Description
Testing data quality plans is an iterative process that occurs as part of the Design Phase of Velocity. That is, plan testing often precedes the projects main testing activities, as the tested plan outputs will be used as inputs in the Build Phase. It is not necessary to formally test the plans used in the Analyze Phase of Velocity. The development of data quality plans typically follows a prototyping methodology of create, execute, analyze. Testing is performed as part of the third step, in order to determine that the plans are being developed in accordance with design and project requirements. This method of iterative testing helps support rapid identification and resolution of bugs. Bear in mind that data quality plans are designed to analyze and resolve data content issues. These are not typically cut-and-dry problems, but more often represent a continuum of data improvement issues where it is possible that every data instance is unique and there is a target level of data quality rather than a right or wrong answer. Data quality plans tend to resolve problems in terms of percentages and probabilities that a problem is fixed. For example, the project may set a target of 95 percent accuracy in its customer addresses.
What dataset will you use to test the plans? While the ideal situation is to use a data set that exactly mimics the project production data, you may not gain access to this data. If you obtain a full cloned set of the project data for testing purposes, bear in mind that some plans (specifically, some data matching plans) can take several hours to complete. Consider testing data matching plans overnight. Are the plans using reference dictionaries? Reference dictionary
INFORMATICA CONFIDENTIAL
BEST PRACTICE
194 of 702
management is an important factor since it is possible to make changes to a reference dictionary independently of IDQ and without making any changes to the plan itself. When you pass an IDQ plan as tested, you must ensure that no additional work is carried out on any dictionaries referenced in the plan. Moreover, you must ensure that the dictionary files reside in locations that are valid IDQ.
q
How will the plans be executed? Will they be executed on a remote IDQ Server, and/or via a scheduler? In cases like these, its vital to ensure that your plan resources, including source data files and reference data files, are in valid locations for use by the Data Quality engine. For details on the local and remote locations to which IDQ looks for source and reference data files, refer to the Informatica Data Quality 3.1 User Guide. Will the plans be integrated into a PowerCenter transformation? If so, the plans must have realtime-enabled data source and sink components.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
195 of 702
Description
You should consider the following questions prior to making changes to a data quality plan:
q
What is the purpose of changing the plan? You should consider changing a plan if you believe the plan is not optimally configured, or the plan is not functioning properly and there is a problem at execution time or the plan is not delivering expected results as per the plan design principles. Are you trained to change the plan? Data quality plans can be complex. You should not alter a plan unless you have been trained or are highly experienced with IDQ methodology. Is the plan properly documented? You should ensure all plan documentation on the data flow and the data components are up-to-date. For guidelines on documenting IDQ plans, see the Sample Deliverable Data Quality Plan Design. Have you backed up the plan before editing? If you are using IDQ in a client-server environment, you can create a baseline version of the plan using IDQ version control functionality. In addition, you should copy the plan to a new project folder (viz., Work_Folder) in the Workbench for changing and testing, and leave the original plan untouched during testing. Is the plan operating directly on production data? This applies especially to standardization plans. When editing a plan, always work on staged data (database or flat-file). You can later migrate the plan to the production environment after complete and thorough testing.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
196 of 702
You should have a clear goal whenever you plan to change an existing plan. An event may prompt the change: for example, input data changing (in format or content), or changes in business rules or business/project targets. You should take into account all current change-management procedures, and the updated plans should be thoroughly tested before production processes are updated. This includes integration and regression testing too. (See also Testing Data Quality Plans.) Bear in mind that at a high level there are two types of data quality plans: data analysis and data enhancement plans.
q
Data analysis plans produce reports on data patterns and data quality across the input data. The key objective in data analysis is to determine the levels of completeness, conformity, and consistency in the dataset. In pursuing these objectives, data analysis plans can also identify cases of missing, inaccurate or noisy data. Data enhancement plans corrects completeness, conformity and consistency problems; they can also identify duplicate data entries and fix accuracy issues through the use of reference data.
Your goal in a data analysis plan is to discover the quality and usability of your data. It is not necessarily your goal to obtain the best scores for your data. Your goal in a data enhancement plan is to resolve the data quality issues discovered in the data analysis.
Adding Components
In general, simply adding a component to a plan is not likely to directly effect results if no further changes are made to the plan. However, once the outputs from the new component are integrated into existing components, the data process flow is changed and the plan must be re-tested and results reviewed in detail before migrating the plan into production. Bear in mind, particularly in data analysis plans, that improved plan statistics do not always mean that the plan is performing better. It is possible to configure a plan that moves beyond the point of truth by focusing on certain data elements and exclude others. When added to existing plans, some components have a larger impact than others. For example, adding a To Upper component to convert text into upper case may not cause the plan results to change meaningfully, although the presentation of the output data will change. However, adding and integrating a Rule Based Analyzer component
INFORMATICA CONFIDENTIAL
BEST PRACTICE
197 of 702
(designed to apply business rules) may cause a severe impact, as the rules are likely to change the plan logic. As well as adding a new component that is, a new icon to the plan, you can add a new instance to an existing component. This can have the same effect as adding and integrating a new component icon. To avoid overloading a plan with too many components, it is a good practice to add multiple instances to a single component, within reason. Good plan design suggests that instances within a single component should be logically similar and work on the selected inputs in similar ways. If you add a new instance to a component, and that instance behaves very differently to the other instances in that component for example, if it acts on an unrelated set of outputs or performs an unrelated type of action on the data you should probably add a new component for this instance. This will also help you keep track of your changes onscreen. To avoid making plans over-complicated, it is often a good practice to split tasks into multiple plans where a large amount of data quality measures need to be checked. This makes plans and business rules easier to maintain and provides a good framework for future development. For example, in an environment where a large number of attributes must be evaluated against the six standard data quality criteria (i.e., completeness, conformity, consistency, accuracy, duplication and consolidation) using one plan per data quality criterion may be a good way to move forward. Alternatively, splitting plans up by data entity may be advantageous. Similarly, during standardization, you can create plans for specific function areas (e.g,. address, product, or name) as opposed to adding all standardization tasks to a single large plan. For more information on the six standard data quality criteria, see Data Cleansing .
Removing Components
Removing a component from a plan is likely to have a major impact since, in most cases, data flow in the plan will be broken. If you remove an integrated component, configuration changes will be required to all components that use the outputs from the component. The plan cannot run without these configuration changes being completed. The only exceptions to this case are when the output(s) of the removed component are solely used by CSV Sink component or by a frequency component. However, in these cases, you must note that the plan output changes since the column(s) no longer appear in the result set.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
198 of 702
Changing the configuration of a component can have a comparable impact on the overall plan as adding or removing a component the plans logic changes, and therefore, so do the results that it produces. However, although adding or removing a component may make a plan non-executable, changing the configuration of a component can impact the results in more subtle ways. For example, changing the reference dictionary used by a parsing component does not break a plan, but may have a major impact on the resulting output. Similarly, changing the name of a component instance output does not break a plan. By default, component output names cascade through the other components in the plan, so when you change an output name, all subsequent components automatically update with the new output name. It is not necessary to change the configuration of dependent components.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
199 of 702
Description
Data profiling and data mapping involve a combination of automated and human analyses to reveal the quality, content and structure of project data sources. Data profiling analyzes several aspects of data structure and content, including characteristics of each column or field, the relationships between fields, and the commonality of data values between fields often an indicator of redundant data.
Data Profiling
Data profiling involves the explicit analysis of source data and the comparison of observed data characteristics against data quality standards. Data quality and integrity issues include invalid values, multiple formats within a field, non-atomic fields (such as long address strings), duplicate entities, cryptic field names, and others. Quality standards may either be the native rules expressed in the source datas metadata, or an external standard (e.g., corporate, industry, or government) to which the source data must be mapped in order to be assessed.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
200 of 702
Inference of characteristics from the data Comparison of those characteristics with specified standards, as an assessment of data quality
Data mapping involves establishing relationships among data elements in various data structures or sources, in terms of how the same information is expressed or stored in different ways in different sources. By performing these processes early in a data project, IT organizations can preempt the code/load/explode syndrome, wherein a project fails at the load stage because the data is not in the anticipated form. Data profiling and mapping are fundamental techniques applicable to virtually any project. The following figure summarizes and abstracts these scenarios into a single depiction of the IDE solution.
2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. 3. The resultant metadata are exported to and managed in the IDE Repository. 4. In a derived-target scenario, the project team designs the target database by modeling the existing data sources and then modifying the model as required to meet current business and performance requirements. In this scenario, IDE is used to develop the normalized schema into a target database. The normalized and target schemas are then exported to IDEs FTM/XML tool, which documents transformation requirements between fields in the source, normalized, and target schemas. OR 5. In a fixed-target scenario, the design of the target database is a given (i.e., because another organization is responsible for developing it, or because an off-the-shelf package or industry standard is to be used). In this scenario, the schema development process is bypassed. Instead, FTM/XML is used to map the source data fields to the corresponding fields in an externally-specified target schema, and to document transformation requirements between fields in the normalized and target schemas. FTM is used for SQL-based metadata structures, and FTM/XML is used to map SQL and/or XML-based metadata structures. Externally specified targets are typical for ERP package migrations, business-tobusiness integration projects, or situations where a data modeling team is independently designing the target schema. 6. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE applications.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
202 of 702
Table Structural profiling - uses the sample data to infer relationships among the columns in a table. This process can discover primary and foreign keys, functional dependencies, and sub-tables.
Cross-Table profiling - determines the overlap of values across a set of columns, which may come from multiple tables.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
203 of 702
Profiling against external standards requires that the data source be mapped to the standard before being assessed (as shown in the following figure). Note that the mapping is performed by IDEs Fixed Target Mapping tool (FTM). IDE can also be used in the development and application of corporate standards, making them relevant to existing systems as well as to new systems.
Data profiling projects may involve iterative profiling and cleansing as well since data cleansing may improve the quality of the results obtained through dependency and redundancy profiling. Note that Informatica Data Quality should be considered as an alternative tool for data cleansing.
the data source(s), while IDEs Fixed Target Mapping tool (FTM) is used to map from the normalized schema to the fixed target. The general sequence of activities for a fixed-target migration project, as shown in the figure below, is as follows: 1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. The resultant metadata are exported to and managed by the IDE Repository. 4. FTM maps the source data fields to the corresponding fields in an externally specified target schema, and documents transformation requirements between fields in the normalized and target schemas. Externallyspecified targets are typical for ERP migrations or projects where a data modeling team is independently designing the target schema. 5. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE and FTM. 6. The cleansing, transformation, and formatting specs can be used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms or configure an ETL product to perform the data conversion and migration.
The following screen shot shows how IDE can be used to generate a suggested normalized schema, which may discover hidden tables within tables.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
205 of 702
Depending on the staging architecture used, IDE can generate the data definition language (DDL) needed to establish several of the staging databases between the sources and target, as shown below:
Derived-Target Migration
Derived-target migration projects involve the conversion and migration of data from one or more sources to a target database defined by the migration team. IDE is used to profile the data and develop a normalized schema representing the data source(s), and to further develop the normalized schema into a target schema by adding tables and/or fields, eliminating unused tables and/or fields, changing the relational structure, and/or denormalizing the schema to enhance performance. When the target schema is developed from the normalized schema within IDE, the product automatically maintains the mappings from the source to normalized schema, and from the normalized to target schemas. The figure below shows that the general sequence of activities for a derived-target migration project is as follows:
INFORMATICA CONFIDENTIAL BEST PRACTICE 206 of 702
1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE is used to profile the data, generate accurate metadata (including a normalized schema), and document cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. IDE is used to modify and develop the normalized schema into a target schema. This generally involves removing obsolete or spurious data elements, incorporating new business requirements and data elements, adapting to corporate data standards, and denormalizing to enhance performance. 4. The resultant metadata are exported to and managed by the IDE Repository. 5. FTM is used to develop and document transformation requirements between the normalized and target schemas. The mappings between the data elements are automatically carried over from the IDE-based schema development process. 6. The IDE Repository is used to export an XSLT document containing the transformation and the formatting specs developed with IDE and FTM/XML. 7. The cleansing, transformation, and formatting specs are used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms of configure an ETL product to perform the data conversion and migration.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
207 of 702
Data Cleanse and Match Workbench, the desktop application in which data quality processes - or plans - plans can be designed, tested, and executed. Workbench installs with its own Data Quality repository, where plans are stored until needed. Data Quality Integration, a plug-in component that integrates Informatica Data Quality and PowerCenter. The plug-in adds a transformation to PowerCenter, called the Data Quality Integration transformation; PowerCenter Designer users can connect to the Data Quality repository and read data quality plan information into this transformation.
Informatica Data Cleanse and Match has been developed to work with Content Packs developed by Informatica. This document focuses on the plans that install with the North America Content Pack, which was developed in conjunction with the components of Data Cleanse and Match. The North America Content Pack delivers data parsing, cleansing, standardization, and de-duplication functionality to United States and Canadian name and address data through a series of pre-built data quality plans and address reference data files. This document focuses on the following areas:
q q q
when to use one plan vs. another for data cleansing. what behavior to expected from the plans. how best to manage exception data.
Description
The North America Content Pack installs several plans to the Data Quality Repository:
q q
Plans 01-04 are designed to parse, standardize, and validate United States name and address data. Plans 05-07 are designed to enable single-source matching operations (identifying duplicates within a data set) or dual source matching operations (identifying matching records between two datasets).
The processing logic for data matching is split between PowerCenter and Informatica Data Quality (IDQ) applications.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
208 of 702
In cases where the address is well structured and specific data elements (i.e., city, state, and zip) are mapped to specific fields, only the address validation plan may be required. Where the city, state, and zip are mapped to address fields, but not specifically labeled as such (e.g., as Address1 through Address5), a combination of the address standardization and validation plans is required. In extreme cases, where the data is not mapped to any address columns, a combination of the general parser, address standardization, and validation plans may be required to obtain meaning from the data. The purpose of making the plans modular is twofold:
q
It is possible to apply these plans on an individual basis to the data. There is no requirement that the plans be run in sequence with each other. For example, the address validation plan (plan 03) can be run successfully to validate input addresses discretely from the other plans. In fact, the Data Quality Developer will not run all seven plans consecutively on the same dataset. Plans 01 and 02 are not designed to operate in sequence, nor are plans 06 and 07. Modular plans facilitate faster performance. Designing a single plan to perform all the processing tasks contained in the seven plans, even if it were desirable from a functional point of view, would result in significant performance degradation and extremely complex plan logic that would be difficult to modify and maintain.
01 General Parser
The General Parser plan was developed to handle highly unstructured data and to parse it into type-specific fields. For example, consider data stored in the following format:
Field1 100 Cardinal Way Redwood City Field2 Informatica Corp 38725 Field3 CA 94063 100 Cardinal Way Field4 info@informatica.com CA 94063 Field5 Redwood City info@informatica.com
While it is unusual to see data fragmented and spread across a number of fields in this way, it can and does happen. In cases such as this, data is not stored in any specific fields. Street addresses, email addresses, company names, and dates are scattered throughout the data. Using a combination of dictionaries and pattern recognition, the General Parser plan sorts such data into typespecific fields of address, names, company names, Social Security Numbers, dates, telephone numbers, and email addresses, depending on the profile of the content. As a result, the above data will be parsed into the following format:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
209 of 702
Date
08/01/2006
The General Parser does not attempt to apply any structure or meaning to the data. Its purpose is to identify and sort data by information type. As demonstrated with the address fields in the above example, the address fields are labeled as addresses; the contents are not arranged in a standard address format, they are flagged as addresses in the order in which they were processed in the file. The General Parser does not attempt to validate the correctness of a field. For example, the dates are accepted as valid because they have a structure of symbols and numbers that represents a date. A value of 99/99/9999 would also be parsed as a date. The General Parser does not attempt to handle multiple information types in a single field. For example, if a person name and address element are contained in the same field, the General Parser would label the entire field either a name or an address - or leave it unparsed - depending on the elements in the field it can identify first (if any). While the General Parser does not make any assumption about the data prior to parsing, it parses based on the elements of data that it can make sense of first. In cases where no elements of information can be labeled, the field is left in a pipe-delimited form containing unparsed data. The effectiveness of the General Parser to recognize various information types is a function of the dictionaries used to identify that data and the rules used to sort them. Adding or deleting dictionary entries can greatly affect the effectiveness of this plan. Overall, the General Parser is likely only be used in limited cases, where certain types of information may be mixed together, (e.g., telephone and email in the same contact field), or in cases where the data has been badly managed, such as when several files of differing structures have been merged into a single file.
02 Name Standardization
The Name Standardization plan is designed to take in person name or company name information and apply parsing and standardization logic to it. Name Standardization follows two different tracks: one for person names and one for company names. The plan input fields include two inputs for company names. Data that is entered in these fields are assumed to be valid company names, and no additional tests are performed to validate that the data is an existing company name. Any combination of letters, numbers, and symbols can represent a company; therefore, in the absence of an external reference data source, further tests to validate a company name are not likely to yield usable results. Any data entered into the company name fields is subjected to two processes. First, the company name is standardized using the Word Manager component, standardizing any company suffixes included in the field. Second, the standardized company name is matched against the company_names.dic dictionary, which returns the standardized Dun & Bradstreet company name, if found. The second track for name standardization is person names standardization. While this track is dedicated to standardizing person names, it does not necessarily assume that all data entered here is a person name. Person names in North America tend to follow a set structure and typically do not contain company suffixes or digits. Therefore, values entered in this field that contain a company suffix or a company name are taken out of the person name track and moved to the company name track. Additional logic is applied to identify people whose last name is similar (or equal) to a valid company name (for example John Sears); inputs that contain an identified first name and a company name are treated as a person name. If the company name track inputs are already fully populated for the record in question, then any company name detected in a person name column is moved to a field for unparsed company name output. If the name is not recognized as a company name (e. g., by the presence of a company suffix) but contains digits, the data is parsed into the non-name data output field. Any remaining data is accepted as being a valid person name and parsed as such. North American person names are typically entered in one of two different styles: either in a firstname middlename surname format or surname, firstname middlename format. Name parsing algorithms have been built using this assumption. Name parsing occurs in two passes. The first pass applies a series of dictionaries to the name fields, attempting to parse out name prefixes, name suffixes, firstnames, and any extraneous data (noise) present. Any remaining details are assumed to be middle name or surname details. A rule is applied to the parsed details to check if the name has been parsed correctly. If not, best guess
INFORMATICA CONFIDENTIAL BEST PRACTICE 210 of 702
parsing is applied to the field based on the possible assumed formats. When name details have been parsed into first, last, and middle name formats, the first name is used to derive additional details including gender and the name prefix. Finally, using all parsed and derived name elements, salutations are generated. In cases where no clear gender can be generated from the first name, the gender field is typically left blank or indeterminate. The salutation field is generated according to the derived gender information. This can be easily replicated outside the data quality plan if the salutation is not immediately needed as an output from the process (assuming the gender field is an output). Depending on the data entered in the person name fields, certain companies may be treated as person names and parsed according to person name processing rules. Likewise, some person names may be identified as companies and standardized according to company name processing logic. This is typically a result of the dictionary content. If this is a significant problem when working with name data, some adjustments to the dictionaries and the rule logic for the plan may be required. Non-name data encountered in the name standardization plan may be standardized as names depending on the contents of the fields. For example, an address datum such as Corporate Parkway may be standardized as a business name, as Corporate is also a business suffix. Any text data that is entered in a person name field is always treated as a person or company, depending on whether or not the field contains a recognizable company suffix in the text. To ensure that the name standardization plan is delivering adequate results, Informatica strongly recommends pre- and postexecution analysis of the data. Based on the following input:
ROW ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 IN NAME1 Steven King Chris Pope Jr. Shannon C. Prince Dean Jones Mike Judge Thomas Staples Eugene F. Sears Roy Jones Jr. Thomas Smith, Sr Eddie Martin III Martin Luther King, Jr. Staples Corner Sears Chicago Robert Tyre Chris News
INFORMATICA CONFIDENTIAL
BEST PRACTICE
211 of 702
The last entry (Chris News) is identified as a company in the current plan configuration such results can be refined by changing the underlying dictionary entries used to identify company and person names.
03 US Canada Standardization
This plan is designed to apply basic standardization processes to city, state/province, and zip/postal code information for United States and Canadian postal address data. The purpose of the plan is to deliver basic standardization to address elements where processing time is critical and one hundred percent validation is not possible due to time constraints. The plan also organizes key search elements into discrete fields, thereby speeding up the validation process. The plan accepts up to six generic address fields and attempts to parse out city, state/province, and zip/postal code information. All remaining information is assumed to be address information and is absorbed into the address line 1-3 fields. Any information that cannot be parsed into the remaining fields is merged into the non-address data field. The plan makes a number of assumptions that may or may not suit your data:
q
When parsing city, state, and zip details, the address standardization dictionaries assume that these data elements are spelled correctly. Variation in town/city names is very limited, and in cases where punctuation differences exist or where town names are commonly misspelled, the standardization plan may not correctly parse the information. Zip codes are all assumed to be five-digit. In some files, zip codes that begin with 0 may lack this first number and so appear as a four-digit codes, and these may be missed during parsing. Adding four-digit zips to the dictionary is not recommended, as these will conflict with the Plus 4 element of a zip code. Zip codes may also be confused with other five-digit numbers in an address line such as street numbers. City names are also commonly found in street names and other address elements. For example, United is part of a country (United States of America) and is also a town name in the U.S. Bear in mind that the dictionary parsing operates from right to left across the data, so that country name and zip code fields are analyzed before city names and street addresses. Therefore, the word United may be parsed and written as the town name for a given address before the actual town name datum is reached. The plan appends a country code to the end of a parsed address if it can identify it as U.S. or Canadian. Therefore, there is no need to include any country code field in the address inputs when configuring the plan.
Most of these issues can be dealt with, if necessary, by minor adjustments to the plan logic or to the dictionaries, or by adding some pre-processing logic to a workflow prior to passing the data into the plan. The plan assumes that all data entered into it are valid address elements. Therefore, once city, state, and zip details have been parsed out, the plan assumes all remaining elements are street address lines and parses them in the order they occurred as address lines 1-3.
04 NA Address Validation
The purposes of the North America Address Validation plan are:
q q
To match input addresses against known valid addresses in an address database, and To parse, standardize, and enrich the input addresses.
BEST PRACTICE 212 of 702
INFORMATICA CONFIDENTIAL
Performing these operations is a resource-intensive process. Using the US Canada Standardization plan before the NA Address Validation plan helps to improve validation plan results in cases where city, state, and zip code information are not already in discrete fields. City, state, and zip are key search criteria for the address validation engine, and they need to be mapped into discrete fields. Not having these fields correctly mapped prior to plan execution leads to poor results and slow execution times. The address validation APIs store specific area information in memory and continue to use that information from one record to the next, when applicable. Therefore, when running validation plans, it is advisable to sort address data by zip/postal code in order to maximize the usage of data in memory. In cases where status codes, error codes, or invalid results are generated as plan outputs, refer to the Informatica Data Quality 3.1 User Guide for information on how to interpret them.
05 Match Standardization and Grouping. This plan is used to perform basic standardization and grouping operations on the data prior to matching. 06 Single Source Matching. Single source matching seeks to identify duplicate records within a single data set. 07 Dual Source Matching. Dual source matching seeks to identify duplicate records between two datasets.
q q
Note that the matching plans are designed for use within a PowerCenter mapping and do not deliver optimal results when executed directly from IDQ Workbench. Note also that the Standardization and Matching plans are geared towards North American English data. Although they work with datasets in other languages, the results may be sub-optimal.
Matching Concepts
To ensure the best possible matching results and performance, match plans usually use a pre-processing step to standardize and group the data. The aim for standardization here is different from a classic standardization plan the intent is to ensure that different spellings, abbreviations, etc. are as similar to each other as possible to return better match set. For example, 123 Main Rd. and 123 Main Road will obtain an imperfect match score, although they clearly refer to the same street address. Grouping, in a matching context, means sorting input records based on identical values in one or more user-selected fields. When a matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing time while minimizing the likelihood of missed matches in the dataset. Grouping performs two functions. It sorts the records in a dataset to increase matching plan performance, and it creates new data columns to provide group key options for the matching plan. (In PowerCenter, the Sorter transformation can organize the data to facilitate matching performance. Therefore, the main function of grouping in a PowerCenter context is to create candidate group keys. In both Data Quality and PowerCenter, grouping operations do not affect the source dataset itself.) Matching on un-grouped data involves a large number of comparisons that realistically will not generate a meaningful quantity of additional matches. For example, when looking for duplicates in a customer list, there is little value in comparing the record for John Smith with the record for Angela Murphy as they are obviously not going to be considered as duplicate entries. The type of grouping used depends on the type of information being matched; in general, productive fields for grouping name and address data are location-based (e.g. city name, zip codes) or person/company based (surname and company name composites). For more information on grouping strategies for best result/performance relationship, see the Best Practice Effective Data Matching Techniques. Plan 05 (Match Standardization and Grouping) performs cleansing and standardization operations on the data before group keys are generated. It offers a number of grouping options. The plan generates the following group keys:
q
INFORMATICA CONFIDENTIAL
q q q q
OUT_ZIP_NAME3_GROUP: first 5 digits of ZIP code and the first 3 characters of the last name OUT_ZIP_NAME5_GROUP: first 5 digits of ZIP code and the first 5 characters of the last name OUT_ZIP_COMPANY3_GROUP: first 5 digits of ZIP code and the first 3 characters of the cleansed company name OUT_ZIP_COMPANY5_GROUP: first 5 digits of ZIP code and the first 5 characters of the cleansed company name
The grouping output used depends on the data contents and data volume.
PowerCenter Mappings
When configuring the Data Quality Integration transformation for the matching plan, the Developer must select a valid grouping field.
To ensure best matching results, the PowerCenter mapping that contains plan 05 should include a Sorter transformation that sorts data according to the group key to be used during matching. This transformation should follow standardization and grouping operations. Note that a single mapping can contain multiple Data Quality Integration transformations, so that the Data Quality Developer or Data Integration Developer can add plan 05 to one Integration transformation and plan 06 or 07 to another in the same mapping. The standardization plan requires a passive transformation, whereas the matching plan requires an active transformation.
INFORMATICA CONFIDENTIAL BEST PRACTICE 214 of 702
The developer can add a Sequencer transformation to the mapping to generate a unique identifier for each input record if these not present in the source data. (Note that a unique identifier is not required for matching processes). When working with the dual source matching plan, additional PowerCenter transformations are required to pre-process the data for the Integration transformation. Expression transformations are used to label each input with a source tag of A and B respectively. The data from the two sources is then joined together using a Union transformation, before being passed to the Integration transformation containing the standardization and grouping plan. From here on, the mapping has the same design as the single source version.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
215 of 702
Description
Historically, organizations have approached the development of a "data warehouse" or "data mart" as a departmental effort, without considering an enterprise perspective. The result has been silos of corporate data and analysis, which very often conflict with each other in terms of both detailed data and the business conclusions implied by it. Taking an enterprise-wide, architect stance in developing data integration solutions provides many advantages, including:
q
A sound architectural foundation ensures the solution can evolve and scale with the business over time. Proper architecture can isolate the application component (business context) of the data integration solution from the technology. Lastly, architectures allow for reuse - reuse of skills, design objects, and knowledge.
As the evolution of data integration solutions (and the corresponding nomenclature) has progressed, the necessity of building these solutions on a solid architectural framework has become more and more clear. To understand why, a brief review of the history of data integration solutions and their predecessors is warranted.
Historical Perspective
Online Transaction Processing Systems (OLTPs) have always provided a very detailed, transaction-oriented view of an organization's data. While this view was indispensable for the dayto-day operation of a business, its ability to provide a "big picture" view of the operation, critical for management decision-making, was severely limited. Initial attempts to address this problem took several directions: Reporting directly against the production system. This approach minimized the effort associated with developing management reports, but introduced a number of significant issues: The nature of OLTP data is, by definition, "point-in-time." Thus, reports run at different times of the year, month, or even the day, were inconsistent with each other.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
216 of 702
Ad hoc queries against the production database introduced uncontrolled performance issues, resulting in slow reporting results and degradation of OLTP system performance. Trending and aggregate analysis was difficult (or impossible) with the detailed data available in the OLTP systems.
q
Mirroring the production system in a reporting database . While this approach alleviated the performance degradation of the OLTP system, it did nothing to address the other issues noted above. Reporting databases . To address the fundamental issues associated with reporting against the OLTP schema, organizations began to move toward dedicated reporting databases. These databases were optimized for the types of queries typically run by analysts, rather than those used by systems supporting data entry clerks or customer service representatives. These databases may or may not have included pre-aggregated data, and took several forms, including traditional RDBMS as well as newer technology Online Analytical Processing (OLAP) solutions.
The initial attempts at reporting solutions were typically point solutions; they were developed internally to provide very targeted data to a particular department within the enterprise. For example, the Marketing department might extract sales and demographic data in order to infer customer purchasing habits. Concurrently, the Sales department was also extracting sales data for the purpose of awarding commissions to the sales force. Over time, these isolated silos of information became irreconcilable, since the extracts and business rules applied to the data during the extract process differed for the different departments The result of this evolution was that the Sales and Marketing departments might report completely different sales figures to executive management, resulting in a lack of confidence in both departments' "data marts." From a technical perspective, the uncoordinated extracts of the same data from the source systems multiple times placed undue strain on system resources. The solution seemed to be the "centralized" or "galactic" data warehouse. This warehouse would be supported by a single set of periodic extracts of all relevant data into the data warehouse (or Operational Data Store), with the data being cleansed and made consistent as part of the extract process. The problem with this solution was its enormous complexity, typically resulting in project failure. The scale of these failures led many organizations to abandon the concept of the enterprise data warehouse in favor of the isolated, "stovepipe" data marts described earlier. While these solutions still had all of the issues discussed previously, they had the clear advantage of providing individual departments with the data they needed without the unmanageability of the enterprise solution. As individual departments pursued their own data and data integration needs, they not only created data stovepipes, they also created technical islands. The approaches to populating the data marts and performing the data integration tasks varied widely, resulting in a single enterprise evaluating, purchasing, and being trained on multiple tools and adopting multiple methods for performing these tasks. If, at any point, the organization did attempt to undertake an enterprise effort, it was likely to face the daunting challenge of integrating the disparate data as well as the widely varying technologies. To deal with these issues, organizations began developing approaches that considered the enterprise-level requirements of a data integration solution.
INFORMATICA CONFIDENTIAL BEST PRACTICE 217 of 702
Advantages
The centralized model offers a number of benefits to the overall architecture, including:
q
Centralized control . Since a single project drives the entire process, there is centralized control over everything occurring in the data warehouse. This makes it easier to manage a production system while concurrently integrating new components of the warehouse. Consistent metadata . Because the warehouse environment is contained in a single database and the metadata is stored in a single repository, the entire enterprise can be queried whether you are looking at data from Finance, Customers, or Human Resources. Enterprise view . Developing the entire project at one time provides a global view of how data from one workgroup coordinates with data from others. Since the warehouse is highly integrated, different workgroups often share common tables such as customer, employee, and item lists.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
218 of 702
High data integrity . A single, integrated data repository for the entire enterprise would naturally avoid all data integrity issues that result from duplicate copies and versions of the same business data.
Disadvantages
Of course, the centralized data warehouse also involves a number of drawbacks, including:
q
Lengthy implementation cycle. With the complete warehouse environment developed simultaneously, many components of the warehouse become daunting tasks, such as analyzing all of the source systems and developing the target data model. Even minor tasks, such as defining how to measure profit and establishing naming conventions, snowball into major issues. Substantial up-front costs . Many analysts who have studied the costs of this approach agree that this type of effort nearly always runs into the millions. While this level of investment is often justified, the problem lies in the delay between the investment and the delivery of value back to the business. Scope too broad . The centralized data warehouse requires a single database to satisfy the needs of the entire organization. Attempts to develop an enterprise-wide warehouse using this approach have rarely succeeded, since the goal is simply too ambitious. As a result, this wide scope has been a strong contributor to project failure. Impact on the operational systems . Different tables within the warehouse often read data from the same source tables, but manipulate it differently before loading it into the targets. Since the centralized approach extracts data directly from the operational systems, a source table that feeds into three different target tables is queried three times to load the appropriate target tables in the warehouse. When combined with all the other loads for the warehouse, this can create an unacceptable performance hit on the operational systems.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
219 of 702
Advantages
The independent data mart is the logical opposite of the centralized data warehouse. The disadvantages of the centralized approach are the strengths of the independent data mart:
q
Impact on operational databases localized . Because the independent data mart is trying to solve the DSS needs of a single department or workgroup, only the few operational databases containing the information required need to be analyzed. Reduced scope of the data model . The target data modeling effort is vastly reduced since it only needs to serve a single department or workgroup, rather than the entire company. Lower up-front costs . The data mart is serving only a single department or workgroup; thus hardware and software costs are reduced. Fast implementation . The project can be completed in months, not years. The process of defining business terms and naming conventions is simplified since "players from the same team" are working on the project.
Disadvantages
Of course, independent data marts also have some significant disadvantages:
q
Lack of centralized control . Because several independent data marts are needed to
BEST PRACTICE 220 of 702
INFORMATICA CONFIDENTIAL
solve the decision support needs of an organization, there is no centralized control. Each data mart or project controls itself, but there is no central control from a single location.
q
Redundant data . After several data marts are in production throughout the organization, all of the problems associated with data redundancy surface, such as inconsistent definitions of the same data object or timing differences that make reconciliation impossible. Metadata integration . Due to their independence, the opportunity to share metadata - for example, the definition and business rules associated with the Invoice data object - is lost. Subsequent projects must repeat the development and deployment of common data objects. Manageability . The independent data marts control their own scheduling routines and therefore store and report their metadata differently, with a negative impact on the manageability of the data warehouse. There is no centralized scheduler to coordinate the individual loads appropriately or metadata browser to maintain the global metadata and share development work among related projects.
providing a common understanding of core business concepts that can be shared across the semiautonomous data marts. These data marts each have their own Local Repository, which typically include a combination of purely local metadata and shared metadata by way of links to the Global Repository.
This environment allows for relatively independent development of individual data marts, but also supports metadata sharing without obstacles. The common business model and names described above can be captured in metadata terms and stored in the Global Repository. The data marts use the common business model as a basis, but extend the model by developing departmental metadata and storing it locally. A typical characteristic of the federated architecture is the existence of an Operational Data Store (ODS). Although this component is optional, it can be found in many implementations that extract data from multiple source systems and load multiple targets. The ODS was originally designed to extract and hold operational data that would be sent to a centralized data warehouse, working as a time-variant database to support end-user reporting directly from operational systems. A typical ODS had to be organized by data subject area because it did not retain the data model from the operational system. Informatica's approach to the ODS, by contrast, has virtually no change in data model from the operational system, so it need not be organized by subject area. The ODS does not permit direct end-user reporting, and its refresh policies are more closely aligned with the refresh schedules of the enterprise data marts it may be feeding. It can also perform more sophisticated consolidation functions than a traditional ODS.
INFORMATICA CONFIDENTIAL BEST PRACTICE 222 of 702
Advantages
The Federated architecture brings together the best features of the centralized data warehouse and independent data mart:
q
Room for expansion . While the architecture is designed to quickly deploy the initial data mart, it is also easy to share project deliverables across subsequent data marts by migrating local metadata to the Global Repository. Reuse is built in. Centralized control . A single platform controls the environment from development to test to production. Mechanisms to control and monitor the data movement from operational databases into the data integration environment are applied across the data marts, easing the system management task. Consistent metadata . A Global Repository spans all the data marts, providing a consistent view of metadata. Enterprise view . Viewing all the metadata from a central location also provides an enterprise view, easing the maintenance burden for the warehouse administrators. Business users can also access the entire environment when necessary (assuming that security privileges are granted). High data integrity . Using a set of integrated metadata repositories for the entire enterprise removes data integrity issues that result from duplicate copies of data. Minimized impact on operational systems . Frequently accessed source data, such as customer, product, or invoice records is moved into the decision support environment once, leaving the operational systems unaffected by the number of target data marts.
Disadvantages
Disadvantages of the federated approach include:
q
Data propagation . This approach moves data twice-to the ODS, then into the individual data mart. This requires extra database space to store the staged data as well as extra time to move the data. However, the disadvantage can be mitigated by not saving the data permanently in the ODS. After the warehouse is refreshed, the ODS can be truncated, or a rolling three months of data can be saved. Increased development effort during initial installations . For each table in the target, there needs to be one load developed from the ODS to the target, in addition to all the loads from the source to the targets.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
223 of 702
Data from the various operational sources is staged for subsequent extraction by target systems in the ODS. In the ODS, data is cleaned and remains normalized, tables from different databases are joined, and a refresh policy is carried out (a change/capture facility may be used to schedule ODS refreshes, for instance). The ODS and the data marts may reside in a single database or be distributed across several physical databases and servers. Characteristics of the Operational Data Store are:
q q q q q
Within an enterprise data mart, the ODS can consolidate data from disparate systems in a number of ways:
q
Normalizes data where necessary (such as non-relational mainframe data), preparing it for storage in a relational system. Cleans data by enforcing commonalties in dates, names and other data types that appear across multiple systems. Maintains reference data to help standardize other formats; references might range from zip codes and currency conversion rates to product-code-to-product-name translations. The ODS may apply fundamental transformations to some database tables in order to reconcile common definitions, but the ODS is not intended to be a transformation processor for end-user reporting requirements.
Its role is to consolidate detailed data within common formats. This enables users to create wide varieties of data integration reports, with confidence that those reports will be based on the same detailed data, using common definitions and formats. The following table compares the key differences in the three architectures: Architecture Centralized Data Warehouse Yes Independent Data Mart No Federated Data Warehouse Yes
Centralized Control
INFORMATICA CONFIDENTIAL
BEST PRACTICE
224 of 702
Consistent Metadata Cost effective Enterprise View Fast Implementation High Data Integrity Immediate ROI Repeatable Process
Yes
No
Yes
No Yes No
Yes No Yes
Yes
No
Yes
No No
Yes Yes
Yes Yes
INFORMATICA CONFIDENTIAL
BEST PRACTICE
225 of 702
Description
The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.
Mapping Design
Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?) In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixedwidth files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters, custom transformations, and custom SQL SELECTs where appropriate. Q: What are some considerations when designing the mapping? (i.e., what is the impact of having multiple targets populated by a single map?) With PowerCenter, it is possible to design a mapping with multiple targets. If each target has a separate source qualifier, you can then load the targets in a specific order using Target Load Ordering. However, the recommendation is to limit the amount of complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but such mappings can also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to multiple disks or file systems simultaneously. This minimizes disk writing contention and applies to a session writing to multiple targets, and to multiple sessions running simultaneously. Q: What are some considerations for determining how many objects and transformations to include in a single mapping? The business requirement is always the first consideration, regardless of the number of objects it takes to fulfill the requirement. Beyond this, consideration should be given to having objects that stage data at certain points to allow both easier debugging and better understandability, as well as to create potential partition points. This should be balanced against the fact that more objects means more overhead for the DTM process. It should also be noted that the most expensive use of the DTM is passing unnecessary data through the mapping. It is best to use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to increase the performance of the mapping. If this is not possible, a filter or router transformation can be used instead.
Manager and the application services send log events to the Log Manager. When the Log Manager receives log events, it generates log event files, which can be viewed in the Administration Console. The Log Agent runs on the nodes to collect and process log events for session and workflows. Log events for workflows include information about tasks performed by the Integration Service, workflow processing, and workflow errors. Log events for sessions include information about the tasks performed by the Integration Service, session errors, and load summary and transformation statistics for the session. You can view log events for the last workflow run with the Log Events window in the Workflow Monitor. Log event files are binary files that the Administration Console Log Viewer uses to display log events. When you view log events in the Administration Console, the Log Manager uses the log event files to display the log events for the domain or application service. For more information, please see Chapter 16: Managing Logs in the Administrator Guide. Q: Where can I view the logs? Logs can be viewed in two locations: the Administration Console or the Workflow Monitor. The Administration Console displays domain-level operational and error messages. The Workflow Monitor displays session and workflow level processing and error messages. Q: Where is the best place to maintain Session Logs? One often-recommended location is a shared directory location that is accessible to the gateway node. If you have more than one gateway node, store the logs on a shared disk. This keeps all the logs in the same directory. The location can be changed in the Administration Console. If you have more than one PowerCenter domain, you must configure a different directory path for each domains Log Manager. Multiple domains can not use the same shared directory path. For more information, please refer to Chapter 16: Managing Logs of the Administrator Guide. Q: What documentation is available for the error codes that appear within the error log files? Log file errors and descriptions appear in Chapter 39: LGS Messages of the PowerCenter Trouble Shooting Guide. Error information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific errors, consult your Database User Guide.
Scheduling Techniques
Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session? Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the targets. Workflows can be created to run tasks sequentially or concurrently, or have tasks in different paths doing either.
q
A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2 when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next session only if the previous session was successful, or to stop on errors, etc. A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' symmetric multiprocessing (SMP) architecture.
Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse. This capability allows for the creation of very complex and flexible workflow streams without the use of a third-party scheduler.
INFORMATICA CONFIDENTIAL BEST PRACTICE 227 of 702
Q: Assuming a workflow failure, does PowerCenter allow restart from the point of failure? No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure. It is possible, however, to create tasks and flows based on error handling assumptions. If a previously running real-time workflow fails, first recover and then restart that workflow from the Workflow Monitor. Q: How can a failed workflow be recovered if it is not visible from the Workflow Monitor? Start the Workflow Manager and open the corresponding workflow. Find the failed task and right click to "Recover Workflow From Task." Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications? Workflow Execution needs to be planned around two main constraints:
q q
The number of sessions that can run efficiently at one time depends on the number of processors available on the server. The load manager is always running as a process. If bottlenecks with regards to I/O and network are addressed, a session will be compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive, so the DTM always runs. However, some sessions require more I/O, so they use less processor time. A general rule is that a session needs about 120 percent of a processor for the DTM, reader, and writer in total. For concurrent sessions: One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server. If possible, sessions should run at "off-peak" hours to have as many available resources as possible. Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory usage is more difficult than the processors calculation; it tends to vary according to system load and number of PowerCenter sessions running. The first step is to estimate memory usage, accounting for:
q q q
Operating system kernel and miscellaneous processes Database engine Informatica Load Manager
Next, each session being run needs to be examined with regard to the memory usage, including the DTM buffer size and any cache/memory allocations for transformations such as lookups, aggregators, ranks, sorters and joiners. At this point, you should have a good idea of what memory is utilized during concurrent sessions. It is important to arrange the production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may be able to run only one large session, or several small sessions concurrently. Load-order dependencies are also an important consideration because they often create additional constraints. For example, load the dimensions first, then facts. Also, some sources may only be available at specific times; some network links may become saturated if overloaded; and some target tables may need to be available to end users earlier than others. Q: Is it possible to perform two "levels" of event notification? At the application level and the PowerCenter Server level to notify the Server Administrator? The application level of event notification can be accomplished through post-session email. Post-session email allows you to
INFORMATICA CONFIDENTIAL BEST PRACTICE 228 of 702
create two different messages; one to be sent upon successful completion of the session, the other to be sent if the session fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics about the session. You can use the following variables in the text of your post-session email:
Email Variable %s %l %r %e %t Description Session name Total records loaded Total records rejected Session status Table details, including read throughput in bytes/second and write throughput in rows/second Session start time Session completion time Session elapsed time (session completion time-session start time) Attaches the session log to the message Name and version of the mapping used in the session Name of the folder containing the session Name of the repository containing the session Attaches the named file. The file must be local to the Informatica Server. The following are valid filenames: %a<c:\data\sales.txt> or %a</users/john/data/sales. txt> On Windows NT, you can attach a file of any type. On UNIX, you can only attach text files. If you attach a non-text file, the send may fail. Note: The filename cannot include the Greater Than character (>) or a line break.
%b %c %i %g %m %d %n %a<filename>
The PowerCenter Server on UNIX uses rmail to send post-session email. The repository user who starts the PowerCenter server must have the rmail tool installed in the path in order to send email. To verify the rmail tool is accessible: 1. 2. 3. 4. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server. Type rmail <fully qualified email address> at the prompt and press Enter. Type '.' to indicate the end of the message and press Enter. You should receive a blank email from the PowerCenter user's email account. If not, locate the directory where rmail resides and add that directory to the path. 5. When you have verified that rmail is installed correctly, you are ready to send post-session email.
The output should look like the following: Session complete. Session name: sInstrTest Total Rows Loaded = 1
INFORMATICA CONFIDENTIAL BEST PRACTICE 229 of 702
No errors encountered. Start Time: Tue Sep 14 12:26:31 1999 Completion Time: Tue Sep 14 12:26:41 1999 Elapsed time: 0:00:10 (h:m:s) This information, or a subset, can also be sent to any text pager that accepts email.
Server Administration
Q: What built-in functions does PowerCenter provide to notify someone in the event that the server goes down, or some other significant event occurs? The Repository Service can be used to send messages notifying users that the server will be shut down. Additionally, the Repository Service can be used to send notification messages about repository objects that are created, modified, or deleted by another user. Notification messages are received through the PowerCenter Client tools. Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels? The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes. Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility provides the following information:
q q q q
CPID - Creator PID (process ID) LPID - Last PID that accessed the resource Semaphores - used to sync the reader and writer 0 or 1 - shows slot in LM shared memory
BEST PRACTICE 230 of 702
INFORMATICA CONFIDENTIAL
A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT documentation. Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash? If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.
Custom Transformations
Q: What is the relationship between the Java or SQL transformation and the Custom transformation? Many advanced transformations, including Java and SQL, were built using the Custom transformation. Custom transformations operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality. Other transformations that were built using Custom transformations include HTTP, SQL, Union, XML Parser, XML Generator, and many others. Below is a summary of noticeable differences.
Transformation Custom HTTP Java SQL Union XML Parser XML Generator # of Input Groups Multiple One One One Multiple One Multiple # of Output Groups Multiple One One One One Multiple One Type Active/Passive Passive Active/Passive Active/Passive Active Active Active
For further details, please see the Transformation Guide. Q: What is the main benefit of a Custom transformation over an External Procedure transformation? A Custom transformation allows for the separation of input and output functions, whereas an External Procedure transformation handles both the input and output simultaneously. Additionally, an External Procedure transformations parameters consist of all the ports of the transformation. The ability to separate input and output functions is especially useful for sorting and aggregation, which require all input rows to be processed before outputting any output rows. Q: How do I change a Custom transformation from Active to Passive, or vice versa? After the creation of the Custom transformation, the transformation type cannot be changed. In order to set the appropriate type, delete and recreate the transformation. Q: What is the difference between active and passive Java transformations? When should one be used over the other? An active Java transformation allows for the generation of more than one output row for each input row. Conversely, a passive Java transformation only allows for the generation of one output row per input row.
INFORMATICA CONFIDENTIAL BEST PRACTICE 231 of 702
Use active if you need to generate multiple rows with each input. For example, a Java transformation contains two input ports that represent a start date and an end date. You can generate an output row for each date between the start and end date. Use passive when you need one output row for each input. Q: What are the advantages of a SQL transformation over a Source Qualifier? A SQL transformation allows for the processing of SQL queries in the middle of a mapping. It allows you to insert, delete, update, and retrieve rows from a database. For example, you might need to create database tables before adding new transactions. The SQL transformation allows for the creation of these tables from within the workflow. Q: What is the difference between the SQL transformations Script and Query modes? Script mode allows for the execution of externally located ANSI SQL scripts. Query mode executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters. For more information, please see Chapter 22: SQL Transformation in the Transformation Guide.
Metadata
Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that may be extracted from the PowerCenter repository and used in others? With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column level and give descriptions of the columns in a table if necessary. All information about column size and scale, data types, and primary keys are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this decision should be made on the basis of how much metadata is likely to be required by the systems that use the metadata. There are some time-saving tools that are available to better manage a metadata strategy and content, such as third-party metadata software and, for sources and targets, data modeling tools. Q: What procedures exist for extracting metadata from the repository? Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store, retrieve, and manage their metadata in Informatica's PowerCenter repository. The motivation behind the original Metadata Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository. Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and MicroStrategy, are effectively using the MX views to report and query the Informatica metadata. Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have been created to provide access to the metadata stored in the repository. Additionally, Informatica's Metadata Manager and Data Analyzer, allow for more robust reporting against the repository database and are able to present reports to the end-user and/or management.
Versioning
Q: How can I keep multiple copies of the same object within PowerCenter? A: With PowerCenter, you can use version control to maintain previous copies of every changed object.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
232 of 702
You can enable version control after you create a repository. Version control allows you to maintain multiple versions of an object, control development of the object, and track changes. You can configure a repository for versioning when you create it, or you can upgrade an existing repository to support versioned objects. When you enable version control for a repository, the repository assigns all versioned objects version number 1 and each object has an active status. You can perform the following tasks when you work with a versioned object:
q
View object version properties. Each versioned object has a set of version properties and a status. You can also configure the status of a folder to freeze all objects it contains or make them active for editing. Track changes to an object. You can view a history that includes all versions of a given object, and compare any version of the object in the history to any other version. This allows you to determine changes made to an object over time. Check the object version in and out. You can check out an object to reserve it while you edit the object. When you check in an object, the repository saves a new version of the object and allows you to add comments to the version. You can also find objects checked out by yourself and other users. Delete or purge the object version. You can delete an object from view and continue to store it in the repository. You can recover, or undelete, deleted objects. If you want to permanently remove an object version, you can purge it from the repository.
Q: Is there a way to migrate only the changed objects from Development to Production without having to spend too much time on making a list of all changed/affected objects? A: Yes there is. You can create Deployment Groups that allow you to group versioned objects for migration to a different repository. You can create the following types of deployment groups:
q q
Static. You populate the deployment group by manually selecting objects. Dynamic. You use the result set from an object query to populate the deployment group.
To make a smooth transition/migration to Production, you need to have a query associated with your Dynamic deployment group. When you associate an object query with the deployment group, the Repository Agent runs the query at the time of deployment. You can associate an object query with a deployment group when you edit or create a deployment group. If the repository is enabled for versioning, you may also copy the objects in a deployment group from one repository to another. Copying a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects to copy, rather than the entire contents of a folder.
Performance
Q: Can PowerCenter sessions be load balanced? A: Yes, if the grid option is available. The Load Balancer is a component of the Integration Service that dispatches tasks to Integration Service processes running on nodes in a grid. It matches task requirements with resource availability to identify the best Integration Service process to run a task. It can dispatch tasks on a single node or across nodes. Tasks can be dispatched in three ways: Round-robin, Metric-based, and Adaptive. Additionally, you can set the Service Levels to change the priority of each task waiting to be dispatched. This can be changed in the Administration Consoles domain properties. For more information, please refer to Chapter 11: Configuring the Load Balancer in the Administrator Guide.
Web Services
INFORMATICA CONFIDENTIAL BEST PRACTICE 233 of 702
Q: How does Web Services Hub work in PowerCenter? A: The Web Services Hub is a web service gateway for external clients. It processes SOAP requests from web service clients that want to access PowerCenter functionality through web services. Web service clients access the Integration Service and Repository Service through the Web Services Hub. The Web Services Hub hosts Batch and Real-time Web Services. When you install PowerCenter Services, the PowerCenter installer installs the Web Services Hub. Use the Administration Console to configure and manage the Web Services Hub. For more information, please refer to Creating and Configuring the Web Services Hub in the Administrator Guide. The Web Services Hub connects to the Repository Server and the PowerCenter Server through TCP/IP. Web service clients log in to the Web Services Hub through HTTP(s). The Web Services Hub authenticates the client based on repository user name and password. You can use the Web Services Hub console to view service information and download Web Services Description Language (WSDL) files necessary for running services and workflows.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
234 of 702
Description
Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved through the use indicator files. Users specified the indicator file configuration in the session configuration under advanced options. When the session started, the PowerCenter Server looked for the specified file name; if it wasnt there, it waited until it appeared, then deleted it, and triggered the session. In PowerCenter 6.0 and above, event-based scheduling is triggered by Event-Wait and Event-Raise tasks. These tasks can be used to define task execution order within a workflow or worklet. They can even be used to control sessions across workflows.
q q
An Event-Raise task represents a user-defined event (i.e., an indicator file). An Event-Wait task waits for an event to occurwithin a workflow. After the event triggers, the PowerCenter Server continues executing the workflow from the Event-Wait task forward.
The following paragraphs describe events that can be triggered by an Event-Wait task.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
235 of 702
When you specify the indicator file in the Event-Wait task, specify the directory in which the file will appear and the name of the indicator file. Do not use either a source or target file name as the indicator file name. You must also provide the absolute path for the file and the directory must be local to the PowerCenter Server. If you only specify the file name, and not the directory, Workflow Manager looks for the indicator file in the system directory. For example, on Windows NT, the system directory is C:/winnt/ system32. You can enter the actual name of the file or use server variables to specify the location of the files. The PowerCenter Server writes the time the file appears in the workflow log. Follow these steps to set up a pre-defined event in the workflow: 1. Create an Event-Wait task and double-click the Event-Wait task to open the Edit Tasks dialog box. 2. In the Events tab of the Edit Task dialog box, select Pre-defined. 3. Enter the path of the indicator file. 4. If you want the PowerCenter Server to delete the indicator file after it detects the file, select the Delete Indicator File option in the Properties tab. 5. Click OK.
Pre-defined Event
A pre-defined event is a file-watch event. For pre-defined events, use an Event-Wait task to instruct the PowerCenter Server to wait for the specified indicator file to appear before continuing with the rest of the workflow. When the PowerCenter Server locates the indicator file, it starts the task downstream of the Event-Wait.
User-defined Event
A user-defined event is defined at the workflow or worklet level and the Event-Raise task triggers the event at one point of the workflow/worklet. If an Event-Wait task is configured in the same workflow/worklet to listen for that event, then execution will continue from the Event-Wait task forward. The following is an example of using user-defined events: Assume that you have four sessions that you want to execute in a workflow. You want P1_session and P2_session to execute concurrently to save time. You also want to execute Q3_session after P1_session completes. You want to execute Q4_session only when P1_session, P2_session, and Q3_session complete. Follow these steps:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
236 of 702
1. Link P1_session and P2_session concurrently. 2. Add Q3_session after P1_session 3. Declare an event called P1Q3_Complete in the Events tab of the workflow properties 4. In the workspace, add an Event-Raise task after Q3_session. 5. Specify the P1Q3_Complete event in the Event-Raise task properties. This allows the Event-Raise task to trigger the event when P1_session and Q3_session complete. 6. Add an Event-Wait task after P2_session. 7. Specify the Q1 Q3_Complete event for the Event-Wait task. 8. Add Q4_session after the Event-Wait task. When the PowerCenter Server processes the Event-Wait task, it waits until the Event-Raise task triggers Q1Q3_Complete before it executes Q4_session. The PowerCenter Server executes the workflow in the following order: 1. 2. 3. 4. 5. 6. 7. The PowerCenter Server executes P1_session and P2_session concurrently. When P1_session completes, the PowerCenter Server executes Q3_session. The PowerCenter Server finishes executing P2_session. The Event-Wait task waits for the Event-Raise task to trigger the event. The PowerCenter Server completes Q3_session. The Event-Raise task triggers the event, Q1Q3_complete. The Informatica Server executes Q4_session because the event, Q1Q3_Complete, has been triggered.
Be sure to take carein setting the links though. If they are left as the default and if Q3 fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the workflow will run until it is stopped. To avoid this, check the workflow option suspend on error. With this option, if a session fails, the whole workflow goes into suspended mode and can send an email to notify developers.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
237 of 702
Description
Key management in a decision support RDBMS comprises three techniques for handling the following common situations:
q q q
All three methods are applicable to a Reference Data Store, whereas only the missing and unknown keys are relevant for an Operational Data Store (ODS). Key management should be handled at the data integration level, thereby making it transparent to the Business Intelligence layer.
Key Merging/Matching
When companies source data from more than one transaction system of a similar type, the same object may have different, non-unique legacy keys. Additionally, a single key may have several descriptions or attributes in each of the source systems. The independence of these systems can result in incongruent coding, which poses a greater problem than records being sourced from multiple systems.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
238 of 702
A business can resolve this inconsistency by undertaking a complete code standardization initiative (often as part of a larger metadata management effort) or applying a Universal Reference Data Store (URDS). Standardizing code requires an object to be uniquely represented in the new system. Alternatively, URDS contains universal codes for common reference values. Most companies adopt this pragmatic approach, while embarking on the longer term solution of code standardization. The bottom line is that nearly every data warehouse project encounters this issue and needs to find a solution in the short term.
Missing Keys
A problem arises when a transaction is sent through without a value in a column where a foreign key should exist (i.e., a reference to a key in a reference table). This normally occurs during the loading of transactional data, although it can also occur when loading reference data into hierarchy structures. In many older data warehouse solutions, this condition would be identified as an error and the transaction row would be rejected. The row would have to be processed through some other mechanism to find the correct code and loaded at a later date. This is often a slow and cumbersome process that leaves the data warehouse incomplete until the issue is resolved. The more practical way to resolve this situation is to allocate a special key in place of the missing key, which links it with a dummy 'missing key' row in the related table. This enables the transaction to continue through the loading process and end up in the warehouse without further processing. Furthermore, the row ID of the bad transaction can be recorded in an error log, allowing the addition of the correct key value at a later time. The major advantage of this approach is that any aggregate values derived from the transaction table will be correct because the transaction exists in the data warehouse rather than being in some external error processing file waiting to be fixed. Simple Example: PRODUCT Audi TT18 CUSTOMER Doe10224 SALES REP QUANTITY 1 UNIT PRICE 35,000
In the transaction above, there is no code in the SALES REP column. As this row is processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a record in the SALES REP table. A data warehouse key (8888888) is also added to the
INFORMATICA CONFIDENTIAL
BEST PRACTICE
239 of 702
transaction. PRODUCT Audi TT18 CUSTOMER SALES REP Doe10224 9999999 QUANTITY 1 UNIT PRICE 35,000 DWKEY 8888888
The related sales rep record may look like this: REP CODE 1234567 7654321 9999999 REP NAME David Jones Mark Smith Missing Rep REP MANAGER Mark Smith
An error log entry to identify the missing key on this transaction may look like: ERROR CODE MSGKEY TABLE NAME ORDERS KEY NAME SALES REP KEY 8888888
This type of error reporting is not usually necessary because the transactions with missing keys can be identified using standard end-user reporting tools against the data warehouse.
Unknown Keys
Unknown keys need to be treated much like missing keys except that the load process has to add the unknown key value to the referenced table to maintain integrity rather than explicitly allocating a dummy key to the transaction. The process also needs to make two error log entries. The first, to log the fact that a new and unknown key has been added to the reference table and a second to record the transaction in which the unknown key was found. Simple example: The sales rep reference data record might look like the following:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
240 of 702
A transaction comes into ODS with the record below: PRODUCT Audi TT18 CUSTOMER SALES REP Doe10224 2424242 QUANTITY 1 UNIT PRICE 35,000
In the transaction above, the code 2424242 appears in the SALES REP column. As this row is processed, a new row has to be added to the Sales Rep reference table. This allows the transaction to be loaded successfully. DWKEY 2424242 REP NAME Unknown REP MANAGER
A data warehouse key (8888889) is also added to the transaction. PRODUCT Audi TT18 CUSTOMER SALES REP Doe10224 2424242 QUANTITY 1 UNIT PRICE 35,000 DWKEY 8888889
Some warehouse administrators like to have an error log entry generated to identify the addition of a new reference table entry. This can be achieved simply by adding the following entries to an error log. ERROR CODE NEWROW TABLE NAME KEY NAME SALES REP SALES REP KEY 2424242
A second log entry can be added with the data warehouse key of the transaction in which the unknown key was found.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
241 of 702
KEY 8888889
As with missing keys, error reporting is not essential because the unknown status is clearly visible through the standard end-user reporting. Moreover, regardless of the error logging, the system is self-healing because the newly added reference data entry will be updated with full details as soon as these changes appear in a reference data feed. This would result in the reference data entry looking complete. DWKEY 2424242 REP NAME David Digby REP MANAGER Mark Smith
Employing the Informatica recommended key management strategy produces the following benefits:
q q q q
All rows can be loaded into the data warehouse All objects are allocated a unique key Referential integrity is maintained Load dependencies are removed
INFORMATICA CONFIDENTIAL
BEST PRACTICE
242 of 702
Description
Although PowerCenter environments vary widely, most sessions and/or mappings can benefit from the implementation of common objects and optimization procedures. Follow these procedures and rules of thumb when creating mappings to help ensure optimization.
Avoid calculating or testing the same value over and over. Calculate it once in an expression, and set a True/False flag. Within an expression, use variable ports to calculate a value that can be used multiple times within that transformation. Delete unnecessary links between transformations to minimize the amount of data moved, particularly in the Source Qualifier. This is also helpful for maintenance. If a transformation needs to be reconnected, it is best to only have necessary ports set as input and output to reconnect. In lookup transformations, change unused ports to be neither input nor output. This makes the transformations cleaner looking. It also makes the generated SQL override as small as possible, which cuts down on the amount of cache necessary and thereby improves performance.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
243 of 702
r r
The engine automatically converts compatible types. Sometimes data conversion is excessive. Data types are automatically converted when types differ between connected ports. Minimize data type changes between transformations by planning data flow prior to developing the mapping. Plan for reusable transformations upfront.. Use variables. Use both mapping variables and ports that are variables. Variable ports are especially beneficial when they can be used to calculate a complex expression or perform a disconnected lookup call only once instead of multiple times. Use mapplets to encapsulate multiple reusable transformations. Use mapplets to leverage the work of critical developers and minimize mistakes when performing similar functions. Reduce the number of non-essential records that are passed through the entire mapping. Use active transformations that reduce the number of records as early in the mapping as possible (i.e., placing filters, aggregators as close to source as possible). Select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table for a faster join. Redesign mappings to utilize one Source Qualifier to populate multiple targets. This way the server reads this source only once. If you have different Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the server reads the source for each Source Qualifier. Remove or reduce field-level stored procedures.
6. Facilitate reuse.
r r
r r
9. Utilize Pushdown Optimization. r Design mappings so they can take advantage of the Pushdown Optimization feature. This improves performance by allowing the source and/or target database to perform the mapping logic.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
244 of 702
1. When your source is large, cache lookup table columns for those lookup tables of 500,000 rows or less. This typically improves performance by 10 to 20 percent. 2. The rule of thumb is not to cache any table over 500,000 rows. This is only true if the standard row byte count is 1,024 or less. If the row byte count is more than 1,024, then you need to adjust the 500K-row standard down as the number of bytes increase (i.e., a 2,048 byte row can drop the cache row count to between 250K and 300K, so the lookup table should not be cached in this case). This is just a general rule though. Try running the session with a large lookup cached and not cached. Caching is often faster on very large lookup tables. 3. When using a Lookup Table Transformation, improve lookup performance by placing all conditions that use the equality operator = first in the list of conditions under the condition tab. 4. Cache only lookup tables if the number of lookup calls is more than 10 to 20 percent of the lookup table rows. For fewer number of lookup calls, do not cache if the number of lookup table rows is large. For small lookup tables (i.e., less than 5,000 rows), cache for more than 5 to 10 lookup calls. 5. Replace lookup with decode or IIF (for small sets of values). 6. If caching lookups and performance is poor, consider replacing with an unconnected, uncached lookup. 7. For overly large lookup tables, use dynamic caching along with a persistent cache. Cache the entire table to a persistent file on the first run, enable the "update else insert" option on the dynamic cache and the engine never has to go back to the database to read data from this table. You can also partition this persistent cache at run time for further performance gains. 8. When handling multiple matches, use the "Return any matching value" setting whenever possible. Also use this setting if the lookup is being performed to determine that a match exists, but the value returned is irrelevant. The lookup creates an index based on the key ports rather than all lookup transformation ports. This simplified indexing process can improve performance. 9. Review complex expressions.
r
Examine mappings via Repository Reporting and Dependency Reporting within the mapping. Minimize aggregate function calls. Replace Aggregate Transformation object with an Expression Transformation object and an Update Strategy Transformation for certain types of Aggregations.
r r
3. 4. 5. 6. 7.
Operators are faster than functions (i.e., || vs. CONCAT). Optimize IIF expressions. Avoid date comparisons in lookup; replace with string. Test expression timing by replacing with constant. Use flat files.
r
Using flat files located on the server machine loads faster than a database located in the server machine. Fixed-width files are faster to load than delimited files because delimited files require extra parsing. If processing intricate transformations, consider loading first to a source flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL Selects where appropriate.
8. If working with data that is not able to return sorted data (e.g., Web Logs), consider using the Sorter Advanced External Procedure. 9. Use a Router Transformation to separate data flows instead of multiple Filter Transformations. 10. Use a Sorter Transformation or hash-auto keys partitioning before an Aggregator Transformation to optimize the aggregate. With a Sorter Transformation, the Sorted Ports option can be used even if the original source cannot be ordered. 11. Use a Normalizer Transformation to pivot rows rather than multiple instances of the same target. 12. Rejected rows from an update strategy are logged to the bad file. Consider filtering before the update strategy if retaining these rows is not critical because logging causes extra overhead on the engine. Choose the option in the update strategy to discard rejected rows. 13. When using a Joiner Transformation, be sure to make the source with the smallest amount of data the Master source. 14. If an update override is necessary in a load, consider using a Lookup transformation just in front of the target to retrieve the primary key. The primary key update is much faster than the non-indexed lookup override.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
246 of 702
Description
Reuse Transformation Logic
Templates can be heavily used in a data integration and warehouse environment, when loading information from multiple source providers into the same target structure, or when similar source system structures are employed to load different target instances. Using templates guarantees that any transformation logic that is developed and tested correctly, once, can be successfully applied across multiple mappings as needed. In some instances, the process can be further simplified if the source/target structures have the same attributes, by simply creating multiple instances of the session, each with its own connection/execution attributes, instead of duplicating the mapping.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
247 of 702
Transport mechanism
Once Mapping Templates have been developed, they can be distributed by any of the following procedures:
q q
Copy mapping from development area to the desired repository/folder Export mapping template into XML and import to the desired repository/folder.
q q q q q q
Aggregation using Sorted Input Tracking Dimension History Constraint-Based Loading Loading Incremental Updates Tracking History and Current Inserts or Updates
Transformation Techniques
q q q q q q q q q q
Error Handling Strategy Flat File Creation with Headers and Footers Removing Duplicate Source Records Transforming One Record into Multiple Records Dynamic Caching Sequence Generator Alternative Streamline a Mapping with a Mapplet Reusable Transformations (Customers) Using a Sorter Pipeline Partitioning Mapping Template
INFORMATICA CONFIDENTIAL
BEST PRACTICE
248 of 702
q q q
Using Update Strategy to Delete Rows Loading Heterogenous Targets Load Using External Procedure
q q q q q
Aggregation Using Expression Transformation Building a Parameter File Best Build Logic Comparing Values Between Records Transaction Control Transformation
Source-Specific Requirements
q q q
Processing VSAM Source Files Processing Data from an XML Source Joining a Flat File with a Relational Table
Industry-Specific Requirements
q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
249 of 702
Description
Although naming conventions are important for all repository and database objects, the suggestions in this Best Practice focus on the former. Choosing a convention and sticking with it is the key. Having a good naming convention facilitates smooth migration and improves readability for anyone reviewing or carrying out maintenance on the repository objects by helping them understand the processes being affected. If consistent names and descriptions are not used, significant time may be needed to understand the working of mappings and transformation objects. If no description is provided, a developer is likely to spend considerable time going through an object or mapping to understand its objective. The following pages offer some suggestions for naming conventions for various repository objects. Whatever convention is chosen, it is important to make the selection very early in the development cycle and communicate the convention to project staff working on the repository. The policy can be enforced by peer review and at test phases by adding process to check conventions to test plans and test execution documents.
Mapplet Target
Aggregator Transformation
Expression Transformation
FIL_ or FILT_{FUNCTION} that leverages the expression or a name that describes the processing being done. JV_{FUNCTION} that leverages the expression or a name that describes the processing being done.
Java Transformation
INFORMATICA CONFIDENTIAL
BEST PRACTICE
250 of 702
JNR_{DESCRIPTION} LKP_{TABLE_NAME} or suffix with _{descriptor} if there are multiple look-ups on a single table. For unconnected look-ups, use ULKP in place of LKP.
Mapplet Input Transformation MPLTI_{DESCRIPTOR} indicating the data going into the mapplet. Mapplet Output Transformation MQ Source Qualifier Transformation Normalizer Transformation MPLTO_{DESCRIPTOR} indicating the data coming out of the mapplet.
NRM_{FUNCTION} that leverages the expression or a name that describes the processing being done. RNK_{FUNCTION} that leverages the expression or a name that describes the processing being done. RTR_{DESCRIPTOR} SEQ_{DESCRIPTOR} if using keys for a target table entity, then refer to that SRT_{DESCRIPTOR} SQ_{SOURCE_TABLE1}_{SOURCE_TABLE2}. Using all source tables can be impractical if there are a lot of tables in a source qualifier, so refer to the type of information being obtained, for example a certain type of product SQ_SALES_INSURANCE_PRODUCTS. SP_{STORED_PROCEDURE_NAME}
Rank Transformation
Router Transformation Sequence Generator Transformation Sorter Transformation Source Qualifier Transformation
Stored Procedure Transformation Transaction Control Transformation Union Transformation Update Strategy Transformation
TCT_ or TRANS_{DESCRIPTOR} indicating the function of the transaction control. UN_{DESCRIPTOR} UPD_{UPDATE_TYPE(S)} or UPD_{UPDATE_TYPE(S)}_ {TARGET_NAME} if there are multiple targets in the mapping. E.g., UPD_UPDATE_EXISTING_EMPLOYEES XMG_{DESCRIPTOR}defines the target message.
XML Generator Transformation XML Parser Transformation XML Source Qualifier Transformation
XMP_{DESCRIPTOR}defines the messaging being selected. XMSQ_{DESCRIPTOR}defines the data being selected.
Port Names
Ports names should remain the same as the source unless some other action is performed on the port. In that case, the port should be prefixed with the appropriate name.
INFORMATICA CONFIDENTIAL BEST PRACTICE 251 of 702
When the developer brings a source port into a lookup, the port should be prefixed with in_. This helps the user immediately identify the ports that are being input without having to line up the ports with the input checkbox. In any other transformation, if the input port is transformed in an output port with the same name, prefix the input port with in_. Generated output ports can also be prefixed. This helps trace the port value throughout the mapping as it may travel through many other transformations. If it is intended to be able to use the autolink feature based on names, then outputs may be better left as the name of the target port in the next transformation. For variables inside a transformation, the developer can use the prefix v, 'var_ or v_' plus a meaningful name. With some exceptions, port standards apply when creating a transformation object. The exceptions are the Source Definition, the Source Qualifier, the Lookup, and the Target Definition ports, which must not change since the port names are used to retrieve data from the database. Other transformations that are not applicable to the port standards are:
q q q
Normalizer - The ports created in the Normalizer are automatically formatted when the developer configures it. Sequence Generator - The ports are reserved words. Router - Because output ports are created automatically, prefixing the input ports with an I_ prefixes the output ports with I_ as well. Port names should not have any prefix. Sorter, Update Strategy, Transaction Control, and Filter - These ports are always input and output. There is no need to rename them unless they are prefixed. Prefixed port names should be removed. Union - The group ports are automatically assigned to the input and output; therefore prefixing with anything is reflected in both the input and output. The port names should not have any prefix.
in_ or i_for Input ports o_ or _out for Output ports io_ for Input/Output ports v,v_ or var_ for variable ports lkp_ for returns from look ups mplt_ for returns from mapplets
Prefixes are preferable because they are generally easier to see; developers do not need to expand the columns to see the suffix for longer port names. Transformation object ports can also:
q q q q
Have the Source Qualifier port name. Be unique. Be meaningful. Be given the target port name.
Transformation Descriptions
This section defines the standards to be used for transformation descriptions in the Designer.
q
Source Qualifier Descriptions. Should include the aim of the source qualifier and the data it is intended to select. Should also indicate if any overrides are used. If so, it should describe the filters or settings used. Some projects prefer items such as the SQL statement to be included in the description as well.
Lookup Transformation Descriptions. Describe the lookup along the lines of the [lookup attribute] obtained from [lookup table name] to retrieve the [lookup attribute name]. Where:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
252 of 702
r r r
Lookup attribute is the name of the column being passed into the lookup and is used as the lookup criteria. Lookup table name is the table on which the lookup is being performed. Lookup attribute name is the name of the attribute being returned from the lookup. If appropriate, specify the condition when the lookup is actually executed.
It is also important to note lookup features such as persistent cache or dynamic lookup.
q
Expression Transformation Descriptions. Must adhere to the following format: This expression [explanation of what transformation does]. Expressions can be distinctly different depending on the situation; therefore the explanation should be specific to the actions being performed. Within each Expression, transformation ports have their own description in the format: This port [explanation of what the port is used for].
Aggregator Transformation Descriptions. Must adhere to the following format: This Aggregator [explanation of what transformation does]. Aggregators can be distinctly different, depending on the situation; therefore the explanation should be specific to the actions being performed. Within each Aggregator, transformation ports have their own description in the format: This port [explanation of what the port is used for].
Sequence Generators Transformation Descriptions. Must adhere to the following format: This Sequence Generator provides the next value for the [column name] on the [table name]. Where:
r r
Table name is the table being populated by the sequence number, and the Column name is the column within that table being populated.
Joiner Transformation Descriptions. Must adhere to the following format: This Joiner uses [joining field names] from [joining table names]. Where:
r
Joining field names are the names of the columns on which the join is done, and the
r
Normalizer Transformation Descriptions. Must adhere to the following format:: This Normalizer [explanation]. Where:
r
INFORMATICA CONFIDENTIAL
Filter Transformation Descriptions. Must adhere to the following format: This Filter processes [explanation]. Where:
r
explanation describes what the filter criteria are and what they do.
Stored Procedure Transformation Descriptions. Explain the stored procedures functionality within the mapping (i.e., what does it return in relation to the input ports?). Mapplet Input Transformation Descriptions. Describe the input values and their intended use in the mapplet. Mapplet Output Transformation Descriptions. Describe the output ports and the subsequent use of those values. As an example, for an exchange rate mapplet, describe what currency the output value will be in. Answer the questions like: is the currency fixed or based on other data? What kind of rate is used? is it a fixed inter-company rate? an inter-bank rate? business rate or tourist rate? Has the conversion gone through an intermediate currency? Update Strategies Transformation Descriptions. Describe the Update Strategy and whether it is fixed in its function or determined by a calculation. Sorter Transformation Descriptions. Explanation of the port(s) that are being sorted and their sort direction. Router Transformation Descriptions. Describes the groups and their functions. Union Transformation Descriptions. Describe the source inputs and indicate what further processing on those inputs (if any) is expected to take place in later transformations in the mapping. Transaction Control Transformation Descriptions. Describe the process behind the transaction control and the function of the control to commit or rollback. Custom Transformation Descriptions. Describe the function that the custom transformation accomplishes and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure which is used. External Procedure Transformation Descriptions. Describe the function of the external procedure and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure that is used. Java Transformation Descriptions. Describe the function of the java code and what data is expected as input and what data is generated as output. Also indicate whether the java code determines the object to be an Active or Passive transformation. Rank Transformation Descriptions. Indicate the columns being used in the rank, the number of records returned from the rank, the rank direction, and the purpose of the transformation. XML Generator Transformation Descriptions. Describe the data expected for the generation of the XML and indicate the purpose of the XML being generated. XML Parser Transformation Descriptions. Describe the input XML expected and the output from the parser and indicate the purpose of the transformation.
Mapping Comments
These comments describe the source data obtained and the structure file, table or facts and dimensions that it populates. Remember to use business terms along with such technical details as table names. This is beneficial when maintenance is required or if issues arise that need to be discussed with business analysts.
Mapplet Comments
These comments are used to explain the process that the mapplet carries out. Always be sure to see the notes regarding descriptions for the input and output transformation.
INFORMATICA CONFIDENTIAL BEST PRACTICE 254 of 702
Repository Objects
Repositories, as well as repository level objects, should also have meaningful names. Repositories should prefix with either L_ for local or G for global and a descriptor. Descriptors usually include information about the project and/or level of the environment (e.g., PROD, TEST, DEV).
INFORMATICA CONFIDENTIAL
BEST PRACTICE
255 of 702
wait_ or ew_{DESCRIPTOR}Waits for an event to occur. Once the event triggers, the PowerCenter Server continues executing the rest of the workflow. raise_ or er_{DESCRIPTOR} Represents a user-defined event. When the PowerCenter Server runs the Event-Raise task, the Event-Raise task triggers the event. Use the Event-Raise task with the Event-Wait task to define events.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
256 of 702
Domain Node
DOM_ or DMN_[PROJECT]_[ENVIRONMENT]
DOM_PROCURE_DEV
Application
PWXR_DB2_Instance_Name
Application Application
PWX NRDB Batch PWXB_IMS_ Instance_Name PWX NRDB CDC Change PWX NRDB CDC Real Time PWXC_IMS_ Instance_Name
Application
PWXR_IMS_ Instance_Name
PWX Oracle CDC PWXC_ORA_Instance_Name Change PWX Oracle CDC PWXR_ORA_Instance_Name Real
Application
Target Type
Connection Type
DB2/390
PWX DB2390 relational database PWXT_DB2_Instance_Name connection PWX DB2400 relational database PWXT_DB2_Instance_Name connection
DB2/400
INFORMATICA CONFIDENTIAL
BEST PRACTICE
258 of 702
Description
As time windows shrink and data volumes increase, it is important to understand the impact of a suitable incremental load strategy. The design should allow data to be incrementally added to the data warehouse with minimal impact on the overall system. This Best Practice describes several possible load strategies.
Incremental Aggregation
Incremental aggregation is useful for applying incrementally-captured changes in the source to aggregate calculations in a session. If the source changes only incrementally, and you can capture those changes, you can configure the session to process only those changes with each run. This allows the PowerCenter Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session. If the session performs incremental aggregation, the PowerCenter Integration Service saves index and data cache information to disk when the session finishes. The next time the session runs, the PowerCenter Integration Service uses this historical information to perform the incremental aggregation. To utilize this functionality set the Incremental Aggregation Session attribute. For details see Chapter 24 in the Workflow Administration Guide. Use incremental aggregation under the following conditions:
q q
Your mapping includes an aggregate function. The source changes only incrementally.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
259 of 702
You can capture incremental changes (i.e., by filtering source data by timestamp). You get only delta records (i.e., you may have implemented the CDC (Change Data Capture) feature of PowerExchange).
You cannot capture new source data. Processing the incrementally-changed source significantly changes the target. If processing the incrementally-changed source alters more than half the existing target, the session may not benefit from using incremental aggregation. Your mapping contains percentile or median functions.
Some conditions that may help in making a decision on an incremental strategy include:
q
Error handling, loading and unloading strategies for recovering, reloading, and unloading data. History tracking requirements for keeping track of what has been loaded and when Slowly-changing dimensions. Informatica Mapping Wizards are a good start to an incremental load strategy. The Wizards generate generic mappings as a starting point (refer to Chapter 15 in the Designer Guide)
Source Analysis
Data sources typically fall into the following possible scenarios:
q
Delta records. Records supplied by the source system include only new or changed records. In this scenario, all records are generally inserted or updated into the data warehouse. Record indicator or flags. Records that include columns that specify the intention of the record to be populated into the warehouse. Records can be selected based upon this flag for all inserts, updates, and deletes. Date stamped data. Data is organized by timestamps, and loaded into the warehouse based upon the last processing date or the effective date range. Key values are present. When only key values are present, data must be checked against what has already been entered into the warehouse. All values must be checked before entering the warehouse.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
260 of 702
No key values present. When no key values are present, surrogate keys are created and all data is inserted into the warehouse based upon validity of the records.
Compare with the target table. When source delta loads are received, determine if the record exists in the target table. The timestamps and natural keys of the record are the starting point for identifying whether the record is new, modified, or should be archived. If the record does not exist in the target, insert the record as a new row. If it does exist, determine if the record needs to be updated, inserted as a new record, or removed (deleted from target) or filtered out and not added to the target. Record indicators. Record indicators can be beneficial when lookups into the target are not necessary. Take care to ensure that the record exists for update or delete scenarios, or does not exist for successful inserts. Some design effort may be needed to manage errors in these situations.
Joins of sources to targets. Records are directly joined to the target using Source Qualifier join conditions or using Joiner transformations after the Source Qualifiers (for heterogeneous sources). When using Joiner transformations, take care to ensure the data volumes are manageable and that the smaller of the two datasets is configured as the Master side of the join. Lookup on target. Using the Lookup transformation, lookup the keys or critical columns in the target relational database. Consider the caches and indexing possibilities. Load table log. Generate a log table of records that have already been inserted into the target system. You can use this table for comparison with lookups or joins, depending on the need and volume. For example, store keys in a separate table and compare source records against this log table to determine load strategy. Another example is to store the dates associated with the data already loaded into a log table. MD5 checksum function. Generate a unique value for each row of data and then compare previous and current unique checksum values to determine
INFORMATICA CONFIDENTIAL
BEST PRACTICE
261 of 702
Date-Stamped Data
This method involves data that has been stamped using effective dates or sequences. The incremental load can be determined by dates greater than the previous load date or data that has an effective key greater than the last key processed. With the use of relational sources, the records can be selected based on this effective date and only those records past a certain date are loaded into the warehouse. Views can also be created to perform the selection criteria. This way, the processing does not have to be incorporated into the mappings but is kept on the source component. Placing the load strategy into the other mapping components is more flexible and controllable by the Data Integration developers and the associated metadata. To compare the effective dates, you can use mapping variables to provide the previous date processed (see the description below). An alternative to Repository-maintained mapping variables is the use of control tables to store the dates and update the control table after each load. Non-relational data can be filtered as records are loaded based upon the effective dates or sequenced keys. A Router transformation or filter can be placed after the Source Qualifier to remove old records.
can also check to see if you need to update these records or discard the source record. It may be possible to perform a join with the target tables in which new data can be selected and loaded into the target. It may also be feasible to lookup in the target to see if the data exists.
Loading directly into the target. Loading directly into the target is possible when the data is going to be bulk loaded. The mapping is then responsible for error control, recovery, and update strategy. Load into flat files and bulk load using an external loader. The mapping loads data directly into flat files. You can then invoke an external loader to bulk load the data into the target. This method reduces the load times (with less downtime for the data warehouse) and provides a means of maintaining a history of data being loaded into the target. Typically, this method is only used for updates into the warehouse. Load into a mirror database. The data is loaded into a mirror database to avoid downtime of the active data warehouse. After data has been loaded, the databases are switched, making the mirror the active database and the active the mirror.
session and as such should represent a date earlier than the earliest desired data. The date can use any one of these formats:
q q q q
Step 3: Refresh the mapping variable for the next session run using an Expression Transformation
Use an Expression transformation and the pre-defined variable functions to set and use the mapping variable. In the expression transformation, create a variable port and use the SETMAXVARIABLE variable function to capture the maximum source date selected during each run. SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE) CREATE_DATE in this example is the date field from the source that should be used to identify incremental rows. You can use the variables in the following transformations:
q q q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
264 of 702
As the session runs, the variable is refreshed with the max date value encountered between the source and variable. So, if one row comes through with 9/1/2004, then the variable gets that value. If all subsequent rows are LESS than that, then 9/1/2004 is preserved. Note: This behavior has no effect on the date used in the source qualifier. The initial select always contains the maximum data value encountered during the previous, successful session run. When the mapping completes, the PERSISTENT value of the mapping variable is stored in the repository for the next run of your session. You can view the value of the mapping variable in the session log file. The advantage of the mapping variable and incremental loading is that it allows the session to use only the new rows of data. No table is needed to store the max(date) since the variable takes care of it. After a successful session run, the PowerCenter Integration Service saves the final value of each variable in the repository. So when you run your session the next time, only new data from the source system is captured. If necessary, you can override the value saved in the repository with a value saved in a parameter file.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
265 of 702
In addition, by leveraging group source processing, where multiple sources are placed in a single mapping, the PowerCenter session reads the committed changes for multiple sources in a single efficient pass, and in the order they occurred. The changes are then propagated to the targets, and upon session completion, restart tokens (markers) are written out to a PowerCenter file so that the next session run knows the point to extract from.
After installing PWX, ensure the PWX Listener is up and running and that connectivity is established to the Listener. For best performance, the Listener should be co-located with the source system. In the PWX Navigator client tool, use metadata to configure data access. This means creating data maps for the non-relational to relational view of mainframe sources (such as IMS and VSAM) and capture registrations for all sources (mainframe, Oracle, DB2, etc). Registrations define the specific tables and columns desired for change capture. There should be one registration per source. Group the registrations logically, for example, by source database. For an initial test, make changes in the source system to the registered sources. Ensure that the changes are committed. Still working in PWX Navigator (and before using PowerCenter), perform Row Tests to verify the returned change records, including the transaction action flag (the DTL__CAPXACTION column) and the timestamp. Set the required access mode: CAPX for change and CAPXRT for real time. Also, if desired, edit the PWX extraction maps to add the Change Indicator (CI) column. This CI flag (Y or N) allows for field level capture and can be filtered in the PowerCenter mapping. Use PowerCenter to materialize the targets (i.e., to ensure that sources and targets are in sync prior to starting the change capture process). This can be accomplished with a simple pass-through batch mapping. This same bulk mapping can be reused for CDC purposes, but only if specific CDC columns are not included, and by changing the session connection/mode. Import the PWX extraction maps into Designer. This requires the PWXPC
INFORMATICA CONFIDENTIAL
BEST PRACTICE
266 of 702
Use group sourcing to create the CDC mapping by including multiple sources in the mapping. This enhances performance because only one read/ connection is made to the PWX Listener and all changes (for the sources in the mapping) are pulled at one time. Keep the CDC mappings simple. There are some limitations; for instance, you cannot use active transformations. In addition, if loading to a staging area, store the transaction types (i.e., insert/update/delete) and the timestamp for subsequent processing downstream. Also, if loading to a staging area, include an Update Strategy transformation in the mapping with DD_INSERT or DD_UPDATE in order to override the default behavior and store the action flags. Set up the Application Connection in Workflow Manager to be used by the CDC session. This requires the PWXPC component. There should be one connection and token file per CDC mapping/session. Set the UOW (unit of work) to a low value for faster commits to the target for real-time sessions. Specify the restart token location and file on the PowerCenter Integration Service (within the infa_shared directory) and specify the location of the PWX Listener. In the CDC session properties, enable session recovery (i.e., set the Recovery Strategy to Resume from last checkpoint). Use post-session commands to archive the restart token files for restart/ recovery purposes. Also, archive the session logs.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
267 of 702
Description
PowerCenter with real-time option can be used to process data from real-time data sources. PowerCenter supports the following types of real-time data:
q
Messages and message queues. PowerCenter with the real-time option can be used to integrate third-party messaging applications using a specific version of PowerCenter Connect. Each PowerCenter Connect version supports a specific industry-standard messaging application, such as IBM MQSeries, JMS, MSMQ, SAP NetWeaver mySAP Option, TIBCO, and webMethods. You can read from messages and message queues and write to messages, messaging applications, and message queues. IBM MQ Series uses a queue to store and exchange data. Other applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the message exchange is identified using a topic. Web service messages. PowerCenter can receive a web service message from a web service client through the Web Services Hub, transform the data, and load the data to a target or send a message back to a web service client. A web service message is a SOAP request from a web service client or a SOAP response from the Web Services Hub. The Integration Service processes real-time data from a web service client by receiving a message request through the Web Services Hub and processing the request. The Integration Service can send a reply back to the web service client through the Web Services Hub or write the data to a target. Changed source data. PowerCenter can extract changed data in real time from a source table using the PowerExchange Listener and write data to a target. Real-time sources supported by PowerExchange are ADABAS, DATACOM, DB2/390, DB2/400, DB2/UDB, IDMS, IMS, MS SQL Server, Oracle and VSAM.
Connection Setup
PowerCenter uses some attribute values in order to correctly connect and identify the thirdparty messaging application and message itself. Each version of PowerCenter Connect supplies its own connection attributes that need to be configured properly before running a realtime session.
The PowerCenter real-time option uses a zero latency engine to process data from the messaging system. Depending on the messaging systems and the application that sends and receives messages, there may be a period when there are many messages and, conversely, there may be a period when there are no messages. PowerCenter uses the attribute Flush Latency to determine how often the messages are being flushed to the target. PowerCenter also provides various attributes to control when the session ends. The following reader attributes determine when a PowerCenter session should end:
q
Message Count - Controls the number of messages the PowerCenter Server reads from the source before the session stops reading from the source. Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive before it stops reading from the source. Time Slice Mode - Indicates a specific range of time that the server read messages from the source. Only PowerCenter Connect for MQSeries uses this option. Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends reading messages from the source.
The specific filter conditions and options available to you depend on which Real-Time source is being used. For example: Attributes for PowerExchange real-time CDC for DB2/400
INFORMATICA CONFIDENTIAL
BEST PRACTICE
269 of 702
Set the attributes that control how the reader ends. One or more attributes can be used to control the end of session. For example: set the Reader Time Limit attribute to 3600. The reader will end after 3600 seconds. The idle time limit is set to 500 seconds. The reader will end if it doesnt process any changes for 500 seconds (i.e., it remains idle for 500 seconds). If more than one attribute is selected, the first attribute that satisfies the condition is used to control the end of session. Note:: The real-time attributes can be found in the Reader Properties for PowerCenter Connect for JMS, Tibco, Webmethods, and SAP Idoc. For PowerCenter Connect for MQ Series, the realtime attributes must be specified as a filter condition.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
270 of 702
The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how often PowerCenter should flush messages, expressed in milli-seconds. For example, if the Real-time Flush Latency is set to 2000, PowerCenter flushes messages every two seconds. The messages will also be flushed from the reader buffer if the Source Based Commit condition is reached. The Source Based Commit condition is defined in the Properties tab of the session. The message recovery option can be enabled to ensure that no messages are lost if a session fails as a result of unpredictable error, such as power loss. This is especially important for realtime sessions because some messaging applications do not store the messages after the messages are consumed by another application. A unit of work (UOW) is a collection of changes within a single commit scope made by a transaction on the source system from an external application. Each UOW may consist of a different number of rows depending on the transaction to the source system. When you use the UOW Count Session condition, the Integration Service commits source data to the target when it reaches the number of UOWs specified in the session condition. For example, if the value for UOW Count is 10, the Integration Service commits all data read from the source after the 10th UOW enters the source. The lower you set the value, the faster the Integration Service commits data to the target. The lower value also causes the system to consume more resources.
Scheduler and select Run Continuously from Run Options. A continuous workflow starts automatically when the Integration Service initializes. When the workflow stops, it restarts immediately.
bottleneck; using compression also increases the CPU and memory usage on the source system.
INFORMATICA CONFIDENTIAL
Queue Manager - the Queue Manager name for the message queue. (in Windows, the default Queue Manager name is QM_<machine name>) Queue Name - the Message Queue name
Open the MQ Series Administration Console. The Queue Manager should appear on the left panel Expand the Queue Manager icon. A list of the queues for the queue manager appears on the left panel Note that the Queue Managers name and Queue Name are case-sensitive
JNDI Application Connection, which is used to connect to a JNDI server during a session run. JMS Application Connection, which is used to connect to a JMS provider during a session run.
Name JNDI Context Factory JNDI Provider URL JNDI UserName JNDI Password JMS Application Connection
INFORMATICA CONFIDENTIAL
q q q
PROVIDER_USERDN=cn=myname,o=infa,c=rc PROVIDER_PASSWORD=test The following table shows the JMSAdmin.config settings and the corresponding attributes in the JNDI application connection in the Workflow Manager:
JMSAdmin.config Settings: JNDI Application Connection Attribute
JNDI Context Factory JNDI Provider URL JNDI UserName JNDI Password
When Queue Connection Factory is used, define a JMS queue as the destination. When Connection Factory is used, define a JMS topic as the destination.
The command to define a queue connection factory (qcf) is: def qcf(<qcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port) The command to define JMS queue is: def q(<JMS_queue_name>) qmgr(queue_manager_name) qu (queue_manager_queue_name) The command to define JMS topic connection factory (tcf) is: def tcf(<tcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port)
INFORMATICA CONFIDENTIAL BEST PRACTICE 276 of 702
The command to define the JMS topic is: def t(<JMS_topic_name>) topic(pub/sub_topic_name) The topic name must be unique. For example: topic (application/infa) The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:
JMS Object Types JMS Application Connection Attribute
QueueConnectionFactory or JMS Connection Name TopicConnectionFactory JMS Queue Name or JMS Topic Name JMS Destination
INFORMATICA CONFIDENTIAL
BEST PRACTICE
277 of 702
PROVIDER_USERDN=cn=informatica,o=infa,c=rc PROVIDER_PASSWORD=test JMS Connection The JMS configuration is similar to the JMS Connection for IBM MQ Series.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
278 of 702
Destinations under your domain. Click Configure a New JMSQueue or Configure a New JMSTopic. The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:
WebLogic Server JMS Object JMS Application Connection Attribute
Connection Factory Settings: JNDIName JMS Application Connection Attribute Connection Factory Settings: JNDIName JMS Connection Factory Name Destination Settings: JNDIName JMS Destination
In addition to JNDI and JMS setting, BEA Weblogic also offers a function called JMS Store, which can be used for persistent messaging when reading and writing JMS messages. The JMS Stores configuration is available from the Console pane: select Services > JMS > Stores under your domain.
JNDI Context Factory.com.tibco.tibjms.naming.TibjmsInitialContextFactory Provider URL.tibjmsnaming://<host>:<port> where host and port are the host name and port number of the Enterprise Server.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
279 of 702
To make a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server: 1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in the example below, so that TIBCO Enterprise Server can communicate with TIBCO Rendezvous messaging systems: tibrv_transports = enabled 2. Enter the following transports in the transports.conf file: [RV] type = tibrv // type of external messaging system topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer daemon = tcp:localhost:7500 // default daemon for the Rendezvous server The transports in the transports.conf configuration file specify the communication protocol between TIBCO Enterprise for JMS and the TIBCO Rendezvous system. The import and export properties on a destination can list one or more transports to use to communicate with the TIBCO Rendezvous system. 3. Optionally, specify the name of one or more transports for reliable and certified message delivery in the export property in the file topics.conf. as in the following example: topicname export="RV" The export property allows messages published to a topic by a JMS client to be exported to the external systems with configured transports. Currently, you can configure transports for TIBCO Rendezvous reliable and certified messaging protocols.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
280 of 702
crpc23232.crp.informatica.com
crpc23232
Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing webMethods source definition. This step is only required for importing PowerCenter Connect for webMethods sources into the Designer. If you are using the request/reply model in webMethods, PowerCenter needs to send an appropriate document back to the broker for every document it receives. PowerCenter populates some of the envelope fields of the webMethods target to enable webMethods broker to recognize that the published document is a reply from PowerCenter. The envelope fields destid and tag are populated for the request/reply model. Destid should be populated from the pubid of the source document and tag should be populated from tag of the source document. Use the option Create Default Envelope Fields when importing webMethods sources and targets into the Designer in order to make the envelope fields available in PowerCenter.
Name Broker Host Broker Name Client ID Client Group Application Name Automatic Reconnect Preserve Client State
Enter the connection to the Broker Host in the following format <hostname: port>. If you are using the request/reply method in webMethods, you have to specify a client ID in the connection. Be sure that the client ID used in the request connection is the same as the client ID used in the reply connection. Note that if you are using multiple request/reply document pairs, you need to setup different webMethods connections for each pair because they cannot share a client ID.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
281 of 702
Description
On hardware systems that are under-utilized, you may be able to improve performance by processing partitioned data sets in parallel in multiple threads of the same session instance running onthe PowerCenter Server engine. However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity. In addition to hardware, consider these other factors when determining if a session is an ideal candidate for partitioning: source and target database setup, target type, mapping design, and certain assumptions that are explained in the following paragraphs. Use the Workflow Manager client tool to implement session partitioning.
Assumptions
The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning. These factors can help to maximize the benefits that can be achieved through partitioning.
q
Indexing has been implemented on the partition key when using a relational source. Source files are located on the same physical machine as the PowerCenter Server process when partitioning flat files, COBOL, and XML, to reduce network overhead and delay. All possible constraints are dropped or disabled on relational targets. All possible indexes are dropped or disabled on relational targets. Table spaces and database partitions are properly managed on the target system. Target files are written to same physical machine that hosts the PowerCenter
q q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
282 of 702
First, determine if you should partition your session. Parallel execution benefits systems that have the following characteristics: Check idle time and busy percentage for each thread. This gives the high-level information of the bottleneck point/points. In order to do this, open the session log and look for messages starting with PETL_ under the RUN INFO FOR TGT LOAD ORDER GROUP section. These PETL messages give the following details against the reader, transformation, and writer threads:
q q q
Under-utilized or intermittently-used CPUs. To determine if this is the case, check the CPU usage of your machine. The column ID displays the percentage utilization of CPU idling during the specified interval without any I/O wait. If there are CPU cycles available (i.e., twenty percent or more idle time), then this session's performance may be improved by adding a partition.
q q
Windows 2000/2003 - check the task manager performance tab. UNIX - type VMSTAT 1 10 on the command line.
Windows 2000/2003 - check the task manager performance tab. UNIX - type IOSTAT on the command line. The column %IOWAIT displays the percentage of CPU time spent idling while waiting for I/O requests. The column %idle displays the total percentage of the time that the CPU spends idling (i.e., the unused capacity of the CPU.)
Sufficient memory. If too much memory is allocated to your session, you will receive a memory allocation error. Check to see that you're using as much memory as you can. If the session is paging, increase the memory. To determine if the session is paging:
q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
283 of 702
UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages swapped in from the page space during the specified interval. PO displays the number of pages swapped out to the page space during the specified interval. If these values indicate that paging is occurring, it may be necessary to allocate more memory, if possible.
If you determine that partitioning is practical, you can begin setting up the partition.
Partition Types
PowerCenter provides increased control of the pipeline threads. Session performance can be improved by adding partitions at various pipeline partition points. When you configure the partitioning information for a pipeline, you must specify a partition type. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types:
Round-robin Partitioning
The PowerCenter Server distributes data evenly among all partitions. Use round-robin partitioning when you need to distribute rows evenly and do not need to group data among partitions. In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider a session based on a mapping that reads data from three flat files of different sizes.
q q q
Source file 1: 100,000 rows Source file 2: 5,000 rows Source file 3: 20,000 rows
In this scenario, the recommended best practice is to set a partition point after the Source Qualifier and set the partition type to round-robin. The PowerCenter Server distributes the data so that each partition processes approximately one third of the data.
Hash Partitioning
The PowerCenter Server applies a hash function to a partition key to group data among
INFORMATICA CONFIDENTIAL
BEST PRACTICE
284 of 702
partitions. Use hash partitioning where you want to ensure that the PowerCenter Server processes groups of rows with the same partition key in the same partition. For example, in a scenario where you need to sort items by item ID, but do not know the number of items that have a particular ID number. If you select hash auto-keys, the PowerCenter Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key. An example of this type of partitioning is when you are using Aggregators and need to ensure that groups of data based on a primary key are processed in the same partition.
Pass-through Partitioning
In this type of partitioning, the PowerCenter Server passes all rows at one partition point to the next partition point without redistributing them. Use pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to (or cannot) change the distribution of data across partitions. The Data Transformation Manager spawns a master thread on each session run, which in turn creates three threads (reader, transformation, and writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence, three data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take a longer time than the other threads, which can slow data throughput.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
285 of 702
It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces the overhead of a single transformation thread. When you have considered all of these factors and selected a partitioning strategy, you can begin the iterative process of adding partitions. Continue adding partitions to the session until you meet the desired performance threshold or observe degradation in performance.
Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before adding additional partitions. Refer to Workflow Administrator Guide, for more information on Restrictions on the Number of Partitions. Set DTM buffer memory. For a session with n partitions, set this value to at least n times the original value for the non-partitioned session. Set cached values for sequence generator. For a session with n partitions, there is generally no need to use the Number of Cached Values property of the sequence generator. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the non-partitioned session. Partition the source data evenly. The source data should be partitioned into equal sized chunks for each partition. Partition tables. A notable increase in performance can also be realized when the actual source and target tables are partitioned. Work with the DBA to discuss the partitioning of source and target tables, and the setup of tablespaces. Consider using external loader. As with any session, using an external loader may increase session performance. You can only use Oracle external loaders for partitioning. Refer to the Session and Server Guide for more information on using and setting up the Oracle external loader for partitioning. Write throughput. Check the session statistics to see if you have increased the write throughput. Paging. Check to see if the session is now causing the system to page. When you partition a session and there are cached lookups, you must make sure that DTM memory is increased to handle the lookup caches. When you partition a source that uses a static lookup cache, the PowerCenter Server creates one memory cache for each partition and one disk cache for each transformation. Thus, memory requirements grow for each partition. If the memory is not bumped up, the system may start paging to disk, causing
INFORMATICA CONFIDENTIAL
BEST PRACTICE
286 of 702
degradation in performance. When you finish partitioning, monitor the session to see if the partition is degrading or improving session performance. If the session performance is improved and the session meets your requirements, add another partition
Dynamic Partitioning
Dynamic partitioning is also called parameterized partitioning because a single parameter can determine the number of partitions. With the Session on Grid option, more partitions can be added when more resources are available. Also the number of partitions in a session can be tied to partitions in the database to facilitate maintenance of PowerCenter partitioning to leverage database partitioning.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
287 of 702
Description
Prior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific transformations and to those server variables that were global in nature. Transformation variables were defined as variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression, Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect the subdirectories for source files, target files, log files, and so forth. More current versions of PowerCenter made variables and parameters available across the entire mapping rather than for a specific transformation object. In addition, they provide built-in parameters for use within Workflow Manager. Using parameter files, these values can change from session-run to session-run. With the addition of workflows, parameters can now be passed to every session contained in the workflow, providing more flexibility and reducing parameter file maintenance. Other important functionality that has been added in recent releases is the ability to dynamically create parameter files that can be used in the next session in a workflow or in other workflows.
Workflow variables Worklet variables Session parameters Mapping parameters and variables
When using parameters or variables in a workflow, worklet, mapping, or session, the PowerCenter Server checks the parameter file to determine the start value of the parameter or variable. Use a parameter file to initialize workflow variables, worklet variables, mapping parameters, and mapping variables. If not defining start values for these parameters and variables, the PowerCenter Server checks for the start value of the parameter or variable in other places. Session parameters must be defined in a parameter file. Because session parameters do not have default values, if the PowerCenter Server cannot locate the value of a session parameter in the parameter file, it fails to initialize the session. To include parameter or variable information for more than one workflow, worklet, or session in a single parameter file, create separate sections for each object within the parameter file. Also, create multiple parameter files for a single workflow, worklet, or session and change the file that these tasks use, as necessary. To specify the parameter file that the PowerCenter Server uses with a workflow, worklet, or session, do either of the following:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
288 of 702
q q
Enter the parameter file name and directory in the workflow, worklet, or session properties. Start the workflow, worklet, or session using pmcmd and enter the parameter filename and directory in the command line.
If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd command line, the PowerCenter Server uses the information entered in the pmcmd command line.
Workflow variables - [folder name.WF:workflow name] Worklet variables -[folder name.WF:workflow name.WT:worklet name] Worklet variables in nested worklets - [folder name.WF:workflow name.WT:worklet name.WT:worklet name...] Session parameters, plus mapping parameters and variables - [folder name.WF:workflow name.ST: session name] or [folder name.session name] or [session name]
For example, a session in the production folder, s_MonthlyCalculations, uses a string mapping parameter, $ $State, that needs to be set to MA, and a datetime mapping variable, $$Time. $$Time already has an initial value of 9/30/2000 00:00:00 saved in the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The session also uses session parameters to connect to source files and target databases, as well as to write session log to the appropriate session log file. The following table shows the parameters and variables that can be defined in the parameter file:
Parameter and Variable Type String Mapping Parameter Datetime Mapping Variable Source File (Session Parameter) Database Connection (Session Parameter) Session Log File (Session Parameter) Parameter and Variable Name $$State $$Time $InputFile1 $DBConnection_Target $PMSessionLogFile Desired Definition MA 10/1/2000 00:00:00 Sales.txt Sales (database connection) d:/session logs/firstrun.txt
INFORMATICA CONFIDENTIAL
BEST PRACTICE
289 of 702
The parameter file for the session includes the folder and session name, as well as each parameter and variable:
q q q q q q
The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable. This allows the PowerCenter Server to use the value for the variable that was set in the previous session run
Mapping Variables
Declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and Variables (See the first figure, below). After selecting mapping variables, use the pop-up window to create a variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to creating a port in most transformations (See the second figure, below).
INFORMATICA CONFIDENTIAL
BEST PRACTICE
290 of 702
Variables, by definition, are objects that can change value dynamically. PowerCenter has four functions to affect change to mapping variables:
q q q q
A mapping variable can store the last value from a session run in the repository to be used as the starting value for the next session run.
q
Name. The name of the variable should be descriptive and be preceded by $$ (so that it is easily identifiable as a variable). A typical variable name is: $$Procedure_Start_Date. Aggregation type. This entry creates specific functionality for the variable and determines how it stores data. For example, with an aggregation type of Max, the value stored in the repository at the end of each session run would be the maximum value across ALL records until the value is deleted. Initial value. This value is used during the first session run when there is no corresponding and overriding parameter file. This value is also used if the stored repository value is deleted. If no initial value is identified, then a data-type specific default value is used.
Variable values are not stored in the repository when the session:
q q q
INFORMATICA CONFIDENTIAL
Order of Evaluation
The start value is the value of the variable at the start of the session. The start value can be a value defined in the parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined initial value for the variable, or the default value based on the variable data type. The PowerCenter Server looks for the start value in the following order: 1. 2. 3. 4. Value in session parameter file Value saved in the repository Initial value Default value
Once defined, mapping parameters and variables can be used in the Expression Editor section of the following transformations:
q q q q q
Mapping parameters and variables also can be used within the Source Qualifier in the SQL query, user-defined join, and source filter sections, as well as in a SQL override in the lookup transformation. The lookup SQL override is similar to entering a custom query in a Source Qualifier transformation. When entering a lookup SQL override, enter the entire override, or generate and edit the default SQL statement. When the Designer generates the default SQL statement for the lookup SQL override, it includes the lookup/output ports in the lookup condition and the lookup/return port. Note: Although you can use mapping parameters and variables when entering a lookup SQL override, the Designer cannot expand mapping parameters and variables in the query override and does not validate the lookup SQL override. When running a session with a mapping parameter or variable in the lookup SQL override, the PowerCenter Integration Service expands mapping parameters and variables and connects to the lookup database to validate the query override. Also note that Workflow Manager does not recognize variable connection parameters such as dbconnection with lookup transformations. At this time, Lookups can use $Source, $Target, or exact db connections.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
292 of 702
Capitalize folder and session names as necessary. Folder and session names are case-sensitive in the parameter file. Enter folder names for non-unique session names. When a session name exists more than once in a repository, enter the folder name to indicate the location of the session. Create one or more parameter files. Assign parameter files to workflows, worklets, and sessions individually. Specify the same parameter file for all of these tasks or create several parameter files. If including parameter and variable information for more than one session in the file, create a new section for each session as follows. The folder name is optional. [folder_name.session_name] parameter_name=value variable_name=value mapplet_name.parameter_name=value [folder2_name.session_name] parameter_name=value variable_name=value mapplet_name.parameter_name=value
Specify headings in any order. Place headings in any order in the parameter file. However, if defining the same parameter or variable more than once in the file, the PowerCenter Server assigns the parameter or variable value using the first instance of the parameter or variable. Specify parameters and variables in any order. Below each heading, the parameters and variables can be specified in any order. When defining parameter values, do not use unnecessary line breaks or spaces. The PowerCenter Server may interpret additional spaces as part of the value. List all necessary mapping parameters and variables. Values entered for mapping parameters and variables become the start value for parameters and variables in a mapping. Mapping parameter and variable names are not case sensitive. List all session parameters. Session parameters do not have default values. An undefined session parameter can cause the session to fail. Session parameter names are not case sensitive. Use correct date formats for datetime values. When entering datetime values, use the following date formats: MM/DD/RR MM/DD/RR HH24:MI:SS MM/DD/YYYY
INFORMATICA CONFIDENTIAL
BEST PRACTICE
293 of 702
MM/DD/YYYY HH24:MI:SS
q
Do not enclose parameters or variables in quotes. The PowerCenter Server interprets everything after the equal sign as part of the value. Do enclose parameters in single quotes in a Source Qualifier SQL Override if the parameter represents a string or date/time value to be used in the SQL Override. Precede parameters and variables created in mapplets with the mapplet name as follows: mapplet_name.parameter_name=value mapplet2_name.variable_name=value
Scenario
Company X wants to start with an initial load of all data, but wants subsequent process runs to select only new information. The environment data has an inherent Post_Date that is defined within a column named Date_Entered that can be used. The process will run once every twenty-four hours.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
294 of 702
Sample Solution
Create a mapping with source and target objects. From the menu create a new mapping variable named $ $Post_Date with the following attributes:
q q q q
TYPE Variable DATATYPE Date/Time AGGREGATION TYPE MAX INITIAL VALUE 01/01/1900
Note that there is no need to encapsulate the INITIAL VALUE with quotation marks. However, if this value is used within the Source Qualifier SQL, it may be necessary to use native RDBMS functions to convert (e.g., TO DATE (--,--)). Within the Source Qualifier Transformation, use the following in the Source_Filter Attribute: DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS') [please be aware that this sample refers to Oracle as the source RDBMS]. Also note that the initial value 01/01/1900 will be expanded by the PowerCenter Server to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime. The next step is to forward $$Post_Date and Date_Entered to an Expression transformation. This is where the function for setting the variable will reside. An output port named Post_Date is created with a data type of date/ time. In the expression code section, place the following function: SETMAXVARIABLE($$Post_Date,DATE_ENTERED) The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be passed forward. For example:
DATE_ENTERED 9/1/2000 10/30/2001 9/2/2000 Resultant POST_DATE 9/1/2000 10/30/2001 10/30/2001
Consider the following with regard to the functionality: 1. In order for the function to assign a value, and ultimately store it in the repository, the port must be connected to a downstream object. It need not go to the target, but it must go to another Expression Transformation. The reason is that the memory will not be instantiated unless it is used in a downstream transformation object. 2. In order for the function to work correctly, the rows have to be marked for insert. If the mapping is an update-only mapping (i.e., Treat Rows As is set to Update in the session properties) the function will not work. In this case, make the session Data Driven and add an Update Strategy after the transformation containing the SETMAXVARIABLE function, but before the Target. 3. If the intent is to store the original Date_Entered per row and not the evaluated date value, then add an ORDER BY clause to the Source Qualifier. This way, the dates are processed and set in order and data is preserved.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
295 of 702
The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900 providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it encounters. Upon successful completion of the session, the variable is updated in the repository for use in the next session run. To view the current value for a particular variable associated with the session, right-click on the session in the Workflow Monitor and choose View Persistent Values. The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered > 02/03/1998 will be processed.
Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A session may (or may not) have a variable, and the parameter file need not have variables and parameters defined for every session using the parameter file. To override the variable, either change, uncomment, or delete the variable in the parameter file. Run pmcmd for that session, but declare the specific parameter file within the pmcmd command. Although you can use mapping parameters and variables when entering a lookup SQL override, the Designer cannot expand mapping parameters and variables in the query override and does not validate the lookup SQL override. When running a session with a mapping parameter or variable in the lookup SQL override, the PowerCenter Server expands mapping parameters and variables and connects to the lookup database to validate the query override.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
296 of 702
Select either the Workflow or Session, choose, Edit, and click the Properties tab. Enter the parameter directory and name in the Parameter Filename field. Enter either a direct path or a server variable directory. Use the appropriate delimiter for the PowerCenter Server operating system.
The following graphic shows the parameter filename and location specified in the session task.
The next graphic shows the parameter filename and location specified in the Workflow.
In this example, after the initial session is run, the parameter file contents may look like: [Test.s_Incremental] ;$$Post_Date=
INFORMATICA CONFIDENTIAL
BEST PRACTICE
297 of 702
By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a simple Perl script or manual change can update the parameter file to: [Test.s_Incremental] $$Post_Date=04/21/2001 Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value and uses that value for the session run. After successful completion, run another script to reset the parameter file.
Scenario
Company X maintains five Oracle database instances. All instances have a common table definition for sales orders, but each instance has a unique instance name, schema, and login.
DB Instance ORC1 ORC99 HALC UGLY GORF Schema aardso environ hitme snakepit gmer Table orders orders order_done orders orders User Sam Help Hi Punch Brer Password max me Lois Judy Rabbit
Each sales order table has a different name, but the same definition:
ORDER_ID DATE_ENTERED DATE_PROMISED DATE_SHIPPED EMPLOYEE_ID CUSTOMER_ID SALES_TAX_RATE STORE_ID NUMBER (28) DATE DATE DATE NUMBER (28) NUMBER (28) NUMBER (5,4) NUMBER (28) NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL
Sample Solution
Using Workflow Manager, create multiple relational connections. In this example, the strings are named according to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then create a Mapping Parameter named $$Source_Schema_Table with the following attributes:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
298 of 702
Note that the parameter attributes vary based on the specific environment. Also, the initial value is not required since this solution uses parameter files. Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.
Open the Expression Editor and select Generate SQL. The generated SQL statement shows the columns.
INFORMATICA CONFIDENTIAL BEST PRACTICE 299 of 702
Override the table names in the SQL statement with the mapping parameter. Using Workflow Manager, create a session based on this mapping. Within the Source Database connection dropdown box, choose the following parameter: $DBConnection_Source. Point the target to the corresponding target and finish. Now create the parameter files. In this example, there are five separate parameter files.
Parmfile1.txt
INFORMATICA CONFIDENTIAL
BEST PRACTICE
300 of 702
Use pmcmd to run the five sessions in parallel. The syntax for pmcmd for starting sessions with a particular parameter file is as follows: pmcmd startworkflow -s serveraddress:portno -u Username -p Password -paramfile parmfilename s_Incremental You may also use "-pv pwdvariable" if the named environment variable contains the encrypted form of the actual password.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
301 of 702
Description
UDB Overview
UDB is used for a variety of purposes and with various environments. UDB servers run on Windows, OS/2, AS/400 and UNIX-based systems like AIX, Solaris, and HP-UX. UDB supports two independent types of parallelism: symmetric multi-processing (SMP) and massively parallel processing (MPP). Enterprise-Extended Edition (EEE) is the most common UDB edition used in conjunction with the Informatica product suite. UDB EEE introduces a dimension of parallelism that can be scaled to very high performance. A UDB EEE database can be partitioned across multiple machines that are connected by a network or a high-speed switch. Additional machines can be added to an EEE system as application requirements grow. The individual machines participating in an EEE installation can be either uniprocessors or symmetric multiprocessors.
Connection Setup
You must set up a remote database connection to connect to DB2 UDB via PowerCenter. This is necessary because DB2 UDB sets a very small limit on the number of attachments per user to the shared memory segments when the user is using the local (or indirect) connection/protocol. The PowerCenter server runs into this limit when it is acting as the database agent or user. This is especially apparent when the repository is installed on DB2 and the target data source is on the same DB2 database. The local protocol limit will definitely be reached when using the same connection node
INFORMATICA CONFIDENTIAL BEST PRACTICE 302 of 702
for the repository via the PowerCenter Server and for the targets. This occurs when the session is executed and the server sends requests for multiple agents to be launched. Whenever the limit on number of database agents is reached, the following error occurs: CMN_1022 [[IBM][CLI Driver] SQL1224N A database agent could not be started to service a request, or was terminated as a result of a database system shutdown or a force command. SQLSTATE=55032] The following recommendations may resolve this problem:
q q
Increase the number of connections permitted by DB2. Catalog the database as if it were remote. (For information of how to catalog database with remote node refer Knowledgebase id 14745 at my.Informatica. com support Knowledgebase) Be sure to close connections when programming exceptions occur. Verify that connections obtained in one method are returned to the pool via close() (The PowerCenter Server is very likely already doing this). Verify that your application does not try to access pre-empted connections (i. e., idle connections that are now used by other resources).
q q
q q
DB2 Timestamp
DB2 has a timestamp data type that is precise to the microsecond and uses a 26character format, as follows: YYYY-MM-DD-HH.MI.SS.MICROS (where MICROS after the last period recommends six decimals places of second) The PowerCenter Date/Time datatype only supports precision to the second (using a 19 character format), so under normal circumstances when a timestamp source is read into PowerCenter, the six decimal places after the second are lost. This is sufficient for most data warehousing applications but can cause significant problems where this timestamp is used as part of a key. If the MICROS need to be retained, this can be accomplished by changing the format of the column from a timestamp data type to a character 26 in the source and target definitions. When the timestamp is read from DB2, the timestamp will be read in and converted to character in the YYYY-MM-DD-HH.MI.SS.MICROS format. Likewise,
INFORMATICA CONFIDENTIAL BEST PRACTICE 303 of 702
when writing to a timestamp, pass the date as a character in the YYYY-MM-DD-HH.MI. SS.MICROS format. If this format is not retained, the records are likely to be rejected due to an invalid date format error. It is also possible to maintain the timestamp correctly using the timestamp data type itself. Setting a flag at the PowerCenter Server level does this; the technique is described in Knowledge Base article 10220 at my.Informatica.com.
Unsupported Datatypes
PowerMart and PowerCenter do not support the following DB2 datatypes:
q q q q
The DB2 EE external loader invokes the db2load executable located in the PowerCenter Server installation directory. The DB2 EE external loader can load data to a DB2 server on a machine that is remote to the PowerCenter Server.
BEST PRACTICE 304 of 702
INFORMATICA CONFIDENTIAL
The DB2 EEE external loader invokes the IBM DB2 Autoloader program to load data. The Autoloader program uses the db2atld executable. The DB2 EEE external loader can partition data and load the partitioned data simultaneously to the corresponding database partitions. When you use the DB2 EEE external loader, the PowerCenter Server and theDB2 EEE server must be on the same machine.
The DB2 external loaders load from a delimited flat file. Be sure that the target table columns are wide enough to store all of the data. If you configure multiple targets in the same pipeline to use DB2 external loaders, each loader must load to a different tablespace on the target database. For information on selecting external loaders, see "Configuring External Loading in a Session" in the PowerCenter User Guide.
Insert. Adds loaded data to the table without changing existing table data. Replace. Deletes all existing data from the table, and inserts the loaded data. The table and index definitions do not change. Restart. Restarts a previously interrupted load operation. Terminate. Terminates a previously interrupted load operation and rolls back the operation to the starting point, even if consistency points were passed. The tablespaces return to normal state, and all table objects are made consistent.
q q
q q q
SYSADM authority DBADM authority LOAD authority on the database, with INSERT privilege
In addition, you must have proper read access and read/write permissions:
q
The database instance owner must have read access to the external loader input files. If you use run DB2 as a service on Windows, you must configure the service start account with a user account that has read/write permissions to use LAN resources, including drives, directories, and files. If you load to DB2 EEE, the database instance owner must have write access to the load dump file and the load temporary file.
Remember, the target file must be delimited when using the DB2 AutoLoader.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
306 of 702
In UDB, LOGBUFSZ value of 8 is too small. Try setting it to 128. Also, set an INTRA_PARALLEL value of YES for CPU parallelism. The database configuration parameter DFT_DEGREE should be set to a value between ANY and 1 depending on the number of CPUs available and number of processes that will be running simultaneously. Setting the DFT_DEGREE to ANY can prove to be a CPU hogger since one process can take up all the processing power with this setting. Setting it to one does not make sense as there is no parallelism in one. Note: DFT_DEGREE and INTRA_PARALLEL are applicable only for EEE DB. Data warehouse databases perform numerous sorts, many of which can be very large. SORTHEAP memory is also used for hash joins, which a surprising number of DB2 users fail to enable. To do so, use the db2set command to set environment variable DB2_HASH_JOIN=ON. For a data warehouse database, at a minimum, double or triple the SHEAPTHRES (to between 40,000 and 60,000) and set the SORTHEAP size between 4,096 and 8,192. If real memory is available, some clients use even larger values for these configuration parameters. SQL is very complex in a data warehouse environment and often consumes large quantities of CPU and I/O resources. Therefore, set DFT_QUERYOPT to 7 or 9. UDB uses NUM_IO_CLEANERS for writing to TEMPSPACE, temporary intermediate tables, index creations, and more. SET NUM_IO_CLEANERS equal to the number of CPUs on the UDB server and focus on your disk layout strategy instead. Lastly, for RAID devices where several disks appear as one to the operating system, be sure to do the following: 1. db2set DB2_STRIPED_CONTAINERS=YES (do this before creating tablespaces or before a redirected restore) 2. db2set DB2_PARALLEL_IO=* (or use TablespaceID numbers for tablespaces residing on the RAID devices for example DB2_PARALLEL_IO=4,5,6,7,8,10,12,13) 3. Alter the tablespace PREFETCHSIZE for each tablespace residing on RAID devices such that the PREFETCHSIZE is a multiple of the EXTENTSIZE.
you may experience slow and erratic behavior resulting from the way UDB handles database locks. Out of the box, DB2 UDB database and client connections are configured on the assumption that they will be part of an OLTP system and place several locks on records and tables. Because PowerCenter typically works with OLAP systems where it is the only process writing to the database and users are primarily reading from the database, this default locking behavior can have a significant impact on performance Connections to DB2 UDB databases are set up using the DB2 Client Configuration utility. To minimize problems with the default settings, make the following changes to all remote clients accessing the database for read-only purposes. To help replicate these settings, you can export the settings from one client and then import the resulting file into all the other clients.
q
Enable Cursor Hold is the default setting for the Cursor Hold option. Edit the configuration settings and make sure the Enable Cursor Hold option is not checked. Connection Mode should be Shared, not Exclusive Isolation Level should be Read Uncommitted (the minimum level) or Read Committed (if updates by other applications are possible and dirty reads must be avoided)
q q
For setting the Isolation level to dirty read at the PowerCenter Server level, you can set a flag can at the PowerCenter configuration file. For details on this process, refer to the KB article 13575 in my.Informatica.com support knowledgebase. If you're not sure how to adjust these settings, launch the IBM DB2 Client Configuration utility, then highlight the database connection you use and select Properties. In Properties, select Settings and then select Advanced. You will see these options and their settings on the Transaction tab To export the settings from the main screen of the IBM DB2 client configuration utility, highlight the database connection you use, then select Export and all. Use the same process to import the settings on another client. If users run hand-coded queries against the target table using DB2's Command Center, be sure they know to use script mode and avoid interactive mode (by choosing the script tab instead of the interactive tab when writing queries). Interactive mode can lock returned records while script mode merely returns the result and does not hold them.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
308 of 702
If your target DB2 table is partitioned and resides across different nodes in DB2, you can use a target partition type DB Partitioning in PowerCenter session properties. When DB partitioning is selected, separate connections are opened directly to each node and the load starts in parallel. This improves performance and scalability.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
309 of 702
Description
After you are familiar with the normal operation of PowerCenter Mapping Designer and Workflow Manager, you can use a variety of shortcuts to speed up their operation. PowerCenter provides two types of shortcuts:
q
keyboard shortcuts to edit repository objects and maneuver through the Mapping Designer and Workflow Manager as efficiently as possible, and shortcuts that simplify the maintenance of repository objects.
To add more toolbars, select Tools | Customize. Select the Toolbar tab to add or remove toolbars.
Follow these steps to use drop-down menus without the mouse: 1. Press and hold the <Alt> key. You will see an underline under one letter of each of the menu titles. 2. Press the underlined letter for the desired drop-down menu. For example, press 'r' for the 'Repository' drop-down menu.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
310 of 702
3. Press the underlined letter to select the command/operation you want. For example, press 't' for 'Close All Tools'. 4. Alternatively, after you have pressed the <Alt> key, use the right/left arrows to navigate across the menubar, and up/down arrows to expand and navigate through the drop-down menu.. Press Enter when the desired command is highlighted.
q
To create a customized toolbar for the functions you frequently use, press <Alt> <T> (expands the Tools drop-down menu) then <C> (for Customize). To delete customized icons, select Tools | Customize, and then remove the icons by dragging them directly off the toolbar To add an icon to an existing (or new) toolbar, select Tools | Customize and navigate to the Commands tab. Find your desired command, then "drag and drop" the icon onto your toolbar. To rearrange the toolbars, click and drag the toolbar to the new location. You can insert more than one toolbar at the top of the designer tool to avoid having the buttons go off the edge of the screen. Alternatively, you can position the toolbars at the bottom, side, or between the workspace and the message windows. To use a Docking\UnDocking window (e.g., Repository Navigator), double-click on the window's title bar. If you are having trouble docking the the window again, right-click somewhere in the white space of the runaway window (not the title bar) and make sure that the "Allow Docking" option is checked. When it is checked, drag the window to its proper place and, when an outline of where the window used to be appears, release the window.
q q
Keyboard Shortcuts
Use the following keyboard shortcuts to perform various operations in Mapping Designer and Workflow Manager.
To: Cancel editing in an object Check and uncheck a check box Copy text from an object onto a clipboard Cut text from an object onto the clipboard Edit the text of an object Press: Esc Space Bar Ctrl+C Ctrl+X. F2. Then move the cursor to the desired location
INFORMATICA CONFIDENTIAL
BEST PRACTICE
311 of 702
Find all combination and list boxes Find tables or fields in the workspace Move around objects in a dialog box (When no objects are selected, this will pan within the workspace) Paste copied or cut text from the clipboard into an object Select the text of an object To start help
Ctrl+V F2 F1
Editing Tables/Transformations
Follow these steps to move one port in a transformation:
INFORMATICA CONFIDENTIAL BEST PRACTICE 312 of 702
1. Double-click the transformation and make sure you are in the "Ports" tab. (You go directly to the Ports tab if you doubleclick a port instead of the colored title bar.) 2. Highlight the port and click the up/down arrow button to reposition the port. 3. Or, highlight the port and then press <Alt><w> to move the port down or <Alt> <u> to move the port up. Note: You can hold down the <Alt> and hit the <w> or <u> multiple times to reposition the currently highlighted port downwards or upwards, respectively. Alternatively, you can accomplish the same thing by following these steps: 1. 2. 3. 4. Highlight the port you want to move by clicking the number beside the port. Grab onto the port by its number and continue holding down the left mouse button.. Drag the port to the desired location (the list of ports scrolls when you reach the end). A red line indicates the new location. When the red line is pointing to the desired location, release the mouse button. Note: You cannot move more than one port at a time with this method. See below for instructions on moving more than one port at a time. If you are using PowerCenter version 6.x, 7.x, or 8.x and the ports you are moving are adjacent, you can follow these steps to move more than one port at a time: 1. Highlight the ports you want to move by clicking the number beside the port while holding down the <Ctrl> key. 2. Use the up/down arrow buttons to move the ports to the desired location.
q q
To add a new field or port, first highlight an existing field or port, then press <Alt><f> to insert the new field/port below it. To validate a defined default value, first highlight the port you want to validate, and then press <Alt><v>. A message box will confirm the validity of the default value. After creating a new port, simply begin typing the name you wish to call the port. There is no need to to remove the default "NEWFIELD" text prior to labelling the new port. This method could also be applied when modifying existing port names. Simply highlight the existing port, by clicking onto the port number, and begin typing the modified name of the port. To prefix a port name, press <Home> to bring the cursor to the beginning of the port name. In addition, to add a suffix to a port name, press <End> to bring the curso to the end of the port name. Checkboxes can be checked (or unchecked) by highlighting the desired checkbox, and pressing SPACE bar to toggle the checkmark on and off.
Follow either of these steps to quickly open the Expression Editor of an output or variable port: 1. Highlight the expression so that there is a box around the cell and press <F2> followed by <F3>. 2. Or, highlight the expression so that there is a cursor somewhere in the expression, then press <F2>.
q q
To cancel an edit in the grid, press <Esc> so the changes are not saved. For all combo/drop-down list boxes, type the first letter on the list to select the item you want. For example, you can highlight a port's Data type box without displaying the drop-down. To change it to 'binary', type <b>. Then use the arrow keys to go down to the next port. This is very handy if you want to change all fields to string for example because using the up and down arrows and hitting a letter is much faster than opening the drop-down menu and making a choice each time. To copy a selected item in the grid, press <Ctrl><c>. To paste a selected item from the Clipboard to the grid, press <Ctrl><v>. To delete a selected field or port from the grid, press <Alt><c>. To copy a selected row from the grid, press <Alt><o>. To paste a selected row from the grid, press <Alt><p>.
q q q q q
You can use either of the following methods to delete more than one port at a time.
q q
-You can repeatedly hit the cut button; or You can highlight several records and then click the cut button. Use <Shift> to highlight many items in a row or <Ctrl> to
BEST PRACTICE 313 of 702
INFORMATICA CONFIDENTIAL
highlight multiple non-contiguous items. Be sure to click on the number beside the port, not the port name while you are holding <Shift> or <Ctrl>.
Editing Expressions
Follow either of these steps to expedite validation of a newly created expression:
q
Click on the <Validate> button or press <Alt> and <v>. Note: This validates and leaves the Expression Editor open.
Or, press <OK> to initiate parsing/validating of the expression. The system closes the Expression Editor if the validation is successful. If you click OK once again in the "Expression parsed successfully" pop-up, the Expression Editor remains open.
There is little need to type in the Expression Editor. The tabs list all functions, ports, and variables that are currently available. If you want an item to appear in the Formula box, just double-click on it in the appropriate list on the left. This helps to avoid typographical errors and mistakes (such as including an output-only port name in an expression formula). In version 6.x and later, if you change a port name, PowerCenter automatically updates any expression that uses that port with the new name. Be careful about changing data types. Any expression using the port with the new data type may remain valid, but not perform as expected. If the change invalidates the expression, it will be detected when the object is saved or if the Expression Editor is active for that expression. The following table summarizes additional shortcut keys that are applicable only when working with Mapping Designer:
To: Add a new field or port Copy a row Cut a row Move current row down Move current row up Paste a row Press Alt + F Alt + O Alt + C Alt + W Alt + U Alt + P
INFORMATICA CONFIDENTIAL
BEST PRACTICE
314 of 702
Validate the default value in a transformation Open the Expression Editor from the expression field To start the debugger
4. A dialog box appears to confirm that you want to create a shortcut. If you want to copy an object from a shared folder instead of creating a shortcut, hold down the <Ctrl> key before dropping the object into the workspace.
Press: Press Ctrl+F2 to select first task you want to link. Press Tab to select the rest of the tasks you want to link Press Ctrl+F2 again to link all the tasks you selected
Edit tasks name in the workspace Expand a selected node and all its children Move across to select tasks in the workspace Select multiple tasks
INFORMATICA CONFIDENTIAL
BEST PRACTICE
315 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
316 of 702
Description
The Java Transformation (JTX) introduced in PowerCenter 8.0 provides a uniform means of entering and maintaining program code written in Java to be executed for every record being processed during a session run. The Java code is maintained, entered, and viewed within the PowerCenter Designer tool. Below is a summary of some of typical questions about JTX.
static initialization blocks can be defined on the tab Helper Code. import statements can be listed on the tab Import Packages.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
317 of 702
static variables of the Java class as a whole (i.e., counters for instances of this class) as well as non-static member variables (for every single instance) can be defined on the tab Helper Code. Auxiliary member functions or static functions may be declared and defined on the tab Helper Code. static final variables may be defined on the tab Helper Code. However, they are private by nature; no object of any other Java class will be able to utilize these. Auxiliary functions (static and dynamic) can be defined on the tab Helper Code.
Important Note: Before trying to start a session utilizing additional import clauses in the Java code, make sure that the environment variable CLASSPATH contains the necessary .jar files or directories before the PowerCenter Integration Service has been started. All non-static member variables declared on the tab Helper Code are automatically available to every partition of a partitioned session without any precautions. In other words, one object of the respective Java class that is generated by PowerCenter will be instantiated for every single instance of the JTX and for every session partition. For example, if you utilize two instances of the same reusable JTX and have set the session to run with three partitions, then six individual objects of that Java class will be instantiated for this session run.
Standard and user-defined constructors Standard and user-defined destructors Any kind of direct user-interface, be it a Swing GUI or a console-based user interface
INFORMATICA CONFIDENTIAL
BEST PRACTICE
318 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
319 of 702
As a general rule of thumb, a passive JTX will usually execute faster than an active JTX.
q
If one input record equals one output record of the JTX, you will probably want to use a passive JTX.
q
If you have to produce a varying number of output records per input record (i.e., for some input values the JTX will generate one output record, for some values it will generate no output records, for some values it will generate two or even more output records) you will have to utilize an active JTX. There is no other choice.
q
If you have to accumulate one or more input records before generating one or more output records, you will have to utilize an active JTX. There is no other choice.
q
If you have to do some initialization work before processing the first input record, then this fact does in no way determine whether to utilize an active or a passive JTX.
q
If you have to do some cleanup work after having processed the last input record, then this fact does in no way determine whether to utilize an active or a passive JTX.
q
If you have to generate one or more output records after the last input record has been processed, then you have to use an active JTX. There is no other choice except changing the mapping accordingly to produce these additional records by other means.
2. 3.
Click the button showing the java icon, then click on the background in the main window of the Mapping Designer. Choose whether to generate a passive or an active JTX (see How do I choose between an active and a passive JTX above). Remember, you cannot change this setting later. Rename the JTX accordingly (i.e., rename it to JTX_SplitString). Go to the Ports tab; define all input-only ports in the Input Group, define all output-only and input-output ports in the Output Group. Make sure that every output-only and every input-output port is defined correctly. Make sure you define the port structure correctly from the onset as changing data types of ports after the JTX has been saved to the repository will not always work. Click Apply. On the Properties tab you may want to change certain properties. For example, the setting "Is Partitionable" is mandatory if this session will be partitioned. Follow the hints in the lower part of the screen form that explain the selection lists in detail. Activate the tab Java Code. Enter code pieces where necessary. Be aware that all ports marked as input-output ports on the Ports tab are automatically processed as pass-through ports by the Integration Service. You do not have to (and should not) enter any code referring to pass-through ports. See the Notes section below for more details. Click the Compile link near the lower right corner of the screen form to compile the Java code you have entered. Check the output window at the lower border of the screen form for all compilation errors and work through each error message encountered; then click Compile again. Repeat this step as often as necessary until you can compile the Java code without any error messages. Click OK. Only connect ports of the same data type to every input-only or input-output port of the JTX. Connect output-only and input-output ports of the JTX only to ports of the same data type in transformations downstream. If any downstream transformation expects a different data type than the type of the respective output port of the JTX, insert an EXP to convert data types. Refer to the Notes below for more detail. Save the mapping.
4.
5. 6.
7.
8.
9. 10.
11.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
321 of 702
Notes:
q
The primitive Java data types available in a JTX that can be used for ports of the JTX to connect to other transformations are Integer, Double, and Date/Time. Date/time values are delivered to or by a JTX by means of a Java long value which indicates the difference of the respective date/time value to midnight, Jan 1st, 1970 (the socalled Epoch) in milliseconds; to interpret this value, utilize the appropriate methods of the Java class GregorianCalendar. Smallint values cannot be delivered to or by a JTX.
q
The Java object data types available in a JTX that can be used for ports are String, byte arrays (for Binary ports), and BigDecimal (for Decimal values of arbitrary precision).
q
In a JTX you check whether an input port has a NULL value by calling the function isNull("name_of_input_port"). If an input value is NULL, then you should explicitly set all depending output ports to NULL by calling setNull("name_of_output_port"). Both functions take the name of the respective input / output port as a string.
q
You retrieve the value of an input port (provided this port is not NULL, see previous paragraph) simply by referring to the name of this port in your Java source code. For example, if you have two input ports i_1 and i_2 of type Integer and one output port o_1 of type String, then you might set the output value with a statement like this one: o_1 = "First value = " + i_1 + ", second value = " + i_2;
q
In contrast to a Custom Transformation, it is not possible to retrieve the names, data types, and/or values of pass-through ports except if these pass-through ports have been defined on the Ports tab in advance. In other words, it is impossible for a JTX to adapt to its port structure at runtime (which would be necessary, for example, for something like a Sorter JTX).
q
If you have to transfer 64-bit values into a JTX, deliver them to the JTX by means of a string representing the 64-bit number and convert this string into a Java long variable using the static method Long. parseLong(). Likewise, to deliver a 64-bit integer from a JTX to downstream transformations, convert the long variable to a string
INFORMATICA CONFIDENTIAL BEST PRACTICE 322 of 702
which will be an output port of the JTX (e.g. using the statement o_Int64 = "" + myLongVariable).
q
As of version 8.1.1, the PowerCenter Designer is very sensitive regarding data types of ports connected to a JTX. Supplying a JTX with not exactly the expected data types or connecting output ports to other transformations expecting other data types (i.e., a string instead of an integer) may cause the Designer to invalidate the mapping such that the only remedy is to delete the JTX, save the mapping, and re-create the JTX.
q
Initialization Properties and Metadata Extensions can neither be defined nor retrieved in a JTX.
q
The code entered on the Java Code sub-tab On Input Row is inserted into some other code; only this complete code constitutes the method execute() of the resulting Java class associated to the JTX (see output of the link "View Code" near the lower-right corner of the Java Code screen form). The same holds true for the code entered on the tabs On End Of Data and On Receiving Transactions with regard to the methods. This fact has a couple of implications which will be explained in more detail below.
q
If you connect input and/or output ports to transformations with differing data types, you might get error messages during mapping validation. One such error message occurring quite often indicates that the byte code of the class cannot be retrieved from the repository. In this case, rectify port connections to all input and/or output ports of the JTX and edit the Java code (inserting one blank comment line usually suffices) and recompile the Java code again.
q
The JTX (Java Transformation) doesn't currently allow pass-through ports. Thus they have to be simulated by splitting them up into one input port and one output port, then the values of all input ports have to be assigned to the respective output port. The key here is the input port of every pair of ports has to be in the Input Group while the respective output port has to be in the Output Group. If you do not do this, there is no warning in designer but it will not function correctly.
Where and how to insert what pieces of Java code into a JTX?
A JTX always contains a code skeleton that is generated by the Designer. Every piece
INFORMATICA CONFIDENTIAL
BEST PRACTICE
323 of 702
of code written by a mapping designer is inserted into this skeleton at designated places. Because all these code pieces do not constitute the sole content of the respective functions, there are certain rules and recommendations as to how to write such code. As mentioned previously, a mapping designer can neither write his or her own constructor nor insert any code into the default constructor or the default destructor generated by the Designer. All initialization work can be done in either of the following two ways:
q
by inserting code that in a standalone class would be part of the destructor into the tab On End Of Data,
q
by inserting code that in a standalone class would be part of the constructor into the tab On Input Row. The last case (constructor code being part of the On Input Row code) requires a little trick: constructor code is supposed to be executed once only, namely before the first method is called. In order to resemble this behavior, follow these steps: 1. 2.
On the tab Helper Code, define a boolean variable (i.e., constructorMissing) and initialize it to true. At the beginning of the On Input Row code, insert code that looks like the following: if( constructorMissing) { // do whatever the constructor should have done constructorMissing = false; }
INFORMATICA CONFIDENTIAL
BEST PRACTICE
324 of 702
This will ensure that this piece of code is executed only once, namely directly before the very first input row is processed. The code pieces on the tabs On Input Row, On End Of Data, and On Receiving Transaction are embedded in other code. There is code that runs before the code entered here will execute, and there is more code to follow; for example, exceptions raised within code written by a developer will be caught here. As a mapping developer you cannot change this order, so you need to be aware of the following important implication. Suppose you are writing a Java class that performs some checks on an input record and, if the checks fail, issues an error message and then skips processing to the next record. Such a piece of code might look like this: if (firstCheckPerformed( inputRecord) && secondCheckPerformed( inputRecord)) { logMessage( ERROR: one of the two checks failed!); return; } // else insertIntoTarget( inputRecord); countOfSucceededRows ++; This code will not compile in a JTX because it would lead to unreachable code. Why? Because the return at the end of the if statement might enable the respective function (in this case, the method will have the name execute()) to ignore the subsequent code that is part of the framework created by the Designer. In order to make this code work in a JTX, change it to look like this: if (firstCheckPerformed( inputRecord) && secondCheckPerformed( inputRecord))
INFORMATICA CONFIDENTIAL
BEST PRACTICE
325 of 702
{ }
countOfSucceededRows ++; } The same principle (never use return in these code pieces) applies to all three tabs On Input Row, On End Of Data, and On Receiving Transaction. Another important point is that the code entered on the On Every Record tab is embedded in a try-catch block. So never include any try-catch code on this tab.
the additional process switches between the PowerCenter Integration Service and the Java Virtual Machine (JVM) that executes as another operating system process
q
Java not being compiled to machine code but to portable byte code (although this has been largely remedied in the past years due to the introduction of Just-In-Time compilers) which is interpreted by the JVM
q
The inherent complexity of the genuine object model in Java (except for most sorts of number types and characters everything in Java is an object that occupies space and execution time).
INFORMATICA CONFIDENTIAL
BEST PRACTICE
326 of 702
So it is obvious that a JTX cannot perform as fast as, for example, a carefully written Custom Transformation. The rule of thumb is for simple JTX to require approximately 50% more total running time than an EXP of comparable functionality. It can also be assumed that Java code utilizing several of the fairly complex standard classes will need even more total runtime when compared to an EXP performing the same tasks.
The Designer is very sensitive in regards to the data types of ports that are connected to the ports of a JTX. However, most of the troubles arising from this sensitivity can be remedied rather easily by simply recompiling the Java code.
q
Working with long values representing days and time within, for example, the GregorianCalendar can be extremely difficult to do and demanding in terms of runtime resources (memory, execution time). Date/time ports in PowerCenter are by far easier to use. So it is advisable to split up date/time ports into their individual components, such as year, month, and day, and to process these singular attributes within a JTX if needed.
q
In general a JTX can reduce performance simply by the nature of the architecture. Only use a JTX when necessary.
q
A JTX always has one input group and one output group. For example, it is impossible to write a Joiner as a JTX. Significant advantages to using a JTX are:
q
Java knowledge and experience are generally easier to find than comparable skills in other languages.
q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
327 of 702
simple JTX that calculates the calendar week and calendar year for a given date takes approximately 10-20 minutes. Writing Custom Transformations (even for easy tasks) can take several hours.
q
Not every data integration environment has access to a C compiler used to compile Custom Transformations in C. Because PowerCenter is installed with its own JDK, this problem will not arise with a JTX.
In Summary
q
If you need a transformation that adapts its processing behavior to its ports, a JTX is not the way to go. In such a case, write a Custom Transformation in C, C++, or Java to perform the necessary tasks. The CT API is considerably more complex than the JTX API, but it is also far more flexible.
q
Use a JTX for development whenever a task cannot be easily completed using other standard options in PowerCenter (as long as performance requirements do not dictate otherwise).
q
If performance measurements are slightly below expectations, try optimizing the Java code and the remainder of the mapping in order to increase processing speed.
Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL
BEST PRACTICE
328 of 702
This Best Practice describes how each of these steps can be facilitated within the PowerCenter environment.
Description
A typical error handling process leverages the best-of-breed error management technology available in PowerCenter, such as: Relational database error logging Email notification of workflow failures Session error thresholds The reporting capabilities of PowerCenter Data Analyzer Data profiling These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in the flow chart below:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
329 of 702
Error Identification
The first step in the error handling process is error identification. Error identification is often achieved through the use of the ERROR() function within mappings, enablement of relational error logging in PowerCenter, and referential integrity constraints at the database. This approach ensures that row-level issues such as database errors (e.g., referential integrity failures), transformation errors, and business rule exceptions for which the ERROR() function was called are captured in relational error logging tables. Enabling the relational error logging functionality automatically writes row-level data to a set of four error handling tables (PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can be centralized in the PowerCenter repository and store information such as error messages, error data, and source row data. Row-level errors trapped in this manner include any database errors, transformation errors, and business rule exceptions for which the ERROR() function was called within the mapping.
Error Retrieval
The second step in the error handling process is error retrieval. After errors have been captured in the PowerCenter repository, it is important to make their retrieval simple and automated so that the process is as efficient as possible. Data Analyzer can be customized to create error retrieval reports from the information stored in the PowerCenter repository. A typical error report prompts a user for the folder and workflow name, and returns a report with information such as the session, error message, and data that caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved through a Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed (such as number of errors is greater than zero).
Error Correction
The final step in the error handling process is error correction. As PowerCenter automates the process of error identification, and Data Analyzer can be used to simplify error retrieval, error correction is straightforward. After retrieving an error through Data Analyzer, the error report (which contains information such as workflow name, session name, error date, error message, error data, and source row data) can be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of an error, the error report can be extracted into a supported format and emailed to a developer or DBA to resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface supports emailing a report directly through the web-based interface to make the process even easier. For further automation, a report broadcasting rule that emails the error report to a developers inbox can be set up to run on a pre-defined schedule. After the developer or DBA identifies the condition that caused the error, a fix for the error can be implemented. The exact method of data correction depends on various factors such as the number of records with errors, data availability requirements per SLA, the level of data criticality to the business unit(s), and the type of error that occurred. Considerations made during error correction include: The owner of the data should always fix the data errors. For example, if the source data is coming from an external system, then the errors should be sent back to the source system to be fixed. In some situations, a simple re-execution of the session will reprocess the data. Does partial data that has been loaded into the target systems need to be backed-out in order to avoid duplicate processing of rows. Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of errors is low, the rejected data can be easily exported to Microsoft Excel or CSV format and corrected in a spreadsheet from the Data Analyzer error reports. The corrected data can then be manually inserted into the target table using a SQL statement. Any approach to correct erroneous data should be precisely documented and followed as a standard.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
330 of 702
If the data errors occur frequently, then the reprocessing process can be automated by designing a special mapping or session to correct the errors and load the corrected data into the ODS or staging area.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
331 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
332 of 702
Error handling underpins the data integration system from end-to-end. Each of the load components performs validation checks, the results of which must be reported to the operational team. These components are not just PowerCenter processes such as business rule and field validation, but cover the entire data integration architecture, for example: Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity to source systems)? Source File Validation. Is the source file datestamp later than the previous load? File Check. Does the number of rows successfully loaded match the source rows read?
INFORMATICA CONFIDENTIAL
BEST PRACTICE
333 of 702
These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or column-related errors) concerns.
Description
In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you can be sure that every source element was populated correctly, with meaningful values, never missing a value, and fulfilling all relational constraints. At the same time, source data sets always have a fixed structure, are always available on time (and in the correct order), and are never corrupted during transfer to the data warehouse. In addition, the OS and RDBMS never run out of resources, or have permissions and privileges change. Realistically, however, the operational applications are rarely able to cope with every possible business scenario or combination of events; operational systems crash, networks fall over, and users may not use the transactional systems in quite the way they were designed. The operational systems also typically need some flexibility to allow non-fixed data to be stored (typically as free-text comments). In every case, there is a risk that the source data does not match what the data warehouse expects. Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by the business managers. If erroneous data does reach the warehouse, it must be identified and removed immediately (before the current version of the warehouse can be published). Preferably, error data should
INFORMATICA CONFIDENTIAL
BEST PRACTICE
334 of 702
be identified during the load process and prevented from reaching the warehouse at all. Ideally, erroneous source data should be identified before a load even begins, so that no resources are wasted trying to load it. As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors within the warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point on, it becomes impossible to guarantee that a metric or data item came from a specific source via a specific chain of processes. As a by-product, adopting this principle also helps to tie both the end-users and those responsible for the source data into the warehouse process; source data staff understand that their professionalism directly affects the quality of the reports, and end-users become owners of their data. As a final consideration, error management (the implementation of an error handling strategy) complements and overlaps load management, data quality and key management, and operational processes and procedures. Load management processes record at a high-level if a load is unsuccessful; error management records the details of why the failure occurred. Quality management defines the criteria whereby data can be identified as in error; and error management identifies the specific error(s), thereby allowing the source data to be corrected. Operational reporting shows a picture of loads over time, and error management allows analysis to identify systematic errors, perhaps indicating a failure in operational procedure. Error management must therefore be tightly integrated within the data warehouse load process. This is shown in the high level flow chart below:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
335 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
336 of 702
Process Dependency checks in the load management can identify when a source data set is missing, duplicates a previous version, or has been presented out of sequence, and where the previous load failed but has not yet been corrected. Load management prevents this source data from being loaded. At the same time, error management processes should record the details of the failed load; noting the source instance, the load affected, and when and why the load was aborted. Source file structures can be compared to expected structures stored as metadata, either from header information or by attempting to read the first data row. Source table structures can be compared to expectations; typically this can be done by interrogating the RDBMS catalogue directly (and comparing to the expected structure held in metadata), or by simply running a describe command against the table (again comparing to a prestored version in metadata). Control file totals (for file sources) and row number counts (table sources) are also used to determine if files have been corrupted or truncated during transfer, or if tables have no new data in them (suggesting a fault in an operational application). In every case, information should be recorded to identify where and when an error occurred, what sort of error it was, and any other relevant process-level details.
Low-Level Issues
Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load to abort), further error management processes need to be applied to the individual source rows and fields.
Individual source fields can be compared to expected data-types against standard metadata within the repository, or additional information added by the development. In some instances, this is enough to abort the rest of the load; if the field structure is incorrect, it is much more likely that the source data set as a whole either cannot be processed at all or (more worryingly) is likely to be processed unpredictably. Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Built-in error handling can be used to spot failed date conversions, conversions of string to numbers, or missing required data. In rare cases, stored procedures can be called if a specific conversion fails; however this cannot be generally recommended because of the potentially crushing impact on performance if a particularly error-filled load occurs. Business rule breaches can then be picked up. It is possible to define allowable values, or acceptable value ranges within PowerCenter mappings (if the rules are few, and it is clear from the mapping metadata that the business rules are included in the mapping itself). A more flexible approach is to use external tables to codify the business rules. In this way, only the rules tables need to be amended if a new business rule needs to be applied. Informatica has suggested methods to implement such a process. Missing Key/Unknown Key issues have already been defined in their own best practice document Key Management in Data Warehousing Solutions with suggested management techniques for identifying and handling them. However, from an error handling perspective, such errors must still be identified and recorded, even when key management techniques do not formally fail source rows with key errors. Unless a record is kept of the frequency with which particular source data fails, it is difficult to realize when there is a systematic problem in the source systems. Inter-row errors may also have to be considered. These may occur when a business process expects a certain hierarchy of events (e.g., a customer query, followed by a booking request,
INFORMATICA CONFIDENTIAL
BEST PRACTICE
337 of 702
followed by a confirmation, followed by a payment). If the events arrive from the source system in the wrong order, or where key events are missing, it may indicate a major problem with the source system, or the way in which the source system is being used.
An important principle to follow is to try to identify all of the errors on a particular row before halting processing, rather than rejecting the row at the first instance. This seems to break the rule of not wasting resources trying to load a sourced data set if we already know it is in error; however, since the row needs to be corrected at source, then reprocessed subsequently, it is sensible to identify all the corrections that need to be made before reloading, rather than fixing the first, re-running, and then identifying a second error (which halts the load for a second time).
INFORMATICA CONFIDENTIAL
BEST PRACTICE
338 of 702
In addition, such automatic correction of data might hide the fact that one or other of the source systems had a generic fault, or more importantly, had acquired a fault because of on-going development of the transactional applications, or a failure in user training. The principle to apply here is to identify the errors in the load, and then alert the source system users that data should be corrected in the source system itself, ready for the next load to pick up the right data. This maintains the data lineage, allows source system errors to be identified and ameliorated in good time, and permits extra training needs to be identified and managed.
The example defines three main sets of information: The ERROR_DEFINITION table, which stores descriptions for the various types of errors, including: o o o process-level (e.g., incorrect source file, load started out-of-sequence) row-level (e.g., missing foreign key, incorrect data-type, conversion errors) and reconciliation (e.g., incorrect row numbers, incorrect file total etc.).
The ERROR_HEADER table provides a high-level view on the process, allowing a quick identification of the frequency of error for particular loads and of the distribution of error types. It is linked to the load management processes via the SRC_INST_ID and PROC_INST_ID, from which other process-level information can be gathered. The ERROR_DETAIL table stores information about actual rows with errors, including how to identify the specific row that was in error (using the source natural keys and row number) together with a string of field identifier/value pairs concatenated together. It is not expected that this
INFORMATICA CONFIDENTIAL
BEST PRACTICE
339 of 702
information will be deconstructed as part of an automatic correction load, but if necessary this can be pivoted (e.g., using simple UNIX scripts) to separate out the field/value pairs for subsequent reporting.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
340 of 702
Description
Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business. When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two conflicting goals:
q q
The need for accurate information. The ability to analyze or process the most complete information available with the understanding that errors can exist.
Reject All. This is the simplest to implement since all errors are rejected from entering the target when they are detected. This provides a very reliable target that the users can count on as being correct, although it may not be complete. Both dimensional and factual data can be rejected when any errors are encountered. Reports indicate what the errors are and how they affect the completeness of the data. Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key relationship cannot be created. These errors need to be fixed in the source systems and reloaded on a subsequent load. Once the corrected rows have been loaded, the factual data will be reprocessed and loaded, assuming that all errors have been fixed. This delay may cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a complete picture of the operational systems until the errors are fixed. For an operational system, this delay may affect downstream transactions. The development effort required to fix a Reject All scenario is minimal, since the rejected data can be processed through existing mappings once it has been fixed. Minimal additional code may need to be written since the data will only enter the target if it is correct, and it would then be loaded into the data mart using the normal process.
Reject None. This approach gives users a complete picture of the available data without having to consider data that was not available due to it being rejected during the load process. The problem is that the data may not be complete or accurate. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty transactions. With Reject None, the complete set of data is loaded, but the data may not support correct transactions or
INFORMATICA CONFIDENTIAL
BEST PRACTICE
341 of 702
aggregations. Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total numbers that are correct, but incorrect detail numbers. After the data is fixed, reports may change, with detail information being redistributed along different hierarchies. The development effort to fix this scenario is significant. After the errors are corrected, a new loading process needs to correct all of the target data structures, which can be a time-consuming effort based on the delay between an error being detected and fixed. The development strategy may include removing information from the target, restoring backup tapes for each nights load, and reprocessing the data. Once the target is fixed, these changes need to be propagated to all downstream data structures or data marts.
q
Reject Critical. This method provides a balance between missing information and incorrect information. It involves examining each row of data and determining the particular data elements to be rejected. All changes that are valid are processed into the target to allow for the most complete picture. Rejected elements are reported as errors so that they can be fixed in the source systems and loaded on a subsequent run of the ETL process. This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts or updates. Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at various levels in the organization. Attributes provide additional descriptive information per key element. Inserts are important for dimensions or master data because subsequent factual data may rely on the existence of the dimension data row in order to load properly. Updates do not affect the data integrity as much because the factual data can usually be loaded with the existing dimensional data unless the update is to a key element. The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or non-critical, and developing logic to update the target and flag the fields that are in error. The effort also incorporates some tasks from the Reject None approach, in that processes must be developed to fix incorrect data in the entire target data architecture. Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target. By providing the most fine-grained analysis of errors, this method allows the greatest amount of valid data to enter the target on each run of the ETL process, while at the same time screening out the unverifiable data fields. However, business management needs to understand that some information may be held out of the target, and also that some of the information in the target data structures may be at least temporarily allocated to the wrong hierarchies.
Three methods exist for handling the creation and update of profiles: 1. The first method produces a new profile record each time a change is detected in the source. If a field value was invalid, then the original field value is maintained.
Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000 Profile Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000 Field 1 Value Closed Sunday Open Sunday Open Sunday Open Sunday Field 2 Value Black Black Black Red Field 3 Value Open 9 5 Open 9 5 Open 24hrs Open 24hrs
By applying all corrections as new profiles in this method, we simplify the process by directly applying all changes to the source system directly to the target. Each change -- regardless if it is a fix to a previous error -- is applied as a new change that creates a new profile. This incorrectly shows in the target that two changes occurred to the source information when, in reality, a mistake was entered on the first change and should be reflected in the first profile. The second profile should not have been created. 2.
The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000, which loses the profile record for the change to Field 3. If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile information. If the third field changes before the second field is fixed, we show the third field changed at the same time as the first. When the second field was fixed, it would also be added to the existing profile, which incorrectly reflects the changes in the source system.
3. The third method creates only two new profiles, but then causes an update to the profile records on 1/15/2000 to fix the Field 2 value in both.
Date 1/1/2000 Profile Date 1/1/2000 Field 1 Value Closed Sunday Field 2 Value Black Field 3 Value Open 9 5
INFORMATICA CONFIDENTIAL
BEST PRACTICE
343 of 702
If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create complex algorithms that handle the process correctly. It involves being able to determine when an error occurred and examining all profiles generated since then and updating them appropriately. And, even if we create the algorithms to handle these methods, we still have an issue of determining if a value is a correction or a new value. If an error is never fixed in the source system, but a new value is entered, we would identify it as a previous error, causing an automated process to update old profile records, when in reality a new profile record should have been entered.
Recommended Method
A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters a new, correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. In this way, the corrected data enters the target as a new Profile record, but the process of fixing old Profile records, and potentially deleting the newly inserted record, is delayed until the data is examined and an action is decided. Once an action is decided, another process examines the existing Profile records and corrects them as necessary. This method only delays the As-Was analysis of the data until the correction method is determined because the current information is reflected in the new Profile.
Show the record and field level quality associated with a given record at the time of extract. Identify data sources and errors encountered in specific records. Support the resolution of specific record error types via an update and resubmission process.
Quality indicators can be used to record several types of errors e.g., fatal errors (missing primary key value), missing data in a required field, wrong data type/format, or invalid data value. If a record contains even one error, data quality (DQ) fields will be appended to the end of the record, one field for every field in the record. A data quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors were encountered. Records containing a fatal error are stored in a Rejected Record Table and associated to the original file name and record number. These records cannot be loaded to the target because they lack a primary key field to be used as a unique record identifier in the target. The following types of errors cannot be processed:
q
A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be saved and used to generate a notice to the sending system indicating that x number of invalid records were received and could not be processed. However, in the absence of a primary key, no tracking is possible to determine whether the invalid record has been replaced or not. The source file or record is illegible. The file or record would be sent to a reject queue. Metadata indicating
BEST PRACTICE 344 of 702
INFORMATICA CONFIDENTIAL
that x number of invalid records were received and could not be processed may or may not be available for a general notice to be sent to the sending system. In this case, due to the nature of the error, no tracking is possible to determine whether the invalid record has been replaced or not. If the file or record is illegible, it is likely that individual unique records within the file are not identifiable. While information can be provided to the source system site indicating there are file errors for x number of records, specific problems may not be identifiable on a record-by-record basis. In these error types, the records can be processed, but they contain errors:
q q q
A required (non-key) field is missing. The value in a numeric or date field is non-numeric. The value in a field does not fall within the range of acceptable values identified for the field. Typically, a reference table is used for this validation.
When an error is detected during ingest and cleansing, the identified error type is recorded.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
345 of 702
Attributes provide additional descriptive information about a dimension concept. Attributes include things like the color of a product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate characters in the address. These types of errors do not generally affect the aggregated facts and statistics in the target data; the attributes are most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find specific patterns for market research). Attribute errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the target. When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new record enter thetarget. Some rules that have been proposed for handling defaults are as follows:
Value Types Reference Values Small Value Sets Other Description Attributes that are foreign keys to other tables Y/N indicator fields Any other type of attribute Default Unknown No Null or Business provided value
Reference tables are used to normalize the target model to prevent the duplication of data. When a source value does not translate into a reference table value, we use the Unknown value. (All reference tables contain a value of Unknown for this purpose.) The business should provide default values for each identified attribute. Fields that are restricted to a limited domain of values (e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in translating these values, we use the value that represents off or No as the default. Other values, like numbers, are handled on a case-by-case basis. In many cases, the data integration process is set to populate Null into these fields, which means undefined in the target. After a source system value is corrected and passes validation, it is corrected in the target.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
346 of 702
If information is captured as dimensional data from the source, but used as measures residing on the fact records in the target, we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until the value is corrected. If we load the facts with the incorrect data, the process to fix the target can be time consuming and difficult to implement. If we let the facts enter downstream target structures, we need to create processes that update them after the dimensional data is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes simpler. After the errors are fixed, the affected rows can simply be loaded and applied to the target data.
Fact Errors
If there are no business rules that reject fact records except for relationship errors to dimensional data, then when we encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the following night. This nightly reprocessing continues until the data successfully enters the target data structures. Initial and periodic analyses should be performed on the errors to determine why they are not being loaded.
Data Stewards
Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new entities in dimensional data, and designating one primary data source when multiple sources exist. Reference data and translation tables enable the target data architecture to maintain consistent descriptions across multiple source systems, regardless of how the source system stores the data. New entities in dimensional data include new locations, products, hierarchies, etc. Multiple source data occurs when two source systems can contain different data for the same dimensional entity.
Reference Tables
The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a short code value as a primary key and a long description for reporting purposes. A translation table is associated with each reference table to map the codes to the source system values. Using both of these tables, the ETL process can load data from the source systems into the target structures. The translation tables contain one or more rows for each source value and map the value to a matching row in the reference table. For example, the SOURCE column in FILE X on System X can contain O, S or W. The data steward would be responsible for entering in the translation table the following values:
Source Value O S W Code Translation OFFICE STORE WAREHSE
These values are used by the data integration process to correctly load the target. Other source systems that maintain a similar field may use a two-letter abbreviation like OF, ST and WH. The data steward would make the following entries into the translation table to maintain consistency across systems:
Source Value Code Translation
INFORMATICA CONFIDENTIAL
BEST PRACTICE
347 of 702
OF ST WH
The data stewards are also responsible for maintaining the reference table that translates the codes into descriptions. The ETL process uses the reference table to populate the following values into the target:
Code Translation OFFICE STORE WAREHSE Code Description Office Retail Store Distribution Warehouse
Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered ST as translating to OFFICE by mistake). The only way to determine which rows should be changed is to restore and reload source data from the first time the mistake was entered. Processes should be built to handle these types of situations, including correction of the entire target data architecture.
Dimensional Data
New entities in dimensional data present a more complex issue. New entities in the target may include Locations and Products, at a minimum. Dimensional data uses the same concept of translation as reference tables. These translation tables map the source system value to the target value. For location, this is straightforward, but over time, products may have multiple source system values that map to the same product in the target. (Other similar translation issues may also exist, but Products serves as a good example for error handling.) There are two possible methods for loading new dimensional entities. Either require the data steward to enter the translation data before allowing the dimensional data into the target, or create the translation data through the ETL process and force the data steward to review it. The first option requires the data steward to create the translation for new entities, while the second lets the ETL process create the translation, but marks the record as Pending Verification until the data steward reviews it and changes the status to Verified before any facts that reference it can be loaded. When the dimensional value is left as Pending Verification however, facts may be rejected or allocated to dummy values. This requires the data stewards to review the status of new values on a daily basis. A potential solution to this issue is to generate an email each night if there are any translation table entries pending verification. The data steward then opens a report that lists them. A problem specific to Product is that when it is created as new, it is really just a changed SKU number. This causes additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would also have to be merged, requiring manual intervention. The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same product, but really represent two different products). In this case, it is necessary to restore the source information for all loads since the error was introduced. Affected records from the target should be deleted and then reloaded from
INFORMATICA CONFIDENTIAL BEST PRACTICE 348 of 702
the restore to correctly split the data. Facts should be split to allocate the information correctly and dimensions split to generate correct profile information.
Manual Updates
Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs to be established for manually entering fixed data and applying it correctly to the entire target data architecture, including beginning and ending effective dates. These dates are useful for both profile and date event fixes. Further, a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of the normal load process.
Multiple Sources
The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources contain subsets of the required information. For example, one system may contain Warehouse and Store information while another contains Store and Hub information. Because they share Store information, it is difficult to decide which source contains the correct information. When this happens, both sources have the ability to update the same row in the target. If both sources are allowed to update the shared information, data accuracy and profile problems are likely to occur. If we update the shared information on only one source system, the two systems then contain different information. If the changed system is loaded into the target, it creates a new profile indicating the information changed. When the second system is loaded, it compares its old unchanged value to the new profile, assumes a change occurred and creates another new profile with the old, unchanged value. If the two systems remain different, the process causes two profiles to be loaded every day until the two source systems are synchronized with the same information. To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source that should be considered primary for the field. Then, only if the field changes on the primary source would it be changed. While this sounds simple, it requires complex logic when creating Profiles, because multiple sources can provide information toward the one profile record created for that day. One solution to this problem is to develop a system of record for all sources. This allows developers to pull the information from the system of record, knowing that there are no conflicts for multiple sources. Another solution is to indicate, at the field level, a primary source where information can be shared from multiple sources. Developers can use the field level information to update only the fields that are marked as primary. However, this requires additional effort by the data stewards to mark the correct source fields as primary and by the data integration team to customize the load process.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
349 of 702
Description
Identifying errors and creating an error handling strategy is an essential part of a data integration project. In the production environment, data must be checked and validated prior to entry into the target system. One strategy for catching data errors is to use PowerCenter mappings and error logging capabilities to catch specific data validation errors and unexpected transformation or database constraint errors.
What types of data errors are likely to be encountered? Of these errors, which ones should be captured? What process can capture the possible errors? Should errors be captured before they have a chance to be written to the target database? Will any of these errors need to be reloaded or corrected? How will the users know if errors are encountered? How will the errors be stored? Should descriptions be assigned for individual errors? Can a table be designed to store captured errors and the error descriptions?
Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis by end users and improves performance. One practical application of the mapping approach is to capture foreign key constraint errors (e.g., executing a lookup on a dimension table prior to loading a fact table). Referential integrity is assured by including this sort of functionality in a mapping. While the database still enforces the foreign key constraints, erroneous data is not written to the target table; constraint errors are captured within the mapping so that the PowerCenter server does not have to write them to the session log and the reject/bad file, thus improving performance. Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attach descriptions to them. This approach can be effective for many types of data content error, including: date conversion, null values intended for not null target fields, and incorrect data formats or data types.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
350 of 702
An expression transformation can be employed to validate the source data, applying rules and flagging records with one or more errors. A router transformation can then separate valid rows from those containing the errors. It is good practice to append error rows with a unique key; this can be a composite consisting of a MAPPING_ID and ROW_ID, for example. The MAPPING_ID would refer to the mapping name and the ROW_ID would be created by a sequence generator. The composite key is designed to allow developers to trace rows written to the error tables that store information useful for error reporting and investigation. In this example, two error tables are suggested, namely: CUSTOMER_ERR and ERR_DESC_TBL.
The table ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the ROW_ID, and the error description. This table can be used to hold all data validation error descriptions for all mappings, giving a single point of reference for reporting. The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two additional columns: ROW_ID and MAPPING_ID. These columns allow the two error tables to be joined. The CUSTOMER_ERR table stores the entire row that was rejected, enabling the user to trace the error rows back to the source and potentially build mappings to reprocess them. The mapping logic must assign a unique description for each error in the rejected row. In this example, any null value intended for a not null target field could generate an error message such as NAME is NULL or DOB is NULL. This step can be done in an expression transformation (e.g., EXP_VALIDATION in the sample mapping). After the field descriptions are assigned, the error row can be split into several rows, one for each possible error using a normalizer transformation. After a single source row is normalized, the resulting rows can be
INFORMATICA CONFIDENTIAL
BEST PRACTICE
351 of 702
filtered to leave only errors that are present (i.e., each record can have zero to many errors). For example, if a row has three errors, three error rows would be generated with appropriate error descriptions (ERROR_DESC) in the table ERR_DESC_TBL. The following table shows how the error data produced may look.
Table Name: NAME NULL Table Name: FOLDER_NA ME CUST CUST CUST
CUSTOMER_ERR DOB ADDRESS ROW_ID NULL NULL 1 ERR_DESC_TBL MAPPING ROW_I ERROR_DE LOAD_DA _ID D SC TE Name is 10/11/2006 DIM_LOA 1 NULL D DOB is 10/11/2006 DIM_LOA 1 NULL D Address is 10/11/2006 DIM_LOA 1 NULL D
MAPPING_ID DIM_LOAD SOURCE CUSTOMER _FF CUSTOMER _FF CUSTOMER _FF Target CUSTOM ER CUSTOM ER CUSTOM ER
The efficiency of a mapping approach can be increased by employing reusable objects. Common logic should be placed in mapplets, which can be shared by multiple mappings. This improves productivity in implementing and managing the capture of data validation errors. Data validation error handling can be extended by including mapping logic to grade error severity. For example, flagging data validation errors as soft or hard.
A hard error can be defined as one that would fail when being written to the database, such as a constraint error. A soft error can be defined as a data content error.
A record flagged as hard can be filtered from the target and written to the error tables, while a record flagged as soft can be written to both the target system and the error tables. This gives business analysts an opportunity to evaluate and correct data imperfections while still allowing the records to be processed for end-user reporting. Ultimately, business organizations need to decide if the analysts should fix the data in the reject table or in the source systems. The advantage of the mapping approach is that all errors are identified as either data errors or constraint errors and can be properly addressed. The mapping approach also reports errors based on projects or categories by identifying the mappings that contain errors. The most important aspect of the mapping approach however, is its flexibility. Once an error type is identified, the error handling logic can be placed anywhere within a mapping. By using the mapping approach to capture identified errors, the operations team can effectively communicate data quality issues to the business users.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
352 of 702
data must be corrected, and the entire source may need to be reloaded or recovered. This is not always an acceptable approach. An alternative might be to have the load process continue in the event of records being rejected, and then reprocess only the records that were found to be in error. This can be achieved by configuring the stop on errors property to 0 and switching on relational error logging for a session. By default, the error-messages from the RDBMS and any un-caught transformation errors are sent to the session log. Switching on relational error logging redirects these messages to a selected database in which four tables are automatically created: PMERR_MSG, PMERR_DATA, PMERR_TRANS and PMERR_SESS. The PowerCenter Workflow Administration Guide contains detailed information on the structure of these tables. However, the PMERR_MSG table stores the error messages that were encountered in a session. The following four columns of this table allow us to retrieve any RDBMS errors: SESS_INST_ID: A unique identifier for the session. Joining this table with the Metadata Exchange (MX) View REP_LOAD_SESSIONS in the repository allows the MAPPING_ID to be retrieved. TRANS_NAME: Name of the transformation where an error occurred. When a RDBMS error occurs, this is the name of the target transformation. TRANS_ROW_ID: Specifies the row ID generated by the last active source. This field contains the row number at the target when the error occurred. ERROR_MSG: Error message generated by the RDBMS
With this information, all RDBMS errors can be extracted and stored in an applicable error table. A post-load session (i.e., an additional PowerCenter session) can be implemented to read the PMERR_MSG table, join it with the MX View REP_LOAD_SESSION in the repository, and insert the error details into ERR_DESC_TBL. When the post process ends, ERR_DESC_TBL will contain both soft errors and hard errors. One problem with capturing RDBMS errors in this way is mapping them to the relevant source key to provide lineage. This can be difficult when the source and target rows are not directly related (i.e., one source row can actually result in zero or more rows at the target). In this case, the mapping that loads the source must write translation data to a staging table (including the source key and target row number). The translation table can then be used by the post-load session to identify the source key by the target row number retrieved from the error log. The source key stored in the translation table could be a row number in the case of a flat file, or a primary key in the case of a relational data source.
Reprocessing
After the load and post-load sessions are complete, the error table (e.g., ERR_DESC_TBL) can be analyzed by members of the business or operational teams. The rows listed in this table have not been loaded into the target database. The operations team can, therefore, fix the data in the source that resulted in soft errors and may be able to explain and remediate the hard errors. Once the errors have been fixed, the source data can be reloaded. Ideally, only the rows resulting in errors during the first run should be reprocessed in the reload. This can be achieved by including a filter and a lookup in the original load mapping and using a parameter to configure the mapping for an initial load or for a reprocess load. If the mapping is reprocessing, the lookup searches for each source row number in the error table, while the filter removes source rows for which the lookup has not found errors. If initial loading, all rows are passed through the filter, validated, and loaded. With this approach, the same mapping can be used for initial and reprocess loads. During a reprocess run, the records successfully loaded should be deleted (or marked for deletion) from the error table, while any new errors encountered should be inserted as if an initial run. On completion, the post-load process is executed to capture any new RDBMS errors. This ensures that reprocessing loads are repeatable and result in reducing numbers of records in the error table over time.
Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL
BEST PRACTICE
353 of 702
Description
Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. The errors in an ETL process can be broadly categorized into two types: data errors in the load process, which are defined by the standards of acceptable data quality; and process errors, which are driven by the stability of the process itself. The first step in implementing an error handling strategy is to understand and define the error handling requirement. Consider the following questions:
q q q q
What tools and methods can help in detecting all the possible errors? What tools and methods can help in correcting the errors? What is the best way to reconcile data across multiple systems? Where and how will the errors be stored? (i.e., relational tables or flat files)
A robust error handling strategy can be implemented using PowerCenters built-in error handling capabilities along with Data Analyzer as follows:
q
Process Errors: Configure an email task to notify the PowerCenter Administrator immediately of any process failures. Data Errors: Setup the ETL process to:
r
Use the Row Error Logging feature in PowerCenter to capture data errors in the PowerCenter error tables for analysis, correction, and reprocessing. Setup Data Analyzer alerts to notify the PowerCenter Administrator in the event of any rejected rows. Setup customized Data Analyzer reports and dashboards at the project level to provide information on failed sessions, sessions with failed rows, load time, etc.
r r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
354 of 702
When you configure the subject and body of a post-session email, use email variables to include information about the session run, such as session name, mapping name, status, total number of records loaded, and total number of records rejected. The following table lists all the available email variables:
Email Variables for Post-Session Email Email Variable %s %e %b %c %i %l %r Session name. Session status. Session start time. Session completion time. Session elapsed time (session completion time-session start time). Total rows loaded. Total rows rejected. Source and target table details, including read throughput in bytes per second and write throughput in rows per second. The PowerCenter Server includes all information displayed in the session detail dialog box. Name of the mapping used in the session. Description
%t
%m
INFORMATICA CONFIDENTIAL
BEST PRACTICE
355 of 702
%n %d %g
Name of the folder containing the session. Name of the repository containing the session. Attach the session log to the message. Attach the named file. The file must be local to the PowerCenter Server. The following are valid file names: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>. Note: The file name cannot include the greater than character (>) or a line break.
%a<filename>
Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the email subject. Include these variables in the email message only.
PMERR_DATA. Stores data and metadata about a transformation row error and its corresponding source row. PMERR_MSG. Stores metadata about an error and the error message. PMERR_SESS. Stores metadata about the session. PMERR_TRANS. Stores metadata about the source and transformation ports, such as name and datatype, when a transformation error occurs.
s
r r r
Appends error data to the same tables cumulatively, if they already exist, for the further runs of the session. Allows you to specify a prefix for the error tables. For instance, if you want all your EDW session errors to go to one set of error tables, you can specify the prefix as EDW_ Allows you to collect row errors from multiple sessions in a centralized set of four error tables. To do this, you specify the same error log table name prefix for all sessions.
Example:
In the following figure, the session s_m_Load_Customer loads Customer Data into the EDW Customer table. The Customer Table in EDW has the following structure: CUSTOMER_ID NOT NULL NUMBER (PRIMARY KEY)
INFORMATICA CONFIDENTIAL BEST PRACTICE 356 of 702
CUSTOMER_NAME NULL VARCHAR2(30) CUSTOMER_STATUS NULL VARCHAR2(10) There is a primary key constraint on the column CUSTOMER_ID. To take advantage of PowerCenters built-in error handling features, you would set the session properties as shown below:
The session property Error Log Type is set to Relational Database, and Error Log DB Connection and Table name Prefix values are given accordingly. When the PowerCenter server detects any rejected rows because of Primary Key Constraint violation, it writes information into the Error Tables as shown below:
EDW_PMERR_DATA
WORKFLOW_ RUN_ID WORKLET_ RUN_ID SESS_ INST_ ID 3 TRANS_NAME TRANS_ ROW_ID TRANS_ROW DATA SOURCE_ ROW_ID SOURCE_ ROW_ TYPE -1 SOURCE_ LINE_ ROW_ NO DATA N/A 1
Customer_Table 1
INFORMATICA CONFIDENTIAL
BEST PRACTICE
357 of 702
Customer_Table 2
-1
N/A
Customer_Table 3
-1
N/A
EDW_PMERR_MSG
WORKFLOW_ SESS_ RUN_ID INST_ID 6 3 SESS_ START_TIME 9/15/2004 18:31 9/15/2004 18:33 9/15/2004 18:34 REPOSITORY_ NAME pc711 FOLDER_ WORKFLOW_ TASK_ NAME NAME INST_PATH Folder1 wf_test1 s_m_test1 MAPPING_ NAME m_test1 LINE_ NO 1
pc711
Folder1
wf_test1
s_m_test1
m_test1
pc711
Folder1
wf_test1
s_m_test1
m_test1
EDW_PMERR_SESS
WORKFLOW_ SESS_ RUN_ID INST_ID 6 3 SESS_ START_TIME 9/15/2004 18:31 9/15/2004 18:33 9/15/2004 18:34 REPOSITORY_ NAME pc711 FOLDER_ WORKFLOW_ TASK_ NAME NAME INST_PATH Folder1 wf_test1 s_m_test1 MAPPING_ NAME m_test1 LINE_ NO 1
pc711
Folder1
wf_test1
s_m_test1
m_test1
pc711
Folder1
wf_test1
s_m_test1
m_test1
EDW_PMERR_TRANS
WORKFLOW_RUN_ID SESS_INST_ID TRANS_NAME TRANS_GROUP TRANS_ATTR LINE_ NO Input Customer _Id:3, Customer _Name:12, Customer _Status:12 1
Customer_Table
By looking at the workflow run id and other fields, you can analyze the errors and reprocess them after fixing the errors.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
358 of 702
You can use the Operations Dashboard provided with the repository reports as one central location to gain insight into production environment ETL activities. In addition, the following capabilities of Data Analyzer are recommended best practices:
q
Configure alerts to send an email or a pager message to the PowerCenter Administrator whenever there is an entry made into the error tables PMERR_DATA or PMERR_TRANS. Configure reports and dashboards to provide detailed session run information grouped by projects/PowerCenter folders for easy analysis. Configure reports to provide detailed information of the row level errors for each session. This can be accomplished by using the four error tables as sources of data for the reports
INFORMATICA CONFIDENTIAL
BEST PRACTICE
359 of 702
Description
At the end of the ICC selection process, one of the following models will have been selected for implementation:
Select the initial project Identify the resource needs Establish the 30/60/90-day plan
BEST PRACTICE 360 of 702
INFORMATICA CONFIDENTIAL
30 Day Plan
The following plan outlines the people, process, and technology steps that should occur during the first 30 days of the ICC rollout:
INFORMATICA CONFIDENTIAL BEST PRACTICE 361 of 702
People
q
Identify, assemble, and budget for the human resources necessary to support the ICC rollout.; typically, one technical FTE is a good place to start. Identify, estimate, and budget for the necessary technical resources (e.g., hardware, software). Note: To encourage projects to utilize the ICC model, it can often be effective to provide hardware and software resources without any internal chargeback for the first year of the ICC conception. Alternatively, the hardware and software costs can be funded by the projects that are likely to leverage the ICC.
Processes
q
Identify and start planning for the initial projects that will utilize the ICC shared services.
Technology
q
Implement a short-term technical infrastructure for the ICC. This includes implementing the hardware and software required to support the initial five projects (or projects within the scope of the first year of the ICC) in both a development and production capacity. Typically, this technical infrastructure is not the end-goal configuration, but it should include a hardware and software configuration that can easily meld into the end-goal configuration. The hardware and software requirements of the short-term technical infrastructure are generally limited to the components required for the projects that will leverage the infrastructure during the first year.
60 Day Plan
People
q
Provide the shared resources to support the ongoing projects that are utilizing the ICC shared services in development. These resources need to support the deployment of objects between environments (dev/test/production) and support the monitoring of ongoing production processes (a.k.a production support).
Processes
q
Start building, establishing, and communicating processes that are going to be required to support the ICC. These include:
r r r r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
362 of 702
Technology
q
Build out additional features into the short-term technical infrastructure that can improve service levels of the ICC and reduce costs. Examples include:
r r r r r
PowerCenter Metadata Reporter PowerCenter Team Based Development Model Metadata Manager Data Profiling and Cleansing options Various connectivity options, including PowerExchange and PowerCenter Connects
90 Day Plan
People
q
Continue to provide production support shared services to projects leveraging the ICC infrastructure. Provide training to project teams on additional ICC capabilities available to them (i. e., implemented during the 60 day plan).
Processes
q
Finalize and fully communicate all ICC processes (i.e., the processes listed in the 30 day plan) Develop a governance plan such that all objects/code developed for projects leveraging the ICC are reviewed by a governing board of architects and senior developers before being migrated into production. This governance ensures that only high-quality projects are placed in production. Establish SLAs between the projects leveraging the ICC shared services and the ICC itself. Begin work on a chargeback model such that projects that join the ICC after the first year provide an internal transfer of funds to support the ICC based on their usage of the ICC shared services. Typically, chargeback models are based upon CPU utilization used in production by the project on a monthly basis. r PowerCenter 8.1 logs metadata in the repository regarding the amount of CPU used by a particular process. For this reason, PowerCenter 8.1 is a key technology that should be leveraged for ICC implementations.
r
The ICC chargeback model should be flexible in that the project manager can choose between a number of options for levels of support. For example, projects have different SLA requirements and a project that requires 24/7 high availability and dedicated hardware should have a different, more expensive, chargeback than a
BEST PRACTICE 363 of 702
INFORMATICA CONFIDENTIAL
similar project that does not require high availability or dedicated hardware.
r
The ICC chargeback model should reflect costs that are lower than the costs that the project would otherwise have to pay to various hardware, software, and services vendors if they were to go down the path of a project silo approach.
Technology
q
As projects join the ICC that have disaster recovery/failover needs, the appropriate implementation of DR/Failover should be completed for the ICC infrastructure. This usually happens in the first 90 days of the ICC.
Implement a long-term technical infrastructure, including both hardware and software. This long-term technical infrastructure can generally provide cost-effective options for horizontal scaling such as leveraging Informaticas Enterprise Grid capabilities with a relatively inexpensive hardware platform, such as Linux or Windows. Proactively implement additional software components that can be leveraged by ICC customers/projects. Examples include: r High Availability
r r
After the initial project successes leveraging the ICC shared services model, establish the ICC as the enterprise standard for all data integration project needs. Provide additional chargeback models offering greater flexibility to the ICC customers/ projects The ICC should expand its services offerings beyond simple development and production support to include shared services resources that can be shared across projects during the development and testing phases of the project. Examples of such resources include Data Architects, Data Modelers, Development resources, and/or QA resources. Establish an ICC Help Desk that provides 24x7 production support similar to an operator in the mainframe environment. Consider negotiating with hardware vendors for more flexible offerings.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
364 of 702
Description
Objectives
Typical ICC objectives include:
q q
Promoting data integration as a formal discipline. Developing a set of experts with data integration skills and processes, and leveraging their knowledge across the organization. Building and developing skills, capabilities, and best practices for integration processes and operations. Monitoring, assessing, and selecting integration technology and tools. Managing integration pilots. Leading and supporting integration projects with the cooperation of subject matter experts. Reusing development work such as source definitions, application interfaces, and codified business rules.
q q q q
Benefits
Although a successful project that shares its lessons with other teams can be a great way to begin developing organizational awareness of the value of an ICC, setting up a more formal ICC requires
INFORMATICA CONFIDENTIAL
BEST PRACTICE
365 of 702
upper management buy-in and funding. Here are some of the typical benefits that can be realized from doing so:
q
Rapid development of in-house expertise through coordinated training and shared knowledge. Leverage shared resources and "best practice" methods and solutions. More rapid project deployments. Higher quality/reduced risk data integration projects. Reduced costs of project development and maintenance.
q q q q
When examining the move toward an ICC model that optimizes and (in certain situations) centralizes integration functions, consider two things: the problems, costs and risks associated with a project silo-based approach, and the potential benefits of an ICC environment.
Having considered the service categories, the appropriate ICC Organizational Model can be selected.
Knowledge Management
Training
q
Standards Training (Training Coordinator) - Training of best practices, including but not limited to, naming conventions, unit test plans, configuration management strategy, and project methodology. Product Training (Training Coordinator) - Co-ordination of vendor-offered or internallysponsored training of specific technology products.
Standards
q
Standards Development (Knowledge Coordinator) - Creating best practices, including but not limited to, naming conventions, unit test plans, and coding standards.
INFORMATICA CONFIDENTIAL BEST PRACTICE 366 of 702
Standards Enforcement (Knowledge Coordinator) - Enforcing development teams to use documented best practices through formal development reviews, metadata reports, project audits or other means. q Methodology (Knowledge Coordinator) - Creating methodologies to support development initiatives. Examples include methodologies for rolling out data warehouses and data integration projects. Typical topics in a methodology include, but are not limited to:
r r r r
Mapping Patterns (Knowledge Coordinator) - Developing and maintaining mapping patterns (templates) to speed up development time and promote mapping standards across projects.
Technology
q
Emerging Technologies (Technology Leader ) - Assessing emerging technologies and determining if/where they fit in the organization and policies around their adoption/use. Benchmarking (Technology Leader) - Conducting and documenting tests on hardware and software in the organization to establish performance benchmarks.
Metadata
q
Metadata Standards (Metadata Administrator) - Creating standards for capturing and maintaining metadata. For example, database column descriptions can be captured in ErWin and pushed to PowerCenter via Metadata Exchange. Metadata Enforcement (Metadata Administrator) - Enforcing development teams to conform to documented metadata standards. Data Integration Catalog (Metadata Administrator) - Tracking the list of systems involved in data integration efforts, the integration between systems, and the use of/subscription to data integration feeds. This information is critical to managing the interconnections in the environment in order to avoid duplication of integration efforts. The Calalog can also assist in understanding when particular integration feeds are no longer needed.
Environment
Hardware
q
Vendor Selection and Management (Vendor Manager) - Selecting vendors for the hardware tools needed for integration efforts that may span Servers, Storage and network facilities.
BEST PRACTICE 367 of 702
INFORMATICA CONFIDENTIAL
Hardware Procurement (Vendor Manager) - Responsible for the purchasing process for hardware items that may include receiving and cataloging the physical hardware items. Hardware Architecture (Technical Architect) - Developing and maintaining the physical layout and details of the hardware used to support the Integration Competency Center. Hardware Installation (Product Specialist) - Setting up and activating new hardware as it becomes part of the physical architecture supporting the Integration Competency Center. Hardware Upgrades (Product Specialist) - Managing the upgrade of hardware including operating system patches, additional cpu/memory upgrades, replacing old technology etc.
Software
q
Vendor Selection and Management (Vendor Manager) - Selecting vendors for the software tools needed for integration efforts. Activities may include formal RFPs, vendor presentation reviews, software selection criteria, maintenance renewal negotiations and all activities related to managing the software vendor relationship. Software Procurement (Vendor Manager) - Responsible for the purchasing process for software packages and licenses. Software Architecture (Technical Architect) - Developing and maintaining the architecture of the software package(s) used in the competency center. This may include flowcharts and decision trees of what software to select for specific tasks. Software Installation (Product Specialist) - Setting up and installing new software as it becomes part of the physical architecture supporting the Integration Competency Center. Software Upgrade (Product Specialist) - Managing the upgrade of software including patches and new releases. Depending on the nature of the upgrade, significant planning and rollout efforts may be required during upgrades. (Training, testing, physical installation on client machines etc.) Compliance (Licensing) (Vendor Manager) - Monitoring and ensuring proper licensing compliance across development teams. Formal audits or reviews may be scheduled. Physical documentation should be kept matching installed software with purchased licenses.
Professional Services
q
Vendor Selection and Management (Vendor Manager) - Selecting vendors for professional services efforts related to integration efforts. Activities may include managing vendor rates and bulk discount negotiations, payment of vendors, reviewing past vendor work efforts, managing list of "preferred" vendors etc. Vendor Qualification (Vendor Manager) - Conducting formal vendor interviews as consultants/ contracts are proposed for projects, checking vendor references and certifications, formally qualifying selected vendors for specific work tasks (i.e., Vendor A is qualified for Java development while Vendor B is qualified for ETL and EAI work.)
Security
INFORMATICA CONFIDENTIAL
BEST PRACTICE
368 of 702
Security Administration (Security Administrator) - Providing access to the tools and technology needed to complete data integration development efforts including software user id's, source system user id/passwords, and overall data security of the integration efforts. Ensures enterprise security processes are followed. Disaster Recovery (Technical Architect) - Performing risk analysis in order to develop and execute a plan for disaster recovery including repository backups, off-site backups, failover hardware, notification procedures and other tasks related to a catastrophic failure (i.e., server room fire destroys dev/prod servers).
Financial
q
Budget (ICC Manager) - Yearly budget management for the Integration Competency Center. Responsible for managing outlays for services, support, hardware, software and other costs. Departmental Cost Allocation (ICC Manager) - For clients where shared services costs are to be spread across departments/ business units for cost purposes. Activities include defining metrics uses for cost allocation, reporting on the metrics, and applying cost factors for billing on a weekly/monthly or quarterly basis as dictated.
Scalability/Availability
q
High Availability (Technical Architect) - Designing and implementing hardware, software and procedures to ensure high availability of the data integration environment. Capacity Planning (Technical Architect) - Designing and planing for additional integration capacity to address the growth in size and volume of data integration in the future for the organization.
Development Support
Performance
q
Performance and Tuning (Product Specialist) - Providing targeted performance and tuning assistance for integration efforts. Providing on-going assessments of load windows and schedules to ensure service level agreements are being met.
Shared Objects
q
Shared Object Quality Assurance (Quality Assurance) - Providing quality assurance services for shared objects so that objects conform to standards and do not adversely affect the various projects that may be using them. Shared Object Change Management (Change Control Coordinator) - Managing the migration to production of shared objects which may impact multiple project teams. Activities include defining the schedule for production moves, notifying teams of changes, and coordinating the migration of the object to production.
BEST PRACTICE 369 of 702
INFORMATICA CONFIDENTIAL
Shared Object Acceptance (Change Control Coordinator) - Defining and documenting the criteria for a shared object and officially certifying an object as one that will be shared across project teams. Shared Object Documentation (Change Control Coordinator) - Defining the standards for documentation of shared objects and maintaining a catalog of all shared objects and their functions.
Project Support
q
Development Helpdesk (Data Integration Developer) - Providing a helpdesk of expert product personnel to support project teams. This will provide project teams new to developing data integration routines with a place to turn to for experienced guidance. Software/Method Selection (Technical Architect) - Providing a workflow or decision tree to use when deciding which data integration technology to use for a given technology request. Requirements Definition (Business/Technical Analyst) - Developing the process to gather and document integration requirements. Depending on the level of service, activity may include assisting or even fully gathering the requirements for the project. Project Estimation (Project Manager) - Developing project estimation models and provide estimation assistance for data integration efforts. Project Management (Project Manager) - Providing full time management resources experienced in data integration to ensure successful projects. Project Architecture Review (Data Integration Architect) - Providing project level architecture review as part of the design process for data integration projects. Helping ensure standards are met and the project architecture fits within the enterprise architecture vision. Detailed Design Review (Data Integration Developer) - Reviewing design specifications in detail to ensure conformance to standards and identifying any issues upfront before development work is begun. Development Resources (Data Integration Developer) - Providing product-skilled resources for completion of the development efforts. Data Profiling (Data Integration Developer) - Providing data profiling services to identify data quality issues. Develop plans for addressing issues found in data profiling. Data Quality (Data Integration Developer) - Defining and meeting data quality levels and thresholds for data integration efforts.
Testing
q
Unit Testing (Quality Assurance ) - Defining and executing unit testing of data integration processes. Deliverables include documented test plans, test cases and verification against end-user acceptance criteria. System Testing (Quality Assurance) - Defining and performing system testing to ensure that data integration efforts work seamlessly across multiple projects and teams.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
370 of 702
Schedule Management/Planning (Data Integration Developer) - Providing a single point for managing load schedules across the physical architecture to make best use of available resources and appropriately handle integration dependencies. Impact Analysis (Data Integration Developer) - Providing impact analysis on proposed and scheduled changes that may impact the integration environment. Changes include, but are not limited to, system enhancements, new systems, retirement of old systems, data volume changes, shared object changes, hardware migration and system outages.
Production Support
Issue Resolution
q
Operations Helpdesk (Production Operator) - First line of support for operations issues providing high level issue resolution. Helpdesk should provide field support for cases and issues related to scheduled jobs, system availability and other production support tasks. q Data Validation (Quality Assurance) - Providing data validation on integration load tasks. Data may be "held" from end-user access until some level of data validation has been performed. It may be manual review of load statistics - to automated review of record counts including grand total comparisons, expected size thresholds, or any other metric an organization may define to catch potential data inconsistencies before reaching end users. Production Monitoring
q
Schedule Monitoring (Production Operator) - Nightly/daily monitoring of the data integration load jobs. Ensuring jobs are properly initiated, are not being delayed, and ensuring successful completion. May provide first level support to the load schedule while escalating issues to the appropriate support teams. q Operations Metadata Delivery (Production Operator) - Responsible for providing metadata to system owners and end users regarding the production load process including load times, completion status, known issues and other pertinent information regarding the current state of the integration job stream. Change Management
q
Object Migration (Change Control Coordinator) - Coordinating movement of development objects and processes to production. May even physically control migration such that all migration is scheduled, managed, and performed by the ICC. Change Control Review (Change Control Coordinator) - Conducting formal and informal reviews of production changes before migration is approved. At this time, standards may be enforced, system tuning reviewed, production schedules updated, and formal sign off
BEST PRACTICE 371 of 702
INFORMATICA CONFIDENTIAL
Process Definition (Change Control Coordinator) - Developing and documenting the change management process such that development objects are efficiently and flawlessly migrated into the production environment. This may include notification rules, schedule migration plans, emergency fix procedures etc.
The higher the degree of centralization, the greater the potential cost savings. Some organizations have the flexibility to easily move toward central services, while others dont either due to organizational or regulatory constraints. There is no ideal model, just one that is appropriate to the environment in which it operates. To assist the selection of the appropriate ICC model, the Services described above are mapped to the Organizational Models below:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
372 of 702
The adoption of the Central Services model does not necessarily mandate the inclusion of all applications within the orbit of the ICC. Some projects require very specific SLAs (Service Level Agreements) that are much more stringent than other projects, and as such they may require a less stringent ICC model.
Last updated: 09-Feb-07 14:51
INFORMATICA CONFIDENTIAL
BEST PRACTICE
373 of 702
Description
Reusable Objects
Prior to creating an inventory of reusable objects or shortcut objects, be sure to review the business requirements and look for any common routines and/or modules that may appear in more than one data movement. These common routines are excellent candidates for reusable objects or shortcut objects. In PowerCenter, these objects can be created as:
q q q q
single transformations (i.e., lookups, filters, etc.) a reusable mapping component (i.e., a group of transformations - mapplets) single tasks in workflow manager (i.e., command, email, or session) a reusable workflow component (i.e., a group of tasks in workflow manager - worklets).
Please note that shortcuts are not supported for workflow level objects (Tasks). Identify the need for reusable objects based on the following criteria:
q q
Is there enough usage and complexity to warrant the development of a common object? Are the data types of the information passing through the reusable object the same from case to case or is it simply the same high-level steps with different fields and data.
Do these objects need to be shared with in the same folder. If so, then create re-usable objects with in the folder Do these objects need to be shared in several other PowerCenter repository folders? If so, then create local shortcuts Do these objects need to be shared across repositories? If so, then create a global repository and maintain these re-usable objects in the global repository. Create global
INFORMATICA CONFIDENTIAL
BEST PRACTICE
374 of 702
shortcuts to these reusable objects from the local repositories. Note: Shortcuts cannot be created for workflow objects.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
375 of 702
do Session overrides at the session instance level for the database connections/pre-session commands / post session commands. Logging load statistics, failure criteria and success criteria are usually common pieces of code that would be executed for multiple loads in most Projects. Some of these common tasks include:
q
Notification when there are any reject rows using email tasks and link conditions
q
Successful completion notification based on success criteria like number of rows loaded using email tasks and link conditions
q
Fail the load based on failure criteria like load statistics or status of some critical session using control task
q
Based on some previous session completion times, calculate the amount of time the down stream session has to wait before it can start using worklet variables, timer task and assignment task Re-usable worklets can be developed to encapsulate the above-mentioned tasks and can be used in multiple loads. By passing workflow variable values to the worklets and assign then to worklet variables, one can easily encapsulate common workflow logic.
Mappings
A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. In a simple world, a single source table would populate a single target table. However, in practice, this is usually not the case. Sometimes multiple sources of data need to be combined to create a target table, and sometimes a single source of data creates many target tables. The latter is especially true for mainframe data sources where COBOL OCCURS statements litter the landscape. In a typical warehouse or data mart model, each OCCURS statement decomposes to a separate table. The goal here is to create an inventory of the mappings needed for the project. For this exercise, the challenge is to think in individual components of data movement. While the business may consider a fact table and its three related dimensions as a single object in the data mart or warehouse, five mappings may be needed to populate the corresponding star schema with data (i.e., one for each of the dimension tables and two for the fact table, each from a different source system).
INFORMATICA CONFIDENTIAL
BEST PRACTICE
376 of 702
Typically, when creating an inventory of mappings, the focus is on the target tables, with an assumption that each target table has its own mapping, or sometimes multiple mappings. While often true, if a single source of data populates multiple tables, this approach yields multiple mappings. Efficiencies can sometimes be realized by loading multiple tables from a single source. By simply focusing on the target tables, however, these efficiencies can be overlooked. A more comprehensive approach to creating the inventory of mappings is to create a spreadsheet listing all of the target tables. Create a column with a number next to each target table. For each of the target tables, in another column, list the source file or table that will be used to populate the table. In the case of multiple source tables per target, create two rows for the target, each with the same number, and list the additional source(s) of data. The table would look similar to the following: Number 1 2 3 4 4 Target Table Customers Products Customer_Type Orders_Item Orders_Item Source Cust_File Items Cust_File Tickets Ticket_Items
When completed, the spreadsheet can be sorted either by target table or source table. Sorting by source table can help determine potential mappings that create multiple targets. When using a source to populate multiple tables at once for efficiency, be sure to keep restartabilty and reloadability in mind. The mapping will always load two or more target tables from the source, so there will be no easy way to rerun a single table. In this example, potentially the Customers table and the Customer_Type tables can be loaded in the same mapping. When merging targets into one mapping in this manner, give both targets the same number. Then, re-sort the spreadsheet by number. For the mappings with multiple sources or targets, merge the data back into a single row to generate the inventory of mappings, with each number representing a separate mapping. The resulting inventory would look similar to the following: Number 1 2 4 Target Table Customers Products Orders_Item Source Cust_File Items Tickets Ticket_Items
Customer_Type
At this point, it is often helpful to record some additional information about each mapping to help with planning and maintenance. First, give each mapping a name. Apply the naming standards generated in 3.2 Design Development Architecture. These names can then be used to distinguish mappings from one other and also can be put on the project plan as individual tasks.
INFORMATICA CONFIDENTIAL BEST PRACTICE 377 of 702
Next, determine for the project a threshold for a high, medium, or low number of target rows. For example, in a warehouse where dimension tables are likely to number in the thousands and fact tables in the hundred thousands, the following thresholds might apply:
q
High 100,000 rows + Assign a likely row volume (high, medium or low) to each of the mappings based on the expected volume of data to pass through the mapping. These high level estimates will help to determine how many mappings are of high volume; these mappings will be the first candidates for performance tuning. Add any other columns of information that might be useful to capture about each mapping, such as a high-level description of the mapping functionality, resource (developer) assigned, initial estimate, actual completion time, or complexity rating.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
378 of 702
Description
The Informatica tool suite can capture extensive levels of metadata but the amount of metadata that is entered depends on the metadata strategy. Detailed information or metadata comments can be entered for all repository objects (e.g. mapping, sources, targets, transformations, ports etc.). Also, all information about column size and scale, data types, and primary keys are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it will also require extra amount of time and efforts to do so. But once that information is fed to the Informatica repository ,the same information can be retrieved using Metadata reporter any time. There are several out-of-box reports and customized reports can also be created to view that information. There are several options available to export these reports (e.g. Excel spreadsheet, Adobe .pdf file etc.). Informatica offers two ways to access the repository metadata:
Metadata Reporter, which is a web-based application that allows you to run reports against the repository metadata. This is a very comprehensive tool that is powered by the functionality of Informaticas BI reporting tool, Data Analyzer. It is included on the PowerCenter CD. Because Informatica does not support or recommend direct reporting access to the repository, even for Select Only queries, the second way of repository metadata reporting is through the use of views written using Metadata Exchange (MX).
Metadata Reporter
The need for the Informatica Metadata Reporter arose from the number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter is based on the Data Analyzer and PowerCenter products. It provides Data Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations, reports to access to every Informatica object stored in the repository, and even reports to access objects in the Data Analyzer repository. The architecture of the Metadata Reporter is web-based, with an Internet browser front end. Because Metadata Reporter runs on Data Analyzer, you must have Data Analyzer installed and running before you proceed with Metadata Reporter setup. Metadata Reporter setup includes the following .XML files to be imported from the PowerCenter CD in the same sequence as they are listed below:
Schemas.xml Schedule.xml GlobalVariables_Oracle.xml (This file is database specific, Informatica provides GlobalVariable files for DB2, SQLServer, Sybase and Teradata. You need to select the appropriate file based on your PowerCenter repository environment) Reports.xml Dashboards.xml
Note : If you have setup a new instance of Data Analyzer exclusively for Metadata reporter, you should have no problem importing these files. However, if you are using an existing instance of Data Analyzer which you currently use for some other reporting purpose, be careful while importing these files. Some of the file (e.g., Global variables, schedules, etc.) may already exist with the same name. You can rename the conflicting objects.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
379 of 702
The following are the folders that are created in Data Analyzer when you import the above-listed files: Data Analyzer Metadata Reporting - contains reports for Data Analyzer repository itself e.g. Todays Login ,Reports accessed by Users Today etc. PowerCenter Metadata Reports - contains reports for PowerCenter repository. To better organize reports based on their functionality these reports are further grouped into subfolders as following: Configuration Management - contains a set of reports that provide detailed information on configuration management, including deployment and label details. This folder contains following subfolders: o Deployment o Label o Object Version Operations - contains a set of reports that enable users to analyze operational statistics including server load, connection usage, run times, load times, number of runtime errors, etc. for workflows, worklets and sessions. This folder contains following subfolders: o Session Execution o Workflow Execution PowerCenter Objects - contains a set of reports that enable users to identify all types of PowerCenter objects, their properties, and interdependencies on other objects within the repository. This folder contains following subfolders: o Mappings o Mapplets o Metadata Extension o Server Grids o Sessions o Sources o Target o Transformations o Workflows o Worklets Security - contains a set of reports that provide detailed information on the users, groups and their association within the repository.
Informatica recommends retaining this folder organization, adding new folders if necessary. The Metadata Reporter provides 44 standard reports which can be customized with the use of parameters and wildcards. Metadata Reporter is accessible from any computer with a browser that has access to the web server where the Metadata Reporter is installed, even without the other Informatica client tools being installed on that computer. The Metadata Reporter connects to the PowerCenter repository using JDBC drivers. Be sure the proper JDBC drivers are installed for your database platform. (Note: You can also use the JDBC to ODBC bridge to connect to the repository (e.g., Syntax jdbc:odbc:<data_source_name>)
Metadata Reporter is comprehensive. You can run reports on any repository. The reports provide information about all types of metadata objects. Metadata Reporter is easily accessible. Because the Metadata Reporter is web-based, you can generate reports from any machine that has access to the web server. The reports in the Metadata Reporter are customizable. The Metadata Reporter allows you to set parameters for the metadata objects to include in the report. The Metadata Reporter allows you to go easily from one report to another. The name of any metadata object that displays on a report links to an associated report. As you view a report, you can generate reports for objects on which you need more information.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
380 of 702
The following table shows list of reports provided by the Metadata Reporter, along with their location and a brief description: Reports For PowerCenter Repository Folder Public Folders>PowerCenter Metadata Reports>Configuration Management>Deployment>Deployment Group Public Folders>PowerCenter Metadata Reports>Configuration Management>Deployment>Deployment Group History
Sr No 1
10
11
Displays, by group, deployment groups and the dates they were deployed. It also displays the source and target repository names of the deployment group for all deployment dates. This is a primary report in an analytic workflow. Labels Public Folders>PowerCenter Metadata Displays labels created in the Reports>Configuration repository for any versioned Management>Labels>Labels object by repository. All Object Version Public Folders>PowerCenter Metadata Displays all versions of an History Reports>Configuration Management>Object object by the date the object is Version>All Object Version History saved in the repository. This is a standalone report. Server Load by Day Public Folders>PowerCenter Metadata Displays the total number of of Week Reports>Operations>Session sessions that ran, and the total Execution>Server Load by Day of Week session run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays Session Run Details Public Folders>PowerCenter Metadata Displays session run details for Reports>Operations>Session any start date by repository by Execution>Session Run Details folder. This is a primary report in an analytic workflow. Target Table Load Public Folders>PowerCenter Metadata Displays the load statistics for Analysis (Last Reports>Operations>Session each table for last month by Month) Execution>Target Table Load Analysis (Last repository by folder. This is a Month) primary report in an analytic workflow. Workflow Run Public Folders>PowerCenter Metadata Displays the run statistics of all Details Reports>Operations>Workflow workflows by repository by Execution>Workflow Run Details folder. This is a primary report in an analytic workflow. Worklet Run Details Public Folders>PowerCenter Metadata Displays the run statistics of all Reports>Operations>Workflow worklets by repository by folder. Execution>Worklet Run Details This is a primary report in an analytic workflow. Mapping List Public Folders>PowerCenter Metadata Displays mappings by Reports>PowerCenter repository and folder. It also Objects>Mappings>Mapping List displays properties of the mapping such as the number of sources used in a mapping, the number of transformations, and the number of targets. This is a primary report in an analytic workflow. Mapping Lookup Public Folders>PowerCenter Metadata Displays Lookup
INFORMATICA CONFIDENTIAL
BEST PRACTICE
381 of 702
Sr No
Name Transformations
12
Mapping Shortcuts
13
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Source to Target Dependency
14
Mapplet List
15
16
17
18
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet Shortcuts Unused Mapplets in Public Folders>PowerCenter Metadata Mappings Reports>PowerCenter Objects>Mapplets>Unused Mapplets in Mappings Metadata Public Folders>PowerCenter Metadata Extensions Usage Reports>PowerCenter Objects>Metadata Extensions>Metadata Extensions Usage
Mapplet Shortcuts
Description transformations used in a mapping by repository and folder. This report is a standalone report and also the first node in the analytic workflow associated with the Mapping List primary report. Displays mappings defined as a shortcut by repository and folder. Displays the data flow from the source to the target by repository and folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived. Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets. This is a primary report in an analytic workflow. Displays all Lookup transformations used in a mapplet by folder and repository. This report is a standalone report and also the first node in the analytic workflow associated with the Mapplet List primary report. Displays mapplets defined as a shortcut by repository and folder. Displays mapplets defined in a folder but not used in any mapping in that folder. Displays, by repository by folder, reusable metadata extensions used by any object. Also displays the counts of all objects using that metadata extension. Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol address of the servers. Displays all sessions and their properties by repository by folder. This is a primary report
19
20
Session List
INFORMATICA CONFIDENTIAL
BEST PRACTICE
382 of 702
Sr No 21
Reports For PowerCenter Repository Folder Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source List
22
Source Shortcuts
23
Target List
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target List
24
25
26
27
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target Shortcuts Transformation List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Transformations>Transformation List Transformation Public Folders>PowerCenter Metadata Shortcuts Reports>PowerCenter Objects>Transformations>Transformation Shortcuts Scheduler Public Folders>PowerCenter Metadata (Reusable) List Reports>PowerCenter Objects>Workflows>Scheduler (Reusable) List
Target Shortcuts
Description in an analytic workflow. Displays relational and nonrelational sources by repository and folder. It also shows the source properties. This report is a primary report in an analytic workflow. Displays sources that are defined as shortcuts by repository and folder Displays relational and nonrelational targets available by repository and folder. It also displays the target properties. This is a primary report in an analytic workflow. Displays targets that are defined as shortcuts by repository and folder. Displays transformations defined by repository and folder. This is a primary report in an analytic workflow. Displays transformations that are defined as shortcuts by repository and folder. Displays all the reusable schedulers defined in the repository and their description and properties by repository by folder. This is a primary report in an analytic workflow. Displays workflows and workflow properties by repository by folder. This report is a primary report in an analytic workflow. Displays worklets and worklet properties by repository by folder. This is a primary report in an analytic workflow. Displays users by repository and group. Description Displays the ten least accessed reports for the current year. It has an analytic workflow that provides access details such as user name and access time. Part of the analytic workflows "Top 10 Most Accessed Reports This Year", "Bottom 10 Least Accessed Reports this Year" and "Usage by Login (Month To Date)".
28
Workflow List
29
Worklet List
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Worklets>Worklet List Public Folders>PowerCenter Metadata Reports>Security>Users By Group Reports For Data Analyzer Repository Folder Public Folders>Data Analyzer Metadata Reporting>Bottom 10 Least Accessed Reports this Year
30
Users By Group
Sr No 1
INFORMATICA CONFIDENTIAL
BEST PRACTICE
383 of 702
Sr No 3
Name Report Activity Details for Current Month Report Refresh Schedule
Reports For PowerCenter Repository Folder Public Folders>Data Analyzer Metadata Reporting>Report Activity Details for Current Month Public Folders>Data Analyzer Metadata Reporting>Report Refresh Schedule
Todays Logins
Public Folders>Data Analyzer Metadata Reporting>Todays Logins Public Folders>Data Analyzer Metadata Reporting>Todays Report Usage by Hour
Public Folders>Data Analyzer Metadata Reporting>Top 10 Most Accessed Reports this Year
Top 5 Logins (Month Public Folders>Data Analyzer Metadata To Date) Reporting>Top 5 Logins (Month To Date)
10
Public Folders>Data Analyzer Metadata Reporting>Top 5 Longest Running OnDemand Reports (Month To Date)
11
Top 5 Longest Public Folders>Data Analyzer Metadata Running Scheduled Reporting>Top 5 Longest Running Reports (Month To Scheduled Reports (Month To Date) Date)
12
Description Provides information about reports accessed in the current month until current date. Provides information about the next scheduled update for scheduled reports. It can be used to decide schedule timing for various reports for optimum system performance. Part of the analytic workflow for "Today's Logins". It provides detailed information on the reports accessed by users today. This can be used independently to get comprehensive information about today's report activity details. Provides the login count and average login duration for users who logged in today. Provides information about the number of reports accessed today for each hour. The analytic workflow attached to it provides more details on the reports accessed and users who accessed them during the selected hour. Shows the ten most accessed reports for the current year. It has an analytic workflow that provides access details such as user name and access time. Provides information about users and their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user. Shows the five longest running on-demand reports for the current month to date. It displays the average total response time, average DB response time, and the average Data Analyzer response time (all in seconds) for each report shown. Shows the five longest running scheduled reports for the current month to date. It displays the average response time (in seconds) for each report shown. Provides the number of errors encountered during execution
INFORMATICA CONFIDENTIAL
BEST PRACTICE
384 of 702
Sr No
Name
13
User Logins (Month Public Folders>Data Analyzer Metadata To Date) Reporting>User Logins (Month To Date)
14
Description of reports attached to schedules. The analytic workflow "Scheduled Report Error Details for Today" is attached to it. Provides information about users and their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user. Provides information about users who exist in the repository but have never logged in. This information can be used to make administrative decisions about disabling accounts.
Wildcards
The Metadata Reporter supports two wildcard characters:
Percent symbol (%) - represents any number of characters and spaces. Underscore (_) - represents one character or space.
You can use wildcards in any number and combination in the same parameter. Leaving a parameter blank returns all values and is the same as using %. The following examples show how you can use the wildcards to set parameters. Suppose you have the following values available to select:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
385 of 702
The following list shows the return values for some wildcard combinations you can use: Wildcard Combination % <blank> %items item_ item% ___m% %pr_mo% Return Values items, items_in_promotions, order_items, promotions items, items_in_promotions, order_items, promotions items, order_items Items items, items_in_promotions items, items_in_promotions, promotions items_in_promotions, promotions
A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce such a printout, arrange the mapping in Designer so the full mapping appears on the screen, and then use Alt+PrtSc to copy the active window to the clipboard. Use Ctrl+V to paste the copy into a Word document. For a detailed description of how to run these reports, consult the Metadata Reporter Guide included in the PowerCenter documentation.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
386 of 702
Self-contained Software Development Kit (SDK). One of the key advantages of MX views is that they are part of the repository database and thus can be used independent of any of the Informatica software products. The same requirement also holds for MX2, thus leading to the development of a self-contained API Software Development Kit that can be used independently of the client or server products. Extensive metadata content, especially multidimensional models for OLAP. A number of BI tools and upstream data warehouse modeling tools require complex multidimensional metadata, such as hierarchies, levels, and various relationships. This type of metadata was specifically designed and implemented in the repository to accommodate the needs of the Informatica partners by means of the new MX2 interfaces. Ability to write (push) metadata into the repository. Because of the limitations associated with relational views, MX could not be used for writing or updating metadata in the Informatica repository. As a result, such tasks could only be accomplished by directly manipulating the repository's relational tables. The MX2 interfaces provide metadata write capabilities along with the appropriate verification and validation features to ensure the integrity of the metadata in the repository. Complete encapsulation of the underlying repository organization by means of an API. One of the main challenges with MX views and the interfaces that access the repository tables is that they are directly exposed to any schema changes of the underlying repository database. As a result, maintaining the MX views and direct interfaces requires a major effort with every major upgrade of the repository. MX2 alleviates this problem by offering a set of object-based APIs that are abstracted away from the details of the underlying relational tables, thus providing an easier mechanism for managing schema evolution. Integration with third-party tools. MX2 offers the object-based interfaces needed to develop more sophisticated procedural programs that can tightly integrate the repository with the third-party data warehouse modeling and query/reporting tools. Synchronization of metadata based on changes from up-stream and down-stream tools. Given that metadata is likely to reside in various databases and files in a distributed software environment, synchronizing changes and updates ensures the validity and integrity of the metadata. The object-based technology used in MX2 provides the infrastructure needed to implement automatic metadata synchronization and change propagation across different tools that access the PowerCenter Repository. Interoperability with other COM-based programs and repository interfaces. MX2 interfaces comply with Microsoft's Component Object Model (COM) interoperability protocol. Therefore, any existing or future program that is COM-compliant can seamlessly interface with the PowerCenter Repository by means of MX2.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
387 of 702
Description
Regular actions such as backups, testing backup and restore procedures, and deleting unwanted information from the repository maintains the repository for better performance.
Managing Repository
The PowerCenter Administrator plays a vital role in managing and maintaining the repository and metadata. The role involves tasks such as securing the repository, managing the users and roles, maintaining backups, and managing the repository through such activities as removing unwanted metadata, analyzing tables, and updating statistics.
Repository backup
Repository back up can be performed using the client tool Repository Server Admin Console or the command line program pmrep. Backup using pmrep can be automated and scheduled for regular backups.
This shell script can be scheduled to run as cron job for regular backups. Alternatively, this shell script can be called from PowerCenter via a command task. The command task can be placed in a workflow and scheduled to run daily.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
388 of 702
The following paragraphs describe some useful practices for maintaining backups: Frequency: Backup frequency depends on the activity in repository. For Production repositories, backup is recommended once a month or prior to major release. For development repositories, backup is recommended once a week or once a day, depending upon the team size. Backup file sizes: Because backup files can be very large, Informatica recommends compressing them using a utility such as winzip or gzip. Storage: For security reasons, Informatica recommends maintaining backups on a different physical device that the repository itself. Move backups offline: Review the backups on a regular basis to determine how long they need to remain online. Any that are not required online should be moved offline, to tape, as soon as possible.
Restore repository
Although the Repository restore function is used primarily as part of disaster recovery, it can also be useful for testing the validity of the backup files and for testing the recovery process on a regular basis. Informatica recommends testing the backup files and recovery process at least once each quarter. The repository can be restored using the client tool, Repository Server Administrator Console, or the command line programs pmrepagent.
Restore folders
There is no easy way to restore only one particular folder from backup. First the backup repository has to be restored into a new repository, then you can use the client tool, repository manager, to copy the entire folder from the restored repository into the target repository.
After an object has been deleted from the repository, you cannot create another object with the same name unless the deleted object has been completely removed from the repository. Use the purge command to completely remove deleted objects from the repository. Keep in mind, however, that you must remove all versions of a deleted object to completely remove it from repository.
Truncating Logs
You can truncate the log information (for sessions and workflows) stored in the repository either by using repository manager or the pmrep command line program. Logs can be truncated for the entire repository or for a particular folder. Options allow truncating all log entries or selected entries based on date and time.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
390 of 702
Repository Performance
Analyzing (or updating the statistics) of repository tables can help to improve the repository performance. Because this process should be carried out for all tables in the repository, a script offers the most efficient means. You can then schedule the script to run using either an external scheduler or a PowerCenter workflow with a command task to call the script.
Managing Metadata
The following paragraphs list the queries that are most often used to report on PowerCenter metadata. The queries are written for PowerCenter repositories on Oracle and are based on PowerCenter 6 and PowerCenter 7. Minor changes in the queries may be required for PowerCenter repositories residing on other databases.
Failed Sessions
The following query lists the failed sessions in the last day. To make it work for the last n days, replace SYSDATE-1 with SYSDATE - n SELECT Subject_Area AS Folder, Session_Name, Last_Error AS Error_Message, DECODE (Run_Status_Code,3,'Failed',4,'Stopped',5,'Aborted') AS Status, Actual_Start AS Start_Time,
INFORMATICA CONFIDENTIAL
BEST PRACTICE
391 of 702
Session_TimeStamp FROM rep_sess_log WHERE run_status_code != 1 AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)
Invalid Tasks
The following query lists folder names and task name, version number, and last saved for all invalid tasks. SELECT SUBJECT_AREA AS FOLDER_NAME, DECODE(IS_REUSABLE,1,'Reusable',' ') || ' ' ||TASK_TYPE_NAME AS TASK_TYPE, TASK_NAME AS OBJECT_NAME, VERSION_NUMBER, -- comment out for V6 LAST_SAVED
INFORMATICA CONFIDENTIAL
BEST PRACTICE
392 of 702
--AND CHECKOUT_USER_ID = 0 -- Comment out for V6 --AND is_visible=1 -- Comment out for V6 ORDER BY SUBJECT_AREA,TASK_NAME
Load Counts
The following query lists the load counts (number of rows loaded) for the successful sessions. SELECT subject_area, workflow_name, session_name, DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS Session_Status, successful_rows, failed_rows, actual_start FROM REP_SESS_LOG WHERE TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE) ORDER BY subject_area workflow_name, session_name, Session_status
INFORMATICA CONFIDENTIAL BEST PRACTICE 393 of 702
Description
Metadata Extensions, as the name implies, help you to extend the metadata stored in the repository by associating information with individual objects in the repository. Informatica Client applications can contain two types of metadata extensions: vendordefined and user-defined.
q
Vendor-defined. Third-party application vendors create vendor-defined metadata extensions. You can view and change the values of vendor-defined metadata extensions, but you cannot create, delete, or redefine them. User-defined. You create user-defined metadata extensions using PowerCenter clients. You can create, edit, delete, and view user-defined metadata extensions. You can also change the values of user-defined extensions.
You can create reusable or non-reusable metadata extensions. You associate reusable metadata extensions with all repository objects of a certain type. So, when you create a reusable extension for a mapping, it is available for all mappings. Vendor-defined metadata extensions are always reusable. Non-reusable extensions are associated with a single repository object. Therefore, if you edit a target and create a non-reusable extension for it, that extension is available only for the target you edit. It is not available for other targets. You can promote a nonreusable metadata extension to reusable, but you cannot change a reusable metadata extension to non-reusable. Metadata extensions can be created for the following repository objects:
q q
INFORMATICA CONFIDENTIAL
q q q q q q q
Transformations (Expressions, Filters, etc.) Mappings Mapplets Sessions Tasks Workflows Worklets
Metadata Extensions offer a very easy and efficient method of documenting important information associated with repository objects. For example, when you create a mapping, you can store the mapping owners name and contact information with the mapping OR when you create a source definition, you can enter the name of the person who created/imported the source. The power of metadata extensions is most evident in the reusable type. When you create a reusable metadata extension for any type of repository object, that metadata extension becomes part of the properties of that type of object. For example, suppose you create a reusable metadata extension for source definitions called SourceCreator. When you create or edit any source definition in the Designer, the SourceCreator extension appears on the Metadata Extensions tab. Anyone who creates or edits a source can enter the name of the person that created the source into this field. You can create, edit, and delete non-reusable metadata extensions for sources, targets, transformations, mappings, and mapplets in the Designer. You can create, edit, and delete non-reusable metadata extensions for sessions, workflows, and worklets in the Workflow Manager. You can also promote non-reusable metadata extensions to reusable extensions using the Designer or the Workflow Manager. You can also create reusable metadata extensions in the Workflow Manager or Designer. You can create, edit, and delete reusable metadata extensions for all types of repository objects using the Repository Manager. If you want to create, edit, or delete metadata extensions for multiple objects at one time, use the Repository Manager. When you edit a reusable metadata extension, you can modify the properties Default Value, Permissions and Description. Note: You cannot create non-reusable metadata extensions in the Repository Manager. All metadata extensions created in the Repository Manager are reusable. Reusable metadata extensions are repository wide. You can also migrate Metadata Extensions from one environment to another. When
INFORMATICA CONFIDENTIAL
BEST PRACTICE
395 of 702
you do a copy folder operation, the Copy Folder Wizard copies the metadata extension values associated with those objects to the target repository. A non-reusable metadata extension will be copied as a non-reusable metadata extension in the target repository. A reusable metadata extension is copied as reusable in the target repository, and the object retains the individual values. You can edit and delete those extensions, as well as modify the values. Metadata Extensions provide for extended metadata reporting capabilities. Using Informatica MX2 API, you can create useful reports on metadata extensions. For example, you can create and view a report on all the mappings owned by a specific team member. You can use various programming environments such as Visual Basic, Visual C++, C++ and Java SDK to write API modules. The Informatica Metadata Exchange SDK 6.0 installation CD includes sample Visual Basic and Visual C++ applications. Additionally, Metadata Extensions can also be populated via data modeling tools such as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange for Data Models. With the Informatica Metadata Exchange for Data Models, the Informatica Repository interface can retrieve and update the extended properties of source and target definitions in PowerCenter repositories. Extended Properties are the descriptive, user defined, and other properties derived from your Data Modeling tool and you can map any of these properties to the metadata extensions that are already defined in the source or target object in the Informatica repository.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
396 of 702
Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance Challenge
The role that the PowerCenter repository can play in an automated QA strategy is often overlooked and underappreciated. This repository is essentially a database about the transformation process and the software developed to implement it; the challenge is to devise a method to exploit this resource for QA purposes. To address the above challenge, Informatica PowerCenter provides several pre-packaged reports (PowerCenter Repository Reports) that can be installed on Data Analyzer or Metadata Manager Installation. These reports provide lots of useful information about PowerCenter object metadata and operational metadata that can be used for quality assurance.
Description
Before considering the mechanics of an automated QA strategy, it is worth emphasizing that quality should be built in from the outset. If the project involves multiple mappings repeating the same basic transformation pattern(s), it is probably worth constructing a virtual production line. This is essentially a template-driven approach to accelerate development and enforce consistency through the use of the following aids:
q q
Shared template for each type of mapping. Checklists to guide the developer through the process of adapting the template to the mapping requirements. Macros/scripts to generate productivity aids such as SQL overrides etc.
It is easier to ensure quality from a standardized base rather than relying on developers to repeat accurately the same basic keystrokes. Underpinning the exploitation of the repository for QA purposes is the adoption of naming standards which categorize components. By running the appropriate query on the repository, it is possible to identify those components whose attributes differ from those predicted for the category. Thus, it is quite possible to automate some aspects of QA. Clearly, the function of naming conventions is not just to standardize, but also to provide logical access paths into the information in the repository; names can be used to identify patterns and/or categories and thus allow assumptions to be made about object attributes. Along with the facilities provided to query the repository, such as the Metadata Exchange (MX) Views and the PowerCenter Metadata Manager, this opens the door to an automated QA strategy For example, consider the following situation: it is possible that the EXTRACT mapping/session should always truncate the target table before loading; conversely, the TRANSFORM and LOAD phases should never truncate a target. Possible code errors in this respect can be identified as follows:
q q
Define a mapping/session naming standard to indicate EXTRACT, TRANSFORM, or LOAD. Develop a query on the repository to search for sessions named EXTRACT, which do not have the truncate target option set. Develop a query on the repository to search for sessions named TRANSFORM or LOAD, which do have the truncate target option set.
BEST PRACTICE 397 of 702
INFORMATICA CONFIDENTIAL
Provide a facility to allow developers to run both queries before releasing code to the test environment.
Alternatively, a standard may have been defined to prohibit unconnected output ports from transformations (such as expressions) in a mapping. These can be very easily identified from the MX View REP_MAPPING_UNCONN_PORTS: The following bullets represent a high-level overview of the steps involved in automating QA:
q
Review the transformations/mappings/sessions/workflows and allocate to broadly representative categories. Identify the key attributes of each category. Define naming standards to identify the category for transformations/mappings/sessions/workflows. Analyze the MX Views to source the key attributes. Develop the query to compare actual and expected attributes for each category.
q q q q
After you have completed these steps, it is possible to develop a utility that compares actual and expected attributes for developers to run before releasing code into any test environment. Such a utility may incorporate the following processing stages:
q q q q q
Execute a profile to assign environment variables (e.g., repository schema user, password, etc). Select the folder to be reviewed. Execute the query to find exceptions. Report the exceptions in an accessible format. Exit with failure if exceptions are found.
TIP Remember that any queries on the repository that bypass the MX views will require modification if subsequent upgrades to PowerCenter occur and as such is not recommended by Informatica.
The principal objective of any QA strategy is to ensure that developed components adhere to standards and to identify defects before incurring overhead during the migration from development to test/production environments. Qualitative, peer-based reviews of PowerCenter objects due for release obviously have their part to play in this process.
Using Metadata Manager and PowerCenter Repository Reports for Quality Assurance
The need for the Informatica Metadata Reporter was identified from the a number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter provides Data Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations. In this section, we focus primarily on how these reports and custom reports can help ease the QA process. The following reports can help identify regressions in load performance:
q
Server Load by Day of the Week can help determine the load on the server before and after QA migrations and may help balance the loads through the week by modifying the schedules. The Target Table Load Analysis can help identify any data regressions with the number of records loaded in each target (if a baseline was established before the migration/upgrade).
The Failed Session report lists failed sessions at a glance, which is very helpful after a major QA migration or QA of Informatica upgrade process During huge deployments to QA, the Code review team can look at the following reports to determine if the standards (i.e., Naming standards, Comments for repository objects, metadata extensions usage, etc.) were followed. Accessing this information from PowerCenter Repository Reports typically reduces the time required for review because the reviewer doesnt need to open each mapping and check for these details. All of the following are out-of-the-box reports provided by Informatica:
q
Label report
q
Mappings list
q
Mapping shortcuts
q
Mapplet list
q
Mapplet shortcuts
q
Sessions list
q
Worklets list
q
Workflows list
q
Source list
q
Target list
q
Custom reports based on the review requirements In addition, note that the following reports are also useful during migration and upgrade processes:
q
Invalid object reports and deployment group report in the QA repository help to determine which deployments caused the invalidations. Invalid object report against Development repository helps to identify the invalid objects that are part of deployment before QA migration.
BEST PRACTICE 399 of 702
INFORMATICA CONFIDENTIAL
The following table summarizes some of the reports that Informatica ships with a PowerCenter Repository Reports installation:
Report Name 1 2 Deployment Group Description Displays deployment groups by repository
Deployment Group History Displays, by group, deployment groups and the dates they were deployed. It also displays the source and target repository names of the deployment group for all deployment dates. Labels Displays labels created in the repository for any versioned object by repository.
3 4 5
All Object Version History Displays all versions of an object by the date the object is saved in the repository. Server Load by Day of Week Displays the total number of sessions that ran, and the total session run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays Displays session run details for any start date by repository by folder. Displays the load statistics for each table for last month by repository by folder
6 7
Session Run Details Target Table Load Analysis (Last Month) Workflow Run Details Worklet Run Details Mapping List
8 9 10
Displays the run statistics of all workflows by repository by folder. Displays the run statistics of all worklets by repository by folder. Displays mappings by repository and folder. It also displays properties of the mapping such as the number of sources used in a mapping, the number of transformations, and the number of targets. Displays Lookup transformations used in a mapping by repository and folder.
11
12 13
Displays mappings defined as a shortcut by repository and folder. Displays the data flow from the source to the target by repository and folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived. Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets. Displays all Lookup transformations used in a mapplet by folder and repository.
14
Mapplet List
15
16 17
Displays mapplets defined as a shortcut by repository and folder. Displays mapplets defined in a folder but not used in any mapping in that folder.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
400 of 702
18
Displays, by repository by folder, reusable metadata extensions used by any object. Also displays the counts of all objects using that metadata extension. Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol address of the servers. Displays all sessions and their properties by repository by folder. This is a primary report in a data integration workflow. Displays relational and non-relational sources by repository and folder. It also shows the source properties. This report is a primary report in a data integration workflow. Displays sources that are defined as shortcuts by repository and folder Displays relational and non-relational targets available by repository and folder. It also displays the target properties. This is a primary report in a data integration workflow. Displays targets that are defined as shortcuts by repository and folder. Displays transformations defined by repository and folder. This is a primary report in a data integration workflow.
19
20
Session List
21
Source List
22 23
24 25
26 27
Transformation Shortcuts Displays transformations that are defined as shortcuts by repository and folder. Scheduler (Reusable) List Displays all the reusable schedulers defined in the repository and their description and properties by repository by folder. Workflow List Worklet List Displays workflows and workflow properties by repository by folder. Displays worklets and worklet properties by repository by folder.
28 29
INFORMATICA CONFIDENTIAL
BEST PRACTICE
401 of 702
Description
Metadata Manager Console Settings Logging into the Metadata Manager Warehouse
You can use the Metadata Manager console to access one Metadata Manager Warehouse repository at a time. When logging in to the Metadata Manager console for the first time, you need to set up the connection information along with the data source for the Integration Repository. In subsequent logins, you need enter only the Metadata Manager Warehouse database password.
For Windows:
q q
\\Informatica Server Name\PM SrcFiles directory. Click Save when you are finished.
For UNIX:
q q q q q
Select ftp option FTP Server Name: Integration Service Host Name Port Number: 21 (default) User Name: UNIX login name to Integration Server ftp directory: /Integration service Home directory/SrcFiles
Note: Metadata Manager 8.1 does not support secure ftp connections to the Integration server.
Add a new SQL repository from Metadata Manager (Web interface). Log in to the Metadata Manager Console. Click the Source Repository Management tab. The new SQL Server XConnect added above should show up in the console. Select the SQL Server XConnect and click the Configuration Properties tab. Enter the following information related to the XConnect: Description Database user name and password to access SQL Server data dictionary ODBC connection name to connect to SQL Server data dictionary Microsoft SQL Server
Database Type
INFORMATICA CONFIDENTIAL
BEST PRACTICE
403 of 702
Connection String
For default instance: SQL Server Name@Database Name For named instance: Server Name\Instance Name@Database Name
q q
Click Save when you have finished entering this information. To configure the list of user schemas to load, click the Parameter Setup tab and select the list of schemas to load (these are listed in the Included Schemas). Click Save when you are finished. The XConnect is ready to be loaded. After a successful load of the SQL Server metadata, you can see the metadata in the Web interface.
q q
To configure SQL Server out-of-the-box XConnects to run on the PowerCenter server in a UNIX environment, follow these steps:
q q q q
Install DataDirect ODBC drivers on the PowerCenter server location. Configure .odbc.ini just like any other ODBC setup. Create a repository of type Microsoft SQL Server using the Metadata Manager browser Configuring the repository in the configuration console, specify a connect string as <SQLserverhost>@DBname and save the configuration Using Workflow Manager, delete the connection it created R<RepoUID>, and create an ODBC connection with the same name as R<RepoUID> (Specify the connect string same as the one configured in the .odbc.ini)
Oracle XConnect
Specify a user name and password to access the Oracle database metadata. Be sure that the user has the Select Any Table privilege and Select Permissions on the following objects in the specified schemas: tables, views, indexes, packages, procedures, functions, sequences, triggers, and synonyms. Also ensure the user has Select Permissions on the SYS.v_$instance. One XConnect is needed for each Oracle instance. To extract metadata from Oracle, perform the following steps:
q q
Add a new SQL repository from Metadata Manager (Web interface). Log in to the Metadata Manager Console. Click the Source Repository Management tab. The Oracle XConnect added above should show up in the console. Select the Oracle XConnect and click the Configuration Properties tab. Enter the following information related to the XConnect:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
404 of 702
Description Database user name and password to access the Oracle instance data dictionary ODBC connection name to connect to the Oracle instance data dictionary Oracle Oracle instance name
q q
Click Save when you have finished entering this information. To configure the list of schemas to load, click the Parameter Setup tab and select the list of schemas to load (these are listed in the Included Schemas). Click Save when finished. The XConnect is ready to be loaded. After a successful load of the Oracle metadata, you can see the metadata in the Web interface:
q q
Teradata XConnect
Specify a user name and password to access the Teradata metadata. Be sure that the user has access to all the system DBC tables. To extract metadata from Teradata Server, perform the following steps:
q q
Add a new SQL repository from Metadata Manager (Web interface). Log in to the Metadata Manager Console and click the Source Repository Management tab. The new SQL Server XConnect added above should show up in the console. Select the SQL Server XConnect and click the Configuration Properties tab. Enter the following information related to the XConnect: Description Database user name and password to access SQL Server data dictionary ODBC connection name to connect to SQL Server data dictionary Teradata
Database Type
INFORMATICA CONFIDENTIAL
BEST PRACTICE
405 of 702
Connection String
q q
Click Save when you have finished entering this information. To configure the list of user databases to load click the Parameter Setup tab. Select the list of list of database to load (these are listed in the Included Schemas). Click Save when you are finished. The XConnect is ready to be loaded. After a successful load of the Teradata metadata, you can see the metadata in the Web interface.
q q
ERwin XConnect
The following format is required to extract metadata from Erwin:
q q
For Erwin 3.5, save the datamodel as ER1 format For Erwin 4.x, Save as XML format of the ERwin model that you want to load into Metadata Manager.
Log in to Metadata Manager (Web interface) and select the Administration tab. Under Repository Management, select Repositories. Click Add to add a new repository. Enter all the information related to the ERwin XConnect. Repository Type and Name are mandatory fields.) Log into the Metadata Manager Console and click the Source Repository Management tab. The ERwin XConnect added above should show up in the console. Select the ERwin XConnect and click the Configuration Properties tab.
r r r
Each XConnect allows you to add multiple files. Source System Version = Select the appropriate option. Click Add to add the ERwin file. Browse to the location of the ERwin file. The directory path of the file is stored locally. To load a new ERwin file, select the current file, then click Delete and add the new file. Select the Refresh? checkbox to refresh the metadata from the file. If you do not want to update the metadata from a particular metadata file (i.e., if the file does not contain any changes since the last metadata load), uncheck this box. Click Save when you are finished.
If you select Edit/assigned Connections for Lineage Report, set the connection assignments between the ERwin model and the underlying database schemas. Click OK when you are finished. The XConnect is ready to be loaded. After a successful load of the ERwin metadata, you can see the metadata in the Web interface.
ER-Studio XConnect
INFORMATICA CONFIDENTIAL BEST PRACTICE 406 of 702
Log in to Metadata Manager (Web interface) and select the Administration tab. Under Repository Management, select Repositories and click Add to add a new repository. Enter all the information related to the ERStudio XConnect. Repository Type and Name are mandatory fields.) Log in to the Metadata Manager Console and click the Source Repository Management tab. The ERStudio XConnect added above should show up in the console. Select the ERStudio XConnect and click the Configuration Properties tab.
r r r
Each XConnect allows you to add multiple files Source System Version = Select the appropriate option. Click Add to add the ERStudio file. Browse to the location of the ERStudio file. The directory path of the file is stored locally. To load a new ERStudio file, select the current file, and click Delete, then add the new file. Select the Refresh? checkbox to refresh the metadata from the file. If you do not want to update the metadata from a particular metadata file (i.e., if the file does not contain any changes since the last metadata load), uncheck this box. Click Save when you are finished.
If you select Edit/assigned Connections for Lineage Report, set the connection assignments between the ERStudio model and the underlying database schemas. Click OK when you are finished. The XConnect is ready to be loaded. After a successful load of the ERStudio metadata, you can see the metadata in the Web interface.
PowerCenter XConnect
Specify a user name and password to access the PowerCenter database metadata. Be sure that the user has the Select Any Table privilege and the ability to drop and create views. If you are using a different Oracle user to pull PowerCenter metadata into the metadata warehouse than is used by PowerCenter to create the metadata, you need to create synonyms in the new users schema to all tables and views in the PowerCenter users schema. When the XConnect runs, it can successfully create the views it needs in the new users schema. To extract metadata from PowerCenter, perform the following steps:
q
Log into Metadata Manager (Web interface) and select Add to add a new repository. Enter all the information related to the PowerCenter XConnect. Log into the Metadata Manager Console and click the Source Repository Management tab. The PowerCenter XConnect added above should show up in the console. Select the PowerCenter XConnect and click the Configuration Properties tab.
BEST PRACTICE 407 of 702
INFORMATICA CONFIDENTIAL
Enter the following information related to the XConnect: Description Database user name and password to access the PowerCenter repository tables ODBC connection name to connect to the database (provides information about how to connect to the machine containing the source repository database) Database type of PowerCenter database Please refer appropriate RDBMS XConnect based on database type.
q q
Click Save when you have finished entering this information. To configure the list of folders to load, click the Parameter Setup tab, and select the list of folders to load (these are listed in the Included Folders). Select Enable Operational Metadata Extraction to extract operational metadata (e.g., run details, including times and statuses for workflow, worklet, and session runs, etc.) Leave the Source Incremental Extract Window (in days) at its default value of 4000. (To ensure a full extract during the initial workflow run, the workflow is configured to extract records that have been inserted or updated within the past 4000 days of the extract.) Click Save when you are finished.
Browse to the Parameter File Directory. Click the Add button to select the appropriate parameter file for each workflow that is being used. Click Save when you are finished selecting parameter files. The XConnect is ready to be loaded. After a successful load of the PowerCenter metadata, you can see the metadata in the Web interface.
q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
408 of 702
Add a new SQL repository from Metadata Manager (Web interface). Log into the Metadata Manager Console and click the Source Repository Management tab. The new Business Objects XConnect added above should show up in the console. Select the Business Objects XConnect and click the Configuration Properties tab.
To configure the Business Objects repository connection setup for the first time:
q
Click Configure to setup the Business Objects configuration file. The Metadata Administrator needs to define the Business Objects configuration for the first time. Select the Business Object repository, then enter the Business Objects login name and password to connect to the Business Objects repository. Select the list of universes you need to extract. Select the list of documents. Click Save to create the Business Objects configuration file to extract metadata from Business Objects. Browse to select the path and file name for the Business Objects connection configuration file. Click Save when you are finished. The XConnect is now ready to be loaded. After a successful load of the Business Objects metadata, you can see the metadata in the Web interface.
q q q
q q q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
409 of 702
Description
This document organizes all steps into phases, where each phase and step must be performed in the order presented. To integrate custom metadata, complete tasks for the following phases:
q q q q
Design the Metamodel. Implement the Metamodel Design. Set-up and run the custom XConnect. Configure the reports and schema.
Common Warehouse Metamodel (CWM) and Informatica-Defined Metamodels. The CWM metamodel includes industry-standard packages, classes, and class associations. The Informatica metamodel components supplement the CWM metamodel by providing repository-specific packages, classes, and class associations. For more information about CWM, see http:// www.omg.org/cwm. For more information about the Informatica-defined metamodel components, run and review the metamodel reports. PowerCenter Functionality. During the metadata integration process, XConnects are configured and run. The XConnects run PowerCenter workflows that extract custom metadata and load it into the Metadata Manager Warehouse.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
410 of 702
Data Analyzer Functionality. Metadata Manager embeds Data Analyzer functionality to create, run, and maintain a metadata reporting environment. Knowledge of creating, modifying, and deleting reports, dashboards, and analytic workflows in Data Analyzer is required. Knowledge of creating, modifying, and deleting table definitions, metrics, and attributes is required to update the schema with new or changed objects.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
411 of 702
the structure of the metadata contained in the source repositories of the given repository type. Create packages to group related classes and class associations. To see an example of sample metamodel design specifications, see Appendix A in the Metadata Manager Custom Metadata Integration Guide.
Create the originator (aka owner) of the metamodel. When creating a new metamodel, specify the originator of each metamodel. An originator is the organization that creates and owns the metamodel. When defining a new custom originator in Metadata Manager, select Customer as the originator type.
q q q q
Go to the Administration tab. Click Originators under Metamodel Management. Click Add to add a new originator. Fill out the requested information (Note: Domain Name, Name, and Type are mandatory fields). Click OK when you are finished.
2. Create the packages that contain the classes and associations of the subject metamodel. Define the packages to which custom classes and associations are assigned. Packages contain classes and their class associations. Packages have a hierarchical structure, where one package can be the parent of another package. Parent packages are generally used to group child packages together.
q
Click OK when you are finished. 3. Create Custom Classes. In this step, create custom classes identified in the metamodel design task.
q
From the drop-down menu, select the package that you created in the step above
q
Fill out the requested information (Note: Name, Package, and Class Label are mandatory fields).
q
Base Classes: In order to see the metadata in the Metadata Manager metadata browser, you need to at least add the base class, TreeElement. To do this: a. Click Add under Base Classes. b. Select the package. c. Under Classes, select TreeElement. d. Click OK (You should now see the class properties in the properties section).
q
To add custom properties to your class, click Add. Fill out the property information (Name, Data Type, and Display Label are mandatory fields). Click OK when you are done. Click OK at the top of the page to create the class.
Repeat the above steps for additional classes. 4. Create Custom Class Associations. In this step, implement the custom class associations identified in the metamodel design phase. In the previous step, CWM classes are added as base classes. Any of the class associations from the CWM base classes can be reused. Define those custom class associations
INFORMATICA CONFIDENTIAL BEST PRACTICE 413 of 702
that cannot be reused. If you only need the ElementOwnership association, skip this step.
q
Fill out the requested information (all bold fields are required).
q
Click OK when you are finished. 5. Create the Repository Type. Each type of repository contains unique metadata. For example, a PowerCenter data integration repository type contains workflows and mappings, but a Data Analyzer business intelligence repository type does not. Repository types maintain the uniqueness of each repository.
q
Fill out the requested information (Note: Name and Product Type are mandatory fields).
q
Click OK when you are finished. 6. Configure a Repository Type Root Class. Root classes display under the source repository in the metadata tree. All other objects appear under the root class. To configure a repository root class:
q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
414 of 702
Determine which Metadata Manager Warehouse tables to load. Although you do not have to load all Metadata Manager Warehouse tables, you must load the following Metadata Manager Warehouse tables:
q
IMW_ELEMENT: The IME_ELEMENT interface file loads the element names from the source repository into the IMW_ELEMENT table. Note that element is used generically to mean packages, classes, or properties.
q
IMW_ELMNT_ATTR: The IME_ELMNT_ATTR interface file loads the attributes belonging to elements from the source repository into the IMW_ELMNT_ATTR table.
q
IMW_ELMNT_ASSOC: The IME_ELMNT_ASSOC interface file loads the associations between elements of a source repository into the IMW_ELMNT_ASSOC table. To stop the metadata load into particular Metadata Manager Warehouse tables, disable the worklets that load those tables.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
415 of 702
2. Reformat the source metadata. In this step, reformat the source metadata so that it conforms to the format specified in each required IME interface file. (The IME files are packaged with the Metadata Manager documentation.) Present the reformatted metadata in a valid source type format. To extract the reformatted metadata, the integration workflows require that the reformatted metadata be in one or more of the following source type formats: database table, database view, or flat file. Note that you can load metadata into a Metadata Manager Warehouse table using more than one of the accepted source type formats. 3. Register the Source Repository Instance in Metadata Manager. Before the Custom XConnect can extract metadata, the source repository must be registered in Metadata Manager. When registering the source repository, the Metadata Manager application assigns a unique repository ID that identifies the source repository. Once registered, Metadata Manager adds an XConnect in the Configuration Console for the source repository. To register the source repository, go to the Metadata Manager web interface. Register the repository under the custom repository type created above. All packages, classes, and class associations defined for the custom repository type apply to all repository instances registered to the repository type. When defining the repository, provide descriptive information about the repository instance. Once the repository is registered in Metadata Manager, Metadata Manager adds an XConnect in the Configuration Console for the repository. Create the Repository that will hold the metadata extracted from the source system:
q q q q
Go to the Administration tab. Click Repositories under Repository Management. Click Add to add a new repository. Fill out the requested information (Note: Name and Repository Type are mandatory fields). Choose the repository type created above. Click OK when finished.
4. Configure the Custom Parameter Files. Custom XConnects require that the parameter file be updated by specifying the following information:
q
The name of the database views or tables used to load the Metadata Manager Warehouse, if applicable.
INFORMATICA CONFIDENTIAL BEST PRACTICE 416 of 702
The list of all flat files used to load a particular Metadata Manager Warehouse table, if applicable.
q
The worklets you want to enable and disable. Understanding Metadata Manager Workflows for Custom Metadata
q
wf_Load_IME. Custom workflow to extract and transform metadata from the source repository into IME format. This is created by a developer. Metadata Manager prepackages the following integration workflows for custom metadata. These workflows read the IME files mentioned above and load them into the Metadata Manager Warehouse.
r
WF_STATUS: Extracts and transforms statuses from any source repository and loads them into the Metadata Manager Warehouse. To resolve status IDs correctly, the workflow is configured to run before the WF_CUSTOM workflow. WF_CUSTOM: Extracts and transforms custom metadata from IME files and loads that metadata into the Metadata Manager Warehouse.
5. Configure the Custom XConnect. The XConnect loads metadata into the Metadata Manager Warehouse based on classes and class associations specified in the custom metamodel. When the custom repository type is defined, Metadata Manager registers the corresponding XConnect in the Configuration Console. The following information in the Configuration Console configures the XConnect:
q
Under the Administration Tab, select Custom Workflow Configuration and choose the repository type to which the custom repository belongs. q Workflows to load the metadata.
r r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
417 of 702
worklets and sessions required to populate all Metadata Manager Warehouse tables, except the IMW_STATUS table)
r
Note: Metadata Manager Server does not load Metadata Manager Warehouse tables that have disabled worklets.
q
Under the Administration Tab, select Custom Workflow Configuration and choose the parameter file used by the workflows to load the metadata (the parameter file name is assigned at first data load). This parameter file name has the form nnnnn.par, where nnnnn is a five digit integer assigned at the time of the first load of this source repository. The script promoting Metadata Manager from the development environment to test and from the test environment to production preserves this file name.
6. Reset the $$SRC_INCR_DATE Parameter. After completing the first metadata load, reset the $$SRC_INCR_DATE parameter to extract metadata in shorter intervals, such as every f days. The value depends on how often the Metadata Manager Warehouse needs to be updated. If the source does not provide the date when the records were last updated, records are extracted regardless of the $$SRC_INCR_DATE parameter setting. 7. Run the Custom XConnect. Using the Configuration Console, Metadata Manager Administrators can run the custom XConnect and ensure that the metadata loads correctly. Note: When loading metadata with Effective From and Effective To Dates, Metadata Manager does not validate whether the Effective From Date is less than the Effective To Date. Ensure that each Effective To Date is greater than the Effective From Date. If you do not supply Effective From and Effective To Dates, Metadata Manager sets the Effective From Date to 1/1/1899 and the Effective To Date to 1/1/3714.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
418 of 702
Use the existing schema and reports. Metadata Manager contains prepackaged reports that can be used to analyze business intelligence metadata, data integration metadata, data modeling tool metadata, and database catalog metadata. Metadata Manager also provides impact analysis and lineage reports that provide information on any type of metadata. Create new reports using the existing schema. Build new reports using the existing Metadata Manager metrics and attributes. Create new Metadata Manager Warehouse tables and views to support the schema and reports. If the prepackaged Metadata Manager schema does not meet the reporting requirements, create new Metadata Manager Warehouse tables and views. Prefix the name of custom-built tables with Z_IMW_. Prefix custom-built views with Z_IMA_. If you build new Metadata Manager Warehouse tables or views, register the tables in the Metadata Manager schema and create new metrics/attributes in the Metadata Manager schema. Note that the Metadata Manager schema is built on the Metadata Manager views.
After the environment setup is complete, test all schema objects, such as dashboards, analytic workflows, reports, metrics, attributes, and alerts.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
419 of 702
Description
Configuring Metamodels
You may need to configure metamodels for a repository type in order to integrate additional metadata into a Metadata Manager Warehouse and/or to adapt to changes in metadata reporting and browsing requirements. For more information about creating a metamodel for a new repository type, see the Metadata Manager Custom Metadata Integration Guide. Use Metadata Manager to define a metamodel, which consists of the following objects:
q q
Originator - the party that creates and owns the metamodel. Packages - contain related classes that model metadata for a particular application domain or specific application. Multiple packages can be defined under the newly defined originator. Each package stores classes and associations that represent the metamodel. Classes and Class Properties - define a type of object, with its properties, contained in a repository. Multiple classes can be defined under a single package. Each class has multiple properties associated to it. These properties can be inherited from one or many base classes already available. Additional properties can be defined directly under the new class. Associations - define the relationship among classes and their objects. Associations help define relationships across individual classes. The cardinality helps define 1-1, 1-n, or n-n relationships. These relationships mirror real life associations of logical, physical, or design-level building blocks of systems and processes.
For more information about metamodels, originators, packages, classes, and associations, see Metadata Manager Concepts in the Metadata Manager Administration Guide. After you define the metamodel, you need to associate it with a repository type. When registering a repository under a repository type, all classes and associations assigned to the repository type through packages apply to the repository.
Repository Types
You can configure types of repositories for the metadata you want to store and manage in the Metadata Manager Warehouse. You must configure a repository type when you develop an XConnect. You can modify some attributes for existing XConnects and XConnect repository types.
Create an association from Object B to Object A. 'From Objects' in an association display as parent objects; 'To
BEST PRACTICE 420 of 702
INFORMATICA CONFIDENTIAL
Objects' display as child objects. The 'To Object' displays in the metadata tree only if the 'From Object' in the association already displays in the metadata tree. For more information about adding associations, refer to Adding Object Associations in the Metadata Manager User Guide.
q
Add the association to the IMM.properties file. Metadata Manager only displays objects in the metadata tree if the corresponding association between their classes is included in the IMM.properties file.
Note: Some associations are not explicitly defined among the classes of objects. Some objects reuse associations based on the ancestors of the classes. The metadata tree displays objects that have explicit or reused associations.
To determine the ID of an association, click Administration > Metamodel Management > Associations, and then click the association on the Associations page. 3. Save and close the IMM.properties file. 4. Stop and then restart the Metadata Manager Server to apply the changes.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
421 of 702
Query Task Area - allows you to search for metadata objects stored in the Metadata Manager Warehouse. Metadata Tree Task Area - allows you to navigate to a metadata object in a particular repository. Results Task Area - displays metadata objects based on an object search in the Query Task area or based on the object selected in the Metadata Tree Task area. Details Task Area - displays properties about the selected object. You can also view associations between the object and other objects, and run related reports from the Details Task area.
For more information about the Metadata Directory page on the Find tab, refer to the Accessing Source Repository Metadata chapter in the Metadata Manager User Guide. You can perform the following customizations while browsing the source repository metadata:
Class - Displays an icon that represents the class of the selected object. The class name appears when you place the pointer over the icon. Label - Label of the object. Source Update Date - Date the object was last updated in the source repository.
BEST PRACTICE 422 of 702
q q
INFORMATICA CONFIDENTIAL
q q
Repository Name - Name of the source repository from which the object originates. Description - Describes the object.
The default properties that appear in the 'Results Task' area can, however, be rearranged, added, and/or removed for a Metadata Manager user account. For example, you can remove the default Class and Source Update Date properties, move the Repository Name property to precede the Label property, and add a different property, such as the Warehouse Insertion Date, to the list. Additionally, you can add other properties that are specific to the class of the selected object. With the exception of Label, all other default properties can be removed. You can select up to ten properties to display in the 'Results Task' area. Metadata Manager displays them in the order specified while configuring. If there are more than ten properties to display, Metadata Manager displays the first ten, displaying common properties first in the order specified and then all remaining properties in alphabetical order based on the property display label.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
423 of 702
The same settings can be applied to the other classes of objects that currently display in the 'Results Task' area. If the settings are not applied to the other classes, then the settings apply to the objects of the same class as the object selected in the metadata tree.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
424 of 702
Description
The size of the Metadata Manager warehouse is directly proportional to the size of metadata being loaded into it. The size is also dependent on the number of element attributes being captured in source metadata and the associations defined in the metamodel. When estimating volume requirements for a Metadata Manager implementation, consider the following Metadata Manager components:
Metadata Manager Server Metadata Manager Console Metadata Manager Integration Repository Metadata Manager Warehouse
Note: Refer to the Metadata Manager Installation Guide for complete information on minimum system requirements for server, console and integration repository.
Considerations
Volume estimation for Metadata Manager is an iterative process. Use the Metadata Manager development environment to get accurate size estimates for the Metadata Manager production environment. The required steps are as follow: 1. 2. 3. 4. Identify the source metadata that needs to be loaded in the Metadata Manager production warehouse. Size the Metadata Manager Development warehouse based on the initial sizing estimates (as explained in next section of this document). Run the XConnects and monitor the disk usage. The development data loaded during the initial run of the XConnects should be used as a baseline for furthers sizing estimates. Restart the XConnect if a failure due to lack of disk space is encountered after adding additional disk space.
Repeat steps 1 through 4 until the XConnect run is successful. The following figures illustrate the initial sizing estimates for a typical Metadata Manager implementation:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
425 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
426 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
427 of 702
The following table is an initial estimation matrix that should be helpful in deriving a reasonable initial estimation. For increased input sizes, consider the expected Metadata Manager warehouse target size to increase in direct proportion.
XConnect
Input Size
Metamodel and other tables PowerCenter Data Analyzer Database Other XConnect
Expected Metadata Manager Warehouse Target Size 50MB 10MB 4MB 5MB 4.5MB
INFORMATICA CONFIDENTIAL
BEST PRACTICE
428 of 702
Description
The process for validating Metadata Manager metadata loads is very simple using the Metadata Manager Configuration Console. In the Metadata Manager Configuration Console, you can view the run history for each of the XConnects. For those who are familiar with PowerCenter, the Run History portion of the Metadata Manager Configuration Console is similar to the Workflow Monitor in PowerCenter. To view XConnect run history, first log into the Metadata Manager Configuration Console.
After logging into the console, click Administration > Repositories. The XConnect Repositories are displayed with their last load date and status.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
429 of 702
The XConnect run history is displayed (see below) on the Source Repository Management screen. A Metadata Manager Administrator should log into the Metadata Manager Configuration Console on a regular basis and verify that all XConnects that were scheduled ran to successful completion.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
430 of 702
If any XConnects have a status of Failed as shown above in the Last Refresh Status column, the issue should be investigated to correct it and the XConnect should be re-executed. XConnects can fail for a variety of reasons common in IT such as unavailability of the database, network failure, improper configuration, etc. More detailed error messages can be found in the activity log or in the workflow log files. By clicking on the Output tab of the selected XConnect in the Metadata Manager Console, you can view the output for the most recent run of the selected XConnect. In most cases, the logging is setup to write to the <PowerCenter installation directory>\client\Console\ActivityLog file.
After investigating and correcting the issue, the XConnect that failed should be re-executed at the next available time in order to load the most recent metadata.
Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL
BEST PRACTICE
431 of 702
Reports: This would include changes to the reporting schema and the out of the box reports. In addition, this would also include any new reports or schema elements created to cater to the custom reporting needs at the specific implementation instance of the product.
q
Metamodel: This would include the creation of new metamodel components to help associate any custom metadata against repository types and domains that are not covered by the out of the box Metadata Manager repository types.
q
Metadata: This would include the creation of new metadata objects, their properties or associations against repository instances configured within Metadata Manager. These repository instances could either belong to the repository types supported out of the box by Metadata Manager or any new repository types configured through custom additions to the metamodels.
q
Integration Repository: This would include changes to the out of the box PowerCenter workflows or mappings. In addition, this would also include any new PowerCenter objects (mappings, transformations etc.) or associated workflows.
Description
Report Changes
The following chart depicts the various scenarios related to the reporting area and the actions that need to be taken as relates to the deployment of the changed components. It is always advisable to create new schema elements (metrics, attributes etc.) or new reports in a new Data Analyzer folder to facilitate exporting or importing the Data Analyzer objects across development, test and production.
Test
Import the XML exported in the development environment.
Production
Import the XML exported in the development environment. Answer Yes to overriding the definitions that already exist for the changed schema components.
Answer Yes to overriding the Do an XML export of the changed definitions that already exist for components. the changed schema components.
Nature of Report Change: Modify an existing report (add or delete metrics, attributes, filters, change formatting etc.)
INFORMATICA CONFIDENTIAL
BEST PRACTICE
432 of 702
Development
Perform the change in development, test the same and certify it for deployment.
Test
Import the XML exported in the development environment.
Production
Import the XML exported in the development environment. Answer Yes to overriding the definitions that already exist for the changed report.
Answer Yes to overriding the Do an XML export of the changed definitions that already exist for report. the changed report.
Nature of Report Change: Add new schema component (metric, attribute etc.)
Development
Perform the change in development, test the same and certify it for deployment. Do an XML export of the new schema components.
Test
Import the XML exported in the development environment.
Production
Import the XML exported in the development environment.
Test
Import the XML exported in the development environment.
Production
Import the XML exported in the development environment.
Metamodel Changes
The following chart depicts the various scenarios related to the metamodel area and the actions that need to be taken as relates to the deployment of the changed components.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
433 of 702
Perform the change in development, test the same and certify it for deployment. Do an XML export of the new metamodel components (export can be done at 3 levels: Originators, Repository Types and Entry Points) using the Export Metamodel option.
Import the XML exported in the development environment using the Import metamodel option.
Import the XML exported in the development environment using the Import metamodel option.
Nature of the Change: Modify an existing mapping, transformation and/or the associated workflows etc.
Development Test Production
Perform the change in Import the XML exported in the Import the XML exported in the development, test the same and development environment. development environment. certify it for deployment. Do an XML export of the changed objects. Answer Yes to overriding the Answer Yes to overriding definitions that already exist for the definitions that already the changed object. exist for the changed object.
Nature of the Change: Add new ETL object (mapping, transformation etc.) and create an associated
Development Test Production
Perform the change in Import the XML exported in the Import the XML exported in the development, test the same and development environment. development environment. certify it for deployment. Do an XML export of the new objects.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
434 of 702
Description
A Metadata Manager administrator needs to be involved in the following areas to ensure that the Metadata Manager metadata warehouse is fulfilling the end-user needs:
q
Migration of Metadata Manager objects created in the Development environment to QA or the Production environment Creation and maintenance of access and privileges of Metadata Manager objects Repository backups Job monitoring Metamodel creation.
q q q
Install a new Metadata Manager instance for the QA/Production environment. This involves creating a new integration repository and Metadata Manager warehouse Export the metamodel from the Development environment and import it to QA or production via XML Import/Export functionality (in the Metadata Manager Administration tab) or via the Metadata Manager command line
BEST PRACTICE 435 of 702
INFORMATICA CONFIDENTIAL
utility.
q
Export the custom or modified reports created or configured in the Development environment and import them to QA or Production via XML Import/Export functionality in SG Administration Tab. This functionality is identical to the function in Data Analyzer; refer to the Data Analyzer Administration Guide for details on the import/export function.
Configure reports. Users can view particular reports, create reports, and/or modify the reporting schema. Configure the Metadata Manager Warehouse. Users can add, edit, and delete repository objects using Metadata Manager. Configure metamodels. Users can add, edit, and delete metamodels.
Metadata Manager also allows the Administrator to create access permissions on specific source repository objects for specific users. Users can be restricted to reading, writing, or deleting source repository objects that appear in Metadata Manager. Similarly, the Administrator can establish access permissions for source repository objects in the Metadata Manager warehouse. Access permissions determine the tasks that users can perform on specific objects. When the Administrator sets access permissions, he or she determines which users have access to the source repository objects that appear in Metadata Manager. The Administrator can assign the following types of access permissions to objects:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
436 of 702
Read - Grants permission to view the details of an object and the names of any objects it contains. Write - Grants permission to edit an object and create new repository objects in the Metadata Manager warehouse. Delete - Grants permission to delete an object from a repository. Change permission - Grants permission to change the access permissions for an object.
q q
When a repository is first loaded into the Metadata Manager warehouse, Metadata Manager provides all permissions to users with the System Administrator role. All other users receive read permissions. The Administrator can then set inclusive and exclusive access permissions.
Metamodel Creation
In cases where a client needs to create custom metamodels for sourcing custom metadata, the Metadata Manager Administrator needs to create new packages, originators, repository types and class associations. For details on how to create new metamodels for custom metadata loading and rendering in Metadata Manager, refer to the Metadata Manager Installation and Administration Guide.
Job Monitoring
When Metadata Manager Xconnects are running in the Production environment, Informatica recommends monitoring loads through the Metadata Manager console. The Configuration Console Activity Log in the Metadata Manager console can identify the total time it takes for an Xconnect to complete. The console maintains a history of all runs of an Xconnect, enabling a Metadata Manager Administrator to ensure that load times are meeting the SLA agreed upon with end users and that the load times are not increasing inordinately as data increases in the Metadata Manager warehouse. The Activity Log provides the following details about each repository load:
q
Repository Name- name of the source repository defined in Metadata Manager Run Start Date- day of week and date the XConnect run began Start Time- time the XConnect run started End Time- time the XConnect run completed Duration- number of seconds the XConnect run took to complete
q q q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
437 of 702
q q
Ran From- machine hosting the source repository Last Refresh Status- status of the XConnect run, and whether it completed successfully or failed
Repository Backups
When Metadata Manager is running in either the Production or QA environment, Informatica recommends taking periodic backups of the following areas:
q q
Database backups of the Metadata Manager warehouse Integration repository; Informatica recommends either of two methods for this backup:
r
The PowerCenter Repository Server Administration Console or pmrep command line utility The traditional, native database backup method.
The native PowerCenter backup is required but Informatica recommends using both methods because, if database corruption occurs, the native PowerCenter backup provides a clean backup that can be restored to a new database.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
438 of 702
Metadata Manager and Metadata Manager Client Web browser Databases Third-party software Code pages Application server
For more information about requirements for each component, see Chapter 3 PowerCenter Prerequisites in the PowerCenter Installation and Configuration Guide. As we already know from the existing installation, Metadata Manager is made up of various components. Except for the Metadata Manager Repository all other Metadata Manager components (i.e., Metadata Manager Server, PowerCenter Repository, PowerCenter Clients and Metadata Manager Clients) should be uninstalled and then reinstalled with the latest version of the Metadata Manager by the Metadata Manager upgrade process. Keep in mind that all modifications and/or customizations to the standard version of Metadata Manager will be lost and will need to be re-created and re-tested after the upgrade process.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
439 of 702
Description
Upgrade Steps
1. Set up new repository database and user account.
q
Set up new database/schema for the PowerCenter Metadata Manager repository. For Oracle, set the appropriate storage parameters. For IBM DB2, use a single node tablespace to optimize PowerCenter performance. For IBM DB2, configure the system temporary table spaces and update the heap sizes. Create a database user account for the PowerCenter Metadata Manager repository. The database user must have permissions to create and drop tables and indexes, and to select, insert, update, and delete data from tables. For more information, see the PowerCenter Installation and Configuration Guide.
You can use any backup or copy utility provided with the database to make a copy of the working Metadata Manager repository prior to upgrading the Metadata Manager. Use the copy of the Metadata Manager repository for the new Metadata Manager installation.
Make a copy of the existing parameter files. If you have custom XConnects and the parameter, attribute and data files of these custom XConnects is in a different place, do not forget to take a backup of them too. You may need to refer to these files when you later configure the parameters for the custom XConnects as part of the Metadata Manager client upgrade. For PowerCenter 8.0, you can find the parameter files in the following directory: PowerCenter_Home\server\infa_shared\SrcFiles For Metadata Manager, you can find the parameter files in the following directory: PowerCenter_Home\Server\SrcFiles
INFORMATICA CONFIDENTIAL
BEST PRACTICE
440 of 702
4. Export the Metadata Manager mappings that you customized or created for your environment.
q
If you made any changes on the standard Metadata Manager mappings, or create some new mappings within the Metadata Manager Integration repository, make an export of these mappings, workflows and/or sessions. If you created some additional reports, make an export of these reports too.
Select the Custom installation set and install Metadata Manager. The installer creates a Repository Service and Integration Service in the PowerCenter domain and creates a PowerCenter repository for Metadata Manager. For more information about installing Metadata Manager, see the PowerCenter Installation and Configuration Guide.
You must stop the Metadata Manager server before you upgrade the Metadata Manager repository contents. For more information about stopping Metadata Manager, see Appendix C Starting and Stopping Application Servers in the PowerCenter Installation and Configuration Guide.
Use the Metadata Manager upgrade utility shipped with the latest version of Metadata Manager to upgrade the Metadata Manager repository. For instructions on running the Metadata Manager upgrade utility, see the PowerCenter Installation and Configuration Guide.
8. Complete the Metadata Manager post-upgrade tasks. After you upgrade the Metadata Manager repository, perform the following tasks:
q
Update metamodels for Business Objects and Cognos ReportNet Content Manager. Delete obsolete Metadata Manager objects. Refresh Metadata Manager views. For a DB2 Metadata Manager repository, import metamodels.
BEST PRACTICE 441 of 702
q q q
INFORMATICA CONFIDENTIAL
For more information about the post-upgrade tasks, see the PowerCenter Installation and Configuration Guide. 9. Upgrade the Metadata Manager Client.
q
For instructions on upgrading the Metadata Manager Client, see the PowerCenter Installation and Configuration Guide. After you complete the upgrade steps, verify that all dashboards and reports are working correctly in Metadata Manager. When you are sure that the new version is working properly, you can delete the old instance of Metadata Manager and switch to the new version.
10. Compare and redeploy the exported Metadata Manager mappings that were customized or created for your environment.
q
If you had any modified Metadata Manager mappings in the previous release of Metadata Manager, check whether the modifications are still necessary. If the modifications still needed override or rebuild the changes into the new PowerCenter mappings. Import the customized reports into the new environment and check that the reports are still working with the new Metadata Manager environment. If not then make the necessary modifications to make them compatible with the new structure.
If you have any custom XConnects in your environment, you need to regenerate the XConnect mappings that were generated by the previous version of the custom XConnect configuration wizard. Before starting the regeneration process, ensure that the absolute paths to the .csv files are the same as the previous version. If all the paths are the same, no further actions are required after the regeneration of the workflows and mappings.
Verify that the browser and all reports are working correctly in Metadata Manager 8.1. If the upgrade is successful, you can uninstall the previous version of Metadata Manager.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
442 of 702
Description
In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production Support team. This team is typically involved with the support of other systems and has expertise in database systems and various operating systems. The Data Warehouse Development team becomes, in effect, a customer to the Production Support team. To that end, the Production Support team needs two documents, a Service Level Agreement and an Operations Manual, to help in the support of the production data warehouse.
Times when the system should be available to users. Scheduled maintenance window. Who is expected to monitor the operating system. Who is expected to monitor the database. Who is expected to monitor the PowerCenter sessions. How quickly the support team is expected to respond to notifications of system failures. Escalation procedures that include data warehouse team contacts in the event that the support team cannot resolve the system failure.
Operations Manual
The Operations Manual is crucial to the Production Support team because it provides the information needed to perform the data warehouse system maintenance. This manual should be self-contained, providing all of the information necessary for a production support operator to maintain the system and resolve most problems that can arise. This manual should contain information on how to maintain all data warehouse system components. At a minimum, the Operations Manual should contain:
Information on how to stop and re-start the various components of the system. Ids and passwords (or how to obtain passwords) for the system components. Information on how to re-start failed PowerCenter sessions and recovery procedures. A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and the average run times.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
443 of 702
Error handling strategies. Who to call in the event of a component failure that cannot be resolved by the Production Support team.
Application ID
PowerExchange documentation talks about consuming applications the processes that extract changes, whether they are realtime or change (periodic batch extraction). Each consuming application must identify itself to PowerExchange. Realistically, this means that each session must have an application id parameter containing a unique label.
Restart Tokens
Power Exchange remembers each time that a consuming application successfully extracts changes. The end-point of the extraction (Address in the database Log RBA or SCN) is stored in a file on the server hosting the Listener that reads the changed data. Each of these memorized end-points (i.e., Restart Tokens) is a potential restart point. It is possible, using the Navigator interface directly, or by updating the restart file,
INFORMATICA CONFIDENTIAL
BEST PRACTICE
444 of 702
to force the next extraction to restart from any of these points. If youre using the ODBC interface for PowerExchange, this is the best solution to implement. If you are running periodic extractions of changes and everything finishes cleanly, the restart token history is a good approach to recovery back to a previous extraction. You simply chose the recovery point from the list and re-use it. There are more likely scenarios though. If you are running realtime extractions, potentially never-ending or until theres a failure, there are no end-points to memorize for restarts. If your batch extraction fails, you may already have processed and committed many changes. You cant afford to miss any changes and you dont want to reapply the same changes youve just processed, but the previous restart token does not correspond to the reality of what youve processed. If you are using the Power Exchange Client for PowerCenter (PWXPC), the best answer to the recovery problem lies with PowerCenter, which has historically been able to deal with restarting this type of process Guaranteed Message Delivery. This functionality is applicable to both realtime and change CDC options. The PowerExchange Client for PowerCenter stores the Restart Token of the last successful extraction run for each Application Id in files on the PowerCenter Server. The directory and file name are required parameters when configuring the PWXPC connection in the Workflow Manager. This functionality greatly simplifies recovery procedures compared to using the ODBC interface to PowerExchange. To enable recovery, select the Enable Recovery option in the Error Handling settings of the Configuration tab in the session properties. During normal session execution, PowerCenter Server stores recovery information in cache files in the directory specified for $PMCacheDir.
Recovery
If a CDC session fails, and it was executed with recovery enabled, you can restart it in recovery mode either from the PowerCenter Client interfaces or using the pmcmd command line instruction. Obviously, this assumes that you are able to identify that the session failed previously. 1. 2. 3. 4. 5. Start from the point in time specified by the Restart Token in the GMD cache. PowerCenter reads the change records from the GMD cache. PowerCenter processes and commits the records to the target system(s). Once the records in the GMD cache have been processed and committed, PowerCenter purges the records from the GMD cache and writes a restart token to the restart file. The PowerCenter session ends cleanly.
The CDC session is now ready for you to execute in normal mode again.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
445 of 702
When you re-execute a failed CDC session, you receive all the changed data since the last Power Exchange restart token. Your session has to cope with processing some of the same changes you already processed at the start of the failed execution either using lookups/joins to the target to see if youve already applied the change you are processing, or simply ignoring database error messages such as trying to delete a record you already deleted. If you run DTLUAPPL to generate a restart token periodically during the execution of your CDC extraction and save the results, you can use the generated restart token to force a recovery at a more recent point in time than the last session-end restart token. This is especially useful if you are running realtime extractions using ODBC, otherwise you may find yourself re-processing several days of changes youve already processed. Finally, you can always re-initialize the target and the CDC processing:
Take an image copy of the tablespace containing the table to be captured, with QUIESCE option. Monitor the EDMMSG output from the PowerExchange Logger job. Look for message DTLEDM172774I which identifies the PowerExchange Logger sequence number corresponding to the QUIESCE event. The output logger show detail with the following format:
DB2 QUIESCE of TABLESPACE TSNAME.TBNAME at DB2 RBA/LRSN 000849C56185 EDP Logger RBA . . . . . . . . . : D5D3D3D34040000000084E0000000000 Sequence number . . . . . . . . . : 000000084E0000000000 Edition number . . . . . . . . . : B93C4F9C2A79B000 Source EDMNAME(s) . . . . . . . . : DB2DSN1CAPTNAME1
Take note of the log sequence number. Repeat for all tables that form part of the same PowerExchange Application. Run the DTLUAPPL utility specifying the application name and the registration name for each table in the application. Alter the SYSIN as follows:
MOD APPL REGDEMO DSN1 (where REGDEMO is Registration name on Navigator) add RSTTKN CAPDEMO (where CAPDEMO is Capture name from Navigator) SEQUENCE 000000084E0000000000000000084E0000000000 RESTART D5D3D3D34040000000084E0000000000 END APPL REGDEMO (where REGDEMO is Registration name from Navigator)
Note how the sequence number is a repeated string from the sequence number found in the Logger messages after the Copy/Quiesce.
Note that the Restart parameter specified in the DTLUAPPL job is the EDP Logger RBA generated in the same message sequence. This sets the extraction start point on the PowerExchange Logger to the point at which the QUIESCE was done above. The image copy obtained above can be used for the initial materialization of the target tables.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
446 of 702
Task
Stop Command
Notes
Preferred method /F DTLLST, CLOSE If CLOSE doesnt work If FORCE doesnt work If STOP doesnt work /DTLA DRAIN and SHUTDOWN COMPLETELY can be used only at the request of Informatica Support
Description of Task
Listener
The PowerExchange listener is used for bulk data movement and registering sources for Change Data Capture
Agent
The PowerExchange Agent, used to manage connections to the PowerExchange Logger and handle repository and other tasks. This must be started before the Logger.
****(if you are installing, you need to run The PowerExchange Logger used to setup2 here manage the Linear datasets and hiperspace prior to starting that hold change capture data. the Logger) /f DTLL, display
ECCR (DB2)
STOP command jus t cancel ECCR, QUIESCE wa it for open UOWs to There must be registrations present prior to complete. bringing up most adaptor ECCRs. /F DTLDB2EC, display will publish stats into the ECCR sysout
The PowerExchange Condenser used to run condense jobs against the PowerExchange Logger. This is used with PowerExchange CHANGE to organize the data by table, allow for interval-based extraction, and optionally fully condense multiple changes to a single row.
Condense /S DTLC
/F DTLC, SHUTDOWN
INFORMATICA CONFIDENTIAL
BEST PRACTICE
447 of 702
(1) To identify
all tasks running through a certain listener issue the following: (2) Then to stop the Apply issue the Submit (1) F <Listener job>, following JCL or where: name = DA /S DBN2 (apply (2) F DTLLST, name) DTLAPP STOPTASK name If the CAPX access and apply is running locally not through a listener then issue the following command: <Listener job>, CLOSE
Apply
The PowerExchange Apply process used in situations where straight replication is required and the data is not moved through PowerCenter before landing in the target.
Notes:
1. 2.
/p is an MVS STOP command , /f is an MVS MODIFY command. REMOVE the / if the command is done from the console not SDSF.
If you attempt to shut down the Logger before the ECCR(s), a message indicates that there are still active ECCRs and that the logger will come down AFTER the ECCRS go away. What you should do is: You can shut the Listener and the ECCR(s) down at the same time. The Listener: 1. 2. 3. F <Listener_job>,CLOSE If this isnt coming down fast enough for you, issue F <Listener_job>,CLOSE FORCE If it still isnt coming down fast enough, issue C <Listener_job>
Note that these commands are listed in the order of most to least desirable method for bringing the listener down. The DB2 ECCR: 1. 2. 3. F <DB2 ECCR>,QUIESCE - this waits for all OPEN UOWs to finish, which can be awhile if a long-running batch job is running. F <DB2 ECCR>,STOP - this terminates immediately P <DB2 ECCR> - this also terminates immediately
INFORMATICA CONFIDENTIAL
BEST PRACTICE
448 of 702
Once the ECCR(s) are down, you can then bring the Logger down. The Logger: P <Logger job_name> The Agent: CMDPREFIX SHUTDOWN If you know that you are headed for an IPL, you can issue all these commands at the same time. The Listener and ECCR(s) should start down, if you are looking for speed, issue F <Listener_job>,CLOSE FORCE to shut down the Listener, then issue F <DB2 ECCR>,STOP to terminate DB2 ECCR, then shut down the Logger and the Agent. Note: Bringing the Agent down before the ECCR(S) are down can result in a loss of captured data. If a new file/DB2 table/IMS database is being updated during this shutdown process and the Agent is not available, the call to see if the source is registered returns a Not being captured answer. The update, therefore, occurs without you capturing it, leaving your target in a broken state (which you won't know about until too late!)
Resource availability requirements Performance requirements Whether you are running near-realtime or batch replication Data recovery requirements
An inverse relationship exists between the size of the log data sets and the frequency of archiving required. Larger data sets need to be archived less often than smaller data sets. Note: Although smaller data sets require more frequent archiving, the archiving process requires less time. Use the following formulas to estimate the total space you need for each active log data set. For an example of the calculated data set size, refer to the PowerExchange Reference Guide.
active log data set size in bytes = (average size of captured change record * number of changes captured per hour * desired number of hours between archives) * (1 + overhead rate) active log data set size in cylinders = active log data set size in tracks/number of tracks per cylinder active log data set size in tracks = active log data set size in bytes/number of usable bytes per track
When determining the average size of your captured change records, note the following information:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
449 of 702
PWX Change Capture captures the full object that is changed. For example, if one field in an IMS segment has changed, the product captures the entire segment. The PWX header adds overhead to the size of the change record. Per record, the overhead is approximately 300 bytes plus the key length. The type of change transaction affects whether PWX Change Capture includes a before-image, after-image, or both: o DELETE includes a before-image. o INSERT includes an after-image. o UPDATE includes both.
Informatica suggests using an overhead rate of 5 to 10 percent, which includes the following factors:
Overhead for control information Overhead for writing recovery-related information, such as system checkpoints.
You have some control over the frequency of system checkpoints when you define your PWX Logger parameters. See CHKPT_FREQUENCY in the PowerExchange Reference Guide for more information about this parameter. DASD Capacity Conversion Table
estimated average size of a changed record = 600 bytes estimated rate of captured changes = 40,000 changes per hour desired number of hours between archives = 12 overhead rate = 5 percent DASD model = 3390
The estimated size of each active log data set in bytes is calculated as follows: 600 * 40,000 * 12 * 1.05 = 302,400,000 The number of cylinders to allocate is calculated as follows: 302,400,000 / 49,152 = approximately 6152 tracks 6152 / 15 = approximately 410 cylinders The following example shows an IDCAMS DEFINE statement that uses the above calculations:
DEFINE CLUSTER (NAME (HLQ.EDML.PRILOG.DS01) LINEAR VOLUMES(volser) SHAREOPTIONS(2,3) CYL(410) ) DATA (NAME(HLQ.EDML.PRILOG.DS01.DATA) )
INFORMATICA CONFIDENTIAL BEST PRACTICE 450 of 702
The variable HLQ represents the high-level qualifier that you defined for the log data sets during installation.
No secondary allocation. A single VOLSER in the VOLUME parameter. An SMS STORCLAS, if used, without GUARANTEED SPACE=YES.
Display PowerExchange Logger log data sets, units of work (UOWs), and reader/writer connections. Resolve in-doubt UOWs. Stop a PowerExchange Logger. Print the contents of the PowerExchange active log file (in hexadecimal format).
You use batch commands primarily in batch change utility jobs to make changes to parameters and configurations when the PowerExchange Logger is stopped. Use PowerExchange Logger batch commands to:
Define PowerExchange Loggers and PowerExchange Logger options, including PowerExchange Logger names, archive log options, buffer options, and mode (single or dual). Add log definitions to the restart data set. Delete data set records from the restart data set. Display log data sets, UOWs, and reader/writer connections.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
451 of 702
See PowerExchange Reference Guide (8.1.1) for detailed information on Logger Commands (Chapter 4, Page 59)
INFORMATICA CONFIDENTIAL
BEST PRACTICE
452 of 702
How can the team keep track of what has been loaded? What order should the data be loaded in? What happens when there is a load failure? How can bad data be removed and replaced? How can the source of data be identified? When it was loaded?
Description
Load management provides an architecture to allow all of the above questions to be answered with minimal operational effort.
Data Lineage
The term Data Lineage is used to describe the ability to track data from its final resting place in the target back to its original source. This requires the tagging of every row of data in the target with an ID from the load management metadata model. This serves as a direct link between the actual data in the target and the original source data. To give an example of the usefulness of this ID, a data warehouse or integration competency center operations team, or possibly end users, can, on inspection of any row of data in the target schema, link back to see when it was loaded, where it came from, any other metadata about the set it was loaded with, validation check results, number of other rows loaded at the same time, and so forth. It is also possible to use this ID to link one row of data with all of the other rows loaded at the same time. This can be useful when a data issue is detected in one row and the operations team needs to see if the same error exists in all of the other rows. More than this, it is the ability to easily identify the source data for a specific row in the target, enabling the operations team to quickly identify where a data issue may lie. It is often assumed that data issues are produced by the transformation processes executed as part of the target schema load. Using the source ID to link back the source data makes it easy to identify whether the issues were in the source data when it was first encountered by the target schema load processes or if those load processes caused the issue. This ability can save a huge amount of time, expense, and frustration -- particularly in the initial launch of any new subject areas.
Process Lineage
Tracking the order that data was actually processed in is often the key to resolving processing and data issues. Because choices are often made during the processing of data based on business rules and logic, the order and path of processing differs from one run to the next. Only by actually tracking these processes as they act upon the data can issue resolution be simplified.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
453 of 702
Robustness
Using load management metadata to control the loading process also offers two other big advantages, both of which fall under the heading of robustness because they allow for a degree of resilience to load failure.
Load Ordering
Load ordering is a set of processes that use the load management metadata to identify the order in which the source data should be loaded. This can be as simple as making sure the data is loaded in the sequence it arrives, or as complex as having a pre-defined load sequence planned in the metadata. There are a number of techniques used to manage these processes. The most common is an automated process that generates a PowerCenter load list from flat files in a directory, then archives the files in that list after the load is complete. This process can use embedded data in file names or can read header records to identify the correct ordering of the data. Alternatively the correct order can be pre-defined in the load management metadata using load calendars. Either way, load ordering should be employed in any data integration or data warehousing implementation because it allows the load process to be automatically paused when there is a load failure, and ensures that the data that has been put on hold is loaded in the correct order as soon as possible after a failure. The essential part of the load management process is that it operates without human intervention, helping to make the system self healing!
Rollback
If there is a loading failure or a data issue in normal daily load operations, it is usually preferable to remove all of the data loaded as one set. Load management metadata allows the operations team to selectively roll back a specific set of source data, the data processed by a specific process, or a combination of both. This can be done using manual intervention or by a developed automated feature.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
454 of 702
As you can see from the simple load management metadata model above, there are two sets of data linked to every transaction in the target tables. These represent the two major types of load management metadata:
Source Tracking
Source tracking looks at how the target schema validates and controls the loading of source data. The aim is to automate as much of the load processing as possible and track every load from the source through to the target schema.
Source Definitions
Most data integration projects use batch load operations for the majority of data loading. The sources for these come in a variety of forms, including flat file formats (ASCII, XML etc), relational databases, ERP systems, and legacy mainframe systems. The first control point for the target schema is to maintain a definition of how each source is structured, as well as other validation parameters. These definitions should be held in a Source Master table like the one shown in the data model above. These definitions can and should be used to validate that the structure of the source data has not changed. A great example of this practice is the use of DTD files in the validation of XML feeds.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
455 of 702
Header information (if any) How many columns Data types for each column Expected number of rows
For RDBMS sources, the Source Master record might hold the definition of the source tables or store the structure of the SQL statement used to extract the data (i.e., the SELECT, FROM and ORDER BY clauses). These definitions can be used to manage and understand the initial validation of the source data structures. Quite simply, if the system is validating the source against a definition, there is an inherent control point at which problem notifications and recovery processes can be implemented. Its better to catch a bad data structure than to start loading bad data.
Source Instances
A Source Instance table (as shown in the load management metadata model) is designed to hold one record for each separate set of data of a specific source type being loaded. It should have a direct key link back to the Source Master table which defines its type. The various source types may need slightly different source instance metadata to enable optimal control over each individual load. Unlike the source definitions, this metadata will change every time a new extract and load is performed. In the case of flat files, this would be a new file name and possibly date / time information from its header record. In the case of relational data, it would be the selection criteria (i.e., the SQL WHERE clause) used for each specific extract, and the date and time it was executed. This metadata needs to be stored in the source tracking tables so that the operations team can identify a specific set of source data if the need arises. This need may arise if the data needs to be removed and reloaded after an error has been spotted in the target schema.
Process Tracking
Process tracking describes the use of load management metadata to track and control the loading processes rather than the specific data sets themselves. There can often be many load processes acting upon a single source instance set of data. While it is not always necessary to be able to identify when each individual process completes, it is very beneficial to know when a set of sessions that move data from one stage to the next has completed. Not all sessions are tracked this way because, in most cases, the individual processes are simply storing data into temporary tables that will be flushed at a later date. Since load management process IDs are intended to track back from a record in the target schema to the process used to load it, it only makes sense to generate a new process ID if the data is being stored permanently in one of the major staging areas.
Process Definition
Process definition metadata is held in the Process Master table (as shown in the load management metadata model ). This, in its basic form, holds a description of the process and its overall status. It can also be extended, with the introduction of other tables, to reflect any dependencies among processes, as well as processing holidays.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
456 of 702
Process Instances
A process instance is represented by an individual row in the load management metadata Process Instance table. This represents each instance of a load process that is actually run. This holds metadata about when the process started and stopped, as well as its current status. Most importantly, this table allocates a unique ID to each instance. The unique ID allocated in the process instance table is used to tag every row of source data. This ID is then stored with each row of data in the target table.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
457 of 702
Tracking Transactions
This is the simplest data to track since it is loaded incrementally and not updated. This means that the process and source tracking discussed earlier in this document can be applied as is.
Tracking Aggregations
Aggregation also causes additional complexity for load management because the resulting aggregate row very often contains the aggregation across many source data sets. As with reference data, this means that the aggregated row cannot be backed out in the same way as transactions. This problem is managed by treating the source of the aggregate as if it was an original source. This means that rather than trying to track the original source, the load management metadata only tracks back to the transactions in the target that have been aggregated. So, the mechanism is the same as used for transactions but the resulting load management metadata only tracks back from the aggregate to the fact table in the target schema.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
458 of 702
Description
When considering HA recovery, be sure to explore the following two components of HA that exist on all enterprise systems:
External Resilience
External resilience has to do with the integration and specification of domain name servers, database servers, FTP servers, network access servers in a defined, tested 24x7 configuration. The nature of Informaticas DI setup places it at many interface points in system integration. Before placing and configuring PowerCenter within an infrastructure that has an HA expectation, the following questions should be answered:
q
Is the pre-existing set of servers already in a sustained HA configuration? Is there a schematic with applicable settings to use for reference? If so, is there a unit test or system test to exercise before installing PowerCenter products? It is important to remember that the external systems must be HA before the PowerCenter architecture they support can be. What are the bottlenecks or perceived failure points of the existing system? Are these bottlenecks likely to be exposed or heightened by placing PowerCenter in the infrastructure? (e.g., five times the amount of Oracle traffic, ten times the amount of DB2 traffic, a UNIX server that always shows 10% idle may now have twice as many processes running). Finally, if a proprietary solution (such as IBM HACMP or Veritas Storage Foundation for Windows) has been implemented with success at a customer site, this sets a different expectation. The customer may merely want the grid capability of multiple PowerCenter nodes to splay/recover Informatica tasks, and expect their back-end system (such as those listed above) to provide file system or server bootstrap recovery upon a fundamental failure of those back-end systems. If these back-end systems have a script/command capability to, for example, restart a repository service, PowerCenter can be installed in this fashion. However, PowerCenter's HA capability extends as far as the PowerCenter components.
Internal Resilience
INFORMATICA CONFIDENTIAL BEST PRACTICE 459 of 702
Rapid and constant connectivity to the repository metadata. Rapid and constant network connectivity between all gateway and worker nodes in the PowerCenter domain. A common highly-available storage system accessible to all PowerCenter domain nodes with one service name and one file protocol. Only domain nodes on the same operating system can share gateway and log files (see Admin Console->Domain->Properties->Log and Gateway Configuration).
Internal resilience occurs within the PowerCenter environment among PowerCenter services, the PowerCenter Client tools, and other client applications such as pmrep and pmcmd. Internal resilience can be configured at the following levels:
q
Domain. Configure service connection resilience at the domain level in the general properties for the domain. The domain resilience timeout determines how long services attempt to connect as clients to application services or the Service Manager. The domain resilience properties are the default values for all services in the domain. Service. It is possible to configure service connection resilience in the advanced properties for an application service. When configuring connection resilience for an application service, this overrides the resilience values from the domain settings. Gateway. The master gateway node maintains a connection to the domain configuration database. If the domain configuration database becomes unavailable, the master gateway node tries to reconnect. The resilience timeout period depends on user activity and whether the domain has one or multiple gateway nodes:
r
Single gateway node. If the domain has one gateway node, the gateway node tries to reconnect until a user or service tries to perform a domain operation. When a user tries to perform a domain operation, the master gateway node shuts down. Multiple gateway nodes. If the domain has multiple gateway nodes and the master gateway node cannot reconnect, then the master gateway node shuts down. If a user tries to perform a domain operation while the master gateway node is trying to connect, the master gateway node shuts down. If another gateway node is available, the domain elects a new master gateway node. The domain tries to connect to the domain configuration database with each gateway node. If none of the gateway nodes can connect, the domain shuts down and all domain operations fail.
Process
Be aware that your implementation has a dependency on the installation environment. For example, you may want to combine multiple disparate ETL repositories onto a single upgraded PowerCenter platform. This has the benefit of:
q q
Single point of access/administration from the Admin Console. A group of repositories that now can become a repository domain.
BEST PRACTICE 460 of 702
INFORMATICA CONFIDENTIAL
A group of repositories that can be shaped into common processing/backup/schedule patterns for optimal performance and administration.
Single point of failure of one PowerCenter domain. One repository, possibly heavy in processing or poorly designed, degrading that entire PowerCenter domain.
If the primary node running the service process becomes unavailable, the service fails over to a backup node. The primary node may be unavailable if it shuts down or if the connection to the node becomes unavailable. If the primary node running the service process is available, the domain tries to restart the process based on the restart options configured in the domain properties. If the process does not restart, the Service Manager can mark the process as failed. The service then fails over to a backup node and starts another process. If the Service Manager marks the process as failed, the administrator must enable the process after addressing any configuration problem.
If a service process fails over to a backup node, it does not fail back to the primary node when the node becomes available. You can disable the service process on the backup node to cause it to fail back to the primary node.
Recovery
Recovery is the completion of operations after an interrupted service is restored. When a service recovers, it restores the state of operation and continues processing the job from the point of interruption.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
461 of 702
The state of operation for a service contains information about the service process. The PowerCenter services include the following states of operation:
q
Service Manager. The Service Manager for each node in the domain maintains the state of service processes running on that node. If the master gateway shuts down, the newly elected master gateway collects the state information from each node to restore the state of the domain. Repository Service. The Repository Service maintains the state of operation in the repository. This includes information about repository locks, requests in progress, and connected clients. Integration Service. The Integration Service maintains the state of operation in the shared storage configured for the service. This includes information about scheduled, running, and completed tasks for the service. The Integration Service maintains session and workflow state of operation based on the recovery strategy you configure for the session and workflow.
When designing a system that has HA recovery as a core component, be sure to include architectural and procedural recovery. Architectural recovery for a PowerCenter domain involves the three bulleted items above restarting in a complete, sustainable and traceable manner. If the Service Manager and Repository Service recover, but the Integration service cannot recover the restart is not successful and has little value to a production environment. Field experience with PowerCenter has yielded these key items in planning a proper recovery upon a systemic failure:
q
A PowerCenter domain cannot be established without at least one gateway node running. Even if you have established a domain with ten worker nodes and one gateway node, none of the worker nodes can run ETL jobs without a gateway node managing the domain. An Integration Service cannot run without its associated Repository Service being started and connected to its metadata repository. A Repository Service cannot run without its metadata repository DBMS being started and accepting database connections. Often database connections are established on periodic windows that expire which puts the repository offline. If the installed domain configuration is running from Authentication Module Configuration and the LDAP Principal User account becomes corrupt or inactive, all PowerCenter repository access is lost. If the installation uses any additional authentication outside PowerCenter (such as LDAP), an additional recovery and restart plan is required.
Procedural recovery is supported with many features of PowerCenter 8. Consider the following very simple mapping that might run in production for many ETL applications:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
462 of 702
Suppose there is a situation where the ftp server sending this ff_customer file is inconsistent. Many times the file is not there, but the processes depending on this must always run. The process is always insert only. You do not want the succession of ETL that follows this small process to fail they can run to customer_stg with current records only. This setting in the Workflow Manager, Session, Properties would fit your need:
Since it is not critical the ff_customer records run each time, record the failure but continue the process. Now say the situation has changed. Sessions are failing on a PowerCenter server due to target
INFORMATICA CONFIDENTIAL BEST PRACTICE 463 of 702
database timeouts. A requirement is given that the session must recover from this:
Resuming from last checkpoint restarts the process from its prior commit, allowing no loss of ETL work. To finish this second case, consider three basic items on the workflow side when HA is incorporated in your environment:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
464 of 702
An Integration Service in an HA environment can only recover those workflows marked with Enable HA recovery. For all critical workflows, this should be considered. For a mature set of ETL code running in QA or Production, you may consider the following workflow property:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
465 of 702
This would automatically recover tasks from where they failed in a workflow upon an application or system wide failure. Consider carefully the use of this feature, however. Remember, automated restart of critical ETL processes without interaction can have vast unintended side effects. For instance, if a database alias or synonym was dropped, all ETL targets may now refer to different objects than the original intent. Only PowerCenter environments with HA, mature production support practices, and a complete operations manual per Velocity, should expect complete recovery with this feature. In an HA environment, certain components of the Domain can go offline while the Domain stays up to execute ETL jobs. This is a time to use the Suspend On Error feature from the General tab of Workflow settings. The backup Integration Server would then pickup this workflow and resume processing based on the resume settings of this workflow:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
466 of 702
Features
A variety of HA features exist in PowerCenter 8. Specifically, they include:
q q q
Integration Service HA option Integration Service Grid option Repository Service HA option
First, proceed from an assumption that nodes have been provided to you such that a basic HA configuration of PowerCenter 8 can take place. A lab-tested version completed by Informatica is configured as below with an HP solution. Your solution can be completed with any reliable clustered file system. Your first step would always be implementing and thoroughly exercising a clustered file system:
INFORMATICA CONFIDENTIAL BEST PRACTICE 467 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
468 of 702
If a failure were to occur on this HA configuration, the Integration Service INT_SVCS_DEV would poll the Domain: Domain_Corp_RD for another Gateway Node, then assign INT_SVCS_DEV over to that Node, in this case Node_Corp_RD02. Then the B button would highlight showing this Node as providing INT_SVCS_DEV. A vital component of configuring the Integration Service for HA is making sure the Integration Service files are stored in a shared persistent environment. You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files. Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. State of operation files must be accessible by all Integration Service processes. When you enable an Integration Service, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption. All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location. By default, the installation program creates a set of Integration Service directories in the server
INFORMATICA CONFIDENTIAL BEST PRACTICE 469 of 702
\infa_shared directory. You can set the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for each Integration Service process. The key HA concern of this is $PMRootDir should be on the highly-available clustered file system mentioned above.
PowerCenter supports nodes from heterogeneous operating systems, bit modes, and others to be used within same domain. However, if there are heterogeneous nodes for a grid, then you can only run a workflow on the Grid, not a session. A session on grid does not support heterogeneous operating systems. This is because a session may have a sharing cache file and other objects that may not be compatible with all of the operating systems. For session on a grid, you need a homogeneous grid.
In short, scenarios such as a production failure are the worst possible time to find out that a multiOS grid does not meet your needs. If you have a large volume of disparate hardware, it is certainly possible to make perhaps two grids centered on two different operating systems. In either case, the performance of your clustered file system is going to affect the performance of your server grid, and should be considered as part of your performance/maintenance strategy.
The first is during install. When the Install Program prompts for your nodes to do a Repository install (after answering Yes to Create Repository), you can enter a second node where the Install Program can create and invoke the PowerCenter service and Repository Service for a backup repository node. Keep in mind that all of the database, OS, and server preparation steps referred to in the PowerCenter Installation and Configuration Guide still hold true for this backup node. When the install is complete, the Repository Service displays a P/B link similar to that illustrated above for the INT_SVCS_DEV example Integration Service. A second method for configuring Repository Service HA allows for measured, incremental implementation of HA from a tested base configuration. After ensuring that your initial
BEST PRACTICE 470 of 702
INFORMATICA CONFIDENTIAL
Repository Service settings (e.g., resilience timeout, codepage, connection timeout) and the DBMS repository containing the metadata are running and stable, you can add a second node and make it the Repository Backup. Install the PowerCenter Service on this second server following the PowerCenter Installation and Configuration Guide. In particular, skip creating Repository Content or an Integration Service on the node. Following this, Go to Admin Console->Domain and select: Create->Node. The server to contain this node should be of the exact same configuration/ clustered file system/OS as the Primary Repository Service. The following dialog should appear:
Assign a logical name to the node to describe its place, and select Create. The node should now be running as part of your domain, but if it isn't, refer to the PowerCenter Command Line Reference with the infaservice and infacmd commands to ensure the node is running on the domain. When it is running, go to Domain->Repository->Properties->Node Assignments->Edit and the browser window displays:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
471 of 702
Click OK and the Repository Service is now configured in a Primary/Backup setup for the domain. To ensure the P/B setting, test the following elements of the configuration: 1. Be certain the same version of the DBMS client is installed on the server and can access the metadata. 2. Both nodes must be on the same clustered file system. 3. Log onto the OS for the Backup Repository Service and ping the Domain Master Gateway Node. Be sure a reasonable response time is being given at an OS level (i.e., less than 5 seconds). 4. Take the Primary Repository Service Node offline and validate that the polling, failover, restart process takes place in a methodical, traceable manner for the Repository Service on the Domain. This should be clearly visible from the node logs on the Primary and Secondary Repository Service boxes [$INFA_HOME/server/tomcat/logs] or from Admin Console->Repository->Logs. Note: Remember that when a node is taken offline, you cannot access Admin Console from it.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
472 of 702
Description
Methods for validating the load process range from simple to complex. Use the following steps to plan a load validation process: 1. Determine what information you need for load validation (e.g., work flow names, session names, session start times, session completion times, successful rows and failed rows). 2. Determine the source of the information. All of this information is stored as metadata in the PowerCenter repository, but you must have a means of extracting it. 3. Determine how you want the information presented to you. Should the information be delivered in a report? Do you want it emailed to you? Do you want it available in a relational table so that history is easily preserved? Do you want it stored as a flat file? Weigh all of these factors to find the correct solution for your project. Below are descriptions of five possible load validation solutions, ranging from fairly simple to increasingly complex:
%s Session name
INFORMATICA CONFIDENTIAL
BEST PRACTICE
473 of 702
q q q q q q q q q q q q
%e Session status %b Session start time %c Session completion time %i Session elapsed time %l Total records loaded %r Total records rejected %t Target table details %m Name of the mapping used in the session %n Name of the folder containing the session %d Name of the repository containing the session %g Attach the session log to the message %a <file path> Attache a file to the message
INFORMATICA CONFIDENTIAL
BEST PRACTICE
474 of 702
q q q q
Load statistics and operational metadata that enable load validation. Table dependencies and impact analysis that enable change management. PowerCenter object statistics to aid in development assistance. Historical load statistics that enable planning for growth.
In addition to the 130 pre-packaged reports and dashboards that come standard with PCR, you can develop additional custom reports and dashboards that are based upon the PCR limited-use license that allows you to source reports from the PowerCenter repository. Examples of custom components that can be created include:
q q q
Repository-wide reports and/or dashboards with indicators of daily load success/failure. Customized project-based dashboard with visual indicators of daily load success/failure. Detailed daily load statistics report for each project that can be exported to Microsoft Excel or PDF. Error handling reports that deliver error messages and source data for row level errors that may have occurred during a load.
Below is an example of a custom dashboard that gives instant insight into the load validation across an entire repository through four custom indicators.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
475 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
476 of 702
TIP Informatica strongly advises against querying directly from the repository tables. Because future versions of PowerCenter are likely to alter the underlying repository tables, PowerCenter supports queries from the unaltered MX views, not the repository tables.
5. Mapping Approach
A more complex approach, and the most customizable, is to create a PowerCenter mapping to populate a table or a flat file with desired information. You can do this by sourcing the MX view REP_SESS_LOG and then performing lookups to other repository tables or views for additional information. The following graphic illustrates a sample mapping:
This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the absolute minimum and maximum run times for that particular session. This enables you to compare the current execution time with the minimum and maximum durations.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
477 of 702
Note: Unless you have acquired additional licensing, a customized metadata data mart cannot be a source for a PCR report. However, you can use a business intelligence tool of your choice instead.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
478 of 702
Description
The PowerCenter Administrator has many responsibilities. In addition to regularly backing up the domain and repository, truncating logs, and updating the database statistics, he or she also typically performs the following functions:
q q q q q q q
Determines metadata strategy Installs/configures client/server software Migrates development to test and production Maintains PowerCenter servers Upgrades software Administers security and folder organization Monitors and tunes environment
Note: The Administrator is also typically responsible for maintaining domain and repository passwords; changing them on a regular basis and keeping a record of them in a secure place.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
479 of 702
This responsibility includes installing and configuring the application servers in all applicable environments (e.g., development, QA, production, etc.). The Administrator must have a thorough understanding of the working environment, along with access to resources such as a Windows 2000/2003 or UNIX Admin and a DBA. The Administrator is also responsible for installing and configuring the client tools. Although end users can generally install the client software, the configuration of the client tool connections benefits from being consistent throughout the repository environment. The Administrator, therefore, needs to enforce this consistency in order to maintain an organized repository.
Upgrade Software
If and when the time comes to upgrade software, the Administrator is responsible for overseeing the installation and upgrade process.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
480 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
481 of 702
Description
Tasks such as getting server and session properties, session status, or starting or stopping a workflow or a task can be performed either through the Workflow Monitor or by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be integrated with PowerCenter at any of several levels. The level of integration depends on the complexity of the workflow/schedule and the skill sets of production support personnel. Many companies want to automate the scheduling process by using scripts or thirdparty schedulers. In some cases, they are using a standard scheduler and want to continue using it to drive the scheduling process. A third-party scheduler can start or stop a workflow or task, obtain session statistics, and get server details using the pmcmd commands. Pmcmd is a program used to communicate with the PowerCenter server.
Low Level
Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter workflow. This process subsequently kicks off the rest of the tasks or sessions. The PowerCenter scheduler handles all processes and dependencies after the third-party scheduler has kicked off the initial workflow. In this level of integration, nearly all control lies with the PowerCenter scheduler.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
482 of 702
This type of integration is very simple to implement because the third-party scheduler kicks off only one process. It is only used as a loophole to fulfil a corporate mandate on a standard scheduler. This type of integration also takes advantage of the robust functionality offered by the Workflow Monitor. Low-level integration requires production support personnel to have a thorough knowledge of PowerCenter. Because Production Support personnel in many companies are only knowledgeable about the companys standard scheduler, one of the main disadvantages of this level of integration is that if a batch fails at some point, the Production Support personnel may not be able to determine the exact breakpoint. Thus, the majority of the production support burden falls back on the Project Development team.
Medium Level
With Medium-level integration, a third-party scheduler kicks off some, but not all, workflows or tasks. Within the tasks, many sessions may be defined with dependencies. PowerCenter controls the dependencies within the tasks. With this level of integration, control is shared between PowerCenter and the third-party scheduler, which requires more integration between the third-party scheduler and PowerCenter. Medium-level integration requires Production Support personnel to have a fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not have in-depth knowledge about the tool, they may be unable to fix problems that arise, so the production support burden is shared between the Project Development team and the Production Support team.
High Level
With High-level integration, the third-party scheduler has full control of scheduling and kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible for controlling all dependencies among the sessions. This type of integration is the most complex to implement because there are many more interactions between the third-party scheduler and PowerCenter. Production Support personnel may have limited knowledge of PowerCenter but must have thorough knowledge of the scheduling tool. Because Production Support personnel in many companies are knowledgeable only about the companys standard scheduler, one of the main advantages of this level of integration is that if the batch fails at some point, the Production Support personnel are usually able to determine the exact breakpoint. Thus, the production support burden lies with the Production Support
INFORMATICA CONFIDENTIAL
BEST PRACTICE
483 of 702
team.
# The above lines need to be edited to include the name of the workflow or the task that you are attempting to start. TG_TMP_PRODUCT_XREF_TABLE # Checking whether to abort the Current Process or not RetVal=$? echo "Status = $RetVal" if [ $RetVal -ge 1 ] then jobend abnormal "Step 1: Failed wf_stg_tmp_product_xref_table...\n" exit 1 fi echo "Step 1: Successful"
fi
INFORMATICA CONFIDENTIAL
BEST PRACTICE
485 of 702
Description
For PowerCenter, statistics are updated during copy, backup or restore operations. In addition, the RMREP command has an option to update statistics that can be scheduled as part of a regularly-run script. For PowerCenter 6 and earlier there are specific strategies for Oracle, Sybase, SQL Server, DB2 and Informix discussed below. Each example shows how to extract the information out of the PowerCenter repository and incorporate it into a custom stored procedure.
Features in PowerCenter version 7 and later Copy, Backup and Restore Repositories
PowerCenter automatically identifies and updates all statistics of all repository tables and indexes when a repository is copied, backed-up, or restored. If you follow a strategy of regular repository back-ups, the statistics will also be updated.
PMREP Command
PowerCenter also has a command line option to update statistics in the database. This allows this command to be put in a Windows batch file or Unix Shell script to run. The format of the command is: pmrep updatestatistics {-s filelistfile} The s option allows for you to skip different tables you may not want to update statistics.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
486 of 702
In addition, this workflow can be scheduled to run continuously on a daily, weekly or monthly schedule. This allows the statistics to be updated regularly so performance is not degraded.
Oracle
Run the following queries: select 'analyze table ', table_name, ' compute statistics;' from user_tables where table_name like 'OPB_%' select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where INDEX_NAME like 'OPB_%' This will produce output like:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
487 of 702
'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;' analyze table OPB_ANALYZE_DEP compute statistics; analyze table OPB_ATTR compute statistics; analyze table OPB_BATCH_OBJECT compute statistics;
'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' analyze index OPB_DBD_IDX compute statistics; analyze index OPB_DIM_LEVEL compute statistics; analyze index OPB_EXPR_IDX compute statistics; Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines that look like: 'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' Run this as a SQL script. This updates statistics for the repository tables.
MS SQL Server
Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like : name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT Save the output to a file, then edit the file and remove the header information (i.e., the top two lines) and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
488 of 702
Sybase
Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT Save the output to a file, then remove the header information (i.e., the top two lines), and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables.
Informix
Run the following query: select 'update statistics low for table ', tabname, ' ;' from systables where tabname like 'opb_%' or tabname like 'OPB_%'; This will produce output like : (constant) tabname (constant) update statistics low for table OPB_ANALYZE_DEP ; update statistics low for table OPB_ATTR ; update statistics low for table OPB_BATCH_OBJECT ; Save the output to a file, then edit the file and remove the header information (i.e., the top line that looks like: (constant) tabname (constant) Run this as a SQL script. This updates statistics for the repository tables.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
489 of 702
DB2
Run the following query : select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and indexes all;' from sysstat.tables where tabname like 'OPB_%' This will produce output like: runstats on table PARTH.OPB_ANALYZE_DEP and indexes all; runstats on table PARTH.OPB_ATTR and indexes all; runstats on table PARTH.OPB_BATCH_OBJECT and indexes all; Save the output to a file. Run this as a SQL script to update statistics for the repository tables.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
490 of 702
Description
The first step in performance tuning is to identify performance bottlenecks. Carefully consider the following five areas to determine where bottlenecks exist; using a process of elimination, investigating each area in the order indicated: 1. 2. 3. 4. 5. Target Source Mapping Session System
Best Practice Considerations Use Thread Statistics to Identify Target, Source, and Mapping Bottlenecks
Use thread statistics to identify source, target or mapping (transformation) bottlenecks. By default, an Integration Service uses one reader, one transformation, and one target thread to process a session. Within each session log, the following thread statistics are available:
q q
Run time Amount of time the thread was running Idle time Amount of time the thread was idle due to other threads within application or Integration Service. This value does not include time the thread is blocked due to the operating system. Busy Percentage of the overall run time the thread is not idle. This percentage is calculated using the following formula:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
491 of 702
(run time idle time) / run time x 100 By analyzing the thread statistics found in an Integration Service session log, it is possible to determine which thread is being used the most. If a transformation thread is 100 percent busy and there are additional resources (e.g., CPU cycles and memory) available on the Integration Service server, add a partition point in the segment. If reader or writer thread is 100 percent busy, consider using string data types in source or target ports since non-string ports require more processing.
procedure: 1. Make a copy of the original workflow 2. Configure the session in the test workflow to write to a flat file and run the session. 3. Read the thread statistics in session log If session performance increases significantly when writing to a flat file, you have a write bottleneck. Consider performing the following tasks to improve performance:
q q q q q q q
Drop indexes and key constraints Increase checkpoint intervals Use bulk loading Use external loading Minimize deadlocks Increase database network packet size Optimize target databases
INFORMATICA CONFIDENTIAL
BEST PRACTICE
493 of 702
Add a filter transformation in the mapping after each source qualifier. Set the filter condition to false so that no data is processed past the filter transformation. If the time it takes to run the new session remains about the same, then you have a source bottleneck. Using a Read Test Session. You can create a read test mapping to identify source bottlenecks. A read test mapping isolates the read query by removing any transformation logic from the mapping. Use the following steps to create a read test mapping: 1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to a file target. Use the read test mapping in a test session. If the test session performance is similar to the original session, you have a source bottleneck. Using a Database Query You can also identify source bottlenecks by executing a read query directly against the source database. To do so, perform the following steps:
q q
Copy the read query directly from the session log. Run the query against the source database with a query tool such as SQL Plus. Measure the query execution time and the time it takes for the query to return the first row.
If there is a long delay between the two time measurements, you have a source bottleneck. If your session reads from a relational source and is constrained by a source bottleneck, review the following suggestions for improving performance:
q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
494 of 702
q q q
Use conditional filters. Increase database network packet size. Connect to Oracle databases using IPC protocol.
Mapping Bottlenecks
If you have eliminated the reading and writing of data as bottlenecks, you may have a mapping bottleneck. Use the swap method to determine if the bottleneck is in the mapping. Begin by adding a Filter transformation in the mapping immediately before each target definition. Set the filter condition to false so that no data is loaded into the target tables. If the time it takes to run the new session is the same as the original session, you have a mapping bottleneck. You can also use the performance details to identify mapping bottlenecks: high Rowsinlookupcache and High Errorrows counters indicate mapping bottlenecks. Follow these steps to identify mapping bottlenecks: Create a test mapping without transformations 1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to the target. Check for High Rowsinlookupcache counters Multiple lookups can slow the session. You may improve session performance by locating the largest lookup tables and tuning those lookup expressions.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
495 of 702
Check for High Errorrows counters Transformation errors affect session performance. If a session has large numbers in any of the Transformation_errorrows counters, you may improve performance by eliminating the errors. For further details on eliminating mapping bottlenecks, refer to the Best Practice: Tuning Mappings for Better Performance
Session Bottlenecks
Session performance details can be used to flag other problem areas. Create performance details by selecting Collect Performance Data in the session properties before running the session. View the performance details through the Workflow Monitor as the session runs, or view the resulting file. The performance details provide counters about each source qualifier, target definition, and individual transformation within the mapping to help you understand session and mapping efficiency. To view the performance details during the session run:
q q q
Right-click the session in the Workflow Monitor. Choose Properties. Click the Properties tab in the details dialog box.
To view the resulting performance daa file, look for the file session_name.perf in the same directory as the session log and open the file in any text editor. All transformations have basic counters that indicate the number of input row, output rows, and error rows. Source qualifiers, normalizers, and targets have additional counters indicating the efficiency of data moving into and out of buffers. Some transformations have counters specific to their functionality. When reading performance details, the first column displays the transformation name as it appears in the mapping, the second column contains the counter name, and the third column holds the resulting number or efficiency percentage. Low buffer input and buffer output counters
INFORMATICA CONFIDENTIAL
BEST PRACTICE
496 of 702
If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all sources and targets, increasing the session DTM buffer pool size may improve performance. Aggregator, Rank, and Joiner readfromdisk and writetodisk counters If a session contains Aggregator, Rank, or Joiner transformations, examine each Transformation_readfromdisk and Transformation_writetodisk counter. If these counters display any number other than zero, you can improve session performance by increasing the index and data cache sizes. If the session performs incremental aggregation, the Aggregator_readtodisk and writetodisk counters display a number other than zero because the Integration Service reads historical aggregate data from the local disk during the session and writes to disk when saving historical data. Evaluate the incremental Aggregator_readtodisk and writetodisk counters during the session. If the counters show any numbers other than zero during the session run, you can increase performance by tuning the index and data cache sizes. Note: PowerCenter versions 6.x and above include the ability to assign memory allocation per object. In versions earlier than 6.x, aggregators, ranks, and joiners were assigned at a global/session level. For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning Sessions for Better Performance and Tuning SQL Overrides and Environment for Better Performance.
System Bottlenecks
After tuning the source, target, mapping, and session, you may also consider tuning the system hosting the Integration Service. The Integration Service uses system resources to process transformations, session execution, and the reading and writing of data. The Integration Service also uses system memory for other data tasks such as creating aggregator, joiner, rank, and lookup table caches. You can use system performance monitoring tools to monitor the amount of system resources the Server uses and identify system bottlenecks.
q
INFORMATICA CONFIDENTIAL
Processes tab in the Task Manager to view CPU usage and total memory usage. You can also view more detailed performance information by using the Performance Monitor in the Administrative Tools on Windows.
q
UNIX. Use the following system tools to monitor system performance and identify system bottlenecks:
r r
lsattr -E -I sys0 - To view current system settings iostat - To monitor loading operation for every disk attached to the database server vmstat or sar w - To monitor disk swapping actions sar u - To monitor CPU loading.
r r
For further information regarding system tuning, refer to the Best Practices: Performance Tuning UNIX Systems and Performance Tuning Windows 2000/2003 Systems.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
498 of 702
Description
Performance Tuning Tools
Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar with these tools, so weve included only a short description of some of the major ones here.
V$ Views
V$ views are dynamic performance views that provide real-time information on database activity, enabling the DBA to draw conclusions about database performance. Because SYS is the owner of these views, only SYS can query them. Keep in mind that querying these views impacts database performance; with each query having an immediate hit. With this in mind, carefully consider which users should be granted the privilege to query these views. You can grant viewing privileges with either the SELECT privilege, which allows a user to view for individual V$ views or the SELECT ANY TABLE privilege, which allows the user to view all V$ views. Using the SELECT ANY TABLE option requires the O7_DICTIONARY_ACCESSIBILITY parameter be set to TRUE, which allows the ANY keyword to apply to SYS owned objects.
Explain Plan
Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks and developing a strategy to avoid them. Explain Plan allows the DBA or developer to determine the execution path of a block of SQL code. The SQL in a source qualifier or in a lookup that is running for a long time should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid inefficient execution of these statements. Review the PowerCenter session log for long initialization time (an indicator that the source qualifier may need tuning) and the time it takes to build a lookup cache to determine if the SQL for these transformations should
INFORMATICA CONFIDENTIAL
BEST PRACTICE
499 of 702
be tested.
SQL Trace
SQL Trace extends the functionality of Explain Plan by providing statistical information about the SQL statements executed in a session that has tracing enabled. This utility is run for a session with the ALTER SESSION SET SQL_TRACE = TRUE statement.
TKPROF
The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF formats this dump file into a more understandable report.
Disk I/O
Disk I/O at the database level provides the highest level of performance gain in most systems. Database files should be separated and identified. Rollback files should be separated onto their own disks because they have significant disk I/O. Co-locate tables that are heavily used with tables that are rarely used to help minimize disk contention. Separate indexes so that when queries run indexes and tables, they are not fighting for the same resource. Also be sure to implement disk striping; this, or RAID technology can help immensely in reducing disk contention. While this type of planning is time consuming, the payoff is well worth the effort in terms of performance gains.
Dynamic Sampling
Dynamic sampling enables the server to improve performance by:
q
Estimating single-table predicate statistics where available statistics are missing or may lead to bad estimations. Estimating statistics for tables and indexes with missing statistics. Estimating statistics for tables and indexes with out of date statistics.
q q
Dynamic sampling is controlled by the OPTIMIZER_DYNAMIC_SAMPLING parameter, which accepts values from "0" (off) to "10" (aggressive sampling) with a default value of "2". At compile-time, Oracle determines if dynamic sampling can improve query
INFORMATICA CONFIDENTIAL BEST PRACTICE 500 of 702
performance. If so, it issues recursive statements to estimate the necessary statistics. Dynamic sampling can be beneficial when:
q q
The sample time is small compared to the overall query execution time. Dynamic sampling results in a better performing query.
Statistics Analysis. The optimizer recommends the gathering of statistics on objects with missing or stale statistics. Additional statistics for these objects are stored in an SQL profile. SQL Profiling. The optimizer may be able to improve performance by gathering additional statistics and altering session specific parameters such as the OPTIMIZER_MODE. If such improvements are possible, the information is stored in an SQL profile. If accepted, this information can then used by the optimizer when running in normal mode. Unlike a stored outline, which fixes the execution plan, an SQL profile may still be of benefit when the contents of the table alter drastically. Even so, it's sensible to update profiles periodically. The SQL profiling is not performed when the tuining optimizer is run in limited mode. Access Path Analysis. The optimizer investigates the effect of new or modified indexes on the access path. Because its index recommendations relate to a specific statement, where practical, it is also suggest the use of the SQL Access Advisor to check the impact of these indexes on a representative SQL workload. SQL Structure Analysis. The optimizer suggests alternatives for SQL statements that contain structures that may affect performance. Be aware that implementing these suggestions requires human intervention to check their
BEST PRACTICE 501 of 702
INFORMATICA CONFIDENTIAL
validity.
TIP The automatic SQL tuning features are accessible from Enterprise Manager on the "Advisor Central" page
Useful Views
Useful views related to automatic SQL tuning include:
q q q q q q q q q q q q q q q
DBA_ADVISOR_TASKS DBA_ADVISOR_FINDINGS DBA_ADVISOR_RECOMMENDATIONS DBA_ADVISOR_RATIONALE DBA_SQLTUNE_STATISTICS DBA_SQLTUNE_BINDS DBA_SQLTUNE_PLANS DBA_SQLSET DBA_SQLSET_BINDS DBA_SQLSET_STATEMENTS DBA_SQLSET_REFERENCES DBA_SQL_PROFILES V$SQL V$SQLAREA V$ACTIVE_SESSION_HISTORY
INFORMATICA CONFIDENTIAL
BEST PRACTICE
502 of 702
TIP Changes made in the init.ora file take effect after a restart of the instance. Use svrmgr to issue the commands shutdown and startup (eventually shutdown immediate) to the instance. Note that svrmgr is no longer available as of Oracle 9i because Oracle is moving to a web-based Server Manager in Oracle 10g. If you are using Oracle 9i, install Oracle client tools and log onto Oracle Enterprise Manager. Some other tools like DBArtisan expose the initialization parameters.
The settings presented here are those used in a four-CPU AIX server running Oracle 7.3.4 set to make use of the parallel query option to facilitate parallel processing queries and indexes. Weve also included the descriptions and documentation from Oracle for each setting to help DBAs of other (i.e., non-Oracle) systems determine what the commands do in the Oracle environment to facilitate setting their native database commands and settings in a similar fashion.
HASH_AREA_SIZE = 16777216
q q q
Default value: 2 times the value of SORT_AREA_SIZE Range of values: any integer This parameter specifies the maximum amount of memory, in bytes, to be used for the hash join. If this parameter is not set, its value defaults to twice the value of the SORT_AREA_SIZE parameter. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. (Note: ALTER SESSION refers to the Database Administration command issued at the svrmgr command prompt). HASH_JOIN_ENABLED
r
In Oracle 7 and Oracle 8 the hash_join_enabled parameter must be set to true. In Oracle 8i and above hash_join_enabled=true is the default value
HASH_MULTIBLOCK_IO_COUNT
r r
Allows multiblock reads against the TEMP tablespace It is advisable to set the NEXT extentsize to greater than the value for hash_multiblock_io_count to reduce disk I/O This is the same behavior seen when setting the db_file_multiblock_read_count parameter for data tablespaces except this one applies only to multiblock access of segments of TEMP Tablespace
STAR_TRANSFORMATION_ENABLED
r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
503 of 702
star queries
r
When set to TRUE, the optimizer will consider performing a cost-based query transformation on the n-way join table
OPTIMIZER_INDEX_COST_ADJ
r r
Numeric parameter set between 0 and 1000 (default 1000) This parameter lets you tune the optimizer behavior for access path selection to be more or less index friendly
Optimizer_percent_parallel=33
This parameter defines the amount of parallelism that the optimizer uses in its cost functions. The default of 0 means that the optimizer chooses the best serial plan. A value of 100 means that the optimizer uses each object's degree of parallelism in computing the cost of a full-table scan operation. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. Low values favor indexes, while high values favor table scans. Cost-based optimization is always used for queries that reference an object with a nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of OPTIMIZER_PERCENT_PARALLEL. parallel_max_servers=40
q q q
Used to enable parallel query. Initially not set on Install. Maximum number of query servers or parallel recovery processes for an instance.
Parallel_min_servers=8
q q q
Used to enable parallel query. Initially not set on Install. Minimum number of query server processes for an instance. Also the number of query-server processes Oracle creates when the instance is started.
SORT_AREA_SIZE=8388608
INFORMATICA CONFIDENTIAL
BEST PRACTICE
504 of 702
q q q
Default value: operating system-dependent Minimum value: the value equivalent to two database blocks This parameter specifies the maximum amount, in bytes, of program global area (PGA) memory to use for a sort. After the sort is complete, and all that remains to do is to fetch the rows out, the memory is released down to the size specified by SORT_AREA_RETAINED_SIZE. After the last row is fetched out, all memory is freed. The memory is released back to the PGA, not to the operating system. Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. Multiple allocations never exist; there is only one memory area of SORT_AREA_SIZE for each user process at any time. The default is usually adequate for most database operations. However, if very large indexes are created, this parameter may need to be adjusted. For example, if one process is doing all database access, as in a full database import, then an increased value for this parameter may speed the import, particularly the CREATE INDEX statements.
If these parameters are set to a non-zero value, they represent the minimum size for the pool. These minimum values may be necessary if you experience application errors when certain pool sizes drop below a specific threshold. The following parameters must be set manually and take memory from the quota allocated by the SGA_TARGET parameter:
q
DB_KEEP_CACHE_SIZE
BEST PRACTICE 505 of 702
INFORMATICA CONFIDENTIAL
q q q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
507 of 702
ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011133 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011134 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011131 ; Dropping or disabling primary keys also speeds loads. Run the results of this SQL statement after disabling the foreign key constraints: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'P' This produces output that looks like: ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY KEY ; Finally, disable any unique constraints with the following: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')
INFORMATICA CONFIDENTIAL
BEST PRACTICE
508 of 702
AND CONSTRAINT_TYPE = 'U' ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011070 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011071 ; Save the results in a single file and name it something like DISABLE.SQL To re-enable the indexes, rerun these queries after replacing DISABLE with ENABLE. Save the results in another file with a name such as ENABLE.SQL and run it as a post-session command. Re-enable constraints in the reverse order that you disabled them. Re-enable the unique constraints first, and re-enable primary keys before foreign keys.
TIP Dropping or disabling foreign keys often boosts loading, but also slows queries (such as lookups) and updates. If you do not use lookups or updates on your target tables, you should get a boost by using this SQL statement to generate scripts. If you use lookups and updates (especially on large tables), you can exclude the index that will be used for the lookup from your script. You may want to experiment to determine which method is faster.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
509 of 702
ability to create and drop very quickly. Since most dimension tables in a warehouse have nearly every column indexed, the space savings is dramatic. But it is important to note that when a bitmap-indexed column is updated, every row associated with that bitmap entry is locked, making bit-map indexing a poor choice for OLTP database tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after each DML statement (e.g., inserts and updates), which can make loads very slow. For this reason, it is a good idea to drop or disable bitmap indexes prior to the load and recreate or re-enable them after the load. The relationship between Fact and Dimension keys is another example of low cardinality. With a b-tree index on the Fact table, a query processes by joining all the Dimension tables in a Cartesian product based on the WHERE clause, then joins back to the Fact table. With a bitmapped index on the Fact table, a star query may be created that accesses the Fact table first followed by the Dimension table joins, avoiding a Cartesian product of all possible Dimension attributes. This star query access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in the init.ora file and if there are single column bitmapped indexes on the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To specify a bitmap index, add the word bitmap between create and index. All other syntax is identical.
Bitmap Indexes
drop index emp_active_bit; drop index emp_gender_bit; create bitmap index emp_active_bit on emp (active_flag); create bitmap index emp_gender_bit on emp (gender);
B-tree Indexes
drop index emp_active; drop index emp_gender; create index emp_active on emp (active_flag); create index emp_gender on emp (gender);
INFORMATICA CONFIDENTIAL
BEST PRACTICE
510 of 702
Information for bitmap indexes is stored in the data dictionary in dba_indexes, all_indexes, and user_indexes with the word BITMAP in the Uniqueness column rather than the word UNIQUE. Bitmap indexes cannot be unique. To enable bitmap indexes, you must set the following items in the instance initialization file:
q q q q
compatible = 7.3.2.0.0 # or higher event = "10111 trace name context forever" event = "10112 trace name context forever" event = "10114 trace name context forever"
Also note that the parallel query option must be installed in order to create bitmap indexes. If you try to create bitmap indexes without the parallel query option, a syntax error appears in the SQL statement; the keyword bitmap won't be recognized.
TIP To check if the parallel query option is installed, start and log into SQL*Plus. If the parallel query option is installed, the word parallel appears in the banner text.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
511 of 702
ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS; ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS; ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS; The following SQL statement can be used to analyze the indexes in the database: SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;' FROM USER_INDEXES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following results: ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS; ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS; ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS; Save these results as a SQL script to be executed before or after a load.
Schema method
Another way to update index statistics is to compute indexes by schema rather than by table. If data warehouse indexes are the only indexes located in a single schema, you can use the following command to update the statistics: EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute'); In this example, BDB is the schema for which the statistics should be updated. Note that the DBA must grant the execution privilege for dbms_utility to the database user executing this command.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
512 of 702
TIP
These SQL statements can be very resource intensive, especially for very large tables. For this reason, Informatica recommends running them at off-peak times when no other process is using the database. If you find the exact computation of the statistics consumes too much time, it is often acceptable to estimate the statistics rather than compute them. Use estimate instead of compute in the above examples.
Parallelism
Parallel execution can be implemented at the SQL statement, database object, or instance level for many SQL operations. The degree of parallelism should be identified based on the number of processors and disk drives on the server, with the number of processors being the minimum degree.
Example of improper use of alias: SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME FROM EMP A Here, the parallel hint will not be used because of the used alias A for table EMP. The correct way is: SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME FROM EMP A
demonstrates how to set a tables degree of parallelism to four for all eligible SQL statements on this table: ALTER TABLE order_fact PARALLEL 4; Ensure that Oracle is not contending with other processes for these resources or you may end up with degraded performance due to resource contention.
Additional Tips Executing Oracle SQL Scripts as Pre- and Post-Session Commands on UNIX
You can execute queries as both pre- and post-session commands. For a UNIX environment, the format of the command is: sqlplus s user_id/password@database @ script_name.sql For example, to execute the ENABLE.SQL file created earlier (assuming the data warehouse is on a database named infadb), you would execute the following as a postsession command: sqlplus s user_id/password@infadb @ enable.sql In some environments, this may be a security issue since both username and password are hard-coded and unencrypted. To avoid this, use the operating systems authentication to log onto the database instance. In the following example, the Informatica id pmuser is used to log onto the Oracle database. Create the Oracle user pmuser with the following SQL statement: CREATE USER PMUSER IDENTIFIED EXTERNALLY DEFAULT TABLESPACE . . . TEMPORARY TABLESPACE . . . In the following pre-session command, pmuser (the id Informatica is logged onto the operating system as) is automatically passed from the operating system to the database and used to execute the script: sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL
INFORMATICA CONFIDENTIAL BEST PRACTICE 514 of 702
You may want to use the init.ora parameter os_authent_prefix to distinguish between normal oracle-users and external-identified ones. DRIVING_SITE Hint If the source and target are on separate instances, the Source Qualifier transformation should be executed on the target instance. For example, you want to join two source tables (A and B) together, which may reduce the number of selected rows. However, Oracle fetches all of the data from both tables, moves the data across the network to the target instance, then processes everything on the target instance. If either data source is large, this causes a great deal of network traffic. To force the Oracle optimizer to process the join on the source instance, use the Generate SQL option in the source qualifier and include the driving_site hint in the SQL statement as: SELECT /*+ DRIVING_SITE */ ;
INFORMATICA CONFIDENTIAL
BEST PRACTICE
515 of 702
Description
Proper tuning of the source and target database is a very important consideration in the scalability and usability of a business data integration environment. Managing performance on an SQL Server involves the following points.
q q q q q q
Manage system memory usage (RAM caching). Create and maintain good indexes. Partition large data sets and indexes. Monitor disk I/O subsystem performance. Tune applications and queries. Optimize active data.
Taking advantage of grid computing is another option for improving the overall SQL Server performance. To set up a SQL Server cluster environment, you need to set up a cluster where the databases are split among the nodes. This provides the ability to distribute the load across multiple nodes. To achieve high performance, Informatica recommends using a fibre-attached SAN device for shared storage.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
516 of 702
Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM usage:
q
Max async I/O is used to specify the number of simultaneous disk I/O operations that SQL Server can submit to the operating system. Note that this setting is automated in SQL Server 2000 SQL Server allows several selectable models for database recovery, these include:
r r r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
517 of 702
Note: In SQL Server 2000, Microsoft introduced enhancements to distributed partitioned views that enable the creation of federated databases (commonly referred to as scale-out), which spread resource load and I/O activity across multiple servers. Federated databases are appropriate for some high-end online analytical processing (OLTP) applications, but this approach is not recommended for addressing the needs of a data warehouse.
Segregating tempdb
SQL Server creates a database, tempdb, on every server instance to be used by the server as a shared working area for various activities, including temporary tables, sorting, processing subqueries, building aggregates to support GROUP BY or ORDER BY clauses, queries using DISTINCT (temporary worktables have to be created to remove duplicate rows), cursors, and hash joins. To move the tempdb database, use the ALTER DATABASE command to change the physical file location of the SQL Server logical file name associated with tempdb. For example, to move tempdb and its associated log to the new file locations E:\mssql7 and C:\temp, use the following commands:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
518 of 702
alterdatabasetempdbmodifyfile(name='tempdev',filename=
'e:\mssql7\tempnew_location.mDF')
alterdatabasetempdbmodifyfile(name='templog',filename=
'c:\temp\tempnew_loglocation.mDF')
The master database, msdb, and model databases are not used much during production (as compared to user databases), so it is generally y not necessary to consider them in I/O performance tuning considerations. The master database is usually used only for adding new logins, databases, devices, and other system objects.
Database Partitioning
Databases can be partitioned using files and/or filegroups. A filegroup is simply a named collection of individual files grouped together for administration purposes. A file cannot be a member of more than one filegroup. Tables, indexes, text, ntext, and image data can all be associated with a specific filegroup. This means that all their pages are allocated from the files in that filegroup. The three types of filegroups are:
q
Primary filegroup. Contains the primary data file and any other files not placed into another filegroup. All pages for the system tables are allocated from the primary filegroup. User-defined filegroup. Any filegroup specified using the FILEGROUP keyword in a CREATE DATABASE or ALTER DATABASE statement, or on the Properties dialog box within SQL Server Enterprise Manager. Default filegroup. Contains the pages for all tables and indexes that do not have a filegroup specified when they are created. In each database, only one filegroup at a time can be the default filegroup. If no default filegroup is specified, the default is the primary filegroup.
Files and filegroups are useful for controlling the placement of data and indexes and eliminating device contention. Quite a few installations also leverage files and filegroups as a mechanism that is more granular than a database in order to exercise more control over their database backup/recovery strategy.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
519 of 702
number of columns but fewer rows. Determining how to partition tables horizontally depends on how data is analyzed. A general rule of thumb is to partition tables so queries reference as few tables as possible. Otherwise, excessive UNION queries, used to merge the tables logically at query time, can impair performance. When you partition data across multiple tables or multiple servers, queries accessing only a fraction of the data can run faster because there is less data to scan. If the tables are located on different servers, or on a computer with multiple processors, each table involved in the query can also be scanned in parallel, thereby improving query performance. Additionally, maintenance tasks, such as rebuilding indexes or backing up a table, can execute more quickly. By using a partitioned view, the data still appears as a single table and can be queried as such without having to reference the correct underlying table manually
INFORMATICA CONFIDENTIAL
BEST PRACTICE
520 of 702
Use this option to reserve physical memory space for SQL Server that is equal to the server memory setting. The server memory setting is configured automatically by SQL Server based on workload and available resources. It can vary dynamically among minimum server memory and maximum server memory. Setting set working set size means the operating system does not attempt to swap out SQL Server pages, even if they can be used more readily by another process when SQL Server is idle.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
521 of 702
q q
Bcp is a command prompt utility that copies data into or out of SQL Server. BULK INSERT is a Transact-SQL statement that can be executed from within the database environment. Unlike bcp, BULK INSERT can only pull data into SQL Server. An advantage of using BULK INSERT is that it can copy data into instances of SQL Server using a Transact-SQL statement, rather than having to shell out to the command prompt.
TIP Both of these mechanisms enable you to exercise control over the batch size. Unless you are working with small volumes of data, it is good to get in the habit of specifying a batch size for recoverability reasons. If none is specified, SQL Server commits all rows to be loaded as a single batch. For example, you attempt to load 1,000,000 rows of new data into a table. The server suddenly loses power just as it finishes processing row number 999,999. When the server recovers, those 999,999 rows will need to be rolled back out of the database before you attempt to reload the data. By specifying a batch size of 10,000 you could have saved significant recovery time, because SQL Server would have only had to rollback 9999 rows instead of 999,999.
Remove indexes. Use Bulk INSERT or bcp. Parallel load using partitioned data files into partitioned tables. Run one load stream for each available CPU. Set Bulk-Logged or Simple Recovery model. Use the TABLOCK option. Create indexes. Switch to the appropriate recovery model. Perform backups
Load data with indexes in place. Use performance and concurrency requirements to determine locking granularity (sp_indexoption). Change from Full to Bulk-Logged Recovery mode unless there is an overriding need to preserve a pointin-time recovery, such as online users modifying the database during bulk loads. Read operations should not affect bulk loads.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
522 of 702
Description
Teradata offers several bulk load utilities including:
q
MultiLoad which supports inserts, updates, deletes, and upserts to any table. FastExport which is a high-performance bulk export utility. BTEQ which allows you to export data to a flat file but is suitable for smaller volumes than FastExport. FastLoad which is used for loading inserts into an empty table. TPump which is a light-weight utility that does not lock the table that is being loaded.
q q
q q
Tuning MultiLoad
There are many aspects to tuning a Teradata database. Several aspects of tuning can be controlled by setting MultiLoad parameters to maximize write throughput. Other areas to analyze when performing a MultiLoad job include estimating space requirements and monitoring MultiLoad performance.
MultiLoad parameters
Below are the MultiLoad-specific parameters that are available in PowerCenter:
q q
TDPID. A client based operand that is part of the logon string. Date Format. Ensure that the date format used in your target flat file is equivalent to the date format parameter in your MultiLoad script. Also validate that your date format is compatible with the date format specified in the Teradata database. Checkpoint. A checkpoint interval is similar to a commit interval for other
INFORMATICA CONFIDENTIAL
BEST PRACTICE
523 of 702
databases. When you set the checkpoint value to less than 60, it represents the interval in minutes between checkpoint operations. If the checkpoint is set to a value greater than 60, it represents the number of records to write before performing a checkpoint operation. To maximize write speed to the database, try to limit the number of checkpoint operations that are performed.
q
Tenacity. Interval in hours between MultiLoad attempts to log on to the database when the maximum number of sessions are already running. Load Mode. Available load methods include Insert, Update, Delete, and Upsert. Consider creating separate external loader connections for each method, selecting the one that will be most efficient for each target table. Drop Error Tables. Allows you to specify whether to drop or retain the three error tables for a MultiLoad session. Set this parameter to 1 to drop error tables or 0 to retain error tables. Max Sessions. This parameter specifies the maximum number of sessions that are allowed to log on to the database. This value should not exceed one per working amp (Access Module Process). Sleep. This parameter specifies the number of minutes that MultiLoad waits before retrying a logon operation.
Note: Spool space cannot be used for MultiLoad work tables, error tables, or the restart log table. Spool space is freed at each restart. By using permanent space for the MultiLoad tables, data is preserved for restart operations after a system failure. Work tables, in particular, require a lot of extra permanent space. Also remember to account for the size of error tables since error tables are generated for each target table. Use the following formula to prepare the preliminary space estimate for one target table, assuming no fallback protection, no journals, and no non-unique secondary indexes:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
524 of 702
PERM = (using data size + 38) x (number of rows processed) x (number of apply conditions satisfied) x (number of Teradata SQL statements within the applied DML) Make adjustments to your preliminary space estimates according to the requirements and expectations of your MultiLoad job.
If the performance bottleneck is during the acquisition phase, as data is acquired from the client system, then the issue may be with the client system. If it is during the application phase, as data is applied to the target tables, then the issue is not likely to be with the client system. The MultiLoad job output lists the job phases and other useful information. Save these listings for evaluation.
2. Use the Teradata RDBMS Query Session utility to monitor the progress of the MultiLoad job. 3. Check for locks on the MultiLoad target tables and error tables. 4. Check the DBC.Resusage table for problem areas, such as data bus or CPU capacities at or near 100 percent for one or more processors. 5. Determine whether the target tables have non-unique secondary indexes (NUSIs). NUSIs degrade MultiLoad performance because the utility builds a separate NUSI change row to be applied to each NUSI sub-table after all of the rows have been applied to the primary table. 6. Check the size of the error tables. Write operations to the fallback error tables are performed at normal SQL speed, which is much slower than normal MultiLoad tasks. 7. Verify that the primary index is unique. Non-unique primary indexes can cause severe MultiLoad performance problems 8. Poor performance can happen when the input data is skewed with respect to the Primary Index of the database. Teradata depends upon random and well distributed data for data input and retrieval. For example, a file containing a million rows with a single value 'AAAAAA' for the Primary Index will take an infinite time to load. 9. One common tool used for determining load issues/skewed data/locks is Performance Monitor (PMON). PMON requires MONITOR access on the Teradata system. If you do not have Monitor access, then the DBA can help
INFORMATICA CONFIDENTIAL BEST PRACTICE 525 of 702
you to look at the system. 10. SQL against the system catalog can also be used to determine any performance bottle necks. The following query is used to see if the load is inserting data into the system. Spool space (a type of work space) is inside the build as data is transferred to the database. So if the load is going well, the spool will be built rapidly in the database. Use the following query to check: SELECT sum(currentspool) from dbc.diskspace where databasename = userid loading the database. After the spool rises has a reached its peak, spool will fall rapidly as data is inserted from spool into the table. If the spool grows slowly, then the input data is probably skewed.
FastExport
FastExport is a bulk export Teradata utility. One way to pull up data for Lookup/ Sources is by using ODBC since there is not native connectivity to Teradata. However, ODBC is slow. For higher performance, use FastExport if the number of rows to be pulled is in the order of a million rows. FastExport writes to a file. The lookup or source qualifier then reads this file. FastExport integrated within PowerCenter.
BTEQ
BTEQ is a SQL executor utility similar to SQL*Plus. Life FastExport, BTEQ allows you to export data to a flat file, but is suitable for smaller volumes of data. This provides faster performance than ODBC but doesn't tax Teradata system resources the way FastExport can. A possible use for BTEQ with PowerCenter is to export smaller volumes of data to a flat file (i.e., less than 1 million rows). The flat file is then read by PowerCenter. BTEQ is not integrated with PowerCenter but can be called from a presession script.
TPump
TPump was a load utility primarily intended for streaming data (think of loading bundles of messages arriving from MQ using Power Center Real Time). TPump can also load from a file or a named pipe. While FastLoad and MultiLoad are bulk load utilities, TPump is a lightweight utility. Another important difference between MultiLoad and TPump is that TPump locks at the
INFORMATICA CONFIDENTIAL
BEST PRACTICE
526 of 702
row-hash level instead of the table level thus providing users read access to fresher data. Although Teradata says that it has improved the speed of TPump for loading files to compare with that of MultiLoad. So, try a test load using TPump first. Also, be cautious with the use of TPump to load streaming data if the data throughput is large.
Is the network fast enough and tuned effectively to support the necessary data transfer?
q
Is the hardware on which PowerCenter is running sufficiently robust with high processing capability and high memory capacity. ELT (Extract, Load, Transform) is a relatively new design or runtime paradigm that became popular with the advent of high performance RDBM systems such asDSS and OLTP. Because Teradata typically runs on well tuned operating systems and well tuned hardware, the ELT paradigm tries to push as much of the transformation logic as possible onto the Teradata system. The ELT design paradigm can be achieved through the Pushdown Optimization option offered with PowerCenter.
ETL or ELT
Because many database vendors and consultants advocate using ELT (Extract, Load and Transform) over ETL (Extract, Transform and Load), the use of Pushdown Optimization can be somewhat controversial. Informatica advocates using Pushdown Optimization as an option to solve specific performance situations rather than as the default design of a mapping.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
527 of 702
The following scenarios can help in deciding on when to use ETL with PowerCenter and when to use ELT (i.e., Pushdown Optimization): 1. When the load needs to look up only dimension tables then there may be no need to use Pushdown Optimization. In this context, PowerCenter's ability to build dynamic, persistent caching is significant. If a daily load involves 10s or 100s of fact files to be loaded throughout the day, then dimension surrogate keys can be easily obtained from PowerCenter's cache in memory. Compare this with the cost of running the same dimension lookup queries on the database. 2. In many cases large Teradata systems contain only a small amount of data. In such cases there may be no need to push down. 3. When only simple filters or expressions need to be applied on the data then there may be no need to push down. The special case is that of applying filters or expression logic to non-unique columns in incoming data in PowerCenter. Compare this to loading the same data into the database and then applying a WHERE clause on a non-unique column, which is highly inefficient for a large table. The principle here is: Filter and resolve the data AS it gets loaded instead of loading it into a database, querying the RDBMS to filter/resolve and re-loading it into the database. In other words, ETL instead of ELT. 4. Push Down optimization needs to be considered only if a large set of data needs to be merged or queried for getting to your final load set.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
528 of 702
Pushdown Optimization.
Teradata sessions fail if the session requires a conversion to a numeric data type and the precision is greater than 18. Teradata sessions fail when you use full pushdown optimization for a session containing a Sorter transformation. A sort on a distinct key may give inconsistent results if the sort is not case sensitive and one port is a character port. A session containing an Aggregator transformation may produce different results from PowerCenter if the group by port is a string data type and it is not case-sensitive. A session containing a Lookup transformation fails if it is configured for targetside pushdown optimization. A session that requires type casting fails if the casting is from x to date/time. A session that contains a date to string conversion fails
q q
Running a Query
If the Integration Service did not successfully drop the view, you can run a query against the source database to search for the views generated by the Integration Service. When the Integration Service creates a view, it uses a prefix of PM_V. You
INFORMATICA CONFIDENTIAL BEST PRACTICE 529 of 702
can search for views with this prefix to locate the views created during pushdown optimization. Teradata specific SQL: SELECT TableName FROM DBC.Tables WHERE CreatorName = USER AND TableKind ='V' AND TableName LIKE 'PM\_V%' ESCAPE '\'
INFORMATICA CONFIDENTIAL
BEST PRACTICE
530 of 702
Description
This section provides an overview of the subject area, followed by discussion of the use of specific tools.
Overview
All system performance issues are fundamentally resource contention issues. In any computer system, there are three essential resources: CPU, memory, and I/O - namely disk and network I/O. From this standpoint, performance tuning for PowerCenter means ensuring that the PowerCenter and its sub-processes have adequate resources to execute in a timely and efficient manner. Each resource has its own particular set of problems. Resource problems are complicated because all resources interact with each other. Performance tuning is about identifying bottlenecks and making trade-off to improve the situation. Your best approach is to initially take a baseline measurement and to obtain a good understanding of how it behaves, then evaluate any bottleneck revealed on each system resource during your load window and determine the removal of whichever resource contention offers the greatest opportunity for performance enhancement. Here is a summary of each system resource area and the problems it can have.
CPU
q
On any multiprocessing and multi-user system, many processes want to use the CPUs at the same time. The UNIX kernel is responsible for allocation of a finite number of CPU cycles across all running processes. If the total demand on the CPU exceeds its finite capacity, then all processing is likely to reflect a negative impact on performance; the system scheduler puts each process in a queue to wait for CPU availability.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
531 of 702
An average of the count of active processes in the system for the last 1, 5, and 15 minutes is reported as load average when you execute the command uptime. The load average provides you a basic indicator of the number of contenders for CPU time. Likewise vmstat command provides an average usage of all the CPUs along with the number of processes contending for CPU (the value under the r column). On SMP (symmetric multiprocessing) architecture servers, watch the even utilization of all the CPUs. How well all the CPUs are utilized depends on how well an application can be parallelized, If a process is incurring a high degree of involuntary context switch by the kernel; binding the process to a specific CPU may improve performance.
Memory
q
Memory contention arises when the memory requirements of the active processes exceed the physical memory available on the system; at this point, the system is out of memory. To handle this lack of memory, the system starts paging, or moving portions of active processes to disk in order to reclaim physical memory. When this happens, performance decreases dramatically. Paging is distinguished from swapping, which means moving entire processes to disk and reclaiming their space. Paging and excessive swapping indicate that the system can't provide enough memory for the processes that are currently running. Commands such as vmstat and pstat show whether the system is paging; ps, prstat and sar can report the memory requirements of each process.
Disk I/O
q
The I/O subsystem is a common source of resource contention problems. A finite amount of I/O bandwidth must be shared by all the programs (including the UNIX kernel) that currently run. The system's I/O buses can transfer only so many megabytes per second; individual devices are even more limited. Each type of device has its own peculiarities and, therefore, its own problems. Tools are available to evaluate specific parts of the I/O subsystem
r
iostat can give you information about the transfer rates for each disk drive. ps and vmstat can give some information about how many processes are blocked waiting for I/O. sar can provide voluminous information about I/O efficiency. sadp can give detailed information about disk access patterns.
r r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
532 of 702
Network I/O
q
The source data, the target data, or both the source and target data are likely to be connected through an Ethernet channel to the system where PowerCenter resides. Be sure to consider the number of Ethernet channels and bandwidth available to avoid congestion.
r
netstat shows packet activity on a network, watch for high collision rate of output packets on each interface. nfstat monitors NFS traffic; execute nfstat c from a client machine (not from the nfs server); watch for high time rate of total call and not responding message.
Given that these issues all boil down to access to some computing resource, mitigation of each issue con sists of making some adjustment to the environment to provide more (or preferential) access to the resource; for instance:
q
Adjusting execution schedules to allow leverage of low usage times may improve availability of memory, disk, network bandwidth, CPU cycles, etc. Migrating other applications to other hardware is likely tol reduce demand on the hardware hosting PowerCenter. For CPU intensive sessions, raising CPU priority (or lowering priority for competing processes) provides more CPU time to the PowerCenter sessions. Adding hardware resources, such as adding memory, can make more resource available to all processes. Re-configuring existing resources may provide for more efficient usage, such as assigning different disk devices for input and output, striping disk devices, or adjusting network packet sizes.
Detailed Usage
The following tips have proven useful in performance tuning UNIX-based machines. While some of these tips are likely to be more helpful than others in a particular environment, all are worthy of consideration. Availability, syntax and format of each varies across UNIX versions.
Running ps -axu
INFORMATICA CONFIDENTIAL
BEST PRACTICE
533 of 702
Are there any processes waiting for disk access or for paging? If so check the I/ O and memory subsystems. What processes are using most of the CPU? This may help to distribute the workload better. What processes are using most of the memory? This may help to distribute the workload better. Does ps show that your system is running many memory-intensive jobs? Look for jobs with a large set (RSS) or a high storage integral.
Are pages-outs occurring consistently? If so, you are short of memory. Are there a high number of address translation faults? (System V only). This suggests a memory shortage. Are swap-outs occurring consistently? If so, you are extremely short of memory. Occasional swap-outs are normal; BSD systems swap-out inactive jobs. Long bursts of swap-outs mean that active jobs are probably falling victim and indicate extreme memory shortage. If you dont have vmstat S, look at the w and de fields of vmstat. These should always be zero.
Reduce the size of the buffer cache (if your system has one) by decreasing BUFPAGES. If you have statically allocated STREAMS buffers, reduce the number of large (e.g., 2048- and 4096-byte) buffers. This may reduce network performance, but netstat-m should give you an idea of how many buffers you really need. Reduce the size of your kernels tables. This may limit the systems capacity (i. e., number of files, number of processes, etc.). Try running jobs requiring a lot of memory at night. This may not help the memory problems, but you may not care about them as much. Try running jobs requiring a lot of memory in a batch queue. If only one memory-intensive job is running at a time, your system may perform satisfactorily. Try to limit the time spent running sendmail, which is a memory hog. If you dont see any significant improvement, add more memory.
q q
Reorganize your file systems and disks to distribute I/O activity as evenly as possible. Using symbolic links helps to keep the directory structure the same throughout while still moving the data files that are causing I/O contention. Use your fastest disk drive and controller for your root file system; this almost certainly has the heaviest activity. Alternatively, if single-file throughput is important, put performance-critical files into one file system and use the fastest drive for that file system. Put performance-critical files on a file system with a large block size: 16KB or 32KB (BSD). Increase the size of the buffer cache by increasing BUFPAGES (BSD). This may hurt your systems memory performance.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
535 of 702
Rebuild your file systems periodically to eliminate fragmentation (i.e., backup, build a new file system, and restore). If you are using NFS and using remote files, look at your network situation. You dont have local disk I/O problems. Check memory statistics again by running vmstat 5 (sar-rwpg). If your system is paging or swapping consistently, you have memory problems, fix memory problem first. Swapping makes performance worse.
If your system has disk capacity problem and is constantly running out of disk space try the following actions:
q
Write a find script that detects old core dumps, editor backup and auto-save files, and other trash and deletes it automatically. Run the script through cron. Use the disk quota system (if your system has one) to prevent individual users from gathering too much storage. Use a smaller block size on file systems that are mostly small files (e.g., source code files, object modules, and small data files).
Eliminate unnecessary daemon processes. rwhod and routed are particularly likely to be performance problems, but any savings will help. Get users to run jobs at night with at or any queuing system thats available. You may not care if the CPU (or the memory or I/O system) is overloaded at night, provided the work is done in the morning.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
536 of 702
Using nice to lower the priority of CPU-bound jobs improves interactive performance. Also, using nice to raise the priority of CPU-bound jobs expedites them but may hurt interactive performance. In general though, using nice is really only a temporary solution. If your workload grows, it will soon become insufficient. Consider upgrading your system, replacing it, or buying another system to share the load.
Reorganize the computers and disks on your network so that as many users as possible can do as much work as possible on a local system. Use systems with good network performance as file servers. lsattr E l sys0 is used to determine some current settings on some UNIX environments. (In Solaris, you execute prtenv.) Of particular attention is maxuproc, the setting to determine the maximum level of user background processes. On most UNIX environments, this is defaulted to 40, but should be increased to 250 on most systems. Choose a file system. Be sure to check the database vendor documentation to determine the best file system for the specific machine. Typical choices include: s5, the UNIX System V file system; ufs, the UNIX file system derived from Berkeley (BSD); vxfs, the Veritas file system; and lastly raw devices that, in reality are not a file system at all. Additionally, for the PowerCenter Grid option cluster file system (CFS), products such as GFS for RedHat Linux, Veritas CFS, and GPFS for IBM AIX are some of the available choices.
PowerCenter Options
The Integration Service Monitor is available to display system resource usage information about associated nodes. The window displays resource usage information about the running tasks, including CPU%, memory, and swap usage. The PowerCenter 64-bit option can allocate more memory to sessions and achieve higher throughputs compared to 32-bit version of PowerCenter.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
538 of 702
Description
The following tips have proven useful in performance-tuning Windows Servers. While some are likely to be more helpful than others in any particular environment, all are worthy of consideration. The two places to begin tuning an NT server are:
q q
Performance Monitor. Performance tab (hit ctrl+alt+del, choose task manager, and click on the Performance tab).
Although the Performance Monitor can be tracked in real-time, creating a result-set representative of a full day is more likely to render an accurate view of system performance.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
539 of 702
Device Drivers: The device drivers for some types of hardware are notorious for inefficient CPU clock cycles. Be sure to obtain the latest drivers from the hardware vendor to minimize this problem. Memory and services: Although adding memory to Windows Server is always a good solution, it is also expensive and usually must be planned in advance. Before adding memory, check the Services in Control Panel because many background applications do not uninstall the old service when installing a new version. Thus, both the unused old service and the new service may be using valuable CPU memory resources. I/O Optimization: This is, by far, the best tuning option for database applications in the Windows Server environment. If necessary, level the load across the disk devices by moving files. In situations where there are multiple controllers, be sure to level the load across the controllers too. Using electrostatic devices and fast-wide SCSI can also help to increase performance. Further, fragmentation can usually be eliminated by using a Windows Server disk defragmentation product. Finally, on Windows Servers, be sure to implement disk striping to split single data files across multiple disk drives and take advantage of RAID (Redundant Arrays of Inexpensive Disks) technology. Also increase the priority of the disk devices on the Windows Server. Windows Server, by default, sets the disk device priority low.
These Windows Server monitoring tools enable you to analyze usage and detect
INFORMATICA CONFIDENTIAL BEST PRACTICE 540 of 702
System Monitor
The System Monitor displays a graph which is flexible and configurable. You can copy counter paths and settings from the System Monitor display to the Clipboard and paste counter paths from Web pages or other sources into the System Monitor display. Because the System Monitor is portable, it is useful in monitoring other systems that require administration.
Performance Monitor
The Performance Logs and Alerts tool provides two types of performance-related logs counter logs and trace logsand an alerting function. Counter logs record sampled data about hardware resources and system services based on performance objects and counters in the same manner as System Monitor. They can, therefore, be viewed in System Monitor. Data in counter logs can be saved as comma-separated or tab-separated files that are easily viewed with Excel. Trace logs collect event traces that measure performance statistics associated with events such as disk and file I/O, page faults, or thread activity. The alerting function allows you to define a counter value that will trigger actions such as sending a network message, running a program, or starting a log. Alerts are useful if you are not actively monitoring a particular counter threshold value but want to be notified when it exceeds or falls below a specified value so that you can investigate and determine the cause of the change. You may want to set alerts based on established performance baseline values for your system. Note: You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM \CurrentControlSet\Services\SysmonLog\Log_Queries). The predefined log settings under Counter Logs (i.e., System Overview) are configured to create a binary log that, after manual start-up, updates every 15 seconds and logs continuously until it achieves a maximum size. If you start logging with the default settings, data is saved to the Perflogs folder on the root directory and includes the counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and Processor(_Total)\ % Processor Time. If you want to create your own log setting, press the right mouse on one of the log
INFORMATICA CONFIDENTIAL
BEST PRACTICE
541 of 702
types.
PowerCenter Options
The Integration Service Monitor is available to display system resource usage information about associated nodes. The window displays resource usage information about running task including CPU%, Memory and Swap usage. PowerCenter's 64-bit option running on Intel Itanium processor-based machines and 64bit Windows Server 2003 can allocate more memory to sessions and achieve higher throughputs than the 32-bit version of PowerCenter on Windows Server. Using PowerCenter Grid option on Windows Server enables distribution of a session or sessions in a workflow to multiple servers and reduces the processing load window. The PowerCenter Grid option requires that the directories for each integration service to be shared with other servers. This allows integration services to share files such as cache files among various session runs. With a Cluster File System (CFS), integration services running on various servers can perform concurrent reads and write to the same block of data.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
542 of 702
Description
When a PowerCenter session or workflow is not performing at the expected or desired speed, there is a methodology that can help to diagnose problems that may be adversely affecting various components of the data integration architecture. While PowerCenter has its own performance settings that can be tuned, you must consider the entire data integration architecture, including the UNIX/Windows servers, network, disk array, and the source and target databases to achieve optimal performance. More often than not, an issue external to PowerCenter is the cause of the performance problem. In order to correctly and scientifically determine the most logical cause of the performance problem, you need to execute the performance tuning steps in a specific order. This enables you to methodically rule out individual pieces and narrow down the specific areas on which to focus your tuning efforts.
1. Perform Benchmarking
You should always have a baseline of current load times for a given workflow or session with a similar row count. Maybe you are not achieving your required load window or simply think your processes could run more efficiently based on comparison with other similar tasks running faster. Use the benchmark to estimate what your desired performance goal should be and tune to that goal. Begin with the problem mapping that you created, along with a session and workflow that use all default settings. This helps to identify which changes have a positive impact on performance.
Target
BEST PRACTICE 543 of 702
INFORMATICA CONFIDENTIAL
q q q q
The methodology steps you through a series of tests using PowerCenter to identify trends that point where next to focus. Remember to go through these tests in a scientific manner; running them multiple times before reaching any conclusion and always keep in mind that fixing one bottleneck area may create a different bottleneck. For more information, see Determining Bottlenecks.
For source database related bottlenecks, refer to Tuning SQL Overrides and Environment for Better Performance For target database related problems, refer to Performance Tuning Databases - Oracle, SQL Server, or Teradata For operating system problems, refer to Performance Tuning UNIX Systems or Performance Tuning Windows 2000/2003 Systems for more information.
Problems inside PowerCenter refers to anything that PowerCenter controls, such as actual transformation logic, and PowerCenter Workflow/Session settings. The session settings contain quite a few memory settings and partitioning options that can greatly improve performance. Refer to the Tuning Sessions for Better Performance for more information. Although there are certain procedures to follow to optimize mappings, keep in mind that, in most cases, the mapping design is dictated by business logic; there may be a
INFORMATICA CONFIDENTIAL
BEST PRACTICE
544 of 702
more efficient way to perform the business logic within the mapping, but you cannot ignore the necessary business logic to improve performance. Refer to Tuning Mappings for Better Performance for more information.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
545 of 702
Tuning and Configuring Data Analyzer and Data Analyzer Reports Challenge
A Data Analyzer report that is slow to return data means lag time to a manager or business analyst. It can be a crucial point of failure in the acceptance of a data warehouse. This Best Practice offers some suggestions for tuning Data Analyzer and Data Analyzer reports.
Description
Performance tuning reports occurs both at the environment level and the reporting level. Often report performance can be enhanced by looking closely at the objective of the report rather than the suggested appearance. The following guidelines should help with tuning the environment and the report itself. 1. Perform Benchmarking. Benchmark the reports to determine an expected rate of return. Perform benchmarks at various points throughout the day and evening hours to account for inconsistencies in network traffic, database server load, and application server load. This provides a baseline to measure changes against. Review Report. Confirm that all data elements are required in the report. Eliminate any unnecessary data elements, filters, and calculations. Also be sure to remove any extraneous charts or graphs. Consider if the report can be broken into multiple reports or presented at a higher level. These are often ways to create more visually appealing reports and allow for linked detail reports or drill down to detail level. Scheduling of Reports. If the report is on-demand but can be changed to a scheduled report, schedule the report to run during hours when the system use is minimized. Consider scheduling large numbers of reports to run overnight. If mid-day updates are required, test the performance at lunch hours and consider scheduling for that time period. Reports that require filters by users can often be copied and filters pre-created to allow for scheduling of the report. Evaluate Database. Database tuning occurs on multiple levels. Begin by reviewing the tables used in the report. Ensure that indexes have been created on dimension keys. If filters are used on attributes, test the creation of secondary indices to improve the efficiency of the query. Next, execute reports while a DBA monitors the database environment. This provides the DBA the opportunity to tune the database for querying. Finally, look into changes in database settings. Increasing the database memory in the initialization file often improves Data Analyzer performance significantly. Investigate Network. Reports are simply database queries, which can be found by clicking the "View SQL" button on the report. Run the query from the report, against the database using a client tool on the server that the database resides on. One caveat to this is that even the database tool on the server may contact the outside network. Work with the DBA during this test to use a local database connection, (e.g., Bequeath / IPC Oracles local database communication protocol) and monitor the database throughout this process. This test may pinpoint if the bottleneck is occurring on the network or in the database. If, for instance, the query performs well regardless of where it is executed, but the report continues to be slow, this indicates an application server bottleneck. Common locations for network bottlenecks include router tables, web server demand, and server input/output. Informatica does recommend installing Data Analyzer on a dedicated application server. Tune the Schema. Having tuned the environment and minimized the report requirements, the final level of tuning involves changes to the database tables. Review the under performing reports.
2.
3.
4.
5.
6.
Can any of these be generated from aggregate tables instead of from base tables? Data Analyzer makes efficient use of linked aggregate tables by determining on a report-by-report basis if the report can utilize an aggregate table. By studying the existing reports and future requirements, you can determine what key aggregates can be created in the ETL tool and stored in the database. Calculated metrics can also be created in an ETL tool and stored in the database instead of created in Data Analyzer. Each time a calculation must be done in Data Analyzer, it is being performed as part of the query process. To determine if a query can be improved by building these elements in the database, try removing
INFORMATICA CONFIDENTIAL
BEST PRACTICE
546 of 702
them from the report and comparing report performance. Consider if these elements are appearing in a multitude of reports or simply a few. 7. Database Queries. As a last resort for under-performing reports, you may want to edit the actual report query. To determine if the query is the bottleneck, select the View SQL button on the report. Next, copy the SQL into a query utility and execute. (DBA assistance may be beneficial here.) If the query appears to be the bottleneck, revisit Steps 2 and 6 above to ensure that no additional report changes are possible. Once you have confirmed that the report is as required, work to edit the query while continuing to re-test it in a query utility. Additional options include utilizing database views to cache data prior to report generation. Reports are then built based on the view.
Note: Editing the report query requires query editing for each report change and may require editing during migrations. Be aware that this is a time-consuming process and a difficult-to-maintain method of performance tuning. The Data Analyzer repository database should be tuned for an OLTP workload.
JVM Layout
The Java Virtual Machine (JVM) is the repository for all live objects, dead objects, and free memory. It has the following primary jobs:
The size of the JVM determines how often and how long garbage collection runs. The JVM parameters can be set in the "startWebLogic.cmd" or "startWebLogic.sh" if using the Weblogic application server.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
547 of 702
When the new generation fills up, it triggers a minor collection, in which surviving objects are moved to the old generation. When the old generation fills up, it triggers a major collection, which involves the entire object heap. This is more expensive in terms of resources than a minor collection.
6. 7. 8. 9.
If you increase the new generation size, the old generation size decreases. Minor collections occur less often, but the frequency of major collection increases. If you decrease the new generation size, the old generation size increases. Minor collections occur more, but the frequency of major collection decreases. As a general rule, keep the new generation smaller than half the heap size (i.e., 1/4 or 1/3 of the heap size). Enable additional JVM if you expect large numbers of users. Informatica typically recommends two to three CPUs per JVM.
Execute Threads
Threads available to process simultaneous operations in Weblogic. Too few threads means CPUs are under-utilized and jobs are waiting for threads to become available. Too many threads means system is wasting resource in managing threads. The OS performs unnecessary context switching. The default is 15 threads. Informatica recommends using the default value, but you may need to experiment to determine the optimal value for your environment.
Connection Pooling
The application borrows a connection from the pool, uses it, and then returns it to the pool by closing it.
Initial capacity = 15 Maximum capacity = 15 Sum of connections of all pools should be equal to the number of execution threads.
Connection pooling avoids the overhead of growing and shrinking the pool size dynamically by setting the initial and maximum pool size at the same level. Performance packs use platform-optimized (i.e., native) sockets to improve server performance. They are available on: Windows NT/2000 (default installed), Solaris 2.6/2.7, AIX 4.3, HP/UX, and Linux.
Check Enable Native I/O on the server attribute tab. Adds <NativeIOEnabled> to config.xml as true.
For Websphere, use the Performance Tuner to modify the configurable parameters. For optimal configuration, separate the application server , the data warehouse database, and the repository database onto separate dedicated machines.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
548 of 702
minProcessors. Number of threads created initially in the pool. maxProcessors. Maximum number of threads that can ever be created in the pool. acceptCount. Controls the length of the queue of waiting requests when no more threads are available from the pool to process the request. connectionTimeout. Amount of time to wait before a URI is received from the stream. Default is 20 seconds. This avoid problems where a client opens a connection and does not send any data tcpNoDelay. Set to true when data should be sent to the client without waiting for the buffer to be full. This reduces latency at the cost of more packets being sent over the network. The default is true. enableLookups. Determines whether a reverse DNS lookup is performed. This can be enabled to prevent IP spoofing. Enabling this parameter can cause problems when a DNS is misbehaving. The enableLookups parameter can be turned off when you implicitly trust all clients. connectionLinger. How long connections should linger after they are closed. Informatica recommends using the default value: -1 (no linger).
In the Data Analyzer application, each web page can potentially have more than one request to the application server. Hence, the maxProcessors should always be more than the actual number of concurrent users. For an installation with 20 concurrent users, a minProcessors of 5 and maxProcessors of 100 is a suitable value. If the number of threads is too low, the following message may appear in the log files: ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads JSP Optimization. To avoid having the application server compile JSP scripts when they are executed for the first time, Informatica ships Data Analyzer with pre-compiled JSPs. The following is a typical configuration: <JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/web.xml <servlet> <servlet-name>jsp</servlet-name> <servlet-class>org.apache.jasper.servlet.JspServlet</servlet-class> <init-param>
INFORMATICA CONFIDENTIAL
BEST PRACTICE
549 of 702
<param-name>logVerbosityLevel</param-name> <param-value>WARNING</param-value> <param-name>development</param-name> <param-value>false</param-value> </init-param> <load-on-startup>3</load-on-startup> </servlet> The following parameter may need tuning:
Database Connection Pool. Data Analyzer accesses the repository database to retrieve metadata information. When it runs reports, it accesses the data sources to get the report information. Data Analyzer keeps a pool of database connections for the repository. It also keeps a separate database connection pool for each data source. To optimize Data Analyzer database connections, you can tune the database connection pools. Repository Database Connection Pool. To optimize the repository database connection pool, modify the JBoss configuration file: <JBOSS_HOME>/server/informatica/deploy/<DB_Type>_ds.xml The name of the file includes the database type. <DB_Type> can be Oracle, DB2, or other databases. For example, for an Oracle repository, the configuration file name is oracle_ds.xml. With some versions of Data Analyzer, the configuration file may simply be named DataAnalyzer-ds.xml. The following is a typical configuration: <datasources> <local-tx-datasource> <jndi-name>jdbc/IASDataSource</jndi-name> <connection-url> jdbc:informatica:oracle://aries:1521;SID=prfbase8</connection-url> <driver-class>com.informatica.jdbc.oracle.OracleDriver</driver-class> <user-name>powera</user-name> <password>powera</password> <exception-sorter-class-name>org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter </exception-sorter-class-name> <min-pool-size>5</min-pool-size> <max-pool-size>50</max-pool-size> <blocking-timeout-millis>5000</blocking-timeout-millis> <idle-timeout-minutes>1500</idle-timeout-minutes> </local-tx-datasource> </datasources> The following parameters may need tuning:
min-pool-size. The minimum number of connections in the pool. (The pool is lazily constructed, that is, it will be empty until it is first accessed. Once used, it will always have at least the min-poolsize connections.) max-pool-size. The strict maximum size of the connection pool. blocking-timeout-millis. The maximum time in milliseconds that a caller waits to get a connection when no more free connections are available in the pool. idle-timeout-minutes. The length of time an idle connection remains in the pool before it is used.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
550 of 702
The max-pool-size value is recommended to be at least five more than maximum number of concurrent users because there may be several scheduled reports running in the background and each of them needs a database connection. A higher value is recommended for idle-timeout-minutes. Because Data Analyzer accesses the repository very frequently, it is inefficient to spend resources on checking for idle connections and cleaning them out. Checking for idle connections may block other threads that require new connections. Data Source Database Connection Pool. Similar to the repository database connection pools, the data source also has a pool of connections that Data Analyzer dynamically creates as soon as the first client requests a connection. The tuning parameters for these dynamic pools are present in following file: <JBOSS_HOME>/bin/IAS.properties.file The following is a typical configuration: # # Datasource definition # dynapool.initialCapacity=5 dynapool.maxCapacity=50 dynapool.capacityIncrement=2 dynapool.allowShrinking=true dynapool.shrinkPeriodMins=20 dynapool.waitForConnection=true dynapool.waitSec=1 dynapool.poolNamePrefix=IAS_dynapool.refreshTestMinutes=60 datamart.defaultRowPrefetch=20< /FONT> The following JBoss-specific parameters may need tuning:
dynapool.initialCapacity. The minimum number of initial connections in the data source pool. dynapool.maxCapacity. The maximum number of connections that the data source pool may grow to. dynapool.poolNamePrefix. This parameter is a prefix added to the dynamic JDB pool name for identification purposes. dynapool.waitSec. The maximum amount of time (in seconds) a client will wait to grab a connection from the pool if none is readily available. dynapool.refreshTestMinutes. This parameter determines the frequency at which a health check is performed on the idle connections in the pool. This should not be performed too frequently because it locks up the connection pool and may prevent other clients from grabbing connections from the pool. dynapool.shrinkPeriodMins. This parameter determines the amount of time (in minutes) an idle connection is allowed to be in the pool. After this period, the number of connections in the pool shrinks back to the value of its initialCapacity parameter. This is done only if the allowShrinking parameter is set to true.
EJB Container
Data Analyzer uses EJBs extensively. It has more than 50 stateless session beans (SLSB) and more than 60 entity beans (EB). In addition, there are six message-driven beans (MDBs) that are used for the scheduling and real-time functionalities. Stateless Session Beans (SLSB). For SLSBs, the most important tuning parameter is the EJB pool. You can tune the EJB pool parameters in the following file:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
551 of 702
<JBOSS_HOME>/server/Informatica/conf/standardjboss.xml. The following is a typical configuration: <container-configuration> <container-name> Standard Stateless SessionBean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name> stateless-rmi-invoker</invoker-proxy-binding-name> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor> org.jboss.ejb.plugins.LogInterceptor</interceptor> <interceptor> org.jboss.ejb.plugins.SecurityInterceptor</interceptor> <!-- CMT --> <interceptor transaction="Container"> org.jboss.ejb.plugins.TxInterceptorCMT</interceptor> <interceptor transaction="Container" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor transaction="Container"> org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor </interceptor> <!-- BMT --> <interceptor transaction="Bean"> org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor </interceptor> <interceptor transaction="Bean"> org.jboss.ejb.plugins.TxInterceptorBMT</interceptor> <interceptor transaction="Bean" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor> org.jboss.resource.connectionmanager.CachedConnectionInterceptor </interceptor> </container-interceptors> <instance-pool> org.jboss.ejb.plugins.StatelessSessionInstancePool</instance-pool> <instance-cache></instance-cache> <persistence-manager></persistence-manager> <container-pool-conf> <MaximumSize>100</MaximumSize> </container-pool-conf> </container-configuration> The following parameter may need tuning:
MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that can be created. If <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are requests for more objects. However, only the <MaximumSize> number of objects can be returned to the pool. Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data Analyzer to increase the throughput for high-concurrency installations. strictMaximumSize. When the value is set to true, the <strictMaximumSize> enforces a rule that only <MaximumSize> number of objects can be active. Any subsequent requests must wait for an object to be returned to the pool. strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time that requests wait for an object to be made available in the pool.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
552 of 702
Message-Driven Beans (MDB). MDB tuning parameters are very similar to stateless bean tuning parameters. The main difference is that MDBs are not invoked by clients. Instead, the messaging system delivers messages to the MDB when they are available. To tune the MDB parameters, modify the following configuration file: <JBOSS_HOME>/server/informatica/conf/standardjboss.xml The following is a typical configuration: <container-configuration> <container-name>Standard Message Driven Bean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name>message-driven-bean </invoker-proxy-binding-name> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.RunAsSecurityInterceptor </interceptor> <!-- CMT --> <interceptor transaction="Container"> org.jboss.ejb.plugins.TxInterceptorCMT</interceptor> <interceptor transaction="Container" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor </interceptor> <interceptor transaction="Container"> org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor </interceptor> <!-- BMT --> <interceptor transaction="Bean"> org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor </interceptor> <interceptor transaction="Bean"> org.jboss.ejb.plugins.MessageDrivenTxInterceptorBMT </interceptor> <interceptor transaction="Bean" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor> org.jboss.resource.connectionmanager.CachedConnectionInterceptor </interceptor> </container-interceptors> <instance-pool>org.jboss.ejb.plugins.MessageDrivenInstancePool </instance-pool> <instance-cache></instance-cache> <persistence-manager></persistence-manager> <container-pool-conf> <MaximumSize>100</MaximumSize> </container-pool-conf> </container-configuration> The following parameter may need tuning: MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that can be created. Otherwise, if <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are request for more objects. However, only the <MaximumSize> number of objects can be returned to the pool.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
553 of 702
Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data Analyzer to increase the throughput for high-concurrency installations.
strictMaximumSize. When the value is set to true, the <strictMaximumSize> parameter enforces a rule that only <MaximumSize> number of objects will be active. Any subsequent requests must wait for an object to be returned to the pool. strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time that requests wait for an object to be made available in the pool.
Enterprise Java Beans (EJB). Data Analyzer EJBs use BMP (bean-managed persistence) as opposed to CMP (container-managed persistence). The EJB tuning parameters are very similar to the stateless bean tuning parameters. The EJB tuning parameters are in the following configuration file: <JBOSS_HOME>/server/informatica/conf/standardjboss.xml. The following is a typical configuration: <container-configuration> <container-name>Standard BMP EntityBean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name>entity-rmi-invoker </invoker-proxy-binding-name> <sync-on-commit-only>false</sync-on-commit-only> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.SecurityInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.TxInterceptorCMT </interceptor> <interceptor metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityLockInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityReentranceInterceptor </interceptor> <interceptor> org.jboss.resource.connectionmanager.CachedConnectionInterceptor </interceptor> <interceptor> org.jboss.ejb.plugins.EntitySynchronizationInterceptor </interceptor> </container-interceptors> <instance-pool>org.jboss.ejb.plugins.EntityInstancePool </instance-pool> <instance-cache>org.jboss.ejb.plugins.EntityInstanceCache </instance-cache> <persistence-manager>org.jboss.ejb.plugins.BMPPersistenceManager </persistence-manager> <locking-policy>org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock </locking-policy> <container-cache-conf>
INFORMATICA CONFIDENTIAL
BEST PRACTICE
554 of 702
<cache-policy>org.jboss.ejb.plugins.LRUEnterpriseContextCachePolicy </cache-policy> <cache-policy-conf> <min-capacity>50</min-capacity> <max-capacity>1000000</max-capacity> <overager-period>300</overager-period> <max-bean-age>600</max-bean-age> <resizer-period>400</resizer-period> <max-cache-miss-period>60</max-cache-miss-period> <min-cache-miss-period>1</min-cache-miss-period> <cache-load-factor>0.75</cache-load-factor> </cache-policy-conf> </container-cache-conf> <container-pool-conf> <MaximumSize>100</MaximumSize> </container-pool-conf> <commit-option>A</commit-option> </container-configuration> The following parameter may need tuning: MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that can be created. Otherwise, if <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are request for more objects. However, only the <MaximumSize> number of objects are returned to the pool. Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data Analyzer to increase the throughput for high-concurrency installations.
strictMaximumSize. When the value is set to true, the <strictMaximumSize> parameter enforces a rule that only <MaximumSize> number of objects can be active. Any subsequent requests must wait for an object to be returned to the pool. strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time that requests will wait for an object to be made available in the pool.
RMI Pool
The JBoss Application Server can be configured to have a pool of threads to accept connections from clients for remote method invocation (RMI). If you use the Java RMI protocol to access the Data Analyzer API from other custom applications, you can optimize the RMI thread pool parameters. To optimize the RMI pool, modify the following configuration file: <JBOSS_HOME>/server/informatica/conf/jboss-service.xml The following is a typical configuration: <mbeancode="org.jboss.invocation.pooled.server.PooledInvoker"name="jboss:service=invoker,type=pooled "> <attribute name="NumAcceptThreads">1</attribute> <attribute name="MaxPoolSize">300</attribute> <attribute name="ClientMaxPoolSize">300</attribute> <attribute name="SocketTimeout">60000</attribute> <attribute name="ServerBindAddress"></attribute> <attribute name="ServerBindPort">0</attribute> <attribute name="ClientConnectAddress"></attribute> <attribute name="ClientConnectPort">0</attribute>
INFORMATICA CONFIDENTIAL
BEST PRACTICE
555 of 702
<attribute name="EnableTcpNoDelay">false</attribute> <depends optional-attribute-name="TransactionManagerService"> jboss:service=TransactionManager </depends> </mbean> The following parameters may need tuning:
NumAcceptThreads. The controlling threads used to accept connections from the client. MaxPoolSize. A strict maximum size for the pool of threads to service requests on the server. ClientMaxPoolSize. A strict maximum size for the pool of threads to service requests on the client. Backlog. The number of requests in the queue when all the processing threads are in use. EnableTcpDelay. Indicates whether information should be sent before the buffer is full. Setting it to true may increase the network traffic because more packets will be sent across the network.
WebSphere Application Server 5.1. The Tivoli Performance Viewer can be used to observe the behavior of some of the parameters and arrive at a good settings.
Web Container
Navigate to Application Servers > [your_server_instance] > Web Container > Thread Pool to tune the following parameters.
Minimum Size: Specifies the minimum number of threads to allow in the pool. The default value of 10 is appropriate. Maximum Size: Specifies the minimum number of threads to allow in the pool. For a highly concurrent usage scenario (with a 3 VM load-balanced configuration), the value of 50-60 has been determined to be optimal. Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that should elapse before a thread is reclaimed. The default of 3500ms is considered optimal. Is Growable: Specifies whether the number of threads can increase beyond the maximum size configured for the thread pool. Be sure to leave this option unchecked. Also, the maximum threads should be hard-limited to the value given in the Maximum Size.
Note: In a load-balanced environment, there is likely to be more than one server instance that may be spread across multiple machines. In such a scenario, be sure that the changes have been properly propagated to all of the server instances.
Transaction Services
Total transaction lifetime timeout: In certain circumstances (e.g., import of large XML files), the default value of 120 seconds may not be sufficient and should be increased. This parameter can be modified during runtime also.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
556 of 702
Debugging Services
Ensure that the tracing is disabled in a production environment. Navigate to Application Servers > [your_server_instance] > Logging and Tracing > Diagnostic Trace Service > Debugging Service and make sure Startup is not checked.
-Xnocompactgc. This parameter switches off heap compaction altogether. Switching off heap
compaction results in heap fragmentation. Since Data Analyzer frequently allocates large objects, heap fragmentation can result in OutOfMemory exceptions. -Xcompactgc. Using this parameter leads to each garbage collection cycle carrying out compaction, regardless of whether it's useful. -Xgcthreads. This controls the number of garbage collection helper threads created by the JVM during startup. The default is N-1 threads for an N-processor machine. These threads provide the parallelism in parallel mark and parallel sweep modes, which reduces the pause time during garbage collection. -Xclassnogc. This disables collection of class objects. -Xinitsh. This sets the initial size of the application-class system heap. The system heap is expanded as needed and is never garbage collected.
You may want to alter the following parameters after carefully examining the application server processes:
Navigate to Application Servers > [your_server_instance] > Process Definition > Java Virtual Machine" Verbose garbage collection. Check this option to turn on verbose garbage collection. This can help in understanding the behavior of the garbage collection for the application. It has a very low overhead on performance and can be turned on even in the production environment. Initial heap size. This is the ms value. Only the numeric value (without MB) needs to be specified. For concurrent usage, the initial heap-size should be started with a 1000 and, depending on the garbage collection behavior, can be potentially increased up to 2000. A value beyond 2000 may
INFORMATICA CONFIDENTIAL
BEST PRACTICE
557 of 702
actually reduce throughput because the garbage collection cycles will take more time to go through the large heap, even though the cycles may be occurring less frequently. Maximum heap size. This is the mx value. It should be equal to the Initial heap size value. RunHProf:. This should remain unchecked in production mode, because it slows down the VM considerably. Debug Mode. This should remain unchecked in production mode, because it slows down the VM considerably. Disable JIT.: This should remain unchecked (i.e., JIT should never be disabled).
Connection Timeout. The default value of 180 seconds should be good. This implies that after 180 seconds, the request to grab a connection from the pool will timeout. After it times out, DataAnalyzer will throw an exception. In that case, the pool size may need to be increased. Max Connections. The maximum number of connections in the pool. Informatica recommends a value of 50 for this. Min Connections. The minimum number of connections in the pool. Informatica recommends a value of 10 for this. Reap Time. This specifies the frequency of pool maintenance thread. This should not be set very high because when pool maintenance thread is running, it blocks the whole pool and no process can grab a new connection form the pool. If the database and the network are reliable, this should have a very high value (e.g., 1000). Unused Timeout. This specifies the time in seconds after which an unused connection will be discarded until the pool size reaches the minimum size. In a highly concurrent usage, this should be a high value. The default of 1800 seconds should be fine. Aged Timeout. Specifies the interval in seconds before a physical connection is discarded. If the database and the network are stable, there should not be a reason for age timeout. The default is 0 (i.e., connections do not age). If the database or the network connection to the repository database frequently comes down (compared to the life of the AppServer), this can be used to age-out the stale connections.
Much like the repository database connection pools, the data source or data warehouse databases also have a pool of connections that are created dynamically by Data Analyzer as soon as the first client makes a request. The tuning parameters for these dynamic pools are present in <WebSphere_Home>/AppServer/IAS.properties file.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
558 of 702
# # Datasource definition # dynapool.initialCapacity=5 dynapool.maxCapacity=50 dynapool.capacityIncrement=2 dynapool.allowShrinking=true dynapool.shrinkPeriodMins=20 dynapool.waitForConnection=true dynapool.waitSec=1 dynapool.poolNamePrefix=IAS_ dynapool.refreshTestMinutes=60 datamart.defaultRowPrefetch=20
The various parameters that may need tuning are:
dynapool.initialCapacity - the minimum number of initial connections in the data-source pool. dynapool.maxCapacity - the maximum number of connections that the data-source pool may grow up to. dynapool.poolNamePrefix - a prefix added to the dynamic JDB pool name for identification purposes. dynapool.waitSec - the maximum amount of time (in seconds) that a client will wait to grab a connection from the pool if none is readily available. dynapool.refreshTestMinutes - determines the frequency at which a health check on the idle connections in the pool is performed. Such checks should not be performed too frequently because they lock up the connection pool and may prevent other clients from grabbing connections from the pool. dynapool.shrinkPeriodMins - determines the amount of time (in minutes) an idle connection is allowed to be in the pool. After this period, the number of connections in the pool decreases (to its initialCapacity). This is done only if allowShrinking is set to true.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
559 of 702
Navigate to Application Servers > [your_server_instance] > Message Listener Service > Listener Ports > IAS_ScheduleMDB_ListenerPort . The parameters that can be tuned are:
Maximum sessions. The default value is one. On a highly-concurrent user scenario, Informatica does not recommend going beyond five. Maximum messages. This should remain as one. This implies that each report in a schedule will be executed in a separate transaction instead of a batch. Setting it to more than one may have unwanted effects like transaction timeouts, and the failure of one report may cause all the reports in the batch to fail.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
560 of 702
Description
Analyze mappings for tuning only after you have tuned the target and source for peak performance. To optimize mappings, you generally reduce the number of transformations in the mapping and delete unnecessary links between transformations. For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup transformations), limit connected input/output or output ports. Doing so can reduce the amount of data the transformations store in the data cache. Having too many Lookups and Aggregators can encumber performance because each requires index cache and data cache. Since both are fighting for memory space, decreasing the number of these transformations in a mapping can help improve speed. Splitting them up into different mappings is another option. Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O activity on the cache directory. Unless the seek/access time is fast on the directory itself, having too many Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes contention of disk and memory, which can lead to thrashing, leaving insufficient memory to run a mapping efficiently.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
561 of 702
In Mapping X, the source and lookup contain the following number of records:
ITEMS (source): MANUFACTURER: DIM_ITEMS: 5000 records 200 records 100000 records
LKP_Manufacturer Build Cache Read Source Records Execute Lookup Total # of Disk Reads 200 5000 0 5200 0 5000 5000 100000
LKP_DIM_ITEMS Build Cache Read Source Records Execute Lookup Total # of Disk Reads 100000 5000 0 105000 0 5000 5000 10000
Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the lookup table is small in comparison with the number of times the lookup is executed. So this lookup should be cached. This is the more likely scenario. Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in 105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk reads would total 10,000. In this case the number of records in the lookup table is not small in comparison with the number of times the lookup will be executed. Thus, the lookup should not be cached. Use the following eight step method to determine if a lookup should be cached:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
563 of 702
1. Code the lookup into the mapping. 2. Select a standard set of data from the source. For example, add a "where" clause on a relational source to load a sample 10,000 rows. 3. Run the mapping with caching turned off and save the log. 4. Run the mapping with caching turned on and save the log to a different name than the log created in step 3. 5. Look in the cached lookup log and determine how long it takes to cache the lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS = LS. 6. In the non-cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND = NRS. 7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS. 8. Use the following formula to find the breakeven row point: (LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If your expected source records is less than X, it is better to not cache the lookup. If your expected source records is more than X, it is better to cache the lookup. For example: Assume the lookup takes 166 seconds to cache (LS=166). Assume with a cached lookup the load is 232 rows per second (CRS=232). Assume with a non-cached lookup the load is 147 rows per second (NRS = 147). The formula would result in: (166*147*232)/(232-147) = 66,603. Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more than 66,603 records, then the lookup should be cached.
Within a specific session run for a mapping, if the same lookup is used multiple times in a mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup. Using the same lookup multiple times in the mapping will be more resource intensive with each successive instance. If multiple cached lookups are from the same table but are expected to return different columns of data, it may be better to setup the multiple lookups to bring back the same columns even though not all return ports are used in all lookups. Bringing back a common set of columns may reduce the number of disk reads. Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a persistent cache is set in the lookup properties, the memory cache created for the lookup during the initial run is saved to the PowerCenter Server. This can improve performance because the Server builds the memory cache from cache files instead of the database. This feature should only be used when the lookup table is not expected to change between session runs. Across different mappings and sessions, the use of a named persistent cache allows sharing an existing cache file.
There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the WHERE clause to reduce the set of records included in the resulting cache. Note: If you use a SQL override in a lookup, the lookup must be cached.
In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement used to create the cache. Columns used in the ORDER BY condition should be indexed. The session log will contain the ORDER BY statement.
q
In the case of an un-cached lookup, since a SQL statement is created for each row passing into the lookup transformation, performance can be helped by indexing columns in the lookup condition.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
565 of 702
Join two branches of the same pipeline. Create two instances of the same source and join pipelines from these source instances.
You may want to join data from the same source if you want to perform a calculation on part of the data and join the transformed data with the original data. When you join the data using this method, you can maintain the original data and transform parts of that data within one mapping. When you join data from the same source, you can create two branches of the pipeline. When you branch a pipeline, you must add a transformation between the Source Qualifier and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
566 of 702
If you want to join unsorted data, you must create two instances of the same source and join the pipelines. For example, you may have a source with the following ports:
q q q
In the target table, you want to view the employees who generated sales that were greater than the average sales for their respective departments. To accomplish this, you create a mapping with the following transformations:
q q
Sorter transformation. Sort the data. Sorted Aggregator transformation. Average the sales data and group by department. When you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data. Sorted Joiner transformation. Use a sorted Joiner transformation to join the sorted aggregated data with the original data. Filter transformation. Compare the average sales data against sales data for each employee and filter out employees with less than above average sales.
Note: You can also join data from output groups of the same transformation, such as the Custom transformation or XML Source Qualifier transformations. Place a Sorter transformation between each output group and the Joiner transformation and configure the Joiner transformation to receive sorted input. Joining two branches can affect performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch. This can slow processing. You can also join same source data by creating a second instance of the source. After you create the second source instance, you can join the pipelines from the two source instances. Note: When you join data using this method, the PowerCenter Server reads the source data for each source instance, so performance can be slower than joining two branches of a pipeline. Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a source:
q
Join two branches of a pipeline when you have a large source or if you can read the source data only once. For example, you can only read source data from a message queue once. Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you use a Sorter transformation to sort the data, branch the pipeline after you sort the data. Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation. Join two instances of a source if one pipeline may process much more slowly than the other
BEST PRACTICE 567 of 702
INFORMATICA CONFIDENTIAL
pipeline.
Performance Tips
Use the database to do the join when sourcing data from the same database schema. Database systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override or a join condition should be used when joining multiple tables from the same database schema. Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of data is also smaller. Join sorted data when possible. You can improve session performance by configuring the Joiner transformation to use sorted input. When you configure the Joiner transformation to use sorted data, the PowerCenter Server improves performance by minimizing disk input and output. You see the greatest performance improvement when you work with large data sets. For an unsorted Joiner transformation, designate as the master sourcethe source with fewer rows. For optimal performance and disk storage, designate the master source as the source with the fewer rows. During a session, the Joiner transformation compares each row of the master source against the detail source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which speeds the join process. For a sorted Joiner transformation, designate as the master source the source with fewer duplicate key values. For optimal performance and disk storage, designate the master source as the source with fewer duplicate key values. When the PowerCenter Server processes a sorted Joiner transformation, it caches rows for one hundred keys at a time. If the master source contains many rows with the same key value, the PowerCenter Server must cache more rows, and performance can be slowed. Optimizing sorted joiner transformations with partitions. When you use partitions with a sorted Joiner transformation, you may optimize performance by grouping data and using n:n partitions.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
568 of 702
Processing field level transformations takes time. If the transformation expressions are complex, then processing is even slower. Its often possible to get a 10 to 20 percent performance improvement by optimizing complex field level transformations. Use the target table mapping reports or the Metadata Reporter to examine the transformations. Likely candidates for optimization are the fields with the most complex expressions. Keep in mind that there may be more than one field causing performance problems.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
569 of 702
Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the PowerCenter Server must search and group the data. Thus, the following expression: SUM(Column A) + SUM(Column B) Can be optimized to: SUM(Column A + Column B) In general, operators are faster than functions, so operators should be used whenever possible. For example if you have an expression which involves a CONCAT function such as: CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME) It can be optimized to: FIRST_NAME || LAST_NAME Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical statements to be written in a more compact fashion. For example: IIF(FLG_A=Y and FLG_B=Y and FLG_C= Y, VAL_A+VAL_B+VAL_C,< /FONT> IIF(FLG_A=Y and FLG_B=Y and FLG_C= N, VAL_A+VAL_B,< /FONT> IIF(FLG_A=Y and FLG_B=N and FLG_C= Y, VAL_A+VAL_C,< /FONT> IIF(FLG_A=Y and FLG_B=N and FLG_C= N, VAL_A,< /FONT> IIF(FLG_A=N and FLG_B=Y and FLG_C= Y, VAL_B+VAL_C,< /FONT> IIF(FLG_A=N and FLG_B=Y and FLG_C= N, VAL_B,< /FONT> IIF(FLG_A=N and FLG_B=N and FLG_C= Y, VAL_C,< /FONT> IIF(FLG_A=N and FLG_B=N and FLG_C= N, 0.0))))))))< /FONT> Can be optimized to: IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C= Y, VAL_C, 0.0)< /FONT> The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in three IIFs, three comparisons, and two additions. Be creative in making expressions more efficient. The following is an example of rework of an expression that eliminates three comparisons down to one:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
570 of 702
IIF(X=1 OR X=5 OR X=9, 'yes', 'no')< /FONT> Can be optimized to: IIF(MOD(X, 4) = 1, 'yes', 'no')< /FONT >
You can use any command that is valid for the database type. However, the PowerCenter Server does not allow nested comments, even though the database may. You can use mapping parameters and variables in SQL executed against the source, but not against the target. Use a semi-colon (;) to separate multiple statements. The PowerCenter Server ignores semi-colons within single quotes, double quotes, or within /* ...*/. If you need to use a semi-colon outside of quotes or comments, you can escape it with a back slash (\). The Workflow Manager does not validate the SQL.
q q q
Value JOB_STATUS = 'Full-time' OFFICE_ID = 1000 AVG( SALARY, V_CONDITION1 AND V_CONDITION2 ) SUM( SALARY, V_CONDITION1 AND V_CONDITION2 )
INFORMATICA CONFIDENTIAL
BEST PRACTICE
572 of 702
You can use variables to store data from prior rows. This can help you perform procedural calculations. To compare the previous state to the state just read: IIF( PREVIOUS_STATE = STATE, STATE_COUNTER + 1, 1 )< /FONT >
INFORMATICA CONFIDENTIAL
BEST PRACTICE
573 of 702
Description
Once you optimize the source and target database, and mapping, you can focus on optimizing the session. The greatest area for improvement at the session level usually involves tweaking memory cache settings. The Aggregator (without sorted ports), Joiner, Rank, Sorter and Lookup transformations (with caching enabled) use caches. The PowerCenter Server uses index and data caches for each of these transformations. If the allocated data or index cache is not large enough to store the data, the PowerCenter Server stores the data in a temporary disk file as it processes the session data. Each time the PowerCenter Server pages to the temporary file, performance slows. You can see when the PowerCenter Server pages to the temporary file by examining the performance details. The transformation_readfromdisk or transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or Joiner transformation indicate the number of times the PowerCenter Server must page to disk to process the transformation. Index and data caches should both be sized according to the requirements of the individual lookup. The sizing can be done using the estimation tools provided in the Transformation Guide, or through observation of actual cache sizes on in the session caching directory. The PowerCenter Server creates the index and data cache files by default in the PowerCenter Server variable directory, $PMCacheDir. The naming convention used by the PowerCenter Server for these files is PM [type of transformation] [generated session instance id number] _ [transformation instance id number] _ [partition index]. dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19. dat. The cache directory may be changed however, if disk space is a constraint. Informatica recommends that the cache directory be local to the PowerCenter Server. A RAID 0 arrangement that gives maximum performance with no redundancy is
INFORMATICA CONFIDENTIAL BEST PRACTICE 574 of 702
recommended for volatile cache file directories (i.e., no persistent caches). If the PowerCenter Server requires more memory than the configured cache size, it stores the overflow values in these cache files. Since paging to disk can slow session performance, the RAM allocated needs to be available on the server. If the server doesnt have available RAM and uses paged memory, your session is again accessing the hard disk. In this case, it is more efficient to allow PowerCenter to page the data rather than the operating system. Adding additional memory to the server is, of course, the best solution. Refer to Session Caches in the Workflow Administration Guide for detailed information on determining cache sizes. The PowerCenter Server writes to the index and data cache files during a session in the following cases:
q
The mapping contains one or more Aggregator transformations, and the session is configured for incremental aggregation. The mapping contains a Lookup transformation that is configured to use a persistent lookup cache, and the PowerCenter Server runs the session for the first time. The mapping contains a Lookup transformation that is configured to initialize the persistent lookup cache. The Data Transformation Manager (DTM) process in a session runs out of cache memory and pages to the local cache files. The DTM may create multiple files when processing large amounts of data. The session fails if the local directory runs out of disk space.
When a session is running, the PowerCenter Server writes a message in the session log indicating the cache file name and the transformation name. When a session completes, the DTM generally deletes the overflow index and data cache files. However, index and data files may exist in the cache directory if the session is configured for either incremental aggregation or to use a persistent lookup cache. Cache files may also remain if the session does not complete successfully.
caches in memory. Session caches include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches. The values stored in the data and index caches depend upon the requirements of the transformation. For example, the Aggregator index cache stores group values as configured in the group by ports, and the data cache stores calculations based on the group by ports. When the Integration Service processes a Sorter transformation or writes data to an XML target, it also creates a cache.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
576 of 702
session containing a Lookup transformation, the Integration Service overrides or disable the automatic memory settings and uses the default values. When you run a session on a grid and you configure Maximum Memory Allowed for Auto Memory Attributes, the Integration Service divides the allocated memory among all the nodes in the grid. When you configure Maximum Percentage of Total Memory Allowed for Auto Memory Attributes, the Integration Service allocates the specified percentage of memory on each node in the grid.
Aggregator Caches
Keep the following items in mind when configuring the aggregate memory cache sizes:
q
Allocate at least enough space to hold at least one row in each aggregate group. Remember that you only need to configure cache memory for an Aggregator transformation that does not use sorted ports. The PowerCenter Server uses Session Process memory to process an Aggregator transformation with sorted ports, not cache memory. Incremental aggregation can improve session performance. When it is used, the PowerCenter Server saves index and data cache information to disk at the end of the session. The next time the session runs, the PowerCenter Server uses this historical information to perform the incremental aggregation. The PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and saves them to the cache directory. Mappings that have sessions which use incremental aggregation should be set up so that only new detail records are read with each subsequent run. When configuring Aggregate data cache size, remember that the data cache holds row data for variable ports and connected output ports only. As a result, the data cache is generally larger than the index cache. To reduce the data cache size, connect only the necessary output ports to subsequent transformations.
Joiner Caches
When a session is run with a Joiner transformation, the PowerCenter Server reads from master and detail sources concurrently and builds index and data caches based on the master rows. The PowerCenter Server then performs the join based on the detail source data and the cache data. The number of rows the PowerCenter Server stores in the cache depends on the
INFORMATICA CONFIDENTIAL
BEST PRACTICE
577 of 702
partitioning scheme, the data in the master source, and whether or not you use sorted input. After the memory caches are built, the PowerCenter Server reads the rows from the detail source and performs the joins. The PowerCenter Server uses the index cache to test the join condition. When it finds source data and cache data that match, it retrieves row values from the data cache.
Lookup Caches
Several options can be explored when dealing with Lookup transformation caches.
q
Persistent caches should be used when lookup data is not expected to change often. Lookup cache files are saved after a session with a persistent cache lookup is run for the first time. These files are reused for subsequent runs, bypassing the querying of the database for the lookup. If the lookup table changes, you must be sure to set the Recache from Database option to ensure that the lookup cache files are rebuilt. You can also delete the cache files before the session run to force the session to rebuild the caches. Lookup caching should be enabled for relatively small tables. Refer to the Best Practice Tuning Mappings for Better Performance to determine when lookups should be cached. When the Lookup transformation is not configured for caching, the PowerCenter Server queries the lookup table for each input row. The result of the lookup query and processing is the same, regardless of whether the lookup table is cached or not. However, when the transformation is configured to not cache, the PowerCenter Server queries the lookup table instead of the lookup cache. Using a lookup cache can usually increase session performance. Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on an eight-byte boundary, which helps increase the performance of the lookup
You can also configure DTM buffer size and the default buffer block size in the session properties. When the PowerCenter Server initializes a session, it allocates blocks of
INFORMATICA CONFIDENTIAL BEST PRACTICE 578 of 702
memory to hold source and target data. Sessions that use a large number of sources and targets may require additional memory blocks. To configure these settings, first determine the number of memory blocks the PowerCenter Server requires to initialize the session. Then you can calculate the buffer size and/or the buffer block size based on the default settings, to create the required number of session blocks. If there are XML sources or targets in the mappings, use the number of groups in the XML source or target in the total calculation for the total number of sources and targets.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
579 of 702
To achieve better performance, you can create a workflow that runs several sessions in parallel on one PowerCenter Server. This technique should only be employed on servers with multiple CPUs available.
Partitioning Sessions
Performance can be improved by processing data in parallel in a single session by creating multiple partitions of the pipeline. If you have PowerCenter partitioning available, you can increase the number of partitions in a pipeline to improve session performance. Increasing the number of partitions allows the PowerCenter Server to create multiple connections to sources and process partitions of source data concurrently. When you create or edit a session, you can change the partitioning information for each pipeline in a mapping. If the mapping contains multiple pipelines, you can specify multiple partitions in some pipelines and single partitions in others. Keep the following attributes in mind when specifying partitioning information for a pipeline:
q
Location of partition points. The PowerCenter Server sets partition points at several transformations in a pipeline by default. If you have PowerCenter partitioning available, you can define other partition points. Select those transformations where you think redistributing the rows in a different way is likely to increase the performance considerably. Number of partitions. By default, the PowerCenter Server sets the number of partitions to one. You can generally define up to 64 partitions at any partition point. When you increase the number of partitions, you increase the number of processing threads, which can improve session performance. Increasing the number of partitions or partition points also increases the load on the server. If the server contains ample CPU bandwidth, processing rows of data in a session concurrently can increase session performance. However, if you create a large number of partitions or partition points in a session that processes large amounts of data, you can overload the system. You can also overload source and target systems, so that is another consideration. Partition types. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types: 1. Round-robin partitioning. PowerCenter distributes rows of data evenly to all partitions. Each partition processes approximately the same number of rows. In a pipeline that reads data from file sources of different sizes, you can use round-robin partitioning to ensure that each
INFORMATICA CONFIDENTIAL
BEST PRACTICE
580 of 702
partition receives approximately the same number of rows. 2. Hash keys. The PowerCenter Server uses a hash function to group rows of data among partitions. The Server groups the data based on a partition key. There are two types of hash partitioning:
r
Hash auto-keys. The PowerCenter Server uses all grouped or sorted ports as a compound partition key. You can use hash auto-keys partitioning at or before Rank, Sorter, and unsorted Aggregator transformations to ensure that rows are grouped properly before they enter these transformations. Hash user keys. The PowerCenter Server uses a hash function to group rows of data among partitions based on a user-defined partition key. You choose the ports that define the partition key.
3. Key range. The PowerCenter Server distributes rows of data based on a port or set of ports that you specify as the partition key. For each port, you define a range of values. The PowerCenter Server uses the key and ranges to send rows to the appropriate partition. Choose key range partitioning where the sources or targets in the pipeline are partitioned by key range. 4. -Pass-through partitioning. The PowerCenter Server processes data without redistributing rows among partitions. Therefore, all rows in a single partition stay in that partition after crossing a pass-through partition point. 5. Database partitioning partition. You can optimize session performance by using the database partitioning partition type instead of the pass-through partition type for IBM DB2 targets. If you find that your system is under-utilized after you have tuned the application, databases, and system for maximum single-partition performance, you can reconfigure your session to have two or more partitions to make your session utilize more of the hardware. Use the following tips when you add partitions to a session:
q
Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before you add each partition. Set DTM buffer memory. For a session with n partitions, this value should be at least n times the value for the session with one partition. Set cached values for Sequence Generator. For a session with n partitions, there should be no need to use the number of cached values property of the Sequence Generator transformation. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the
BEST PRACTICE 581 of 702
INFORMATICA CONFIDENTIAL
Partition the source data evenly. Configure each partition to extract the same number of rows. Or redistribute the data among partitions early using a partition point with round-robin. This is actually a good way to prevent hammering of the source system. You could have a session with multiple partitions where one partition returns all the data and the override SQL in the other partitions is set to return zero rows (where 1 = 2 in the where clause prevents any rows being returned). Some source systems react better to multiple concurrent SQL queries; others prefer smaller numbers of queries. Monitor the system while running the session. If there are CPU cycles available (twenty percent or more idle time), then performance may improve for this session by adding a partition. Monitor the system after adding a partition. If the CPU utilization does not go up, the wait for I/O time goes up, or the total data transformation rate goes down, then there is probably a hardware or software bottleneck. If the wait for I/ O time goes up a significant amount, then check the system for hardware bottlenecks. Otherwise, check the database configuration. Tune databases and system. Make sure that your databases are tuned properly for parallel ETL and that your system has no bottlenecks.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
582 of 702
The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a high-precision Decimal datatype in a session, you must configure it so that the PowerCenter Server recognizes this datatype by selecting Enable High Precision in the session property sheet. However, since reading and manipulating a high-precision datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter Server down, session performance may be improved by disabling decimal arithmetic. When you disable high precision, the PowerCenter Server reverts to using a dataype of Double.
Verbose initialization logs initialization details in addition to normal, names of index and data files used, and detailed transformation statistics. Verbose data logs each row that passes into the mapping. It also notes where the PowerCenter Server truncates string data to fit the precision of a column and provides detailed transformation statistics. When you configure the tracing level to verbose data, the PowerCenter Server writes row data for all rows in a block when it processes a transformation.
However, the verbose initialization and verbose data logging options significantly affect the session performance. Do not use Verbose tracing options except when testing sessions. Always remember to switch tracing back to Normal after the testing is complete. The session tracing level overrides any transformation-specific tracing levels within the mapping. Informatica does not recommend reducing error tracing as a long-term response to high levels of transformation errors. Because there are only a handful of
INFORMATICA CONFIDENTIAL BEST PRACTICE 583 of 702
reasons why transformation errors occur, it makes sense to fix and prevent any recurring transformation errors. PowerCenter uses the mapping tracing level when the session tracing level is set to none.
Pushdown Optimization
You can push transformation logic to the source or target database using pushdown optimization. The amount of work you can push to the database depends on the pushdown optimization configuration, the transformation logic, and the mapping and session configuration. When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables, and it processes any transformation logic that it cannot push to the database. Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to Pushdown Optimization.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
584 of 702
Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database, starting with the first transformation in the pipeline it can push to the database. The Integration Service processes the transformation logic up to the point that it can push the transformation logic to the target database. Then, it executes the generated SQL.
A long transaction uses more database resources. A long transaction locks the database for longer periods of time, and thereby reduces the database concurrency and increases the likelihood of deadlock. A long transaction can increase the likelihood that an unexpected event may occur.
The Rank transformation cannot be pushed to the database. If you configure the session for full pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator transformation to the source. It pushes the Expression transformation and target to the target database, and it processes the Rank transformation. The Integration Service does not fail the session if it can push only part of the transformation logic to the database and the session is configured for full optimization.
Using a Grid
You can use a grid to increase session and workflow performance. A grid is an alias assigned to a group of nodes that allows you to automate the distribution of workflows and sessions across nodes.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
585 of 702
When you use a grid, the Integration Service distributes workflow tasks and session threads across multiple nodes. Running workflows and sessions on the nodes of a grid provides the following performance gains:
q q q
Balances the Integration Service workload. Processes concurrent sessions faster. Processes partitions faster.
When you run a session on a grid, you improve scalability and performance by distributing session threads to multiple DTM processes running on nodes in the grid. To run a workflow or session on a grid, you assign resources to nodes, create and configure the grid, and configure the Integration Service to run on a grid.
On Node 1, the master service process runs workflow tasks. It also starts a
INFORMATICA CONFIDENTIAL
BEST PRACTICE
586 of 702
temporary preparer DTM process, which becomes the master DTM process. The Load Balancer dispatches the Command task and session threads to nodes in the grid.
q
On Node 2, the worker service process runs the Command task and starts the worker DTM processes that run the session threads. On Node 3, the worker service process starts the worker DTM processes that run the session threads.
For information about configuring and managing a grid, refer to the PowerCenter Administrator Guide. For information about how the DTM distributes session threads into partition groups, see "Running Workflows and Sessions on a Grid" in the Workflow Administration Guide.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
587 of 702
Description
SQL Queries Performing Data Extractions
Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning SQL, the developer must look at the type of execution being forced by hints, the execution plan, and the indexes on the query tables in the SQL, the logic of the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of these areas in more detail.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
588 of 702
Surmounting the Single SQL Statement Limitation in Oracle or DB2: In-line Views
In source qualifiers and lookup objects, you are limited to a single SQL statement. There are several ways to get around this limitation. You can create views in the database and use them as you would tables, either as source tables, or in the FROM clause of the SELECT statement. This can simplify the SQL and make it easier to understand, but it also makes it harder to maintain. The logic is now in two places: in an Informatica mapping and in a database view You can use in-line views which are SELECT statements in the FROM or WHERE clause. This can help focus the query to a subset of data in the table and work more efficiently than using a traditional join. Here is an example of an in-line view in the FROM clause: SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT, N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT, N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER, N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID FROM DOSE_REGIMEN N, (SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID FROM EXPERIMENT_PARAMETER R, NEW_GROUP_TMP TMP WHERE R.EXPERIMENT_PARAMETERS_ID = TMP.EXPERIMENT_PARAMETERS_ID< /FONT > AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID < /FONT > )X WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID < /FONT > ORDER BY N.DOSE_REGIMEN_ID
Surmounting the Single SQL Statement Limitation in DB2: Using the Common Table Expression temp tables and the WITH Clause
The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL statement. The WITH clause lets you assign a name to a CTE block. You can then reference the CTE block multiple places in the query by specifying the query name. For example: WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status = 'P') < /FONT > SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM bio_result bio, data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time
INFORMATICA CONFIDENTIAL BEST PRACTICE 589 of 702
AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', X) AND log.seq_no = maxseq. seq_no< /FONT > Here is another example using a WITH clause that uses recursive SQL: WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS (SELECT PERSON_ID, NAME, PARENT_ID FROM PARENT_CHILD WHERE NAME IN (FRED, SALLY, JIM) UNION ALL SELECT C.PERSON_ID, C.NAME, C.PARENT_ID FROM PARENT_CHILD C, PERSON_TEMP RECURS WHERE C.PERSON_ID = RECURS.PERSON_ID < /FONT > AND LEVEL < 5) SELECT * FROM PERSON_TEMP The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since we all have two parents, but you get the idea. The LEVEL clause prevents infinite recursion.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
590 of 702
SELECT EMPLOYEE, FNAME, LNAME, CASE WHEN SALARY < 10000 THEN NEED RAISE WHEN SALARY > 1000000 THEN OVERPAID ELSE THE REST OF US END AS COMMENT FROM EMPLOYEE
DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT TIME and CURRENT TIMESTAMP Here is an example for DB2: SELECT FNAME, LNAME, CURRENT DATE AS TODAY FROM EMPLOYEE
INFORMATICA CONFIDENTIAL
BEST PRACTICE
592 of 702
Joining all or most of the rows of large tables Joining small sub-sets of data and index available
Hint ALL_ROWS
Description The database engine creates an execution plan that optimizes for throughput. Favors full table scans. Optimizer favors Sort/Merge The database engine creates an execution plan that optimizes for response time. It returns the first row of data as quickly as possible. Favors index lookups. Optimizer favors Nested-loops The database engine creates an execution plan that uses cost-based execution if statistics have been run on the tables. If statistics have not been run, the engine uses rule-based execution. If statistics have been run on empty tables, the engine still uses cost-based execution, but performance is extremely poor. The database engine creates an execution plan based on a fixed set of rules. Use nested loops Use sort merge joins The database engine performs a hash scan of the table. This hint is ignored if the table is not clustered.
FIRST_ROWS
CHOOSE
USE_CONCAT
The syntax for using a hint in a SQL statement is as follows: Select /*+ FIRST_ROWS */ empno, ename From emp;
INFORMATICA CONFIDENTIAL
BEST PRACTICE
593 of 702
The number of executions The elapsed time of the statement execution The CPU time used to execute the statement
The SQL Trace Utility adds value because it definitively shows the statements that are using the most resources, and can immediately show the change in resource consumption after the statement has been tuned and a new explain plan has been run.
Using Indexes
The explain plan also shows whether indexes are being used to facilitate execution. The data warehouse team should compare the indexes being used to those available. If necessary, the administrative staff should identify new indexes that are needed to improve execution and ask the database administration team to add them to the appropriate tables. Once implemented, the explain plan should be executed again to ensure that the indexes are being used. If an index is not being used, it is possible to force the query to use it by using an access method hint, as described earlier.
EXISTS clauses are almost always used in correlated sub-queries. They are executed for each row of the parent query and cannot take advantage of indexes, while the IN clause is executed once and does use indexes, and may be translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN clause. For example: SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN
INFORMATICA CONFIDENTIAL
BEST PRACTICE
594 of 702
(SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster SELECT * FROM DEPARTMENTS D WHERE EXISTS (SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)< /FONT > Situation Index supports subquery No Index to support subquery Exists Yes No Table scans per parent row Probably not Yes No In Yes Yes Table scan once
Sub-query returns many rows Sub-query returns one or a few rows Most of the sub-query rows are eliminated by the parent query Index in parent that match sub-query columns
Possibly not since the Yes IN uses the EXISTS cannot use the index index
Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply modifying the query in this way can improve performance by more than100 percent. Where possible, limit the use of outer joins on tables. Remove the outer joins from the query and create lookup objects within the mapping to fill in the optional information.
Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN
q
Avoid use of the NOT IN clause. This clause causes the database engine to perform a full table scan. While this may not be a problem on small tables, it can become a performance drain on large tables. SELECT NAME_ID FROM CUSTOMERS WHERE NAME_ID NOT IN (SELECT NAME_ID FROM EMPLOYEES)
INFORMATICA CONFIDENTIAL
BEST PRACTICE
595 of 702
Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but still may cause a full table scan. SELECT C.NAME_ID FROM CUSTOMERS C WHERE NOT EXISTS (SELECT * FROM EMPLOYEES E WHERE C.NAME_ID = E.NAME_ID)< /FONT >
In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the equivalent EXCEPT operator. SELECT C.NAME_ID FROM CUSTOMERS C MINUS SELECT E.NAME_ID* FROM EMPLOYEES E
Also consider using outer joins with IS NULL conditions for anti-joins. SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E WHERE C.NAME_ID = E.NAME_ID (+)< /FONT > AND C.NAME_ID IS NULL
Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL clauses as they may change based on the database engine.
q
In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the source qualifier. Add the WHERE clause conditions to the lookup. For example, if the source qualifier selects sales orders entered into the system since the previous load of the database, then, in the product information lookup, only select the products that match the distinct product IDs in the incremental sales orders. Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause that uses values retrieved from a table as limits in the BETWEEN. Here is an example: SELECT R.BATCH_TRACKING_NO, R.SUPPLIER_DESC, R.SUPPLIER_REG_NO, R.SUPPLIER_REF_CODE, R.GCW_LOAD_DATE FROM CDS_SUPPLIER R, (SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV,
INFORMATICA CONFIDENTIAL
BEST PRACTICE
596 of 702
L.LOAD_DATE) AS LOAD_DATE FROM ETL_AUDIT_LOG L WHERE L.LOAD_DATE_PREV IN (SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV FROM ETL_AUDIT_LOG Y) )Z WHERE R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE The work-around is to use an in-line view to get the lower range in the FROM clause and join it to the main query that limits the higher date range in its where clause. Use an ORDER BY the lower limit in the in-line view. This is likely to reduce the throughput time from hours to seconds. Here is the improved SQL: SELECT R.BATCH_TRACKING_NO, R.SUPPLIER_DESC, R.SUPPLIER_REG_NO, R.SUPPLIER_REF_CODE, R.LOAD_DATE FROM /* In-line view for lower limit */ (SELECT R1.BATCH_TRACKING_NO, R1.SUPPLIER_DESC, R1.SUPPLIER_REG_NO, R1.SUPPLIER_REF_CODE, R1.LOAD_DATE FROM CDS_SUPPLIER R1, (SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV
INFORMATICA CONFIDENTIAL BEST PRACTICE 597 of 702
FROM ETL_AUDIT_LOG Y) Z WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV< /FONT> ORDER BY R1.LOAD_DATE) R, /* end in-line view for lower limit */ (SELECT MAX(D.LOAD_DATE) AS LOAD_DATE FROM ETL_AUDIT_LOG D) A /* upper limit /* WHERE R. LOAD_DATE <= A.LOAD_DATE< /FONT>
System Resources
The PowerCenter Server uses the following system resources:
q q q q
CPU Load Manager shared memory DTM buffer memory Cache memory
When tuning the system, evaluate the following considerations during the implementation process.
q
Determine if the network is running at an optimal speed. Recommended best practice is to minimize the number of network hops between the PowerCenter Server and the databases. Use multiple PowerCenter Servers on separate systems to potentially improve session performance. When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC, configure the PowerCenter Server for ASCII data movement mode. In ASCII mode, the PowerCenter Server uses one byte to store each character. In Unicode mode, the PowerCenter Server uses two bytes for each character, which can potentially slow session performance Check hard disks on related machines. Slow disk access on source and target databases, source and target file
BEST PRACTICE 598 of 702
q q
INFORMATICA CONFIDENTIAL
systems, as well as the PowerCenter Server and repository machines can slow session performance.
q
When an operating system runs out of physical memory, it starts paging to disk to free physical memory. Configure the physical memory for the PowerCenter Server machine to minimize paging to disk. Increase system memory when sessions use large cached lookups or sessions have many partitions. In a multi-processor UNIX environment, the PowerCenter Server may use a large amount of system resources. Use processor binding to control processor usage by the PowerCenter Server. In a Sun Solaris environment, use the psrset command to create and manage a processor set. After creating a processor set, use the pbind command to bind the PowerCenter Server to the processor set so that the processor set only runs the PowerCenter Sever. For details, see project system administrator and Sun Solaris documentation. In an HP-UX environment, use the Process Resource Manager utility to control CPU usage in the system. The Process Resource Manager allocates minimum system resources and uses a maximum cap of resources. For details, see project system administrator and HP-UX documentation. In an AIX environment, use the Workload Manager in AIX 5L to manage system resources during peak demands. The Workload Manager can allocate resources and manage CPU, memory, and disk I/O bandwidth. For details, see project system administrator and AIX documentation.
Denormalization. The DBA can use denormalization to improve performance by eliminating the constraints and primary key to foreign key relationships, and also eliminating join tables. Indexes. Proper indexing can significantly improve query response time. The trade-off of heavy indexing is a degradation of the time required to load data rows in to the target tables. Carefully written pre-session scripts are recommended to drop indexes before the load and rebuilding them after the load using post-session scripts. Constraints. Avoid constraints if possible and try to exploit integrity enforcement through the use of incorporating that additional logic in the mappings. Rollback and Temporary Segments. Rollback and temporary segments are primarily used to store data for queries (temporary) and INSERTs and UPDATES (rollback). The rollback area must be large enough to hold all the data prior to a COMMIT. Proper sizing can be crucial to ensuring successful completion of load sessions, particularly on initial loads. OS Priority. The priority of background processes is an often-overlooked problem that can be difficult to determine after the fact. DBAs must work with the System Administrator to ensure all the database processes have the same priority. Striping. Database performance can be increased significantly by implementing either RAID 0 (striping) or RAID 5 (pooled disk sharing) disk I/O throughput. Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by separating the disk controllers.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
599 of 702
Description
Remember that the minimum system requirements for a machine hosting the Metadata Manager console are:
q q q q
Windows operating system (2000, NT 4.0 SP 6a) 400MB disk space 128MB RAM (256MB recommended) 133 MHz processor.
If the system meets or exceeds the minimal requirements, but an XConnect is still taking an inordinately long time to run, use the following steps to try to improve its performance. To improve performance of your XConnect loads from database catalogs:
q
Modify the inclusion/exclusion schema list (if schema to be loaded is more than exclusion, then use exclusion) Carefully examine how many old objects the project needs by default. Modify the sysdate -5000 to a smaller value to reduce the result set.
Load only the production folders that are needed for a particular project. Run the XConnects with just one folder at a time, or select the list of folders for a particular run.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
600 of 702
Description
Ensuring Consistent Data Source Names
To ensure the use of consistent data source names for the same data sources across the domain, the Administrator can create a single "official" set of data sources, then use the Repository Manager to export that connection information to a file. You can then distribute this file and import the connection information for each client machine. Solution:
q q
From Repository Manager, choose Export Registry from the Tools drop-down menu. For all subsequent client installs, simply choose Import Registry from the Tools drop-down menu.
While logged in as the installation user with administrator authority, use regedt32 to edit the registry. Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/. From the menu bar, select Security/Permissions, and grant read access to the users that should be permitted to use the PowerMart Client. (Note that the registry entries for both PowerMart and PowerCenter Server and client tools are stored as PowerMart Server and PowerMart Client tools.)
q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
601 of 702
In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to Wordpad within the workflow monitor client tool. To choose a different editor, just select Tools>Options in the workflow monitor. Then browse for the editor that you want on the General tab. For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad unless the wordpad.exe can be found in the path statement. Instead, a window appears the first time a session log is viewed from the PowerCenter Server Manager prompting the user to enter the full path name of the editor to be used to view the logs. Users often set this parameter incorrectly and must access the registry to change it. Solution:
q
While logged in as the installation user with administrator authority, use regedt32 to go into the registry. Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar, select View Tree and Data. Select the Log File Editor entry by double clicking on it. Replace the entry with the appropriate editor entry (i.e., typically WordPad.exe or Write. exe). Select Registry --> Exit from the menu bar to save the entry.
q q
For PowerCenter version 7.1 and above, you should set the log editor option in the Workflow Monitor. The following figure shows the Workflow Monitor Options Dialog box to use for setting the editor for workflow and session logs.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
602 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
603 of 702
When this is done once, you can easily call another program from your PowerCenter client tools. In the following example, TOAD can be called quickly from the Repository Manager tool.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
604 of 702
Solution:
q q q
In the Workflow Manager tool, choose Tools > Options and go to the Miscellaneous tab. Click the button for either 'normal' or 'bulk', as desired. Click OK, then close and open the Workflow Manager tool.
After this, every time a session is created, the target load type for all relational targets will default to your choice.
Solution:
q
To get the Window correctly docked, right-click in the white space of the Navigator window. Make sure that Allow Docking option is checked. If it is checked, double-click on the title bar of the Navigator Window.
BEST PRACTICE 605 of 702
INFORMATICA CONFIDENTIAL
Clicking View > Navigator Toggling the menu bar Uninstalling and reinstalling Client tools
Note: If none of the above solutions resolve the problem, you may want to try the following solution using the Registry Editor. Be aware, however, that using the Registry Editor incorrectly can cause serious problems that may require reinstalling the operating system. Informatica does not guarantee that any problems caused by using Registry Editor incorrectly can be resolved. Use the Registry Editor at your own risk. Solution: Starting with PowerCenter 7.x, the settings for the client tools are in the registry. Display issues can often be resolved as follows:
q q q
Close the client tool. Go to Start > Run and type "regedit". Go to the key HKEY_CURRENT_USER\Software\Informatica\PowerMart Client Tools\x.y.z Where x.y.z is the version and maintenance release level of the PowerCenter client as follows:
Open the key of the affected tool (for the Repository Manager open Repository Manager Options). Export all of the Toolbars sub-folders and rename them. Re-open the client tool.
q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
606 of 702
The PowerCenter client tools allow you to customize the look and feel of the display. Here are a few examples of what you can do.
Designer
q q q
From the Menu bar, select Tools > Options. In the dialog box, choose the Format tab. Select the feature that you want to modify (i.e., workspace colors, caption colors, or fonts).
Changing the background workspace colors can help identify which workspace is currently open. For example, changing the Source Analyzer workspace color to green or the Target Designer workspace to purple to match their respective metadata definitions helps to identify the workspace. Alternatively, click the Select Theme button to choose a color theme, which displays background colors based on predefined themes.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
607 of 702
Workflow Manager
You can modify the Workflow Manager using the same approach as the Designer tool. From the Menu bar, select Tools > Options and click the Format tab. Select a color theme or customize each element individually.
Workflow Monitor
You can modify the colors in the Gantt Chart view to represent the various states of a task. You can also select two colors for one task to give it a dimensional appearance; this can be helpful in
INFORMATICA CONFIDENTIAL BEST PRACTICE 608 of 702
distinguishing between running tasks, succeeded tasks, etc. To modify the Gantt chart appearance, go to the Menu bar and select Tools > Options and Gantt Chart.
cannot run the Data Stencil macros. To use the security level for the Visio, select Tools > Macros > Security from the menu. On the Security Level tab, select Medium. When you start Data Stencil, Visio displays a security warning about viruses in macros. Click Enable Macros to enable the macros for Data Stencil.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
610 of 702
Description
Configuring Advanced Integration Service Properties
Use the Administration Console to configure the advanced properties, such as the character set of the Integration Service logs. To edit the advanced properties, select the Integration Service in the Navigator, and click the Properties tab > Advanced Properties > Edit. The following Advanced properties are included:
Limit on Resilience Timeouts Optional Maximum amount of time (in seconds) that the service holds on to resources for resilience purposes. This property places a restriction on clients that connect to the service. Any resilience timeouts that exceed the limit are cut off at the limit. If the value of this property is blank, the value is derived from the domain-level settings. Valid values are between 0 and 2592000, inclusive. Default is blank. Resilience Timeout Optional Period of time (in seconds) that the service tries to establish or reestablish a connection to another service. If blank, the value is derived from the domain-level settings. Valid values are between 0 and 2592000, inclusive. Default is blank.
Ease of deployment across environments (DEV > TEST > PRD) Ease of switching sessions from one IS to another without manually editing all the sessions to change directory paths. All the variables are related to directory paths used by a given Integration Service.
You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files. Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the runtime files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. State of operation files must be accessible by all Integration Service processes.When you enable an Integration Service, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption. All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location. By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. You can set the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for each Integration Service process. You must specify the directory path for each type of file. You specify the following directories using service process variables:
INFORMATICA CONFIDENTIAL BEST PRACTICE 611 of 702
Each registered server has its own set of variables. The list is fixed, not user-extensible.
Service Process Variable $PMRootDir $PMSessionLogDir $PMBadFileDir $PMCacheDir $PMTargetFileDir $PMSourceFileDir $PMExtProcDir $PMTempDir $PMSuccessEmailUser $PMFailureEmailUser $PMSessionLogCount $PMSessionErrorThreshold $PMWorkflowLogCount $PMWorkflowLogDir $PMLookupFileDir $PMStorageDir Value (no default user must insert a path) $PMRootDir/SessLogs $PMRootDir/BadFiles $PMRootDir/Cache $PMRootDir/TargetFiles $PMRootDir/SourceFiles $PMRootDir/ExtProc $PMRootDir/Temp (no default user must insert a path) (no default user must insert a path) 0 0 0 $PMRootDir/WorkflowLogs $PMRootDir/LkpFiles $PMRootDir/Storage
Integration Service Custom Properties (undocumented server parameters) can be entered here as well: 1. At the bottom of the list enter the Name and Value of the custom property 2. Click OK.
When PowerCenter runs on a UNIX platform, it uses operating system semaphores to keep processes synchronized and to prevent collisions when accessing shared data structures. You may need to increase these semaphore settings before installing the server. Seven semaphores are required to run a session. Most installations require between 64 and 128 available semaphores, depending on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as database servers. The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and system. The method used to change the parameter depends on the operating system:
q q q
HP/UX: Use sam (1M) to change the parameters. Solaris: Use admintool or edit /etc/system to change the parameters. AIX: Use smit to change the parameters.
HP-UX
For HP-UX release 11i the CDLIMIT and NOFILES parameters are not implemented. In some versions, SEMMSL is hard-coded to 500. NCALL is referred to as NCALLOUT. Use the HP System V IPC Shared-Memory Subsystem to update parameters. To change a value, perform the following steps: 1. 2. 3. 4. 5. 6. 7. Enter the /usr/sbin/sam command to start the System Administration Manager (SAM) program. Double click the Kernel Configuration icon. Double click the Configurable Parameters icon. Double click the parameter you want to change and enter the new value in the Formula/Value field. Click OK. Repeat these steps for all kernel configuration parameters that you want to change. When you are finished setting all of the kernel configuration parameters, select Process New Kernel from the Action menu.
The HP-UX operating system automatically reboots after you change the values for the kernel configuration parameters.
IBM AIX
None of the listed parameters requires tuning because each is dynamically adjusted as needed by the kernel.
SUN Solaris
Keep the following points in mind when configuring and tuning the SUN Solaris platform: 1. Edit the /etc/system file and add the following variables to increase shared memory segments: set shmsys:shminfo_shmmax=value set shmsys:shminfo_shmmin=value set shmsys:shminfo_shmmni=value set shmsys:shminfo_shmseg=value set semsys:seminfo_semmap=value set semsys:seminfo_semmni=value set semsys:seminfo_semmns=value set semsys:seminfo_semmsl=value
INFORMATICA CONFIDENTIAL BEST PRACTICE 613 of 702
set semsys:seminfo_semmnu=value set semsys:seminfo_semume=value 2. Verify the shared memory value changes: # grep shmsys /etc/system 3. Restart the system: # init 6
SuSE Linux
The default shared memory limits (shhmax and shmall) on SuSE Linux platforms can be changed in the proc file system without a restart. For example, to allow 512MB, type the following commands: #sets shmall and shmmax shared memory echo 536870912 >/proc/sys/kernel/shmall echo 536870912 >/proc/sys/kernel/shmmax #Sets shmall to 512 MB #Sets shmmax to 512 MB
You can also put these commands into a script run at startup. Also change the settings for the system memory user limits by modifying a file called /etc/profile. Add lines similar to the following: #sets user limits (ulimit) for system memory resources ulimit -v 512000 ulimit -m 512000 #set virtual (swap) memory to 512 MB #set physical memory to 512 MB
Configure buffer memory and cache memory settings in the Transformation and Session Properties. When you configure buffer memory and cache memory settings, consider the overall memory usage for best performance. Enable automatic memory settings by configuring a value for the Maximum Memory Allowed for Auto Memory Attributes or the Maximum Percentage of Total Memory Allowed for Auto Memory Attributes. If the value is set to zero for either of these attributes, the Integration Service disables automatic memory settings and uses default values.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
615 of 702
Description
Fatal run-time errors in UNIX programs usually result in the termination of the UNIX process by the operating system. Usually, when the operating system terminates a process, a "core dump" file is also created, which can be used to analyze the reason for the abnormal termination.
execution that causes the OS to terminate it and cause a core dump. Mixing incompatible versions of UNIX, vendor, and database libraries can often trigger behavior that causes unexpected core dumps. For example, using an odbc driver library from one vendor and an odbc driver manager from another vendor may result in a core dump if the libraries are not compatible. A similar situation can occur if a process is using libraries from different versions of a database client, such as a mixed installation of Oracle 8i and 9i. An installation like this should not exist, but if it does, core dumps are often the result.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
617 of 702
The second step is to log in with the same UNIX user id that started up the process that crashed. This sets the debugger's environment to be same as that of the process at startup time. The third step is to go to the directory where the program is installed. Run the "file" command on the core file. This returns the name of the process that created the core file. file <fullpathtocorefile>/core.ddmmyyhhmi Core files can be generated by the PowerCenter executables (i.e., pmserver, infaservices, and pmdtm) as well as from other UNIX commands executed by the Integration Service, typically from command tasks and per- or post-session commands. If a PowerCenter process is terminated by the OS and a core is generated, the session or server log typically indicates Process terminating on Signal/Exception as its last entry.
Pmstack also supports a p option, which can be used to extract a stack trace from a running process. This is sometimes useful if the process appears to be hung to determine what the process is doing.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
619 of 702
Description
The domain architecture allows PowerCenter to provide a service-oriented architecture where you can specify which services are running on which node or physical machine from one central location. The components in the domain are aware of each others presence and continually monitor one another via heartbeats. The various services within the domain can move from one physical machine to another without any interruption to the PowerCenter environment. As long as clients can connect to the domain, the domain can route their needs to the appropriate physical machine. From a monitoring perspective, the domain provides the ability to monitor all services in the domain from a central location. You no longer have to log into and ping multiple machines in a robust PowerCenter environment instead a single screen displays the current availability state of all services. For more details on the individual components and detailed configuration of a domain, refer to the PowerCenter Administrator Guide.
Master Gateway The node designated as the master gateway or domain controller is the main entry point to the domain. This server(s) should be your most reliable and available machine in the architecture. It is the first point of entry for all clients wishing to connect to one of the PowerCenter services. If the master gateway is unavailable, the entire domain is unavailable. You may designate more than one node to run the gateway service. One gateway is always the master or primary, but by having the gateway services running on more than one node in a multimode configuration, you domain can continue to function if the master gateway is no longer available. In a high-availability environment, it is critical to have one or more nodes running the gateway service as a backup to the master gateway.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
620 of 702
Shared File System The PowerCenter domain architecture provides centralized logging capability and; when high-availability is enabled, a highly available environment with automatic fail-over of workflows and sessions. In order to achieve this, the base PowerCenter server file directories must reside on a file system that is accessable by all nodes in the domain. When PowerCenter is initially installed, this directory is called infa_shared and is located under the server directory of the PowerCenter installation. It includes logs and checkpoint information that is shared among nodes of the domain. Ideally, this file system is both high-performance and highly available. Domain Metadata As of PowerCenter 8, a store of metadata exists to hold all of the configuration for the domain. This domain repository is separate from the one or more PowerCenter repositories in your domain. Instead, it is a handful of tables that replace the older pmserver.cfg, pmrep.cfg and other PowerCenter configuration information. Upon installation you will be prompted for the RDBMS location for the domain repository. This information should be treated similar to a PowerCenter repository, with regularlyscheduled backups and a disaster recovery plan. Without this metadata, your domain is unable to function. The RDBMS user provided to PowerCenter requires permissions to create and drop tables, as well as insert, update, and delete records. Ideally, if you are going to be grouping multiple independent nodes within this domain, the domain configuration database should reside on a separate and independent server so as to eliminate the single point of failure if the node hosting the domain configuration database fails.
Domain Architecture
Just as in other PowerCenter architectures, the premise of the architecture is maintain flexibility and scalability across the environment. There is no single best way to deploy the architecture. Rather, each environment should be assessed for external factors and then PowerCenter configured appropriately to function best in that particular environment. The advantage of the service-oriented architecture is that components in the architecture (i.e., repository services, integration services, and others) can be moved among nodes without needing to make changes to the mappings or workflows. In this way, it is very simple to alter architecture components if you find a suboptimal configuration and want to alter it in your environment. The key here is that you are not tied to any choices you make at installation time and have the flexibility to make changes to your architecture as your business needs change.
TIP
While the architecture is very flexible and provides easy movement of services throughout the environment, one area that to carefully consider at installation time is the name of the domain and the subsequent nodes. These are somewhat troublesome to change later because of the
INFORMATICA CONFIDENTIAL
BEST PRACTICE
621 of 702
nature of their criticality to the domain. It is not recommended that you imbed server IP addresses and names in the domain name or the node names. You never know when you may need to move to new hardware or move nodes to new locations. For example, instead of naming your domain PowerCenter_11.5.8.20, consider naming it Enterprise_Dev_Test. This makes it much more intuitive to understand what domain you are attaching to and if you ever decide to move the main gateway to another server, you dont need to change the domain or node name. While these names can be changed, the change is not easy and requires using command line programs to alter the domain metadata.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
622 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
623 of 702
Production environment. In most implementations, these are separate PowerCenter repositories and associated servers. It is possible to define a single domain to include one or more of these development environments. However, there are a few points to consider:
q
If the domain gateway is unavailable for any reason, the entire domain is inaccessible. Keep in mind that if you place your development, quality assurance and production services in a single domain, you have the possibility of affecting your production environment with development and quality assurance work. If you decide to restart the domain in Development for some reason, you are effectively restarting development, quality assurance and production at the same time. Also, if you experience some sort of failure that affects the domain in production, you have also brought down your development environment and have no place to test to fix the problem since your entire environment is compromised. For the domain you should have a common, shared, high-performance file system to share the centralized logging and checkpoint files. If you have all three environments together on one domain, you are mixing production logs with development logs and other files on the same physical disk. Your production backups and disaster recovery files will have more than just production information in them. For future upgrade, it is very likely that you will need to upgrade all components of the domain at once to the new version of PowerCenter. If you have placed development, quality assurance, and production in the same domain, you may need to upgrade all of it at once. This is an undesirable situation in most data integration environments.
For these reasons, Informatica generally recommends having at least two separate domains in any environment:
q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
624 of 702
Some architects choose to deploy a separate domain for each environment to further isolate them and ensure no disruptions occur in the Quality Assurance environment by any changes in the development environment. The tradeoff is an additional administration console to log into and maintain. One thing to keep in mind is that while you may have separate domains with separate domain metadata repositories, there is no need to migrate any of the metadata from the separate domain repositories between development, Quality Assurance and production. The domain metadata repositories collect information on the physical location and connectivity of the components and thus, it makes no sense to migrate between environments. You do need to provide separate database locations for each, but there is no migration needs for the data within; each one is specific to the environment it services.
Administration
The domain administrator has the permission to start/shutdown all services within the domain, as well as the ability to create other users and delegate roles and responsibilities to them. Keep in mind that if the domain is shutdown, it has to be restarted via the command line or the host operating system GUI. PowerCenter's High Availability option provides the ability to create multiple gateway nodes to a domain, such that if the Master Gateway Node fails, another can assume its responsibilities, including authentication, logging, and service management.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
625 of 702
Act of 2002 (also known as SOX, SarbOx and Sarbanes-Oxley) have been widely interpreted to place many restrictions on the ability of persons in development roles to have direct write access to production systems, and consequently, you may have to plan your administration roles accordingly. Your organization may simply need to use different folders to group objects in Development, Quality Assurance, and Production roles with separate administrators. In some instances, systems may need to be entirely separate, with different domains for the Development, Quality Assurance, and Production systems. Sharing of metadata remains simple between separate domains, with PowerCenters ability to link domains, and copy data between linked domains. For Data Migration projects, it is recommended to establish a standardized architecture that includes a set of folders, connections and developer access in accordance with the needs of the project. Typically this include folders for:
q q q q
Acquiring data Converting data to match the target system The final load to the target application Establishing reference data structures
Maintenance
As part of your regular backup of metadata, you should schedule a recurring backup of your PowerCenter domain configuration database metadata. This can be accomplished through PowerCenter by using the infasetup command, further explained in the Command Line Reference. You should also add the schema to your normal RDBMS backup schedule, providing two reliable backup methods for disaster recovery purposes.
Licensing
As part of PowerCenters new Service-Oriented Architecture (SOA), licensing for PowerCenter services has been centralized within the domain. You receive your license key file(s) from Informatica at the same time the download location for your software is provided. Adding license object(s) and assigning individual PowerCenter Services to the license(s) is the method by which you enable a PowerCenter Service. You can do this during install, or add initial/incremental license keys after install via the Administration Console web-based utility, or the infacmd command line utility.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
626 of 702
Description
Why should we manage the size of the repository? Repository size affects the following:
q
DB backups and restores. If database backups are being performed, the size required for the backup can be reduced. If PowerCenter backups are being used, you can limit what gets backed up. Overall query time of the repository, which slows performance of the repository over time. Analyzing tables on a regular basis can aid in repository table performance. Migrations (i.e., copying from one repository to the next). Limit data transfer between repositories to avoid locking up the repository for a long period of time. Some options are available to avoid transferring all run statistics when migrating. A typical repository starts off small (i.e., 50MB to 60MB for an empty repository) and grows to upwards of 1GB for a large repository. The type of information stored in the repository includes:
r r r r
INFORMATICA CONFIDENTIAL
BEST PRACTICE
627 of 702
Variables
Folders
Remove folders and objects that are no longer used or referenced. Unnecessary folders increase the size of the repository backups. These folders should not be a part of production but they may exist in development or test repositories.
Run Statistics
Remove old run statistics from the repository if you no longer need them. History is important to determine trending, scaling, and performance tuning needs but you can always generate reports based on the PowerCenter Metadata Reporter and save reports of the data you need. To remove the run statistics, go to Repository Manager and truncate the logs based on the dates.
Recommendations
Informatica strongly recommends upgrading to the latest version of PowerCenter since the most recent release includes such features as skip workflow and session log, skip deployment group history, skip MX data and so forth. The repository size in version 8.x and above is larger than the previous versions of PowerCenter, but the added size does not significantly affect the performance of the repository. It is still advisable to
INFORMATICA CONFIDENTIAL
BEST PRACTICE
628 of 702
analyze the tables or run statistics to optimize the tables. Informatica does not recommend directly querying the repository tables or performing deletes on them. Use the client tools unless otherwise advised by Informatica technical support personnel.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
629 of 702
Description
Parameter files are a means of providing run time values for parameters and variables defined in a workflow, worklet, session, mapplet, or mapping. A parameter file can have values for multiple workflows, sessions, and mappings, and can be created using text editors such as notepad, vi, shell script, or an Informatica mapping. Variable values are stored in the repository and can be changed within mappings. However, variable values specified in parameter files supersede values stored in the repository. The values stored in the repository can be cleared or reset using workflow manager.
Service Variable. Defines a service variable for an Integration Service. Service Process Variable. Defines a service process variable for an Integration Service that runs on a specific node. Workflow Variable. References values and records information in a workflow. For example, use a workflow variable in a decision task to determine whether the previous task ran properly. Worklet Variable. References values and records information in a worklet. You can use predefined worklet variables in a parent workflow, but cannot use workflow variables from the parent workflow in a worklet. Session Parameter. Defines a value that can change from session to session, such a database connection or file name. Mapping Parameter. Defines a value that remains constant throughout a session, such as a state sales tax rate. Mapping Variable. Defines a value that can change during the session. The Integration Service saves the value of a mapping variable to the repository at the end of each successful
BEST PRACTICE 630 of 702
INFORMATICA CONFIDENTIAL
session run and uses that value the next time the session runs.
None. The Integration Service processes all transformation logic for the session. Source. The Integration Service pushes part of the transformation logic to the source database. Source with View. The Integration Service creates a view to represent the SQL override value, and runs an SQL statement against this view to push part of the transformation logic to the source database. Target. The Integration Service pushes part of the transformation logic to the target database. Full. The Integration Service pushes all transformation logic to the database. Full with View. The Integration Service creates a view to represent the SQL override value, and runs an SQL statement against this view to push part of the transformation logic to the source database. The Integration Service pushes any remaining transformation logic to the target database.
q q q
Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. Place the Parameter Files in directory that can be accessed using the server variable. This helps to move the sessions and workflows to a different server without modifying workflow or session properties. You can override the location and name of parameter file specified in the session or workflow while executing workflows via the pmcmd command. The following points apply to both Parameter and Variable files, however these are more relevant to Parameters and Parameter files, and are therefore detailed accordingly.
Method 1:
1. The workflow is configured to use a parameter file. 2. The workflow has a decision task before running the session: comparing the Current System date against the date in the parameter file. 3. Use a shell (or batch) script to create a parameter file. Use an SQL query to extract a single date, which is greater than the System Date (today) from the table and write it to a file with required format. 4. The shell script uses pmcmd to run the workflow. 5. The shell script is scheduled using cron or an external scheduler to run daily. The following figure shows the use of a shell script to generate a parameter file.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
632 of 702
Method 2:
1. The Workflow is configured to use a parameter file. 2. The initial value for the data parameter is the first date on which the workflow is to run. 3. The workflow has a decision task before running the session: comparing the Current System date against the date in the parameter file 4. The last task in the workflow generates the parameter file for the next run of the workflow (using
INFORMATICA CONFIDENTIAL BEST PRACTICE 633 of 702
a command task calling a shell script) or a session task, which uses a mapping. This task extracts a date that is greater than the system date (today) from the table and writes into parameter file in the required format. 5. Schedule the workflow using Scheduler, to run daily (as shown in the following figure).
In such cases, a parameter file template (i.e., a parameter file containing values for some parameters and placeholders for others) may prove useful. Use a shell (or batch) script at run time to create actual parameter file (for a specific client), replacing the placeholders with actual values, and then execute the workflow using pmcmd. [PROJ_DP.WF:Client_Data] $InputFile_1=/app/data/Client_ID/input/client_info.dat $LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log Using a script, replace Client_ID and curdate to actual values before executing the workflow. The following text is an excerpt from a parameter file that contains service variables for one Integration Service and parameters for four workflows: [Service:IntSvs_01] $PMSuccessEmailUser=pcadmin@mail.com $PMFailureEmailUser=pcadmin@mail.com [HET_TGTS.WF:wf_TCOMMIT_INST_ALIAS] $$platform=unix [HET_TGTS.WF:wf_TGTS_ASC_ORDR.ST:s_TGTS_ASC_ORDR] $$platform=unix $DBConnection_ora=qasrvrk2_hp817 [ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1] $$DT_WL_lvl_1=02/01/2005 01:05:11 $$Double_WL_lvl_1=2.2 [ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1.WT:NWL_PARAM_Lvl_2] $$DT_WL_lvl_2=03/01/2005 01:01:01 $$Int_WL_lvl_2=3 $$String_WL_lvl_2=ccccc
INFORMATICA CONFIDENTIAL BEST PRACTICE 635 of 702
that workflow. Use the variable in tasks within that workflow. You can edit and delete user-defined workflow variables. Use user-defined variables when you need to make a workflow decision based on criteria you specify. For example, you create a workflow to load data to an orders database nightly. You also need to load a subset of this data to headquarters periodically, every tenth time you update the local orders database. Create separate sessions to update the local database and the one at headquarters. Use a user-defined variable to determine when to run the session that updates the orders database at headquarters. To configure user-defined workflow variables, set up the workflow as follows: Create a persistent workflow variable, $$WorkflowCount, to represent the number of times the workflow has run. Add a Start task and both sessions to the workflow. Place a Decision task after the session that updates the local orders database. Set up the decision condition to check to see if the number of workflow runs is evenly divisible by 10. Use the modulus (MOD) function to do this. Create an Assignment task to increment the $$WorkflowCount variable by one. Link the Decision task to the session that updates the database at headquarters when the decision condition evaluates to true. Link it to the Assignment task when the decision condition evaluates to false. When you configure workflow variables using conditions, the session that updates the local database runs every time the workflow runs. The session that updates the database at headquarters runs every 10th time the workflow runs.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
637 of 702
Description
The required platform size to support PowerCenter depends on each customers unique environment and processing requirements. The Integration Service allocates resources for individual extraction, transformation, and load (ETL) jobs or sessions. Each session has its own resource requirements. The resources required for the Integration Service depend on the number of sessions, what each session does while moving data, and how many sessions run concurrently. This Best Practice discusses the relevant questions pertinent to estimating the platform requirements.
TIP An important concept regarding platform sizing is not to size your environment too soon in the project lifecycle. Too often, clients size their machines before any ETL is designed or developed, and in many cases these platforms are too small for the resultant system. Thus, it is better to analyze sizing requirements after the data transformation processes have been well defined during the design and development phases.
Environment Questions
To determine platform size, consider the following questions regarding your environment:
q q q
What sources do you plan to access? How do you currently access those sources? Have you decided on the target environment (i.e., database, hardware, operating system)? If so, what is it? Have you decided on the PowerCenter environment (i.e., hardware, operating system)? Is it possible for the PowerCenter services to be on the same machine as the target? How do you plan to access your information (i.e., cube, ad-hoc query tool) and
BEST PRACTICE 638 of 702
INFORMATICA CONFIDENTIAL
What other applications or services, if any, run on the PowerCenter server? What are the latency requirements for the PowerCenter loads?
Is the overall ETL task currently being performed? If so, how is it being done, and how long does it take? What is the total volume of data to move? What is the largest table (i.e., bytes and rows)? Is there any key on this table that can be used to partition load sessions, if needed? How often does the refresh occur? Will refresh be scheduled at a certain time, or driven by external events? Is there a "modified" timestamp on the source table rows? What is the batch window available for the load? Are you doing a load of detail data, aggregations, or both? If you are doing aggregations, what is the ration of source/target rows for the largest result set? How large is the result set (bytes and rows)?
q q
q q q q q q
The answers to these questions provide an approximation guide to the factors that affect PowerCenter's resource requirements. To simplify the analysis, focus on large jobs that drive the resource requirement.
Processor
1 to 1.5 CPUs per concurrent non-partitioned session or transformation job.
Memory
q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
639 of 702
20 to 30MB of memory per session, if there are no aggregations, lookups, or heterogeneous data joins. Note that 32-bit systems have an operating system limitation of 2GB per session. Caches for aggregation, lookups or joins use additional memory: Lookup tables are cached in full; the memory consumed depends on the size of the tables. Aggregate caches store the individual groups; more memory is used if there are more groups. Sorting the input to aggregations greatly reduces the need for memory. Joins cache the master table in a join; memory consumed depends on the size of the master.
q q
q q
System Recommendations
PowerCenter has a service-oriented architecture that provides the ability to scale services and share resources across multiple machines. Below are the recommendations for the system.
Minimum server
q
1 Node, 4 CPUs and 8GB of memory (instead of the minimal requirement of 4GB RAM).
Disk Space
Disk space is not a factor if the machine is used only for PowerCenter services, unless the following conditions exist:
q q
Data is staged to flat files on the PowerCenter machine. Data is stored in incremental aggregation files for adding data to aggregates. The space consumed is about the size of the data aggregated. Temporary space is needed for paging for transformations that require large caches that cannot be entirely cached by system memory Sessions logs are saved by timestamp
If any of these factors is true, Informatica recommends monitoring disk space on a regular basis or maintaining some type of script to purge unused files.
Sizing Analysis
INFORMATICA CONFIDENTIAL BEST PRACTICE 640 of 702
The basic goal is to size the machine so that all jobs can complete within the specified load window. You should consider the answers to the questions in the "Environment" and "Engine Sizing" sections to estimate the required number of sessions, the volume of data that each session moves, and its lookup table, aggregation, and heterogeneous join caching requirements. Use these estimates with the recommendations in the "Engine Resource Consumption" section to determine the required number of processors, memory, and disk space to achieve the required performance to meet the load window. Note that the deployment environment often creates performance constraints that hardware capacity cannot overcome. The engine throughput is usually constrained by one or more of the environmental factors addressed by the questions in the "Environment" section. For example, if the data sources and target are both remote from the PowerCenter machine, the network is often the constraining factor. At some point, additional sessions, processors, and memory may not yield faster execution because the network (not the PowerCenter services) imposes the performance limit. The hardware sizing analysis is highly dependent on the environment in which the server is deployed. You need to understand the performance characteristics of the environment before making any sizing conclusions. It is also vitally important to remember that other applications (in addition to PowerCenter) are likely to use the platform. PowerCenter often runs on a server with a database engine and query/analysis tools. In fact, in an environment where PowerCenter, the target database, and query/analysis tools all run on the same machine, the query/analysis tool often drives the hardware requirements. However, if the loading is performed after business hours, the query/analysis tools requirements may not be a sizing limitation.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
641 of 702
Description
PowerCenter has a service-oriented architecture that provides the ability to scale services and share resources across multiple machines. The PowerCenter domain is the fundamental administrative unit in PowerCenter. A domain is a collection of nodes and services that you can group in folders based on administration ownership. The Administration Console consolidates administrative tasks for domain objects such as services, nodes, licenses, and grids. For more information on domain configuration, refer the Best Practice on Domain Configuration.
Functionality-type folders group services based on a functional area such as Sales or Marketing. Object type-folders group objects based on the service type. For example, Integration services folder. Environment-type folders group objects based on the environment. For example, if you have development and testing on the same domain, group the services according to the environment.
Create User Accounts in the admin console, then set permissions and privileges to the folders the users need access to. It is a good practice for the Administrator to monitor the user activity in the domain periodically and save the reports for audit purposes.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
642 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
643 of 702
Description
UNIX systems impose per-process limits on resources such as processor usage, memory, and file handles. Understanding and setting these resources correctly is essential for PowerCenter installations.
Process data Process stack Number of open files Total virtual memory
These limits are implemented on an individual process basis. The limits are also inherited by child processes when they are created. In practice, this means that the resource limits are typically set at log-on time, and apply to all processes started from the login shell. In the case of PowerCenter, any limits in effect before the Integration Service is started also apply to all sessions (pmdtm) started from that node. Any limits in effect when the Repository Service is started also apply to all pmrepagents started from that repository service (repository service process is an instance of the repository service running on a particular machine or node). When a process exceeds its resource limit, UNIX fails the operation that caused the limit to be exceeded. Depending on the limit that is reached, memory allocations fail, files cant be opened, and processes are terminated when they exceed their processor time. Since PowerCenter sessions often use a large amount of processor time, open many files, and can use large amounts of memory, it is important to set resource limits correctly to prevent the operating system from limiting access to required resources, while preventing problems.
INFORMATICA CONFIDENTIAL BEST PRACTICE 644 of 702
ulimit a ulimit a H
ulimit -S -t unlimited ulimit -S -f 2097152 ulimit -S -n 1024 ulimit -S -v unlimited after running this, the limits are changed: % ulimit S a core file size (blocks, -c) unlimited data seg size (kbytes, -d) 1232896 file size (blocks, -f) 2097152 max memory size (kbytes, -m) unlimited open files (-n) 1024 stack size (kbytes, -s) 32768 cpu time (seconds, -t) unlimited virtual memory (kbytes, -v) unlimited
INFORMATICA CONFIDENTIAL
BEST PRACTICE
646 of 702
Description
The performance of PowerExchange CDC on Oracle databases is dependant upon a variety of factors including
q
The type of connection that PowerExchange has to the Oracle database being captured . The amount of data that is being written to the Oracle redo logs. The workload of the server where the Oracle database being captured resides.
q q
Connection Type
Ensure that wherever possible PowerExchange has a connection type of Local mode to the source database. Connections over slow networks and via SQL*Net should be avoided.
Volume of Data
The volume of data that the Oracle Log Miner has to process in order to provide changed data to PowerExchange has a significant impact upon performance. Bear in mind that other processes may be writing large volumes of data to the Oracle redo logs, in addition to the changed data rows. These include, but are not restricted to
q q q
Oracle Catalog dumps. Oracle Workload monitor customizations. Other (non-Oracle) tools that use the redo logs to provide proprietary information.
In order to optimize the PowerExchange CDC performance, the amount of data these processes write to the Oracle redo logs needs to be minimized, both in terms of volume
INFORMATICA CONFIDENTIAL BEST PRACTICE 647 of 702
and frequency. Review the processes that are actively writing data to the Oracle redo logs and tune them within the context of a production environment. For example, is it strictly necessary to perform a Catalog dump every 30 minutes? In a production environment schema, changes are less frequent than in a development environment where Catalog dumps may be needed at this frequency.
Server Workload
Optimize the performance of the Oracle database server by reducing the number of uneccessary tasks it is performing concurrently with the PowerExchange CDC components. This may include a full review of the scheduling of backups and restores, Oracle import and export processing, and other application software utilized within the production server environment.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
648 of 702
Description
PowerExchange installation is very straight-forward and can generally be accomplished in a timely fashion. When considering a PowerExchange installation, be sure that the appropriate resources are available. These include, but are not limited to:
q q
MVS systems operator Appropriate database administrator; this depends on what (if any) databases are going to be sources/and or targets (e.g., IMS, IDMS, etc.). MVS Security resources
Be sure to adhere to the sequence of the following steps to successfully install PowerExchange. Note that in this very typical scenario, the mainframe source data is going to be pulled across to a server box. 1. 2. 3. 4. 5. 6. 7. Complete the PowerExchange pre-install checklist and obtain valid license keys. Install PowerExchange on the mainframe. Start the PowerExchange jobs/tasks on the mainframe. Install the PowerExchange client (Navigator) on a workstation. Test connectivity to the mainframe from the workstation. Install Navigator on the UNIX/NT server. Test connectivity to the mainframe from the server.
Complete the PowerExchange Pre-install Checklist and Obtain Valid License Keys
Reviewing the environment and recording the information in a detailed checklist facilitates the PowerExchange install. The checklist (which is a prerequisite) is installed in the Documentation Folder when the PowerExchange software is installed. It is also
INFORMATICA CONFIDENTIAL
BEST PRACTICE
649 of 702
available within the client from the PowerExchange Program Group. Be sure to complete all relevant sections. You will need a valid license key in order to run any of the PowerExchange components. This is a 44-byte key that uses hyphens every 4 bytes. For example: 1234-ABCD-1234-EF01-5678-A9B2-E1E2-E3E4-A5F1 The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F). Keys are valid for a specific time period and are also linked to an exact or generic TCP/ IP address. They also control access to certain databases and determine if the PowerCenter Mover can be used. You cannot successfully install PowerExchange without a valid key for all required components. Note: When copying software from one machine to another, you may encounter license key problems since the license key is IP specific. Be prepared to deal with this eventuality, especially if you are going to a backup site for disaster recovery testing.
Step 4: Edit JOBCARD in RUNLIB and configure as per the environment (e.g., execution class, message class, etc.) Step 5: Edit the SETUP member in RUNLIB. Copy in the JOBCARD and SUBMIT. This process can submit from 5 to 24 jobs. All jobs should end with return code 0 (success) or 1, and a list of the needed installation jobs can be found in the XJOBS member.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
651 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
652 of 702
Description
A Business Case should include both qualitative and quantitative measures of potential benefits. The Qualitative Assessment portion of the Business Case is based on the Statement of Problem/Need and the Statement of Project Goals and Objectives (both generated in Subtask 1.1.1 Establish Business Project Scope ) and focuses on discussions with the project beneficiaries regarding the expected benefits in terms of problem alleviation, cost savings or controls, and increased efficiencies and opportunities. Many qualitative items are intangible, but you may be able to cite examples of the potential costs or risks if the system is not implemented. An example may be the cost of bad data quality resulting in the loss of a key customer or an invalid analysis resulting in bad business decisions. Risk factors may be classified as business, technical, or execution in nature. Examples of these risks are uncertainty of value or the unreliability of collected information, new technology employed, or a major change in business thinking for personnel executing change. It is important to identify an estimated value added or cost eliminated to strengthen the business case. The better definition of the factors, the better the value to the business case. The Quantitative Assessment portion of the Business Case provides specific measurable details of the proposed project, such as the estimated ROI. This may involve the following calculations:
q
Cash flow analysis- Projects positive and negative cash flows for the anticipated life of the project. Typically, ROI measurements use the cash flow formula to depict results.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
654 of 702
Net present value - Evaluates cash flow according to the long-term value of current investment. Net present value shows how much capital needs to be invested currently, at an assumed interest rate, in order to create a stream of payments over time. For instance, to generate an income stream of $500 per month over six months at an interest rate of eight percent would require an investment (i.e., a net present value) of $2,311.44. Return on investment - Calculates net present value of total incremental cost savings and revenue divided by the net present value of total costs multiplied by 100. This type of ROI calculation is frequently referred to as return-onequity or return-on-capital. Payback Period - Determines how much time must pass before an initial capital investment is recovered.
The following are steps to calculate the quantitative business case or ROI: Step 1 Develop Enterprise Deployment Map. This is a model of the project phases over a timeline, estimating as specifically as possible participants, requirements, and systems involved. A data integration or migration initiative or amendment may require estimating customer participation (e.g., by department and location), subject area and type of information/analysis, numbers of users, numbers and complexity of target data systems (data marts or operational databases, for example) and data sources, types of sources, and size of data set. A data migration project may require customer participation, legacy system migrations, and retirement procedures. The types of estimations vary by project types and goals. It is important to note that the more details you have for estimations, the more precise your phased solutions are likely to be. The scope of the project should also be made known in the deployment map. Step 2 Analyze Potential Benefits. Discussions with representative managers and users or the Project Sponsor should reveal the tangible and intangible benefits of the project. The most effective format for presenting this analysis is often a "before" and "after" format that compares the current situation to the project expectations, Include in this step, costs that can be avoided by the deployment of this project. Step 3 Calculate Net Present Value for all Benefits. Information gathered in this step should help the customer representatives to understand how the expected benefits are going to be allocated throughout the organization over time, using the enterprise deployment map as a guide. Step 4 Define Overall Costs. Customers need specific cost information in order to assess the dollar impact of the project. Cost estimates should address the following fundamental cost components:
INFORMATICA CONFIDENTIAL
BEST PRACTICE
655 of 702
q q q q q q q q q
Hardware Networks RDBMS software Back-end tools Query/reporting tools Internal labor External labor Ongoing support Training
Step 5 Calculate Net Present Value for all Costs. Use either actual cost estimates or percentage-of-cost values (based on cost allocation assumptions) to calculate costs for each cost component, projected over the timeline of the enterprise deployment map. Actual cost estimates are more accurate than percentage-of-cost allocations, but much more time-consuming. The percentage-of-cost allocation process may be valuable for initial ROI snapshots until costs can be more clearly predicted. Step 6 Assess Risk, Adjust Costs and Benefits Accordingly. Review potential risks to the project and make corresponding adjustments to the costs and/or benefits. Some of the major risks to consider are:
q
Scope creep, which can be mitigated by thorough planning and tight project scope. Integration complexity, which may be reduced by standardizing on vendors with integrated product sets or open architectures. Architectural strategy that is inappropriate. Current support infrastructure may not meet the needs of the project. Conflicting priorities may impact resource availability. Other miscellaneous risks from management or end users who may withhold project support; from the entanglements of internal politics; and from technologies that don't function as promised. Unexpected data quality, complexity, or definition issues often are discovered late in the course of the project and can adversely affect effort, cost, and schedule. This can be somewhat mitigated by early source analysis.
q q q q
Step 7 Determine Overall ROI. When all other portions of the business case are complete, calculate the project's "bottom line". Determining the overall ROI is simply a matter of subtracting net present value of total costs from net present value of (total
INFORMATICA CONFIDENTIAL BEST PRACTICE 656 of 702
Final Deliverable
The final deliverable of this phase of development is a complete business case that documents both tangible (quantified) and in-tangible (non-quantified, but estimate of benefits and risks) to be presented to the Project Sponsor and Key Stakeholders. This allows them to review the Business Case in order to justify the development effort. If your organization has the concept of a Project Office which provides the governance for project and priorities, many times this is part of the original Project Charter which states items like scope, initial high level requirements, and key project stakeholders. However, developing a full Business Case can validate any initial analysis and provide additional justification. Additionally, the Project Office should provide guidance in building and communicating the Business Case. Once completed, the Project Manager is responsible for scheduling the review and socialization of the Business Case.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
657 of 702
Data names to be assessed Data definitions Data formats and physical attributes Required business rules including allowed values Data usage Expected quality levels
By gathering and documenting some of the key detailed data requirements, a solid understanding the business rules involved is reached. Certainly, all elements cant be analyzed in detail, but it helps in getting to the heart of the business system so you are better prepared when speaking with business and technical users.
Description
The following steps are key for successfully defining and prioritizing requirements:
Step 1: Discovery
Gathering business requirements is one of the most important stages of any data integration project. Business requirements affect virtually every aspect of the data integration project starting from Project Planning and Management to End-User
INFORMATICA CONFIDENTIAL
BEST PRACTICE
658 of 702
Application Specification. They are like a hub that sits in the middle and touches the various stages (spokes) of the data integration project. There are two basic techniques for gathering requirements and investigating the underlying operational data: interviews and facilitated sessions.
Data Profiling
Informatica Data Explorer (IDE) is an automated data profiling and analysis software product that can be extremely beneficial in defining and prioritizing requirements. It provides a detailed description of data content, structure, rules, and quality by profiling the actual data that is loaded into the product. Some industry examples of why data profiling is crucial prior to beginning the development process are:
q
37 percent of projects are cancelled; 50 percent are completed but with 20 percent overruns, leaving only 13 percent completed on time and within budget.
q
Using a Data Profiling Tool can lower the risk and lower the cost of the project and increase the chances of success.
q
Data Profiling reports can be posted to a central presence where all team members can review results and track accuracy. IDE provides the ability to promote collaboration through tags, notes, action items, transformations and rules. By profiling the information, the framework is set to have an effective interview process with business and technical users.
Interviews
By conducting interview research before starting the requirements gathering process, interviewees can be categorized into functional business management and Information Technology (IT) management. This, in conjunction with effective data profiling, helps to establish a comprehensive set of business requirements.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
659 of 702
Business Interviewees. Depending on the needs of the project, even though you may be focused on a single primary business area, it is always beneficial to interview horizontally to achieve a good cross-functional perspective of the enterprise. This also provides insight into how extensible your project is across the enterprise. Before you interview, be sure to develop an interview questionnaire based upon profiling results, as well as business questions; schedule the interview time and place; and prepare the interviewees by sending a sample agenda. When interviewing business people, it is always important to start with the upper echelons of management so as to understand the overall vision, assuming you have the business background, confidence and credibility to converse at those levels. If not adequately prepared, the safer approach is to interview middle management. If you are interviewing across multiple teams, you might want to scramble interviews among teams. This way if you hear different perspectives from finance and marketing, you can resolve the discrepancies with a scrambled interview schedule. A note to keep in mind is that business is sponsoring the data integration project and is going to be the end-users of the application. They will decide the success criteria of your data integration project and determine future sponsorship. Questioning during these sessions should include the following:
q
Who are the stakeholders for this milestone delivery (IT, field business analysts, executive management)? What are the target business functions, roles, and responsibilities? What are the key relevant business strategies, decisions, and processes (in brief)? What information is important to drive, support, and measure success for those strategies/processes? What key metrics? What dimensions for those metrics? What current reporting and analysis is applicable? Who provides it? How is it presented? How is it used? How can it be improved?
q q
IT interviewees. The IT interviewees have a different flavor than the business user community. Interviewing the IT team is generally very beneficial because it is composed of data gurus who deal with the data on a daily basis. They can provide great insight into data quality issues, help in systematic exploration of legacy source systems, and understanding business user needs around critical reports. If you are developing a prototype, they can help get things done quickly and address important business reports. Questioning during these sessions should include the following:
q
INFORMATICA CONFIDENTIAL
What day-to-day maintenance issues does the operations team encounter with these systems? Ask for their insight into data quality issues. What business users do they support? What reports are generated on a daily, weekly, or monthly basis? What are the current service level agreements for these reports? How can the DI project support the IS department needs? Review data profiling reports and analyze the anomalies in the data. Note and record each of the comments from the more detailed analysis. What are the key business rules involved in each item?
q q
q q
Facilitated Sessions
Facilitated sessions - known sometimes as JAD (Joint Application Development) or RAD (Rapid Application Development) - are ways to work as a group of business and technical users to capture the requirements. This can be very valuable in gathering comprehensive requirements and building the project team. The difficulty is the amount of preparation and planning required to make the session a pleasant, and worthwhile experience. Facilitated sessions provide quick feedback by gathering all the people from the various teams into a meeting and initiating the requirements process. You need a facilitator who is experienced in these meetings to ensure that all the participants get a chance to speak and provide feedback. During individual (or small group) interviews with highlevel management, there is often focus and clarity of vision that may be hindered in large meetings. Thus, it is extremely important to encourage all attendees to participate and minimize a small number from dominating the requirement process. A challenge of facilitated sessions is matching everyones busy schedules and actually getting them into a meeting room. However, this part of the process must be focused and brief or it can become unwieldy with too much time expended just trying to coordinate calendars among worthy forum participants. Set a time period and target list of participants with the Project Sponsor, but avoid lengthening the process if some participants aren't available. Questions asked during facilitated sessions are similar to the questions asked to business and IS interviewees.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
661 of 702
the discovery process after interviewing the business and IT management. The next step is to define the business requirements specification. The resulting Business Requirements Specification includes a matrix linking the specific business requirements to their functional requirements. Defining the business requirements is a time consuming process and should be facilitated by forming a working group team. A working group team usually consists of business users, business analysts, project manager, and other individuals who can help to define the business requirements. The working group should meet weekly to define and finalize business requirements. The working group helps to:
q q q q q q
Design the current state and future state Identify supply format and transport mechanism Identify required message types Develop Service Level Agreement(s), including timings Identify supply management and control requirements Identify common verifications, validations, business validations and transformation rules Identify common reference data requirements Identify common exceptions Produce the physical message specification
q q q
At this time also, the Architect develops the Information Requirements Specification to clearly represent the structure of the information requirements. This document, based on the business requirements findings, can facilitate discussion of informational details and provide the starting point for the target model definition. The detailed business requirements and information requirements should be reviewed with the project beneficiaries and prioritized based on business need and the stated project objectives and scope.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
662 of 702
Manager, Business Analyst, and Architect develop consensus on a project "phasing" approach. Items of secondary priority and those with poor near-term feasibility are relegated to subsequent phases of the project. Thus, they develop a phased, or incremental, "roadmap" for the project (Project Roadmap).
Final Deliverable
The final deliverable of this phase of development is a complete list of business requirement, a diagram of current and future state, and a list of high-level business rules affected by the requirements that will effect the change from current to future. This provides the development team with much of the information in order to begin the design effort of the system modifications. Once completed, the Project Manager is responsible for scheduling the review and socialization of the requirements and plan to achieve sign-off on the deliverable. This is presented to the Project Sponsor for approval and becomes the first "increment" or starting point for the Project Plan.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
663 of 702
Description
The WBS is a deliverable-oriented hierarchical tree that allows large tasks to be visualized as a group of related smaller, more manageable subtasks. These tasks and subtasks can then be assigned to various resources, which helps to identify accountability and is invaluable for tracking progress. The WBS serves as a starting point as well as a monitoring tool for the project. One challenge in developing a thorough WBS is obtaining the correct balance between sufficient detail, and too much detail. The WBS shouldnt include every minor detail in the project, but it does need to break the tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of at least a day. It is also important to maintain consistency across project for the level of detail. A well designed WBS can be extracted at a higher level to communicate overall project progress, as shown in the following sample. The actual WBS for the project manager may, for example, may be a level of detail deeper than the overall project WBS to ensure that all steps are completed, but the communication can roll up a level or two to make things more clear.
% Complete 82% 46% 59% Budget Hours 167 28 32 Actual Hours 137 13 19
Plan Architecture - Set up of Informatica Environment Develop analytic solution architecture Design development architecture Customize and implement Iterative Framework Data Profiling Legacy Stage Pre-Load Stage Reference Data
32 10 10 18
32 15 15 23
INFORMATICA CONFIDENTIAL
BEST PRACTICE
664 of 702
56% 50%
27 10
15 5
Analysis - Target-to-Source Data Mapping Customer (9 tables) Product (7 tables) Inventory (3 tables) Shipping (3 tables) Invoicing (7 tables) Orders (13 tables) Review and signoff of Functional Specification
52%
1167
602
A fundamental question is to whether to include activities as part of a WBS. The following statements are generally true for most projects, most of the time, and therefore are appropriate as the basis for resolving this question.
q
The project manager should have the right to decompose the WBS to whatever level of detail he or she requires to effectively plan and manage the project. The WBS is a project management tool that can be used in different ways, depending upon the needs of the project manager. The lowest level of the WBS can be activities.
The hierarchical structure should be organized by deliverables and milestones with process steps detailed within it. The WBS can be structured from a process or life cycle basis (i.e., the accepted concept of Phases), with non-deliverables detailed within it.
q
At the lowest level in the WBS, an individual should be identified and held accountable for the result. This person should be an individual contributor, creating the deliverable personally, or a manager who will in turn create a set of tasks to plan and manage the results.
q
The WBS is not necessarily a sequential document. Tasks in the hierarchy are often completed in parallel. At part, the goal is to list every task that must be completed; it is not necessary to determine the critical path for completing these tasks. r For example, multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3). Subtasks 4.3.1 through 4.3.4 may have sequential requirements that forces them to be
INFORMATICA CONFIDENTIAL BEST PRACTICE 665 of 702
completed in order while subtasks 4.3.5 through 4.3.7 can - and should - be completed in parallel if they do not have sequential requirements.
r
It is important to remember that a task is not complete until all of its corresponding subtasks are completed - whether sequentially or in parallel. For example, the Build Phase is not complete until tasks 4.1 through 4.7 are complete, but some work can (and should) begin for the Deploy Phase long before the Build Phase is complete.
The Project Plan provides a starting point for further development of the project WBS. This sample is a Microsoft Project file that has been "pre-loaded" with the phases, tasks, and subtasks that make up the Informatica methodology. The Project Manager can use this WBS as a starting point, but should review it to ensure that it corresponds to the specific development effort, removing any steps that arent relevant or adding steps as necessary. Many projects require the addition of detailed steps to accurately represent the development effort. If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown Structure is also available. The phases, tasks, and subtasks can be exported from Excel into many other project management tools, simplifying the effort of developing the WBS. Sometimes it is best to build an initial task list and timeline with a project team using a facilitator with the project team. The project manager can act as a facilitator or can appoint one, freeing up the project manager and enabling team members to focus on determining the actual tasks and effort needed. Depending on the size and scope of the project, sub-projects may be beneficial, with multiple project teams creating their own project plans. The overall project manager then brings the plans together into a master project plan. This group of projects can be defined as a program and the project manager and project architect manage the interaction among the various development teams.
Caution: Do not expect plans to be set in stone. Plans inevitably change as the project progresses; new information becomes available; scope, resources and priorities change; deliverables are (or are not) completed on time, etc. The process of estimating and modifying the plan should be repeated many times throughout the project. Even initial planning is likely to take several iterations to gather enough information. Significant changes to the project plan become the basis to communicate with the project sponsor(s) and/or key stakeholders with regard to decisions to be made and priorities rearranged. The goal of the project manager is to be non-biased toward any decision, but to place the responsibility with the sponsor to shape direction.
Business requirements are less tangible and predictable than in OLTP (online transactional processing) projects. Database queries are very data intensive, involving few or many tables, but with many, many rows. In OLTP, transactions are data selective, involving few or many tables and comparatively few rows.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
666 of 702
Metadata is important, but in OLTP the meaning of fields is predetermined on a screen or report. In a data integration project (e.g., warehouse or common data management, etc.), metadata and traceability are much more critical.
q
Data integration projects, like all development projects, must be managed. To manage them, they must follow a clear plan. Data integration project managers often have a more difficult job than those managing OLTP projects because there are so many pieces and sources to manage. Two purposes of the WBS are to manage work and ensure success. Although this is the same as any project, data integration projects are unlike typical waterfall projects in that they are based on a iterative approach. Three of the main principles of iteration are as follows:
q
Iteration. Division of work into small chunks of effort using lessons learned from earlier iterations.
q
Time boxing. Delivery of capability in short intervals, with the first release typically requiring from three to nine months (depending on complexity) and quarterly releases thereafter.
q
Prototyping. Early delivery of a prototype, with a working database delivered approximately one-third of the way through. Incidentally, most iteration projects follow an essentially waterfall process within a given increment. The danger is that projects can iterate or spiral out of control..
The three principles listed above are very important because even the best data integration plans are likely to invite failure if these principles are ignored. An example of a failure waiting to happen, even with a fully detailed plan, is a large common data management project that gathers all requirements upfront and delivers the application all-at-once after three years. It is not the "large" that is the problem, but the "all requirements upfront" and the "all-at-once in three years." Even enterprise data warehouses are delivered piece-by-piece using these three (and other) principles. The feedback you can gather from increment to increment is critical to the success of the future increments. The benefit is that such incremental deliveries establish patterns for development that can be used and leveraged for future deliveries.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
667 of 702
In many cases, a manual technique is used identify and record the high-level phases and tasks, then the information is transferred to project tracking software such as Microsoft Project. Project team members typically begin by identifying the high-level phases and tasks, writing the relevant information on large sticky notes or index cards, then mount the notes or cards on a wall or white board. Use one sticky note or card per phase or task so that you can easily be rearrange them as the project order evolves. As the project plan progresses, you can add information to the cards or notes to flesh out the details, such as task owner, time estimates, and dependencies. This information can then be fed into the project tracking software. Once you have a fairly detailed methodology, you can enter the phase and task information into your project tracking software. When the project team is assembled, you can enter additional tasks and details directly into the software. Be aware however, that the project team can better understand a project and its various components if they actually participate in the high-level development activities, as they do in the manual approach. Using software alone, without input from relevant project team members, to designate phases, tasks, dependencies and time lines can be difficult and prone to errors and ommissions. Benefits of developing the project timeline manually, with input from team members include:
q
Team members have an opportunity to work with each other and set the foundation. This is particularly important if the team is geographically dispersed and cannot work face-to-face throughout much of the project.
prepare a project schedule . The end result is the Project Plan. Refer to Developing and Maintaining the Project Plan for further information about the project plan. Use your project plan to track progress. Be sure to review and modify estimates and keep the project plan updated throughout the project.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
669 of 702
A designated begin and end date. Well-defined business and technical requirements Adequate resources must be assigned.
Without these components, the project is subject slippage and incorrect expectations set with the Project Sponsor. 2. Project Plans are subject to revision and change throughout the project. It is imperative to establish a communication plan with the Project Sponsor; such communication may involve a weekly status report of accomplishments, and/or a report on issues and plans for the following week. This type of forum is very helpful in involving the Project Sponsor to actively make decisions with regards to change in scope or timeframes. If your organization has the concept of a Project Office that provides governance for the project and priorities, look for a Project Charter that contains items like scope, initial high-level requirements, and key project stakeholders. Additionally, the Project Office should provide guidance in funding and resource allocation for key projects. Informaticas PowerCenter and Data Quality are not exempted from this project planning process. However, the purpose here is to provide some key elements that can be used to develop and maintain a data integration, data migration, or data quality project.
Description
Use the following steps as a guide for developing the initial project plan: 1. Define major milestones based on the project scope. (Be sure to list all key items such as analysis, design, development, and testing.) 2. Break the milestones down into major tasks and activities. The Project Plan should be helpful as a starting point or for recommending tasks for inclusion. 3. Continue the detail breakdown, if possible, to a level at which there are logical chunks of work can be completed and assigned to resources for accountability purposes. This level provides satisfactory detail to facilitate estimation, assignment of resources, and tracking of progress. If the detail tasks are too broad in scope, such as assigning multiple resources, estimates are much less likely to be accurate and resource accountability becomes difficult to maintain. 4. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if applicable). This helps to build commitment for the project plan. 5. Establish the dependencies among tasks, where one task cannot be started until another is completed (or must start or complete concurrently with another). 6. Define the resources based on the role definitions and estimated number of resources needed for each role. 7. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan. 8. Ensure that project plan follows your organizations system development methodology. Note: Informatica Professional Services has found success in projects that blend thewaterfall method with the iterative method. TheWaterfall method works well in the early stages of a project, such as analysis and initial design. The Iterative methods work well in accelerating development and testing where feedback from extensive testing
INFORMATICA CONFIDENTIAL BEST PRACTICE 670 of 702
validates the design of the system. At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities. Set the constraint type to As Soon As Possible and avoid setting a constraint date. Use the Effort-Driven approach so that the Project Plan can be easily modified as adjustments are made. By setting the initial definition of tasks and efforts, the resulting schedule should provide a realistic picture of the project, unfettered by concerns about ideal user-requested completion dates. In other words, be as realistic as possible in your initial estimations, even if the resulting scheduling is likely to miss Project Sponsor expectations. This helps to establish good communications with your Project Sponsor so you can begin to negotiate scope and resources in good faith. This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for opportunities for parallel activities, perhaps adding resources if necessary, to improve the schedule. When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions, dependencies, assignments, milestone dates, etc. Expect to modify the plan as a result of this review.
Key Accomplishments
Listing key accomplishments provides an audit trail of activities completed for comparison against the initial plan. This is an opportunity to bring in the lead developers and have them report to management on what they have accomplished; it also provides them with an opportunity to raise concerns, which is very good from a motivation perspective since they have to own and account to management. Keep accomplishments at a high-level and coach the team members to be brief, keeping their presentation to a five to ten minute maximum during this portion of the meeting.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
671 of 702
Reference Data Reusable Objects Review and signoff of Architecture Analysis - Target-to-Source Data Mapping Customer (9 tables) Product (6 tables) Inventory (3 tables) Shipping (3 tables) Invoicing (7 tables) Orders (19 tables) Review and signoff of Functional Specification
83% 19% 0%
17-Apr 40
24-Apr 24 10
1-May 40 40 40
8-May 40 40 36 36 8 *462
15-May 40 40 40 40 8
22-May 40 40 40 40 16
29-May 32 32 32 32 32
24
40 12
110
160
97
160
160 687
160
160
160
1167
Key Issues
This is the most important part of the meeting. Presenting key issues such as resource commitment, user roadblocks, key design concerns, etc, to the Project Sponsor and Key Stakeholders as they occur allows them to make immediate decisions and minimizes the risk of impact to the project.
Tracking Changes
One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With Microsoft Project, this involves creating a "Baseline" that remains static as changes are applied to the schedule. If company and project management do not require tracking against a baseline, simply maintain the plan through updates without a baseline. Maintain all records of Project Sponsor meetings and recap changes in scope after the meeting is
INFORMATICA CONFIDENTIAL BEST PRACTICE 672 of 702
completed.
Summary
Managing a data integration, data migration, or data quality project requires good project planning and communications. Many data integration project fail because of issues such as poor data quality or complexity of integration issues. However, good communication and expectation setting with the Project Sponsor can prevent such issues from causing a project to fail.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
673 of 702
Description
The following four steps summarize business case development and lay a good foundation for proceeding into detailed business requirements for the project. 1. One of the first steps in establishing the business scope is identifying the project beneficiaries and understanding their business roles and project participation. In many cases, the Project Sponsor can help to identify the beneficiaries and the various departments they represent. This information can then be summarized in an organization chart that is useful for ensuring that all project team members understand the corporate/business organization.
q
Activity - Interview project sponsor to identify beneficiaries, define their business roles and project participation. Deliverable - Organization chart of corporate beneficiaries and participants.
2. The next step in establishing the business scope is to understand the business problem or need that the project addresses. This information should be clearly defined in a Problem/Needs Statement, using business terms to describe the problem. For example, the problem may be expressed as "a lack of information" rather than "a lack of technology" and should detail the business decisions or analysis that is required to resolve the lack of information. The best way to gather this type of information is by interviewing the Project Sponsor and/or the project beneficiaries.
q
Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding problems and needs related to project. Deliverable - Problem/Need Statement
3. The next step in creating the project scope is defining the business goals and objectives for the project and detailing them in a comprehensive Statement of Project
INFORMATICA CONFIDENTIAL BEST PRACTICE 674 of 702
Goals and Objectives. This statement should be a high-level expression of the desired business solution (e.g., what strategic or tactical benefits does the business expect to gain from the project,) and should avoid any technical considerations at this point. Again, the Project Sponsor and beneficiaries are the best sources for this type of information. It may be practical to combine information gathering for the needs assessment and goals definition, using individual interviews or general meetings to elicit the information.
q
Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding business goals and objectives for the project. Deliverable - Statement of Project Goals and Objectives
4. The final step is creating a Project Scope and Assumptions statement that clearly defines the boundaries of the project based on the Statement of Project Goals and Objective and the associated project assumptions. This statement should focus on the type of information or analysis that will be included in the project rather than what will not. The assumptions statements are optional and may include qualifiers on the scope, such as assumptions of feasibility, specific roles and responsibilities, or availability of resources or data.
q
Activity -Business Analyst develops Project Scope and Assumptions statement for presentation to the Project Sponsor. Deliverable - Project Scope and Assumptions statement
INFORMATICA CONFIDENTIAL
BEST PRACTICE
675 of 702
Description
The quality of a project can be directly correlated to the amount of review that occurs during its lifecycle and the involvement of the Project Sponsor and Key Stakeholders.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
676 of 702
q q q q q
Project scope and business case review. Business requirements review. Source analysis and business rules reviews. Data architecture review. Technical infrastructure review (hardware and software capacity and configuration planning). Data integration logic review (source to target mappings, cleansing and transformation logic, etc.). Source extraction process review. Operations review (operations and maintenance of load sessions, etc.). Reviews of operations plan, QA plan, deployment and support plan.
q q q
Key Accomplishments. Activities Next Week. Tracking of Progress to-Date (Budget vs. Actual). Key Issues / Roadblocks.
It is the Project Managers role to stay neutral to any issue and to effectively state facts and allow the Project Sponsor or other key executives to make decisions. Many times this process builds the partnership necessary for success.
Change in Scope
Directly address and evaluate any changes to the planned project activities, priorities, or staffing as they arise, or are proposed, in terms of their impact on the project plan. The Project Manager should institute a change management process in response to any issue or request that appears to add or alter expected activities and has the
INFORMATICA CONFIDENTIAL
BEST PRACTICE
677 of 702
Use the Scope Change Assessment to record the background problem or requirement and the recommended resolution that constitutes the potential scope change. Note that such a change-in-scope document helps capture key documentation that is particularly useful if the project overruns or fails to deliver upon Project Sponsor expectations. Review each potential change with the technical team to assess its impact on the project, evaluating the effect in terms of schedule, budget, staffing requirements, and so forth. Present the Scope Change Assessment to the Project Sponsor for acceptance (with formal sign-off, if applicable). Discuss the assumptions involved in the impact estimate and any potential risks to the project.
Even if there is no evident effect on the schedule, it is important to document these changes because they may affect project direction and it may become necessary, later in the project cycle, to justify these changes to management.
Management of Issues
Any questions, problems, or issues that arise and are not immediately resolved should be tracked to ensure that someone is accountable for resolving them so that their effect can also be visible. Use the Issues Tracking template, or something similar, to track issues, their owner, and dates of entry and resolution as well as the details of the issue and of its solution. Significant or "showstopper" issues should also be mentioned on the status report and communicated through the weekly project sponsor meeting. This way, the Project Sponsor has the opportunity to resolve and cure a potential issue.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
678 of 702
Even for relatively short projects, use the Project Close Report to finalize the project with a final status report detailing:
r r r
What was accomplished. Any justification for tasks expected but not completed. Recommendations.
Prepare for the close by considering what the project team has learned about the environments, procedures, data integration design, data architecture, and other project plans. Formulate the recommendations based on issues or problems that need to be addressed. Succinctly describe each problem or recommendation and if applicable, briefly describe a recommended approach.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
679 of 702
Description
Determining the data integration requirements in satisfactory detail and clarity is a difficult task however, especially while ensuring that the requirements are representative of all the potential stakeholders. This Best Practice summarizes the recommended interview and prioritization process for this requirements analysis.
Process Steps
The first step in the process is to identify and interview all major sponsors and stakeholders. This typically includes the executive staff and CFO since they are likely to be the key decision makers who will depend on the data integraton. At a minimum, figure on 10 to 20 interview sessions. The next step in the process is to interview representative information providers. These individuals include the decision makers who provide the strategic perspective on what information to pursue, as well as details on that information, and how it is currently used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors and stakeholders regarding the findings of the interviews and the recommended subject areas and information profiles. It is often helpful to facilitate a Prioritization Workshop with the major stakeholders, sponsors, and information providers in order to set priorities on the subject areas.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
680 of 702
Conduct Interviews
The following paragraphs offer some tips on the actual interviewing process. Two sections at the end of this document provide sample interview outlines for the executive staff and information providers. Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A focused, consistent interview format is desirable. Don't feel bound to the script, however, since interviewees are likely to raise some interesting points that may not be included in the original interview format. Pursue these subjects as they come up, asking detailed questions. This approach often leads to discoveries of strategic uses for information that may be exciting to the client and provide sparkle and focus to the project. Questions to the executives or decision-makers should focus on what business strategies and decisions need information to support or monitor them. (Refer to Outline for Executive Interviews at the end of this document). Coverage here is critical if key managers are left out, you may miss a critical viewpoint and may miss an important buy-in. Interviews of information providers are secondary but can be very useful. These are the business analyst-types who report to decision-makers and currently provide reports and analyses using Excel or Lotus or a database program to consolidate data from more than one source and provide regular and ad hoc reports or conduct sophisticated analysis. In subsequent phases of the project, you must identify all of these individuals, learn what information they access, and how they process it. At this stage however, you should focus on the basics, building a foundation for the project and discovering what tools are currently in use and where gaps may exist in the analysis and reporting functions. Be sure to take detailed notes throughout the interview process. If there are a lot of interviews, you may want the interviewer to partner with someone who can take good notes, perhaps on a laptop to save note transcription time later. It is important to take down the details of what each person says because, at this stage, it is difficult to know what is likely to be important. While some interviewees may want to see detailed notes from their interviews, this is not very efficient since it takes time to clean up the notes for review. The most efficient approach is to simply consolidate the interview notes into a summary format following the interviews. Be sure to review previous interviews as you go through the interviewing process, You can often use information from earlier interviews to pursue topics in later interviews in more detail and with varying perspectives.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
681 of 702
The executive interviews must be carried out in business terms. There can be no mention of the data warehouse or systems of record or particular source data entities or issues related to sourcing, cleansing or transformation. It is strictly forbidden to use any technical language. It can be valuable to have an industry expert prepare and even accompany the interviewer to provide business terminology and focus. If the interview falls into technical details, for example, into a discussion of whether certain information is currently available or could be integrated into the data warehouse, it is up to the interviewer to re-focus immediately on business needs. If this focus is not maintained, the opportunity for brainstorming is likely to be lost, which will reduce the quality and breadth of the business drivers. Because of the above caution, it is rarely acceptable to have IS resources present at the executive interviews. These resources are likely to engage the executive (or vice versa) in a discussion of current reporting problems or technical issues and thereby destroy the interview opportunity. Keep the interview groups small. One or two Professional Services personnel should suffice with at most one client project person. Especially for executive interviews, there should be one interviewee. There is sometimes a need to interview a group of middle managers together, but if there are more than two or three, you are likely to get much less input from the participants.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
682 of 702
I. II.
Introduction Executive Summary A. Objectives for the Data Warehouse B. Summary of Requirements C. High Priority Information Categories D. Issues Recommendations A. Strategic Information Requirements B. Issues Related to Availability of Data C. Suggested Initial Increments D. Data Warehouse Model Summary of Findings A. Description of Process Used B. Key Business Strategies [this includes descriptions of processes, decisions, other drivers) C. Key Departmental Strategies and Measurements D. Existing Sources of Information E. How Information is Used F. Issues Related to Information Access Appendices A. Organizational structure, departmental roles B. Departmental responsibilities, and relationships
III.
IV.
V.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
683 of 702
Agenda and Introductions Project Background and Objectives Validate Interview Findings: Key Issues Validate Information Needs Reality Check: Feasibility Prioritize Information Needs Data Integration Plan Wrap-up and Next Steps
Keep the presentation as simple and concise as possible, and avoid technical discussions or detailed sidetracks.
As much as possible, categorize the information needs by function, maybe even by specific driver (i.e., a strategic process or decision). Considering the information needs on a function by function basis fosters discussion of how the information is used and by whom.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
685 of 702
Definitions of each subject area, categorized by functional area Within each subject area, descriptions of the business drivers and information metrics Lists of the feasibility issues The subject area priorities and the implementation timeline.
q q
Interviews to understand business information strategies and expectations Document strategy findings Consensus-building meeting to prioritize information requirements and identify quick hits Model strategic subject areas Produce multi-phase Business Intelligence strategy
q q
q q
III. Goals for this meeting A. Description of business vision, strategies B. Perspective on strategic business issues and how they drive information needs
q q
The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add.
BEST PRACTICE 686 of 702
INFORMATICA CONFIDENTIAL
How do corporate strategic initiatives impact your group? These may include MBOs (personal performance objectives), and workgroup objectives or strategies.
B. What do you see as the Critical Success Factors for an Enterprise Information Strategy?
q
C. What information do you need to achieve or support key decisions related to your business objectives? D. How will your organizations progress and final success be measured (e. g., metrics, critical success factors)? E. What information or decisions from other groups affect your success? F. What are other valuable information sources (i.e., computer reports, industry reports, email, key people, meetings, phone)? G. Do you have regular strategy meetings? What information is shared as you develop your strategy? H. If it is difficult for the interviewee to brainstorm about information needs, try asking the question this way: "When you return from a two-week vacation, what information do you want to know first?" I. Of all the information you now receive, what is the most valuable? J. What information do you need that is not now readily available? K. How accurate is the information you are now getting? L. To whom do you provide information? M. Who provides information to you? N. Who would you recommend be involved in the cross-functional Consensus Workshop?
INFORMATICA CONFIDENTIAL
Goals for this meeting A. Understanding of how business issues drive information needs B. High-level understanding of what information is currently provided to whom
q
How is it processed
q
The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add. A. B.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
688 of 702
C. D. E.
What information do you provide to help support or measure the progress/success of their key business decisions? Of all the information you now provide, what is the most requested or most widely used? What are your sources for the information (both in terms of systems and personnel)? What types of analysis do you regularly perform (i.e., trends, investigating problems)? How do you provide these analyses (e.g., charts, graphs, spreadsheets)? How do you change/add value to the information? Are there quality or usability problems with the information you work with? How accurate is it?
F. G.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
689 of 702
Description
Upgrading Data Analyzer involves two steps: 1. Upgrading the Data Analyzer application. 2. Upgrading the Data Anaylzer repository.
For WebLogic:
1. Install WebLogic 8.1 without uninstalling the existing Application Server (WebLogic 6.1). 2. Install the Data Analyzer application on the new WebLogic 8.1 Application Server, making sure to use a different port than the one used in the old installation.. When prompted for repository, please choose the option of existing repository and give the connection details of the database that hosts the backed up old repository of Data Analyzer. 3. When the installation is complete, use the Upgrade utility to connect to the database that hosts the Data Analyzer backed up repository and perform the
INFORMATICA CONFIDENTIAL BEST PRACTICE 690 of 702
upgrade.
When all the reports open without problems, your upgrade can be called complete. Once the upgrade is complete, repeat the above process on the actual repository. Note: This upgrade process creates two instances of Data Analyzer. So when the upgrade is successful, uninstall the older version, following the steps in the Data Analyzer manual.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
691 of 702
Limiting development downtime. Ensuring that development work performed during the upgrade is accurately migrated to the upgraded environment. Testing the upgraded environment to ensure that data integration results are identical to the previous version. Ensuring that all elements of the various environments (e.g., Development, Test, and Production) are upgraded successfully.
Description
Some typical reasons for an initiating a PowerCenter upgrade include:
q
Additional features and capabilities in the new version of PowerCenter that enhance development productivity and administration. To keep pace with higher demands for data integration. To achieve process performance gains. To maintain an environment of fully supported software as older PowerCenter versions end support status.
q q q
Upgrade Team
Assembling a team of knowledgeable individuals to carry out the PowerCenter upgrade is key to completing the process within schedule and budgetary guidelines. Typically,
INFORMATICA CONFIDENTIAL
BEST PRACTICE
692 of 702
PowerCenter Administrator Database Administrator System Administrator Informatica team - the business and technical users that "own" the various areas in the Informatica environment. These resources are required for knowledge transfer and testing during the upgrade process and after the upgrade is complete.
Upgrade Paths
The upgrade process details depend on which of the existing PowerCenter versions you are upgrading from and which version you are moving to. The following bullet items summarize the upgrade paths for the various PowerCenter versions:
q
Direct upgrade for PowerCenter 6.x to 8.1.1 Direct upgrade for PowerCenter 7.x to 8.1.1 Direct upgrade for PowerCenter 8.0 to 8.1.1
Other versions:
r r
For version 4.6 or earlier - upgrade to 5.x, then to 7.x and to 8.1.1 For version 4.7 or later - upgrade to 6.x and then to 8.1.1
Upgrade Tips
Some of the following items may seem obvious, but adhering to these tips should help to ensure that the upgrade process goes smoothly.
q
Be sure to have sufficient memory and disk space (database) for the installed software. As new features are added into PowerCenter, the repository grows in size anywhere from 5 to 25 percent per release to accommodate the metadata for the new features. Plan for this increase in all of your PowerCenter repositories. Always read and save the upgrade log file. Backup Repository Server and PowerCenter Server configuration files prior to beginning the upgrade process. Test the AEP/EP (Advanced External Procedure/External Procedure) prior to
q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
693 of 702
PowerCenter 8.x and beyond require Domain Metadata in addition to the standard PowerCenter Repositories. Work with your DBA to create a location for the Domain Metadata Repository, which is created at install time. Ensure that all repositories for upgrade are backed up and that they can be restored successfully. Repositories can be restored to the same database in a different schema to allow an upgrade to be carried out in parallel. This is especially useful if the PowerCenter test and development environments reside in a single repository. When naming your nodes and domains in PowerCenter 8, think carefully about the naming convention before the upgrade. While changing the name of a node or the domain later is possible, it is not an easy task since it is embedded in much of the general operation of the product. Avoid using IP addresses and machine names for the domain and node names since over time machine IP addresses and server names may change. With PowerCenter 8, a central location exists for shared files (i.e., log files, error files, checkpoint files, etc.) across the domain. If using the Grid option or High Availability option, it is important that this file structure is on a highperformance file system and viewable by all nodes in the domain. If High Availability is configured, this file system should also be highly available.
All projects sharing a repository must upgrade at same time (test concurrently). Projects using multiple repositories must all upgrade at same time. After upgrade, each project should undergo full regression testing.
q q
INFORMATICA CONFIDENTIAL
BEST PRACTICE
694 of 702
When an upgrade is scheduled in conjunction with other development work, it is prudent to have it occur within a separate test environment that mimics (or at least closely resembles) production. This reduces the risk of unexpected errors and can decrease the effort spent on the upgrade. It may also allow the development work to continue in parallel with the upgrade effort, depending on the specific site setup.
Environmental Impact
With each new PowerCenter release, there is the potential for the upgrade to effect your data integration environment based on new components and features. The PowerCenter 8 upgrade changes the architecture from PowerCenter version 7, so you should spend time planning the upgrade strategy concerning domains, nodes, domain metadata, and the other architectural components with PowerCenter 8. Depending on the complexity of your data integration environment, this may be a minor or major impact. Single integration server/single repository installations are not likely to notice much of a difference to the architecture, but customers striving for highly-available systems with enterprise scalability may need to spend time understanding how to alter their physical architecture to take advantage of these new features in PowerCenter 8. For more information on these architecture changes, reference the PowerCenter documentation and the Best Practice on Domain Configuration.
Upgrade Process
Informatica recommends using the following approach to handle the challenges inherent in an upgrade effort.
Alternatively, you can begin the upgrade process in the Development environment or create a parallel environment in which to start the effort. The decision to use or copy an existing platform depends on the state of project work across all environments. If it is not possible to set up a parallel environment, the upgrade may start in Development, then progress to the Test and Production systems. However, using a parallel environment is likely to minimize development downtime. The important thing is to understand the upgrade process and your own business and technical requirements, then adapt the approaches described in this document to one that suits your particular situation.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
696 of 702
Ensure that the PowerCenter client is installed on at least one workstation to be used for upgrade testing and that connections to repositories are updated if parallel repositories are being used. Re-compile any Advanced External Procedures/External Procedures if necessary, and test them. The PowerCenter license key is now in the form of a file. During the installation of PowerCenter, youll be asked for the location of this key file. The key should be saved on the server prior to beginning the installation process. When installing PowerCenter 8.x, youll configure the domain, node, repository service, and the integration service at the same time. Ensure that you have all necessary database connections ready before beginning the installation process. If upgrading to PowerCenter 8.x from PowerCenter 7.x (or earlier), you must gather all of your configuration files that are going to be used in the automated process to upgrade the Integration Services and Repositories. See the PowerCenter Upgrade Manual for more information on how to gather them and where to locate them for the upgrade process. Once the installation has been completed, use the Repository Server Administration Console to perform the upgrade. Unlike previous versions of PowerCenter, in version 8 the Administration Console is a web application. The Administration Console URL is http://hostname:portnumber where hostname is the name of their server where the PowerCenter services are installed and port number is the port identified during the installation process. The default port number is 6001. Re-register any plug-ins (such as PowerExchange) to the newly upgraded environment. You can start both the repository and integration services on the Admin Console. Analyze upgrade activity logs to identify areas where changes may be required, rerun full regression tests on the upgraded repository. Execute test plans. Ensure that there are no failures and all the loads run successfully in the upgraded environment. Verify the data to ensure that there are no changes and no additional or missing records.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
697 of 702
production. Assess the changes when the results from the test runs are available. If you decide to deviate from best practice and make changes in test and migrate them forward to production, remember that you'll still need to implement the changes in development. Otherwise, these changes will be lost the next time work is migrated from development to the test environment. When you are satisfied with the results of testing, upgrade the other environments by backing up and restoring the appropriate repositories. Be sure to closely monitor the production environment and check the results after the upgrade. Also remember to archive and remove old repositories from the previous version.
If multiple nodes were configured and you own the PowerCenter Grid option, you can create a server grid to test performance gains If you own the high-availability option, you should configure your environment for high availability including setting up failover gateway node(s) and designating primary and backup nodes for your various PowerCenter services. In addition, your shared file location for the domain should be located on a highly available, high-performance file server. Start measuring data quality by creating a sample data profile. If LDAP is in use, associate LDAP users with PowerCenter users. Install PowerCenter Reports and configure the built-in reports for the PowerCenter repository.
q q q
Repository Versioning
After upgrading to version 8.x, you can set the repository to versioned if you purchased the Team-Based Management option and enabled it via the license key. Keep in mind that once the repository is set to versioned, it cannot be set back to nonversioned. You can invoke the team-based development option in the Administration Console.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
698 of 702
q q q
The folder with the highest version number becomes the current folder. Other versions of the folders are folder_<folder_version_number>. Shortcuts are created to mappings from the current folder.
No more folder versions for pmrep and pmrepagent scripts. Ensure that the workflow/session folder names match the upgraded names. Note that pmcmd command structure changes significantly after version 5. Version 5 pmcmd commands can still run in version 8, but may not be backwards-compatible in future versions.
Version 8 supports XML schema. The upgrade removes namespaces and prefixes for multiple namespaces. Circular reference definitions are read-only after the upgrade. Some datatypes are changed in XML definitions by the upgrade.
For more information on the specific changes to the PowerCenter software for your particular upgraded version, reference the release notes as well as the PowerCenter documentation.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
699 of 702
Description
The PowerExchange upgrade is essentially an installation with a few additional steps and some changes to the steps of a new installation. When planning for a PowerExchange upgrade the same resources are required as the initial implementation requires. These include, but are not limited to:
q q
MVS systems operator Appropriate database administrator; this depends on what (if any) databases are going to be sources/and or targets (e.g., IMS, IDMS, etc.). MVS Security resources
Since an upgrade is so similar to an initial implementation of PowerExchange, this document does not address the details of the installation. This document addresses the steps that are not documented in the Best Practices Installation document, as well as changes to existing steps in that document. For details on installing a new PowerExchange release see the Best Practice PowerExchange Installation (for Mainframe) .
INFORMATICA CONFIDENTIAL
BEST PRACTICE
700 of 702
paths on the client workstations and the PowerCenter server. 3. When executing the MVS Install Assistant and providing values on each screen, make sure the following parameters differ from those used in the existing version of PowerExchange. Specify new high-level qualifiers used for the PowerExchange datasets, libraries, and VSAM files. The value needs to match the qualifier used for the RUNLIB and BINLIB datasets allocated earlier. Consider including the version of PowerExchange in the high-level nodes of the datasets. An example could be SYSB.PWX811. The PowerExchange Agent/Logger three character prefix needs to be unique and differ from that used in the existing version of PowerExchange. Make sure the values on Logger/Agent/Condenser Parameters screen reflect the new prefix. For DB2, the plan name specified should differ from that used in the existing release. Run the jobs listed in the XJOBS member in the RUNLIB. Before starting the Listener, rename the DBMOVER member in the new RUNLIB dataset. Copy the DBMOVER member from the current PowerExchange RUNLIB to the corresponding library for the new release of PowerExchange. Update the port numbers to reflect the new ports. Update any dataset names specified in the NETPORTS to reflect the new high-level qualifier. Start the Listener and make sure the PING works. See the other document or the Implementation guide for more details. The existing Datamaps must now be migrated to the new release using the DTLURDMO utility. Details and examples can be found in the PWX Utilities Guide and the PWX Migration Guide.
4. 5. 6.
7. 8.
At this point, the mainframe upgrade is complete for bulk processing. For PowerExchange Change Data Capture or Change Data Capture Real-time, complete the additional steps in the installation manual. Also perform the following steps: 1. Use the DTLURDMO utility to migrate existing Capture Registrations and Capture Extractions to the new release. 2. Create a Registration Group for each source. 3. Open and save each Extraction Map in the new Extraction Groups. 4. Insure the values for CHKPT_BASENAME and EXT_CAPT_MASK parameters are correct before running a Condense.
INFORMATICA CONFIDENTIAL
BEST PRACTICE
701 of 702
INFORMATICA CONFIDENTIAL
BEST PRACTICE
702 of 702