Sunteți pe pagina 1din 10

ETL Standards and Naming Conventions

Change History
Version No. Date of Issue Author Comment
1.0 Architecture Initial
1.1 5/22/2008 Sharath NitturiAdded sections
1.2 11/24/2008 Sharath NitturiUpdated
1.3 12/04/2008 Sharath NitturiUpdated

ETL Standards & Naming Conventions

1. Introduction

Informatica application has several directories, called some times application directories,
required on the server for source files, target files, logs etc. A default installation keeps
binaries for the application, and these directories within same location. This – combining
binaries and application directories - poses a challenge, and some times danger in multi-
user development environment. Both have different requirements, and need different
access/security levels.

2. Requirements:

• Give write access to application developers to the application directories, but not
expose the binaries to them.
• Provide same path for scripts on dev/qa/prod. We need to provide same path
across all three environments to make migration easier and error free. At present, the path
on dev is: (Currently this is not the same for FTP file system). It will be fixed.

3. STANDARDS

A. Informatica Servers and Repositories:


There will be 4 distinct environments Development, QA, UAT and Production.

Each of the above environments will have its own informatica server and associated
repository running. This provides the flexibility to support a multi-function environment
where a development stream, QA stream and User acceptance stream can exist in
parallel. Associated to this environment is the need to manage object migration between
individual environments (repositories).

B. Folders:

Folders are setup according to subject areas defined in the application groups. The
migration of folders and objects become critical. While entire folder migration is a
desired approach when changes need to be propagated between environments, there is a
need to enable individual object migration too.

There should be a Common folder called <FolderName>_SHARED that will host the
shared objects such as mapplets, et al. This provides reusability and encapsulation of
common business logic across subject areas. Care must be taken to migrate the
<FolderName>_SHARED folder first before migrating any other dependent folders
across environments.

When folder migrations are performed, use the XML Export/Import method to
successfully migrate folders. Likewise, if object migrations are to be performed use the
XML export/import to migrate the same. In a non-versioned repository the Object
migration method is preferred and recommended.

The primary disadvantage of the folder copy method is that the repository is locked while
the folder copy is being performed. Therefore, it is necessary to schedule this migration
task during a time when the repository is least utilized. Please keep in mind that a locked
repository means than NO jobs can be launched during this process. This can be a
serious consideration in real-time or near real-time environments.

Other method that can be considered for migration is deployment groups. This method
needs further expertise in Informatica administration (refer to the manual). This method
is used if the repository is versioned.

C. Connection objects:

1. The database connections in workflow manager should be logical and generic.


Examples: PMX rather than PMX_edsi_ora_Dev, or GLASS rather than GLASS_DEV.
We will be stuck with them in QA or Prod.

D. Mappings:
1. Mapping names consist of no more than 50 bytes. The standard as a guideline
shall consist of m_<SourceName>_<Target>_<description>, rather than
“m_myMappingName (mixed case)”. The description is as appropriate but definitely not
required to provide type and name of database names !! The name can be either in
lowercase or upper case (upper case preferred but definitely no mixed case please!). The
session logs inherit mapping names, and the file names on unix server are case sensitive.
When session logs are created on unix server, it becomes difficult to follow the files,
when they are longer and in mixed case.

Examples: m_OCC_GMI_OPTIONS
m_AML_S3D_SECURITY_UPD

2. The transformation names as in Informatica help book. When multiple short cuts
to same transformation (such as lookup) are used in a mapping, the names of such
transformations shall be given a valid name that indicates purpose of such transformation.
3. The database short names in Sources (as a result of ODBC names) should be in
UPPER CASE and should be in generic, i.e., AML instead of AML_DEV, or
AML_ORACLE.
4. As far as possible, the source qualifier to a database table, should NOT have a
over-ride query. If a filter is needed, please add the filter, rather than, generate a query.
Sufficient care must be taken and discussed, when we need to over ride the query.
5. It is good to have expression transformation after SQ, as well as before
connecting to a target. It is very painful to remember and connect the ports to the target.
6. Leverage the use of shortcuts into shared folders.
7. If Object level migration is performed, use the XML export/import method.
8. Use Parameters and variables in the mappings to keep the mapping as generic as
possible. Reusability can be achieved by changing the parameter files and the
corresponding values within them.
9. Following table summarizes the naming standards for the mapping
transformations.

Transformation Objects Suggested Naming Conventions


Application Source Qualifier ASQ_TransformationName _SourceTable1_SourceTable2.
Represents data from application source.
Expression Transformation: EXP_Function exp that leverages the expression
and/or a name that describes the processing being done.
Custom CT_Tranformation name that describe the processing being done.
Sequence Generator Transformation SEQ_Descriptor if using keys for a target table
entity, then refer to that
Lookup Transformation: LKP_ Use lookup table name or item being obtained by
lookup since there can be different lookups on a single table.
Source Qualifier Transformation: SQ_{SourceTable1}_{SourceTable2}. Using all
source tables can be impractical if there are a lot of tables in a source qualifier, so refer to
the type of information being obtained, for example a certain type of product –
SQ_SALES_INSURANCE_PRODUCTS.
Aggregator Transformation: AGG_{function} that leverages the expression or a name
that describes the processing being done.
Filter Transformation:FIL_ or FILT_ {function } that leverages the expression or a name
that describes the processing being done.
Update Strategy Transformation: UPD_{TargetTableName(s)} that leverages the
expression or a name that describes the processing being done. If the update is carryout
inserts or updates only, then add insert /ins update / upd to the name. E.g.,
UPD_UPDATE_EXISTING_EMPLOYEES
MQ Source Qualifier SQ_MQ_Descriptor defines the messaging being selected.
Normalizer Transformation: NRM_{TargetTableName(s)} that leverages the expression
or a name that describes the processing being done.
Union UN_Descriptor
Router Transformation RTR_{Descriptor}
XML Generator XMG_Descriptor defines the target message.
XML Parser XMP_Descriptor defines the messaging being selected.
XML Source Qualifier XMSQ_Descriptor defines the data being selected.
Rank Transformation: RNK_{TargetTableName(s)} that leverages the expression or a
name that describes the processing being done.
Stored Procedure Transformation: SP_{StoredProcedureName}
External Procedure Transformation: EXT_{ProcedureName}
Joiner Transformation: JNR_{SourceTable/FileName1}_
{SourceTable/FileName2} or use more general descriptions for the content in the data
flows as joiners are not only used to provide pure joins between heterogeneous source
tables and files.
Target TGT_Target_Name
Mapplet Transformation: mplt_{description}
Mapping Name: m_{target}_{descriptor}
Email Object email_{Descriptor}
Variable ports in Expression transformationsV_{column Name} column name is given
based on the functionality of the port being defined.
Output only ports O_{column Name} column name is given based on the
functionality of the port being defined. This is for ports that are defined as Output only as
opposed to ports that are both Input as well as Output.
(Source- Velocity@Informatica)

E. Workflows:

1. Workflow names consist of no more than 50 bytes. The standard as a guideline is


WF_<<mapping_name>> (Without m_) rather than “m_myMappingName (mixed
case)”. If same mapping/session is used for a daily and weekly run, the session should be
of re-usable type. The workflows in such case should tell of such purpose
WF_<wfname>_daily.
2. Any pre/post-session commands, as well as other workflow tasks should not have
any hard coded (absolute) paths. The commands should rather use environment/session
variables $PMDIR/<script name>. The environment variables are source from .profile
file of pwrmart user on unix.
3. Any script names should have extension. Shells with .ksh, or perls, with .pl
extension.
4. Do not put jobs such as ftp or update scripts in post session, as you can not re-run
them in task monitor. For example, if “ftp” and “archive” are put as two commands in run
in post session, if archive job fails, you can not re-run, as the file was already ftp’ed to
the client. Be judicious to group the jobs logically.
5. The links between different workflow tasks should have condition, if the previous
task succeeded ($s_m_pme_option_pme_flat.PrevTaskStatus= SUCCEEDED).
6. Most importantly, the lookup transformation connections, as far as possible,
should use $Source, or $Target database connection, rather than loading the db
connections individually. The session inherits $Source/$Target names from the mapping.
7. If the object level migration is performed, use the XML export/import method to
migrate workflows across environments.
8. Parameter files should be used to capture project specific inputs to mappings.
9. Following table summarizes the naming standards for the workflows.

WorkFlow Objects Suggested Naming Convention


Session Name: s_{MappingName}
Command Object cmd_{Descriptor}
WorkLet Name Wk or Wklt_{Descriptor}
Workflow Names: Wkf or wf_{Workflow Descriptor}
Email Task: Email_ or eml_{Email Descriptor}
Decision Task: dcn_{Condition_Descriptor}
Assign Task: asgn_{Variable_Descriptor }
Timer Task: Timer_ or tim_{Descriptor}
Control Task: ctl_{WorkFlow_Descriptor} Specify when and how the PowerCenter
Server to stop or abort a workflow by using the Control task in the workflow.
Event Wait Task: Wait_ or evtw_{Event_Descriptor} The Event-Wait task waits for
an event to occur. Once the event triggers, the PowerCenter Server continues executing
the rest of the workflow.
Event Raise Task: Raise_ or evtr_ {Event_Descriptor} Event-Raise task represents a
user-defined event. When the PowerCenter Server runs the Event-Raise task, the Event-
Raise task triggers the event. Use the Event-Raise task with the Event-Wait task to define
events.
(Source- Velocity@Informatica)

F. Scripts:

1. Environment variable script cdw_infa.env is created to source all high level


variables required to handle archive directories, logs etc. This is located under $HOME
2. Do not create too many .env files, just append to cdw_infa.env file! Before you
create new variables, see if you can use existing ones, when appropriate.
3. Provide some foot print, as who created, purpose of script, assumptions etc. (see
weekly_options.ksh)
4. We should not hard code the paths of any scripts, parameters or any where in the
scripts. We should use environment variables (sourced from cdw_infa.env file).
5. If multiple scripts need to be run (concatenate, archive), create a driver script
with multiple scripts there in and with error checking.
6. Every driver script should have to source the cdw_infa.env file at the beginning.
For example, in k-shell scripts, “. <space> $ HOME/cdw_infa.env”. If you have file
names as parameters, the driver script should pass with complete path (not hard coded),
i.e., the input file as $PMTGTDIR/target file.
7. Group the scripts logically, we do not have to clutter every thing under Scripts
directory. We can create sub directories or libraries. But create driver scripts under
Scripts.
8. All the relevant scripts should log to “same file”. It will be much easier for prod
support to go thru’, in case of failures (imagine at mid night). Log the output with “tee”
command – it will be much useful when you execute or test in command mode.

G. Documentation:

1. Design Document name :


GLCT_ETL_Design Specifications_<SourceName>_<TargetTable>_<Description>.doc

2. Production Run Book:


Provide reasonable documentation of the mappings/workflows. This can serve as
production run book (separate template available). It should, consist of the following.
• purpose and logic of the programs,
• flow of transformations,
• error handling
• source/lkup/target database connections for dev/qa/prod
• where and what to check in log files

H. Dates Used in ETL:


1. Effective Date : Default it should be 01/01/1900
2. Expiration Date : Default it should be 12/31/9999
3. COB Date : COB Date = Load Date Dim Key is taken from “Load Date Dim”

I. Dim Key Logic:


1. Value -1 = “UNKNOWN”: Source Input Value is “null” then dim key is
considered not found in the lookup (Source Data Quality Issue).
2. Value -2 = “NOT APPLICABLE”: if the dim key is not mapped
3. Value -3 = “UNRESOLVED”: Source Input Value is “NOT NULL” but data is
not found in the lookup. In this case dim key is considered not found in the lookup (CDS
Data Quality Issue).
4. Value -4 = Not yet decided.
5. Fact key: COB Date + “Generated surrogate key”
For example: Generated key is “10000” and the COB Date is – 12/01/2008. So finally
Fact Key is – 1201200810000.
6. Dim key: “Generated surrogate key”.
J. Reference Data:
1. Reference Data: If we are going to have lookup into reference data multiple times
then build a cache file to have a lookup into the cache, instead of having lookup into the
table.
2. All natural Keys of Reference Data (Account, Party, Organization ... etc) should
be converted to upper case.
K. Staging Tables:
1. Staging Table Name: <SourceName>_<FeedName>_STG
2. Archive Table Name: <SourceName>_<FeedName>_STG_ARC

L.

Developer corner- Best practices

(Reproduced from Informatica web site)


This article provides a step by step approach to troubleshooting issues that arise when
using an Oracle database with PowerCenter.
Background
PowerCenter connects to Oracle from the client tools (importing database object
definitions), Repository Server (when the repository is in Oracle) and PowerCenter
Server (if the server accesses one or more database object definitions in Oracle).
This troubleshooting document is focused on cases where the connectivity fails and steps
to take during that time.
Connectivity
The following steps should be taken when there are issues connecting to Oracle from a
PowerCenter component (client, server or repository server).

General
• Confirm that the Oracle client is installed in the machine where PowerCenter is
installed.
• Confirm that Oracle is able to establish connectivity.
Using the same user that starts PowerCenter connect to the Oracle database using
SQL*Plus (do not use other tools such as Toad).
• Confirm that the type of Oracle client libraries matches the type of PowerCenter
libraries.
Only 32-bit Oracle client libraries can be used with 32-bit PowerCenter.
Only 64-bit Oracle client libraries can be used with 64-bit PowerCenter.
PowerCenter Server and Repository Server
• Confirm that the ORACLE_HOME points to the Oracle installation directory.
• <UNIX: Confirm that library path (LD_LIBRARY_PATH, LIBPATH etc) must
point to Oracle client libraries.
• Confirm that the ORACLE_HOME/bin is in the PATH environment variable.
• If there is a TNS_ADMIN environment variable, confirm that it is set to the
correct tnsnames.ora file.
• Confirm that the NLS_LANG environment variable is set to the correct value for
reading and writing data. The client character set must match the setting of the Oracle
database
• If there is an Oracle end of file communication error, then check Oracle log for
details
• HP-UX: Confirm that the Oracle 9i client is being used when connecting to an
Oracle 9i database.
On HP-UX there are known issues with using the Oracle 8i client to connect to an Oracle
9i database
• If multiple Oracle clients are installed on the same machine, confirm that the path
and library path environment variables as set to the correct Oracle client directory.
• Check the Oracle listener status from command line using the following
command
• TNS_PACKET failure error will mostly be due to network error. Contact the
DBA for this error. There is a known issue that can cause this error as well
PowerCenter Client
• Confirm that the Data Direct Oracle ODBC drivers (supplied by Informatica) are
being used and not any other ODBC drivers.
• Confirm that the correct relational connection information is provided in
Workflow Manager.
• Confirm that the definition of the source or target (or other database object) in the
repository matches the table (or function, etc.) definition in the database.
Performance
The following lists reasons for poor performance and failures when using PowerCenter
with Oracle.
• Performance will be drastically reduced if Oracle tracing is turned on in
tnsnames.ora. In this scenario each and every packet sent from client to server is traced.
• Logs to check on Oracle side for failure
o Trace Logs
o Alert Logs
• The tnsping connect string can be used to determine the speed of the connection
between the client and server.
• If the PowerCenter server is on the same machine as the Oracle database Oracle
IPC connectivity can be used (this is faster than TCP/IP).
• Confirm that the recommended version of Oracle client is being used (version
8.1.7.4 and 9.2.0.5 or later are recommended).
Compatibility
The following lists reasons for issues that will arise when connected to Oracle.
Datatypes
Confirm that the Oracle datatype is supported in PowerCenter (Timestamp is currently
unsupported).
Triggers
If a trigger is being used on and Oracle target table then commit behavior will be
affected. It could commit more than the specified commit interval Oracle Events
Check for events in init.ora file and turn them off.
Note: This might cause behavioral changes.
Bulk mode
To run a session in bulk mode the minimum version of Oracle client is 8.1.7.0.
Bulk mode cannot be used if there are indexes on the target table.

Informatica Installation is with Team:

• Give write access to application developers to the application directories, but not
expose the binaries to them.
• Currently, any developers are given sudo to ‘pwrmart’. Once they are given
pwrmart, they have rights on the binaries. It is not acceptable. The developers may need
to be given a different group.
• All the jobs are run under pwrmart. Hence the logs/files are created by pwrmart.
The developers need to access them on dev environment. Hence, the users need to have
r/w/x rights to the application directories, thru’ the groups.
• The binaries consume fixed amount of disk space. Disk space for application
directories varies, and increases over time. It depends on many parameters – type and
number of jobs, option on how much history is kept etc.
• Provide same path for scripts on dev/qa/prod. We need to provide same path
across all three environments to make migration easier and error free. At present, the path
on dev is: ‘/mif-as1/u01/pwrmart/informatica’ and on qa/prod, it is ‘/inform-
as1/pwrmart/informatica/Server’.

4. Changes:

a. Informatica home directory is /inform-as1/pwrmart/informatica


b. A soft link is created as /inform-as1 to /mif-as1/u01. Hence migration becomes
easier, as the same path /inform-as1/… is referred to in dev/qa/prod. Henceforth, all
development scripts within/outside informatica should not refer to /mif-as1.
c. Informatica binaries are located as before, under /inform-
as1/pwrmart/informatica/Server. Developers do not have to go to binaries. All other
application directories are located under /inform-as1/pwrmart/informatica/app/ directory.
Developers will be given access to this directory for maintenance and clean up of sess
logs, bad files etc.
d. Following Environment variables are created: $PMSERVER (refers to /inform-
as1/pwrmart/informatica/Server), and $OGC_HOME (refers to /inform-
as1/pwrmart/informatica/app). These are available for pwrmart user (under which
informatica server runs, and hence all files are created under this owner).
e. All server parameters would be updated by admin to reflect the paths of logs,
source, target and scripts.
f. A separate directory ParamFiles is created under $OGC_HOME. Any and all
parameter files should be located here and named with *.par extension.
g. All logs – SessLogs, WorkflowLogs, ScriptLogs are located under
$OGC_HOME/Logs directory.
h. User folders will be named “~<ntlogin>”.

S-ar putea să vă placă și