Sunteți pe pagina 1din 36

ETL Best Practices for IBM DataStage

Version 1.2 Date: 04-10-2008

Submitted by

Approved by: <name> Date: <mm-dd-yyyy>

ETL Best Practices for IBM DataStage


CONTENTS
1 Introduction .................................................................................................................................................... 4 1.1 1.2 2 3 Objective ................................................................................................................................................... 4 Scope........................................................................................................................................................ 4

IBM WebSphere DataStage ........................................................................................................................... 5 DataStage Job Types..................................................................................................................................... 6 3.1 3.2 3.3 3.4 Server ....................................................................................................................................................... 6 Parallel ...................................................................................................................................................... 6 Mainframe ................................................................................................................................................. 6 Server vs. Parallel..................................................................................................................................... 6

DataStage Stages........................................................................................................................................ 7 4.1 Sequential File .......................................................................................................................................... 7 4.2 Complex Flat File Stage ........................................................................................................................... 7 4.3 DB2 UDB Stages .................................................................................................................................... 10 4.3.1 DB2 UDB API .................................................................................................................................. 10 4.3.2 DB2 UDB Load ................................................................................................................................ 10 4.3.3 DB2 Enterprise ................................................................................................................................ 11 4.3.4 Database Operations....................................................................................................................... 11 4.4 FTP Enterprise Stage ............................................................................................................................. 13 4.5 Look up vs. Join ...................................................................................................................................... 16 4.6 Transformer ............................................................................................................................................ 19 4.7 Sort Stage ............................................................................................................................................... 27 4.8 DataSet ................................................................................................................................................... 27 4.9 Change Data Capture Stage .................................................................................................................. 27 4.10 CDC Vs UPSERT mode in DB2 stages.................................................................................................. 28 4.11 Parameter Sets ....................................................................................................................................... 28 4.12 Slowly Changing Dimension Stage ........................................................................................................ 30

5 6

Job Representation...................................................................................................................................... 31 Performance Tuning .................................................................................................................................... 32 6.1 6.2 6.3 6.4 6.5 General ................................................................................................................................................... 32 Sequential File Stage.............................................................................................................................. 33 Complex Flat File Stage ......................................................................................................................... 34 DB2 UDB API.......................................................................................................................................... 35 DB2 ENTERPRISE STAGE.................................................................................................................... 36

CONFIDENTIAL LabCorp.

Page 2 of 36

ETL Best Practices for IBM DataStage


Revision History Version Date of (x.yy) Revision 1.0 3-21-2008 1.1 3-31-2008 Description of Change Reason Change LabCorp Review comments LabCorp Review comments for Affected Sections Approved By

Added examples

1.2

4-10-2008

Added Examples and Details on sections

Affected Groups Enterprise Results Repository Team Wipro ETL COE team

CONFIDENTIAL LabCorp.

Page 3 of 36

ETL Best Practices for IBM DataStage

1 Introduction
Laboratory Corporation of America, referred as LabCorp, has invited Wipro Technologies to setup an ETL Center of excellence at LabCorp. In doing so, Wipro will understand the existing ETL architecture and processes and recommend best practices. Wipro will also mentor the LabCorp ETL development team in implementing the best practices on a project chosen for Proof of Concept.

1.1 Objective
The purpose of this document is to suggest the DataStage Best practices for the Results Repository implementation to ETL developers. It is assumed that the developers understand the stages and the terminology used in this document. In doing so, The Wipro team considered all the components involved in the Enterprise Results Repository

1.2 Scope
The components involved in Enterprise Results Repository are Input: Flat Files and DB2 tables Output: DB2 tables Reject Handling: Flat Files Restartability: Datasets/flat files will be used as intermediate sources Business Logic: Identify the records for insert and update Check for referential integrity

The DataStage stages discussed in this document are all the stages that are required to cover the components mentioned above. What each stage does and why choose a particular stage have been discussed in detail. For few of the components multiple implementations have been discussed. Best suitable implementation has to be decided based on the considerations mentioned.

CONFIDENTIAL LabCorp.

Page 4 of 36

ETL Best Practices for IBM DataStage

2 IBM WebSphere DataStage


The Results Repository project has been chosen for proof of concept. Currently, the Results repository project is in the development phase. IBM WebSphere DataStage, a core component of the IBM WebSphere Data Integration Suite, enables you to tightly integrate enterprise information, despite having many sources or targets and short time frames. Whether you're building an enterprise data warehouse to support the information needs of the entire company, building a "real time" data warehouse, or integrating dozens of source systems to support enterprise applications like customer relationship management (CRM), supply chain management (SCM), and enterprise resource planning (ERP). WebSphere DataStage helps ensure that you will have information you can trust. Although primarily aimed at data warehousing environments, DataStage tool can also be used in any data handling, data migration, or data reengineering projects. It is basically an ETL tool that simplifies the data warehousing process. It is an integrated product that supports extraction of the source data, cleansing, decoding, transformation, integration, aggregation, and loading of target databases. DataStage has the following features to aid the design and processing required building a data warehouse: Uses graphical design tools. With simple point-and-click techniques you can draw a scheme to represent your processing requirements. Extracts data from any number or type of database. Handles all the Metadata definitions required to define your data warehouse. You can view and modify the table definitions at any point during the design of your application. Aggregates data. You can modify SQL SELECT statements used to extract data. Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use. Loads the data warehouse

CONFIDENTIAL LabCorp.

Page 5 of 36

ETL Best Practices for IBM DataStage

3 DataStage Job Types


3.1 Server
These are compiled and run on the DataStage server. A server job will connect to databases on other machines as necessary, extract data, process it, and then write the data to the target data warehouse.

3.2 Parallel
These are compiled and run on the DataStage server in a similar way to server jobs, but support parallel processing on SMP (Symmetric Multiprocessing), MPP (Massively Parallel Processing), and cluster systems. Parallel jobs can significantly improve performance because the different stages of a job are run concurrently rather than sequentially. There are two basic types of parallel processing, pipeline and partitioning.

3.3 Mainframe
These are available only if you have Enterprise MVS Edition installed. A mainframe job is compiled and run on the mainframe. Data extracted by such jobs is then loaded into the data warehouse.

3.4 Server vs. Parallel


DataStage offers two distinct types of technologies that provide individual costs and benefits. When making the decision as to whether to use DataStage Server or DataStage Parallel you need to assess the strengths of each and apply the type that will best suit the needs of the particular module under development. The License acquired is good for either of them. Several factors determine whether an interface will be constructed using DataStage Server or DataStage Parallel: Volume Complexity Batch window

As a general guide, larger volume (i.e. 2 million rows or more) jobs with small batch window (30 minutes or less) constraint should be considered for the Parallel job. This is an important decision because Parallel jobs will consume more hardware (memory and processor) resources than a Server Job, so it is definitely more advisable to do simpler jobs in Server and complex in Parallel and then manage them accordingly using a Sequencer. (Another component in DataStage used to manage the flow of multiple jobs). Complexity depends on the business rules to be applied in the job and the kind of table it will load. A job that loads a fact table may be categorized as complex as it involves one or more lookup, join, merge or funnel stage. A dimension table load is simple if all the data is available in one data source. It is difficult to quantify complexity, as it is very relative. As a rule of thumb, if the business rules dictate the use of more than one source of data to load a table, then the job may be considered complex. If the load window is small, the data volumes are high and the business rules are complex, then always prefer a parallel job to a server job.

CONFIDENTIAL LabCorp.

Page 6 of 36

ETL Best Practices for IBM DataStage

4 DataStage Stages
4.1 Sequential File
The sequential file stage is used to read or write to flat files. The files can be fixed length or delimited. It can perform operation on a single file or a set of files using file pattern. When reading or writing a single file the stage operates in sequential mode, but if the operation is being performed on multiple files, the mode of operation is parallel. Reading from and Writing to Fixed-Length Files Particular attention must be taken when processing fixed-length fields using the Sequential File stage: If the incoming columns are variable-length data types (e.g. Integer, Decimal, and Varchar), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property. If a field is nullable, you must define the null field value and length in the nullable section of the column property. Double-click on the column number in the grid dialog to set these properties. When writing fixed-length files from variable-length fields (e.g. Integer, Decimal, Varchar), the field width and pad string column properties must be set to match the fixed-width of the output column. Double-click on the column number in the grid dialog to set this column property.

Reading Bounded-Length VARCHAR Columns Care must be taken when reading delimited, bounded-length Varchar columns (Varchars with the length option set). By default, if the source file has fields with values longer than the maximum Varchar length, these extra characters will be silently truncated. Converting Binary to ASCII Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters. Applies to fields of the string data type and record. Reject Links The reject link can be used to write the records that do not satisfy the specified format to a reject file. For writing files, the link uses the column definitions for the input link. For reading files, the link uses a single column called rejected containing raw data for columns rejected after reading because they do not match the schema. The data written to reject file is raw binary.

4.2 Complex Flat File Stage


The Complex Flat File (CFF) stage is a file stage. You can use the stage to read a file or write to a file, but you cannot use the same stage to do both. As a source, the CFF stage can have multiple output links and a single reject link. You can read data from one or more complex flat files, including MVS data sets with QSAM and VSAM files. You can also read data from files that contain multiple record types.

CONFIDENTIAL LabCorp.

Page 7 of 36

ETL Best Practices for IBM DataStage


In reading a single file, the CFF is no different from the sequential file. Use the default options on the FILE option and RECORD option tab. When reading a file with varying record length and writing the output to multiple links, the column list for each varying record type should be identified in the RECORDS tab. The recommendations for sequential file stage apply to this stage as well. Choose CFF over sequential file only if you intend to read varying record length files. The sequential file can achieve everything else that a CFF does. For example: If the TEST data Lab file contains the records for test, test order and test results, each record will be of varying length. A test record might have 3 column, test order might have 4 columns and test result might have 5 columns and you want to divert the 3-column records to output3, 4-column records to output4 and 5-column records to output5.

The following snapshots represents a typical implementation to read a COBOL file, of varying record lengths, using CFF. On the Stage page, specify information about the stage data: a. On the File Options tab, provide details about the file that the stage will read. For a COBOL file with varying records select the options as shown in the snapshot below:

Read from multiple nodes Select this check box if you want to speed up the read operation, using multiple nodes.
CONFIDENTIAL LabCorp. Page 8 of 36

ETL Best Practices for IBM DataStage


Report progress Select this check box to display a progress report at each 10% interval when the stage can determine the file size. Missing file mode Specify the action to take if a file to be read does not exist. It should be set to Error to stop the processing if the file does not exist. Reject mode Specify the action to take if any source records do not match the specified record format, or if any records are not written to the target file (for target stages). It should be set to Save so as to create a reject file using the reject link. Filter Filter will allow you to use a UNIX command to split input files as the data is read from each file. This will makes the file reading faster using the parallelism and multiple nodes. b. On the Record Options tab, describe the format of the data in the file. For the COBOL files, the typical settings are shown in the snapshot below:

c. If the stage is reading a file that contains multiple record types, on the Records tab, create record definitions for the data. On the Records tab, create or load column definitions for the data. To load the columns for the COBOL files import the column definition file (.cfd).

CONFIDENTIAL LabCorp.

Page 9 of 36

ETL Best Practices for IBM DataStage

On the output tab specify the columns which should be mapped to the output file or stage.

4.3 DB2 UDB Stages


4.3.1 DB2 UDB API
This stage allows you to read from or write to a DB2 database table. The execution mode for DB2 API stage is sequential (by default). When data is read using a DB2 API stage, the data is loaded entirely to the coordinating node only. If the execution mode is set to Parallel for DB2 API stage, the data is duplicated from the coordinating node to all the participating nodes. Unless you have business scenarios that require such data duplication, set the DB2 API stage to run sequentially. The DB2/API plugin should only be used to read from and write to DB2 on other, non-UNIX platforms. You might, for example, use it to access mainframe editions through DB2 Connect.

4.3.2 DB2 UDB Load


This is a bulk load option. If the job writes data of 2GB bytes or more, then consider using this stage instead of a DB2 API stage. The Restart Count option can be used as a restartability option. By default it is set to 0 to single that the load has to start at row 1. For example: When loading 10000 records to a table using the DB2 load stage, if the job aborts after 5000 records, the job can be rerun with the restart count set to 5000. This will ensure that the job starts the load from 5001 record. For better code management, this value can be passed as a parameter.

CONFIDENTIAL LabCorp.

Page 10 of 36

ETL Best Practices for IBM DataStage


4.3.3 DB2 Enterprise
This stage executes in parallel and allows handling partitioned data properly. Consider this over DB2 API stage to read data if you intent to read large volumes in partitions. IBM DB2 databases distribute data in multiple partitions. DB2/UDB enterprise stage can match the partitioning when reading data from or writing data to an IBM DB2 database. By default, the DB2 enterprise stage partitions data in DB2 partition mode i.e., it takes the partitioning method from a selected IBM DB2 database. The DB2 Data Partitioning Feature (DPF) offers the necessary scalability to distribute a large database over multiple partitions (logical or physical). ETL processing of a large bulk of data across whole tables is very time-expensive using traditional plug-in stages like DB2 API. DB2 Enterprise Stage however provides a parallel execution engine, using direct communication with each database partition to achieve the best possible performance. DataStage starts up processes across ETL and DB2 nodes in the cluster. DB2/UDB Enterprise stage passes data to/from each DB2 node through the DataStage parallel framework, not the DB2 client. The parallel execution instance can be examined from the job monitor of the DataStage Director. If the setup to support DB2 Enterprise stage, i.e., both DB2 database and DataStage server are on the same platform, is in place then always use Enterprise over, API and Load.

4.3.4 Database Operations


4.3.4.1 Appropriate Use of SQL and DataStage Stages

When using relational database sources, there is often a functional overlap between SQL and DataStage stages. Although it is possible to use either SQL or DataStage to solve a given business problem, the optimal implementation involves leveraging the strengths of each technology to provide maximum throughput and developer productivity. While there are extreme scenarios when the appropriate technology choice is clearly understood, there may be gray areas where the decision should be made on factors such as developer productivity, metadata capture and re-use, and ongoing application maintenance costs. The following guidelines can assist with the appropriate use of SQL and DataStage technologies in a given job flow: When possible, use a SQL filter (WHERE clause) to limit the number of rows sent to the DataStage job. This minimizes impact on network and memory resources, and leverages the database capabilities. Use a SQL Join to combine data from tables with a small number of rows in the same database instance, especially when the join columns are indexed. When combining data from very large tables, or when the source includes a large number of database tables, the efficiency of the DataStage EE Sort and Join stages can be significantly faster than an equivalent SQL query. In this scenario, it can still be beneficial to use database filters (WHERE clause) if appropriate. Avoid the use of database stored procedures on a per-row basis within a high-volume data flow. For maximum scalability and parallel performance, it is best to implement business rules natively using DataStage components. 4.3.4.2 Optimizing Select Lists

For best performance and optimal memory usage, it is best to explicitly specify column names on all source database stages, instead of using an unqualified Table or SQL SELECT * read. For Table read method,
CONFIDENTIAL LabCorp. Page 11 of 36

ETL Best Practices for IBM DataStage


always specify the Select List sub-property. For Auto-Generated SQL, the DataStage Designer will automatically populate the select list based on the stages output column definition. For example: Let us consider a scenario of reading the LAB_CD and LAB_TYPE_CD from the LAB table. The lab table contains 37 columns. If the Table read method is used, always mention the select list as shown below

This will ensure that data for only 2 columns will be loaded into the memory instead of loading all the 37 columns, thus saving time and precious memory. Alternatively if the read method is set To USER DEFINED SQL then, ensure that you mention the column names as shown in the screenshot below. Do not use the SELECT * FROM LAB to achieve this.

CONFIDENTIAL LabCorp.

Page 12 of 36

ETL Best Practices for IBM DataStage

The only exception to this rule is when building dynamic database jobs that use runtime column propagation to process all rows in a source table. For example: While doing an UPSERT using enterprise DB2 stage it is possible to create a reject link to trap rows that fail any update or insert statements. There is a feature in the enterprise stage where a reject link out of the stage will carry two new fields, SQLSTATE and SQLCODE. These hold the return codes from the database engine for failed UPSERT transactions. The fields are called SQLSTATE and SQLCODE. By default this reject link holds just the columns written to the stage, they do not show any columns indicating why the row was rejected and often no warnings or error messages appear in the job log. To trap these values add SQLSTATE and SQLCODE to the list of output columns, on the output column tab and check the "Runtime column propagation" check box, this will turn your two new columns from invalid red columns to black and let your job compile. When the job runs and a reject occurs the record is sent down the reject link, two new columns are propagated down that link and can then be written out to an error handling table of file.

4.4 FTP Enterprise Stage


The FTP Enterprise stage transfers multiple files in parallel. These are sets of files that are transferred from one or more FTP servers into WebSphere DataStage or from WebSphere DataStage to one or more FTP servers. The source or target for the file is identified by a URI (Universal Resource Identifier). The FTP Enterprise stage invokes an FTP client program and transfers files to or from a remote host using the FTP Protocol. When reading files using this stage, it is required to specify the exact record format. This is a drawback if you would want to use this stage to read files of different record formats. Consider FTP stage if you will be reading multiple files of the same record format from the same or different directory path. Consider Unix script if you will be reading multiple files of different record format from the same or different directory path.
CONFIDENTIAL LabCorp. Page 13 of 36

ETL Best Practices for IBM DataStage


For example: If sales files are written to the same directory on a daily basis and your requirement is to read the files on a weekly basis, then you could consider using the FTP enterprise stage to read all the files The format of the URI is very important in reading the files, especially from a Mainframe server. The syntax for an absolute path on UNIX or LINUX servers is: ftp://host//path/filename While connecting to the mainframe system, the syntax for an absolute path is: ftp://host/\path.filename\ Scenario-1: FTP multiple files of same record format For Instance Test_Order table is loaded on weekly basis and files are provided on a daily basis, then FTP Stage can be used to transfer all 7 files using a single job. The following snapshot describes the metadata for 7 daily Test order files. A single job can be used to FTP the files.

Scenario-2 FTP multiple files of different record format For Instance Test,Test_Order,Test_Result tables use files of different formats, then a single DataStage job will not serve the purpose as FTP stage is dependent on the record format to transfer the file. Instead a single Unix script can be used to transfer these files. Following snapshots represents the different record formats: File Name: Test.txt

CONFIDENTIAL LabCorp.

Page 14 of 36

ETL Best Practices for IBM DataStage

File Name: Test_Order.txt

File Name: Test_result.txt

CONFIDENTIAL LabCorp.

Page 15 of 36

ETL Best Practices for IBM DataStage

4.5 Look up vs. Join


The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory. If the dataset is larger than available contiguous block of physical memory, the JOIN or MERGE stage should be used. Lookup Stage Partitioning Consideration Lookup stage does not requires input data to be sorted. There are some special partitioning considerations for Lookup stages. You need to ensure that the data being looked up in the lookup table is in the same partition as the input data referencing it. To ensure this partition the lookup tables using the Entire method.

CONFIDENTIAL LabCorp.

Page 16 of 36

ETL Best Practices for IBM DataStage

Join Stage Partitioning Consideration The data sets input to the Join stage must be hash partitioned and sorted on key columns. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node.

CONFIDENTIAL LabCorp.

Page 17 of 36

ETL Best Practices for IBM DataStage

For example: If the contiguous block of physical memory is 4KB and the dataset is 8KB, then use JOIN or MERGE. When deciding between Join and Merge, consider the following: Join can have only one output link and no reject links. Merge can have multiple output and reject links. Merge requires datasets to be sorted and without any duplication. When unsorted input is very large or sorting is not feasible, Lookup is preferred. When all inputs are of manageable size or are pre-sorted, Join is the preferred solution. If the reference to a Lookup is directly from a table, and the number of input rows is significantly smaller (e.g. 1:100 or more) than the number of reference rows, a Sparse Lookup may be appropriate. For example: When loading the fact table, we lookup on the dimension table to get the appropriate surrogate keys. When a normal lookup is performed on a dimension table, the dimension table data is loaded into the memory and the comparison is done in the memory. Consider a scenario, where input for fact table is only 100 records and it has to be compared to 10000 dimension records. A sparse lookup is advised in such a scenario. When the lookup type is changed to sparse, the data from the lookup table is not loaded to the memory. Instead, the fact records are sent to the database to perform the lookup. Note that a sparse lookup can only be done if the lookup is done on a database table. Sparse Lookup on a database table

CONFIDENTIAL LabCorp.

Page 18 of 36

ETL Best Practices for IBM DataStage

The lookup type can be changed to Sparse on the Database stage used in the lookup

4.6 Transformer
The parallel Transformer stage always generates C code, which is then compiled to a parallel component. For this reason, it is important to minimize the number of transformers, and to use other stages (Copy, Filter, Switch, etc) when derivations are not needed. The Copy stage should be used instead of a Transformer for simple operations including: Job Design placeholder between stages (unless the Force option =true, EE will optimize this out at runtime) Renaming Columns Dropping Columns Default Type Conversions Note that rename, drop (if runtime column propagation is disabled), and default type conversion can also be performed by the output mapping tab of any stage. Example 1: Loading of data from Table A to Table B without any transformation. A Copy stage should be used in these types of scenarios instead of Transformer to improve the performance.

CONFIDENTIAL LabCorp.

Page 19 of 36

ETL Best Practices for IBM DataStage


Example 2: Using Copy stage we can also drop the unwanted columns from the source table before loading the data into Target table.

Example 3: Loading the same set of data from Table A to Table B by adding the current timestamp condition or any transformation logic-using transformer. A Transformer stage should be used only when there is need to apply business logic.

1. Consider, if possible, implementing complex derivation expressions using regular patterns by Lookup tables instead of using a Transformer with nested derivations.
CONFIDENTIAL LabCorp. Page 20 of 36

ETL Best Practices for IBM DataStage


For example, the derivation expression: If A=0,1,2,3 Then B=X If A=4,5,6,7 Then B=C This Could be implemented with a lookup table containing values for column A and corresponding values of column B. 2. Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible. 3. The Filter and/or Switch stages can be used to separate rows into multiple output links based on SQL-like link constraint expressions. Example 1: Data to be loaded from Table A to Table whose Billing_Parm_CD value should be greater than 5. A Filter stage can be used to filter these types of records while loading. The below snapshot represents the records to be rejected based on the Where clause. The output link number should be mentioned to load the rejected records.

The below picture represents the list of link label and link names in the job. We need to decide the link label for the reject link records and the same link label should be defined in the output link as shown in above snapshot.

CONFIDENTIAL LabCorp.

Page 21 of 36

ETL Best Practices for IBM DataStage

CONFIDENTIAL LabCorp.

Page 22 of 36

ETL Best Practices for IBM DataStage


Example 2: SWITCH stage over transformer Switch stage can be used to load data into different target tables or dataset or reject link based on the decision. The main difference between the Transformer and Switch stage is, we can use Transformer when we need to do some validations on data like If..Then..Else, Max (), Min (), Len () and CurrentTimestamp() etc.

CONFIDENTIAL LabCorp.

Page 23 of 36

ETL Best Practices for IBM DataStage


Below snapshot represents the properties for Switch stage. Here we need to select the key column for Selector and the output of the data to be loaded in different stages is based on the value defined in User-Defined Mapping section.

Below snapshot represents the column mapping for output datasets in the job. Here we can drop the columns based on the table structure of the target table.

CONFIDENTIAL LabCorp.

Page 24 of 36

ETL Best Practices for IBM DataStage

Below snapshot defines the list of output link used in the job.

A Transformer stage should be used when we need to do some transformation logic like If..Then..Else, UpCase(), CurrentTimestamp(), Null handling and assigning default values etc.
CONFIDENTIAL LabCorp. Page 25 of 36

ETL Best Practices for IBM DataStage


Example: Need to do some business validations on data like If..Then..Else, UpCase(), CurrentTimestamp(), Null handling and assigning default values etc.

4. The Modify stage can be used for non-default type conversions, null handling, and character string trimming. Transformer NULL Handling and Reject Link When evaluating expressions for output derivations or link constraints, the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. To create a Transformer reject link in DataStage Designer, right-click on an output link and choose Convert to Reject. The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. For this reason, always test for null values before using a column in an expression, for example: If ISNULL(link.col) Then Else Note that if an incoming column is only used in a pass-through derivation, the Transformer will allow this row to be output. Transformer Derivation Evaluation Output derivations are evaluated BEFORE any type conversions on the assignment. For example, the PadString function uses the length of the source type, not the target. Therefore, it is important to make sure the type conversion is done before a row reaches the Transformer. For example, TrimLeadingTrailing (string) works only if string is a VarChar field. Thus, the incoming column must be type Varchar before it is evaluated in the Transformer. Conditionally Aborting Jobs The Transformer can be used to conditionally abort a job when incoming data matches a specific rule. Create a new output link that will handle rows that match the abort rule. Within the link constraints dialog box, apply the abort rule to this output link, and set the Abort After Rows count to the number of rows allowed before the job should be aborted (e.g. 1). Since the Transformer will abort the entire job flow immediately, it is possible that valid rows will not have been flushed from Sequential File (export) buffers, or committed to database tables. It is important to set the Sequential File buffer flush (see section 7.3) or database commit parameters.
CONFIDENTIAL LabCorp. Page 26 of 36

ETL Best Practices for IBM DataStage

4.7 Sort Stage


Try not to use a sort stage when you can use an ORDER BY clause in the database. Sort the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. If necessary sorting can also be done on the input links of the stages by using the properties tab of the stage. This can be done when some simple sorting or partitioning has to be done and only if some complex sorting has to be done then only use the SORT STAGE. If data has already been partitioned and sorted on a set of key columns, specify the dont sort, previously sorted option for the key columns in the Sort stage. This reduces the cost of sorting. When writing to parallel data sets, sort order and partitioning are preserved. But when reading from these data sets, try to maintain this sorting if possible by using Same partitioning method. Use hash partitioning on key columns for the Sort stage.

For example, assume you sort a data set on a system with four processing nodes and store the results to a data set stage. The data set will therefore have four partitions. You then use that data set as input to a stage executing on a different number of nodes, possibly due to node constraints. DataStage automatically repartitions a data set to spread out the data set to all nodes in the system, unless you tell it not to, possibly destroying the sort order of the data. You could avoid this by specifying the Same partitioning method. The stage does not perform any repartitioning as it reads the input data set; the original partitions are preserved.

4.8 DataSet
The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other WebSphere DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments. Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single WebSphere DataStage job. So a segment can contain files from many partitions, and a partition has files from many segments. As the dataset store the values across the nodes and it is tough to view the data in UNIX.

4.9 Change Data Capture Stage


Change Capture stage compares two dataset based on key columns provided and marks the differences using a change code. If the key column is available in both datasets, then it is marked for update, CHANGE_CODE = 3 If the key column is available in the target and not in the source dataset, then it is marked for delete, CHANGE_CODE = 2 If the source dataset has a key values that is not available in the target, then the record is marked for insert, CHANGE_CODE= 1

Note1: Change data capture stage does not insert, update or delete records from the dataset or table. It only adds an extra column, CHANGE_CODE and records the change. To apply changes, you will have to use a DB2 stage as target. Note2: Change data capture can be used to compare identical data sets. By dataset we do not mean the DataStage dataset but any set or records, flat files, database input or DataStage datasets.

CONFIDENTIAL LabCorp.

Page 27 of 36

ETL Best Practices for IBM DataStage 4.10 CDC Vs UPSERT mode in DB2 stages
There are different ways of capturing the inserts and updates on a table and there is no one particular stage that is correct for all scenarios. What a CDC does can also be achieved by using the UPSERT mode of DB2 stage. UPSERT operates on a record-by-record basis where in it tries to update first and if the update fails it inserts the record. Alternatively, you could change the order to insert first and update when the insert fails to meet your business logic. CDC on the other hand compares the source and target and identifies the insert, update and delete records. Once the records are identified, they could be processed by separate DB2 stages to insert, update or delete. The inserts or updates can then be handles in a batch mode, saving some crucial load time. Though the CDC comparison up-front uses system resources and time, the batch mode inserts and updates compensate the time loss. If the incoming file is incremental data only, then processing records for no change and delete does not make sense. You could set the Drop Output for Copy and Drop Output for Delete to True. Consider the following points before making your decision. How many records would you be processing in this job? If the number of records is 100000 or less then choose UPSERT to CDC Consider if you will be using just the key column(s) to identify the change or would you want to compare the entire row to row? Consider CDC over UPSERT if you will compare the entire source record to the target record to identify the changes. What percentage of records being processed are updates? This could be a definitive figure, say 95% for today and few months to come. But in the long run, say 3 years down the line , will the percentage of updates or inserts remain the same? As it is hard to predict this, jobs are developed to cope with any insert/ update percentage fluctuations. CDC is a better candidate in designing for such changes.

4.11 Parameter Sets


Parameters help in eliminating hard coding and reducing redundancy. Prior to version 8, parameters could be set at a job and at the project level. The following parameters were handled at the project level Source/Target database connection details Input and Output directory paths However, when project moved coded from one environment to another, all the parameter values had to be changed to with the new environment details. Parameter sets help in reducing this effort. You could define the parameters as sets and assign value sets. For example: for database connectivity you need, SERVER_NAME USER_ID PASSWORD You know that these details are different in development, test and production environment. You create a parameter set with the three parameters and assign value sets to them. So the next time you run a job, Datastage will prompt you which value set to be used to for this parameter set and it assigns the corresponding values to the parameters. For instance consider a Parameter Set DATABASE_CONNECTIVITY has been defined at Project level and will be used with the jobs under that project. At runtime the user will be asked to select a value from a dropdown list (e.g. DEV,TEST,PROD). Once the selection is made, the corresponding parameter settings will be used for that job execution. Otherwise if nothing is selected the default values will be used.

CONFIDENTIAL LabCorp.

Page 28 of 36

ETL Best Practices for IBM DataStage

If DEV is selected from the Value drop down list above, the development server name, user name and password will be passed to the job as parameter as shown below.

If no selection is made from the Value drop down list, then the default server name, user name and password will be passed as shown below.

CONFIDENTIAL LabCorp.

Page 29 of 36

ETL Best Practices for IBM DataStage

The parameters will differ in development, test and production environment. You create a parameter set with the three different set, one for each environment. While running a job, DataStage will prompt you which value set to be used to for this parameter set and it assigns the corresponding values to the parameters.

4.12 Slowly Changing Dimension Stage


Dimensions that change very slowly are termed as slowly changing dimensions. Consider the example of an employee dimension and you are capturing the employee location. If the chances of an employee moving locations are very low, then the change record comes in rarely. But you would still need to capture the change. How you capture the change determines the type of SCD the dimension is. SCD is classified into three types SCD1: no history is maintained. So, if the employee work location changes, you just update the work location filed with the new value. SCD2: maintains Unlimited History. For every change of work location for an employee, a new record is inserted into the table. But in doing so, you are required to identify the active record. SCD3 maintains limited history. If the company wants to maintain the current and previous work location, then a new column will be added to capture the previous work location. SCD stage can handle SCD 1 and 2 only. So, if you business requires SCD3 handling, SCD stage is not the answer. Try using CDC to identify the change and process it separately. SCD stage takes 2 inputs and writes to 2 output links, one for SCD1 and once for SCD3. If your dimension is a SCD1 and the data volumes is very less, then consider using UPSERT mode on DB2 stages.

CONFIDENTIAL LabCorp.

Page 30 of 36

ETL Best Practices for IBM DataStage

Job Representation

1. If too many stages are being used in a Job, then try splitting the job into multiple jobs.15 stages in a job is good number to start with. If your requirement cannot be handled with multiple jobs, try grouping the stages to local containers to reduce the number of stages on the job canvas. 2. Annotations should be added to all the complex logic stages. 3. Add the short description on all the jobs 4. Add the job version history in the Full description capturing the following information User Name: Name of the user who changed the job Description: the brief description of the change Date of change: Data on which the change was done Reason: Reason to change the job Defect Business Logic Change New Business Requirement

CONFIDENTIAL LabCorp.

Page 31 of 36

ETL Best Practices for IBM DataStage

6 Performance Tuning
6.1 General
When dealing with large volume of data, consider separating the extract, transform and load operations and using dataset for capturing the intermediate results. This will help in job maintenance and handling job restartability. Stage the data coming from ODBC, OCI, DB2, UDB stages or any database on the server using Hash / Sequential files for optimum performance also for data recovery in case job aborts. Filter the rows and columns that will not be used from source as early as possible in the job. For example, If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table. Please refer to section 4.3.4.2 for more details. If a sub-query is being used in multiple jobs, store the result set of the sub-query in a dataset and use that dataset instead. Convert some of the complex joins in DS into jobs and populate tables, which can be further used as reference stages for lookups or joins. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. For example, if you are reading a large data file containing data for the whole year and can be split to multiple files based on the monthly data. In this case you can specify the Unix command grep Month (viz. grep January) to limit each sequential file to have only the records of a particular month and in the job you can specify 12 sequential file stages, one for each month. This will ensure that the single sequential file will be read and processed in parallel thus saving time.

Unnecessary stages need to be avoided like unnecessary transformations and redundant functions. Please refer to Section 4.6 for the sparing use of transformer stage. Use of lookup over join for small amount of reference data is more efficient. Please refer to section 4.5 for details. Share the load of query properly with DataStage; it would be a better to implement the logic in DataStage instead of writing a complex query.
CONFIDENTIAL LabCorp. Page 32 of 36

ETL Best Practices for IBM DataStage


Also one of the main flaws in a job design, which eventually hampers the performance of the job, is of unnecessary usage of Staging area; ideally Datasets need to be used for any intermediate data unless you require the data in the database for audit trail.

6.2 Sequential File Stage


Improving File Performance If the source file is fixed/de-limited, the Readers Per Node option can be used to read a single input file in parallel at evenly spaced offsets. Note that in this manner, input row order is not maintained.

Performance can be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage.

Sequential File (Export) Buffering By default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNT allows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for real-time applications, but there is a small performance penalty associated with increased I/O. Consider this option with care.

CONFIDENTIAL LabCorp.

Page 33 of 36

ETL Best Practices for IBM DataStage 6.3 Complex Flat File Stage
Improving File Performance If the source file is fixed/de-limited, the Multiple Node Reading sub section under the File options can be used to read a single input file in parallel at evenly spaced offsets. Note that in this manner, input row order is not maintained.

Performance can be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Complex Flat File stage.

CONFIDENTIAL LabCorp.

Page 34 of 36

ETL Best Practices for IBM DataStage 6.4 DB2 UDB API
The recommendation is to separate the extract, transform and load operations. However, if the data volumes are less than 100000 records then extraction, transformation and loading could be clubbed. In doing so, the same table may be used as a lookup and as a target. This will impact the new and changed record identification. In order to ensure that the new records inserted into the table are not considered by the lookup, ensure that the TRANSACTION ISOLATION is set to CURSOR STABILITY as shown in the screenshot below

CONFIDENTIAL LabCorp.

Page 35 of 36

ETL Best Practices for IBM DataStage

6.5 DB2 ENTERPRISE STAGE


Array Size and Row commit Interval should be set as per the volume of the records to be inserted. It is advised that these options are set at a job level instead of project level. Set the options to a large value when dealing with large volume of data. For an example if 100,000 or more records are loaded to the database then the Row commit Interval and Array size may be set to 10,000 or more for better performance. This reduces the I/O cycles between the ETL and Database servers.

CONFIDENTIAL LabCorp.

Page 36 of 36

S-ar putea să vă placă și