Documente Academic
Documente Profesional
Documente Cultură
Submitted by
IBM WebSphere DataStage ........................................................................................................................... 5 DataStage Job Types..................................................................................................................................... 6 3.1 3.2 3.3 3.4 Server ....................................................................................................................................................... 6 Parallel ...................................................................................................................................................... 6 Mainframe ................................................................................................................................................. 6 Server vs. Parallel..................................................................................................................................... 6
DataStage Stages........................................................................................................................................ 7 4.1 Sequential File .......................................................................................................................................... 7 4.2 Complex Flat File Stage ........................................................................................................................... 7 4.3 DB2 UDB Stages .................................................................................................................................... 10 4.3.1 DB2 UDB API .................................................................................................................................. 10 4.3.2 DB2 UDB Load ................................................................................................................................ 10 4.3.3 DB2 Enterprise ................................................................................................................................ 11 4.3.4 Database Operations....................................................................................................................... 11 4.4 FTP Enterprise Stage ............................................................................................................................. 13 4.5 Look up vs. Join ...................................................................................................................................... 16 4.6 Transformer ............................................................................................................................................ 19 4.7 Sort Stage ............................................................................................................................................... 27 4.8 DataSet ................................................................................................................................................... 27 4.9 Change Data Capture Stage .................................................................................................................. 27 4.10 CDC Vs UPSERT mode in DB2 stages.................................................................................................. 28 4.11 Parameter Sets ....................................................................................................................................... 28 4.12 Slowly Changing Dimension Stage ........................................................................................................ 30
5 6
Job Representation...................................................................................................................................... 31 Performance Tuning .................................................................................................................................... 32 6.1 6.2 6.3 6.4 6.5 General ................................................................................................................................................... 32 Sequential File Stage.............................................................................................................................. 33 Complex Flat File Stage ......................................................................................................................... 34 DB2 UDB API.......................................................................................................................................... 35 DB2 ENTERPRISE STAGE.................................................................................................................... 36
CONFIDENTIAL LabCorp.
Page 2 of 36
Added examples
1.2
4-10-2008
Affected Groups Enterprise Results Repository Team Wipro ETL COE team
CONFIDENTIAL LabCorp.
Page 3 of 36
1 Introduction
Laboratory Corporation of America, referred as LabCorp, has invited Wipro Technologies to setup an ETL Center of excellence at LabCorp. In doing so, Wipro will understand the existing ETL architecture and processes and recommend best practices. Wipro will also mentor the LabCorp ETL development team in implementing the best practices on a project chosen for Proof of Concept.
1.1 Objective
The purpose of this document is to suggest the DataStage Best practices for the Results Repository implementation to ETL developers. It is assumed that the developers understand the stages and the terminology used in this document. In doing so, The Wipro team considered all the components involved in the Enterprise Results Repository
1.2 Scope
The components involved in Enterprise Results Repository are Input: Flat Files and DB2 tables Output: DB2 tables Reject Handling: Flat Files Restartability: Datasets/flat files will be used as intermediate sources Business Logic: Identify the records for insert and update Check for referential integrity
The DataStage stages discussed in this document are all the stages that are required to cover the components mentioned above. What each stage does and why choose a particular stage have been discussed in detail. For few of the components multiple implementations have been discussed. Best suitable implementation has to be decided based on the considerations mentioned.
CONFIDENTIAL LabCorp.
Page 4 of 36
CONFIDENTIAL LabCorp.
Page 5 of 36
3.2 Parallel
These are compiled and run on the DataStage server in a similar way to server jobs, but support parallel processing on SMP (Symmetric Multiprocessing), MPP (Massively Parallel Processing), and cluster systems. Parallel jobs can significantly improve performance because the different stages of a job are run concurrently rather than sequentially. There are two basic types of parallel processing, pipeline and partitioning.
3.3 Mainframe
These are available only if you have Enterprise MVS Edition installed. A mainframe job is compiled and run on the mainframe. Data extracted by such jobs is then loaded into the data warehouse.
As a general guide, larger volume (i.e. 2 million rows or more) jobs with small batch window (30 minutes or less) constraint should be considered for the Parallel job. This is an important decision because Parallel jobs will consume more hardware (memory and processor) resources than a Server Job, so it is definitely more advisable to do simpler jobs in Server and complex in Parallel and then manage them accordingly using a Sequencer. (Another component in DataStage used to manage the flow of multiple jobs). Complexity depends on the business rules to be applied in the job and the kind of table it will load. A job that loads a fact table may be categorized as complex as it involves one or more lookup, join, merge or funnel stage. A dimension table load is simple if all the data is available in one data source. It is difficult to quantify complexity, as it is very relative. As a rule of thumb, if the business rules dictate the use of more than one source of data to load a table, then the job may be considered complex. If the load window is small, the data volumes are high and the business rules are complex, then always prefer a parallel job to a server job.
CONFIDENTIAL LabCorp.
Page 6 of 36
4 DataStage Stages
4.1 Sequential File
The sequential file stage is used to read or write to flat files. The files can be fixed length or delimited. It can perform operation on a single file or a set of files using file pattern. When reading or writing a single file the stage operates in sequential mode, but if the operation is being performed on multiple files, the mode of operation is parallel. Reading from and Writing to Fixed-Length Files Particular attention must be taken when processing fixed-length fields using the Sequential File stage: If the incoming columns are variable-length data types (e.g. Integer, Decimal, and Varchar), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property. If a field is nullable, you must define the null field value and length in the nullable section of the column property. Double-click on the column number in the grid dialog to set these properties. When writing fixed-length files from variable-length fields (e.g. Integer, Decimal, Varchar), the field width and pad string column properties must be set to match the fixed-width of the output column. Double-click on the column number in the grid dialog to set this column property.
Reading Bounded-Length VARCHAR Columns Care must be taken when reading delimited, bounded-length Varchar columns (Varchars with the length option set). By default, if the source file has fields with values longer than the maximum Varchar length, these extra characters will be silently truncated. Converting Binary to ASCII Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters. Applies to fields of the string data type and record. Reject Links The reject link can be used to write the records that do not satisfy the specified format to a reject file. For writing files, the link uses the column definitions for the input link. For reading files, the link uses a single column called rejected containing raw data for columns rejected after reading because they do not match the schema. The data written to reject file is raw binary.
CONFIDENTIAL LabCorp.
Page 7 of 36
The following snapshots represents a typical implementation to read a COBOL file, of varying record lengths, using CFF. On the Stage page, specify information about the stage data: a. On the File Options tab, provide details about the file that the stage will read. For a COBOL file with varying records select the options as shown in the snapshot below:
Read from multiple nodes Select this check box if you want to speed up the read operation, using multiple nodes.
CONFIDENTIAL LabCorp. Page 8 of 36
c. If the stage is reading a file that contains multiple record types, on the Records tab, create record definitions for the data. On the Records tab, create or load column definitions for the data. To load the columns for the COBOL files import the column definition file (.cfd).
CONFIDENTIAL LabCorp.
Page 9 of 36
On the output tab specify the columns which should be mapped to the output file or stage.
CONFIDENTIAL LabCorp.
Page 10 of 36
When using relational database sources, there is often a functional overlap between SQL and DataStage stages. Although it is possible to use either SQL or DataStage to solve a given business problem, the optimal implementation involves leveraging the strengths of each technology to provide maximum throughput and developer productivity. While there are extreme scenarios when the appropriate technology choice is clearly understood, there may be gray areas where the decision should be made on factors such as developer productivity, metadata capture and re-use, and ongoing application maintenance costs. The following guidelines can assist with the appropriate use of SQL and DataStage technologies in a given job flow: When possible, use a SQL filter (WHERE clause) to limit the number of rows sent to the DataStage job. This minimizes impact on network and memory resources, and leverages the database capabilities. Use a SQL Join to combine data from tables with a small number of rows in the same database instance, especially when the join columns are indexed. When combining data from very large tables, or when the source includes a large number of database tables, the efficiency of the DataStage EE Sort and Join stages can be significantly faster than an equivalent SQL query. In this scenario, it can still be beneficial to use database filters (WHERE clause) if appropriate. Avoid the use of database stored procedures on a per-row basis within a high-volume data flow. For maximum scalability and parallel performance, it is best to implement business rules natively using DataStage components. 4.3.4.2 Optimizing Select Lists
For best performance and optimal memory usage, it is best to explicitly specify column names on all source database stages, instead of using an unqualified Table or SQL SELECT * read. For Table read method,
CONFIDENTIAL LabCorp. Page 11 of 36
This will ensure that data for only 2 columns will be loaded into the memory instead of loading all the 37 columns, thus saving time and precious memory. Alternatively if the read method is set To USER DEFINED SQL then, ensure that you mention the column names as shown in the screenshot below. Do not use the SELECT * FROM LAB to achieve this.
CONFIDENTIAL LabCorp.
Page 12 of 36
The only exception to this rule is when building dynamic database jobs that use runtime column propagation to process all rows in a source table. For example: While doing an UPSERT using enterprise DB2 stage it is possible to create a reject link to trap rows that fail any update or insert statements. There is a feature in the enterprise stage where a reject link out of the stage will carry two new fields, SQLSTATE and SQLCODE. These hold the return codes from the database engine for failed UPSERT transactions. The fields are called SQLSTATE and SQLCODE. By default this reject link holds just the columns written to the stage, they do not show any columns indicating why the row was rejected and often no warnings or error messages appear in the job log. To trap these values add SQLSTATE and SQLCODE to the list of output columns, on the output column tab and check the "Runtime column propagation" check box, this will turn your two new columns from invalid red columns to black and let your job compile. When the job runs and a reject occurs the record is sent down the reject link, two new columns are propagated down that link and can then be written out to an error handling table of file.
Scenario-2 FTP multiple files of different record format For Instance Test,Test_Order,Test_Result tables use files of different formats, then a single DataStage job will not serve the purpose as FTP stage is dependent on the record format to transfer the file. Instead a single Unix script can be used to transfer these files. Following snapshots represents the different record formats: File Name: Test.txt
CONFIDENTIAL LabCorp.
Page 14 of 36
CONFIDENTIAL LabCorp.
Page 15 of 36
CONFIDENTIAL LabCorp.
Page 16 of 36
Join Stage Partitioning Consideration The data sets input to the Join stage must be hash partitioned and sorted on key columns. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node.
CONFIDENTIAL LabCorp.
Page 17 of 36
For example: If the contiguous block of physical memory is 4KB and the dataset is 8KB, then use JOIN or MERGE. When deciding between Join and Merge, consider the following: Join can have only one output link and no reject links. Merge can have multiple output and reject links. Merge requires datasets to be sorted and without any duplication. When unsorted input is very large or sorting is not feasible, Lookup is preferred. When all inputs are of manageable size or are pre-sorted, Join is the preferred solution. If the reference to a Lookup is directly from a table, and the number of input rows is significantly smaller (e.g. 1:100 or more) than the number of reference rows, a Sparse Lookup may be appropriate. For example: When loading the fact table, we lookup on the dimension table to get the appropriate surrogate keys. When a normal lookup is performed on a dimension table, the dimension table data is loaded into the memory and the comparison is done in the memory. Consider a scenario, where input for fact table is only 100 records and it has to be compared to 10000 dimension records. A sparse lookup is advised in such a scenario. When the lookup type is changed to sparse, the data from the lookup table is not loaded to the memory. Instead, the fact records are sent to the database to perform the lookup. Note that a sparse lookup can only be done if the lookup is done on a database table. Sparse Lookup on a database table
CONFIDENTIAL LabCorp.
Page 18 of 36
The lookup type can be changed to Sparse on the Database stage used in the lookup
4.6 Transformer
The parallel Transformer stage always generates C code, which is then compiled to a parallel component. For this reason, it is important to minimize the number of transformers, and to use other stages (Copy, Filter, Switch, etc) when derivations are not needed. The Copy stage should be used instead of a Transformer for simple operations including: Job Design placeholder between stages (unless the Force option =true, EE will optimize this out at runtime) Renaming Columns Dropping Columns Default Type Conversions Note that rename, drop (if runtime column propagation is disabled), and default type conversion can also be performed by the output mapping tab of any stage. Example 1: Loading of data from Table A to Table B without any transformation. A Copy stage should be used in these types of scenarios instead of Transformer to improve the performance.
CONFIDENTIAL LabCorp.
Page 19 of 36
Example 3: Loading the same set of data from Table A to Table B by adding the current timestamp condition or any transformation logic-using transformer. A Transformer stage should be used only when there is need to apply business logic.
1. Consider, if possible, implementing complex derivation expressions using regular patterns by Lookup tables instead of using a Transformer with nested derivations.
CONFIDENTIAL LabCorp. Page 20 of 36
The below picture represents the list of link label and link names in the job. We need to decide the link label for the reject link records and the same link label should be defined in the output link as shown in above snapshot.
CONFIDENTIAL LabCorp.
Page 21 of 36
CONFIDENTIAL LabCorp.
Page 22 of 36
CONFIDENTIAL LabCorp.
Page 23 of 36
Below snapshot represents the column mapping for output datasets in the job. Here we can drop the columns based on the table structure of the target table.
CONFIDENTIAL LabCorp.
Page 24 of 36
Below snapshot defines the list of output link used in the job.
A Transformer stage should be used when we need to do some transformation logic like If..Then..Else, UpCase(), CurrentTimestamp(), Null handling and assigning default values etc.
CONFIDENTIAL LabCorp. Page 25 of 36
4. The Modify stage can be used for non-default type conversions, null handling, and character string trimming. Transformer NULL Handling and Reject Link When evaluating expressions for output derivations or link constraints, the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. To create a Transformer reject link in DataStage Designer, right-click on an output link and choose Convert to Reject. The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. For this reason, always test for null values before using a column in an expression, for example: If ISNULL(link.col) Then Else Note that if an incoming column is only used in a pass-through derivation, the Transformer will allow this row to be output. Transformer Derivation Evaluation Output derivations are evaluated BEFORE any type conversions on the assignment. For example, the PadString function uses the length of the source type, not the target. Therefore, it is important to make sure the type conversion is done before a row reaches the Transformer. For example, TrimLeadingTrailing (string) works only if string is a VarChar field. Thus, the incoming column must be type Varchar before it is evaluated in the Transformer. Conditionally Aborting Jobs The Transformer can be used to conditionally abort a job when incoming data matches a specific rule. Create a new output link that will handle rows that match the abort rule. Within the link constraints dialog box, apply the abort rule to this output link, and set the Abort After Rows count to the number of rows allowed before the job should be aborted (e.g. 1). Since the Transformer will abort the entire job flow immediately, it is possible that valid rows will not have been flushed from Sequential File (export) buffers, or committed to database tables. It is important to set the Sequential File buffer flush (see section 7.3) or database commit parameters.
CONFIDENTIAL LabCorp. Page 26 of 36
Try not to use a sort stage when you can use an ORDER BY clause in the database. Sort the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. If necessary sorting can also be done on the input links of the stages by using the properties tab of the stage. This can be done when some simple sorting or partitioning has to be done and only if some complex sorting has to be done then only use the SORT STAGE. If data has already been partitioned and sorted on a set of key columns, specify the dont sort, previously sorted option for the key columns in the Sort stage. This reduces the cost of sorting. When writing to parallel data sets, sort order and partitioning are preserved. But when reading from these data sets, try to maintain this sorting if possible by using Same partitioning method. Use hash partitioning on key columns for the Sort stage.
For example, assume you sort a data set on a system with four processing nodes and store the results to a data set stage. The data set will therefore have four partitions. You then use that data set as input to a stage executing on a different number of nodes, possibly due to node constraints. DataStage automatically repartitions a data set to spread out the data set to all nodes in the system, unless you tell it not to, possibly destroying the sort order of the data. You could avoid this by specifying the Same partitioning method. The stage does not perform any repartitioning as it reads the input data set; the original partitions are preserved.
4.8 DataSet
The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other WebSphere DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments. Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single WebSphere DataStage job. So a segment can contain files from many partitions, and a partition has files from many segments. As the dataset store the values across the nodes and it is tough to view the data in UNIX.
Note1: Change data capture stage does not insert, update or delete records from the dataset or table. It only adds an extra column, CHANGE_CODE and records the change. To apply changes, you will have to use a DB2 stage as target. Note2: Change data capture can be used to compare identical data sets. By dataset we do not mean the DataStage dataset but any set or records, flat files, database input or DataStage datasets.
CONFIDENTIAL LabCorp.
Page 27 of 36
ETL Best Practices for IBM DataStage 4.10 CDC Vs UPSERT mode in DB2 stages
There are different ways of capturing the inserts and updates on a table and there is no one particular stage that is correct for all scenarios. What a CDC does can also be achieved by using the UPSERT mode of DB2 stage. UPSERT operates on a record-by-record basis where in it tries to update first and if the update fails it inserts the record. Alternatively, you could change the order to insert first and update when the insert fails to meet your business logic. CDC on the other hand compares the source and target and identifies the insert, update and delete records. Once the records are identified, they could be processed by separate DB2 stages to insert, update or delete. The inserts or updates can then be handles in a batch mode, saving some crucial load time. Though the CDC comparison up-front uses system resources and time, the batch mode inserts and updates compensate the time loss. If the incoming file is incremental data only, then processing records for no change and delete does not make sense. You could set the Drop Output for Copy and Drop Output for Delete to True. Consider the following points before making your decision. How many records would you be processing in this job? If the number of records is 100000 or less then choose UPSERT to CDC Consider if you will be using just the key column(s) to identify the change or would you want to compare the entire row to row? Consider CDC over UPSERT if you will compare the entire source record to the target record to identify the changes. What percentage of records being processed are updates? This could be a definitive figure, say 95% for today and few months to come. But in the long run, say 3 years down the line , will the percentage of updates or inserts remain the same? As it is hard to predict this, jobs are developed to cope with any insert/ update percentage fluctuations. CDC is a better candidate in designing for such changes.
CONFIDENTIAL LabCorp.
Page 28 of 36
If DEV is selected from the Value drop down list above, the development server name, user name and password will be passed to the job as parameter as shown below.
If no selection is made from the Value drop down list, then the default server name, user name and password will be passed as shown below.
CONFIDENTIAL LabCorp.
Page 29 of 36
The parameters will differ in development, test and production environment. You create a parameter set with the three different set, one for each environment. While running a job, DataStage will prompt you which value set to be used to for this parameter set and it assigns the corresponding values to the parameters.
CONFIDENTIAL LabCorp.
Page 30 of 36
Job Representation
1. If too many stages are being used in a Job, then try splitting the job into multiple jobs.15 stages in a job is good number to start with. If your requirement cannot be handled with multiple jobs, try grouping the stages to local containers to reduce the number of stages on the job canvas. 2. Annotations should be added to all the complex logic stages. 3. Add the short description on all the jobs 4. Add the job version history in the Full description capturing the following information User Name: Name of the user who changed the job Description: the brief description of the change Date of change: Data on which the change was done Reason: Reason to change the job Defect Business Logic Change New Business Requirement
CONFIDENTIAL LabCorp.
Page 31 of 36
6 Performance Tuning
6.1 General
When dealing with large volume of data, consider separating the extract, transform and load operations and using dataset for capturing the intermediate results. This will help in job maintenance and handling job restartability. Stage the data coming from ODBC, OCI, DB2, UDB stages or any database on the server using Hash / Sequential files for optimum performance also for data recovery in case job aborts. Filter the rows and columns that will not be used from source as early as possible in the job. For example, If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table. Please refer to section 4.3.4.2 for more details. If a sub-query is being used in multiple jobs, store the result set of the sub-query in a dataset and use that dataset instead. Convert some of the complex joins in DS into jobs and populate tables, which can be further used as reference stages for lookups or joins. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. For example, if you are reading a large data file containing data for the whole year and can be split to multiple files based on the monthly data. In this case you can specify the Unix command grep Month (viz. grep January) to limit each sequential file to have only the records of a particular month and in the job you can specify 12 sequential file stages, one for each month. This will ensure that the single sequential file will be read and processed in parallel thus saving time.
Unnecessary stages need to be avoided like unnecessary transformations and redundant functions. Please refer to Section 4.6 for the sparing use of transformer stage. Use of lookup over join for small amount of reference data is more efficient. Please refer to section 4.5 for details. Share the load of query properly with DataStage; it would be a better to implement the logic in DataStage instead of writing a complex query.
CONFIDENTIAL LabCorp. Page 32 of 36
Performance can be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage.
Sequential File (Export) Buffering By default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNT allows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for real-time applications, but there is a small performance penalty associated with increased I/O. Consider this option with care.
CONFIDENTIAL LabCorp.
Page 33 of 36
ETL Best Practices for IBM DataStage 6.3 Complex Flat File Stage
Improving File Performance If the source file is fixed/de-limited, the Multiple Node Reading sub section under the File options can be used to read a single input file in parallel at evenly spaced offsets. Note that in this manner, input row order is not maintained.
Performance can be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Complex Flat File stage.
CONFIDENTIAL LabCorp.
Page 34 of 36
ETL Best Practices for IBM DataStage 6.4 DB2 UDB API
The recommendation is to separate the extract, transform and load operations. However, if the data volumes are less than 100000 records then extraction, transformation and loading could be clubbed. In doing so, the same table may be used as a lookup and as a target. This will impact the new and changed record identification. In order to ensure that the new records inserted into the table are not considered by the lookup, ensure that the TRANSACTION ISOLATION is set to CURSOR STABILITY as shown in the screenshot below
CONFIDENTIAL LabCorp.
Page 35 of 36
CONFIDENTIAL LabCorp.
Page 36 of 36