Sunteți pe pagina 1din 3

14 Good design tips in Datastage

1. When you need to run the same sequence of jobs again and again, better create a sequencer with all the jobs that you need to run. Running this sequencer will run all the jobs. You can provide the sequence as per your requirement. 2. If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage. 3. Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options. 4. Turn off Runtime Column propagation wherever its not required. 5. Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of Transformer stage only if the anticipated volumes are high and performance becomes a problem. Otherwise use Transformer. It is very easy to code a transformer than a modify stage. 6. Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop the metadata. Modify stage will drop the metadata only when explicitly specified using DROP clause. 7. Add reject files wherever you need reprocessing of rejected records or you think considerable data loss may happen. Try to keep reject file at least at Sequential file stages and writing to Database stages. 8. Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of Data Stage resources. Keep the join partitioning as Auto. Indicate dont sort option between DB stage and join stage using sort stage when using order by clause. 9. While doing Outer joins, you can make use of Dummy variables for just Null checking instead of fetching an explicit column from table. 10. Data Partitioning is very important part of Parallel job design. Its always advisable to have the data partitioning as Auto unless you are comfortable with partitioning, since all Data Stage stages are designed to perform in the required way with Auto partitioning. 11. Do remember that Modify drops the Metadata only when it is explicitly asked to do so using KEEP/DROP clauses. 12. Range Look-up: Range Look-up is equivalent to the operator between. Lookup against a range of values was difficult to implement in previous Data Stage versions. By having this functionality in the lookup stage, comparing a source column to a range of two lookup columns or a lookup column to a range of two source columns can be easily implemented. 13. Use a Copy stage to dump out data to intermediate peek stages or sequential debug files. Copy stages get removed during compile time so they do not increase overhead. 14. Where you are using a Copy stage with a single input and a single output, you should ensure that you set the Force property in the stage editor TRUE. This prevents DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job.

14 design tips for better performance in Datastage


1. Avoid unnecessary type conversions: set the OSH_PRINT_SCHEMAS environment variable to verify that run time schemas match the job design column definitions. If you are using stage variables on a Transformer stage, ensure that their data types match the expected result types. 2. Use Transformer stages sparingly and wisely. Transformer stages can slow down your job. Do not have multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations. 3. Increase Sort performance where possible. Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of otherstage types. 4. Remove Unneeded Columns. Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table. 5. Avoid reading from sequential files using the Same partitioning method. Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entiredownstream flow run sequentially unless you explicitly repartition The environment variable APT_STRING_PADCHAR can be used to change the default pad character from an ASCII NULL (00) to another character; for example, an ASCII space (Ox20) . 6. It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup. 7. For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in aDB2 or Oracle table, a Sparse Lookup may be appropriate. CPU-intensive applications, which typically perform multiple CPUdemanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system. 8. Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions. 9. Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMSs, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes. 10. For some jobs, especially those that are disk-intensive, you must sometimes configure your system to prevent the RDBMS from having either to redistribute load data or to re-partition the data from an extract operation.

11. The speed of communication among stages should be optimized by your configuration. For example, jobs whose stages exchange large amounts of data should be assigned to nodes where stages communicate by either shared memory (in an SMP environment) or a high-speed link (in an MPP environment). The relative placement of jobs whose stages share small amounts of data is less important. 12. Turn off Runtime Column propagation wherever its not required. 13. If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage. 14. If you have a single input and single output when you are using a copy stage then set the force option to tru so that DataStage should not try to optimize the job by removing the Copy operation.

S-ar putea să vă placă și