Sunteți pe pagina 1din 12

Data Warehouse Project

Amendment History
Legend: Version # 1.0 Draft Description

Document Author: Arunkumar Sathiamoorthy Change Record: Version 1.0 Action By Company HCL Date & Time 17-Aug-2007 Action Description Document Creation

Reviewers: Version Name Company Position

Distribution:

Data Warehouse Project

Version

Date

Name

Company

Position

Data Warehouse Project

Table of Contents
1 Purpose....................................................................................................................................1 2 Scope.......................................................................................................................................1 3 Assumptions............................................................................................................................1 4 Standards.................................................................................................................................1 4.1 .Parameters.......................................................................................................................1 4.2 General.............................................................................................................................2 4.3 Job Naming Conventions.................................................................................................3 4.4 Naming convention for the stages and links....................................................................4 4.5 ODBC stage / Native Stages............................................................................................5 4.6 Bulk Load stage................................................................................................................6 4.7 Lookups and references....................................................................................................6 4.8 Transformation.................................................................................................................7 4.9 Before / After Job Sub Routines......................................................................................7 4.10 Sequential File...............................................................................................................7 4.11 Directory Structures and File Naming Conventions......................................................8 4.12 Programming Tips and Practices...................................................................................8 4.13 Scheduler Using Datastage............................................................................................9

ETL Standards

1 Purpose
The purpose of this document is to describe the standards for the ETL process which would help in saving lot of time in the development and easy to maintain and support the application as well. Following this standards document, consistency could be maintained in the ETL process and multiple iterations can be cut down in the project.

2 Scope
The scope of this document is to specify the standards that can be used in the ETL development only and this document will be used in the all phases of this project.

3 Assumptions
This document is prepared based on the assumptions that DataStage server version used to perform the ETL process. Oracle 9i is the target data warehouse database. There is no source data available in the form of ASCII text files.

4 Standards
4.1 .Parameters
Parameters are important for making a Datastage job independent of OS and the Database using. Hence decide on parameters upfront. Start designing jobs only after designing the parameters required. Eg. In the ODBC stage the User Name will be used as #userid#. Parameter Name userid passwd srcdsn The following table shows the list of parameters required for this Project . Note that Prompt Enter User ID Enter Type String Encry Help Text User ID that is used to connect to source/target Password that is used to connect to source/target Enter the source data source name. all the parameters are in lower case.

Password pted Enter Source String Data Source Name Enter Source String Schema Enter Target String

srcschema tgtdsn

This is the owner name of the Source tables that are in Siebel DB. Enter the target data source name.

ETL Standards

Data stgschema tgtschema path

Source Staging schema name This is the owner name of the Target tables, which are in Oracle DB. Base directory path where all intermediate files created during ETL process. If OS is NT, use \ as directory separator. If OS is UNIX, use / as directory separator. Data Stage Job release Number. Eg. 25.1.1 How a particular File is going to be Loaded into the Data warehouse like Daily or Weekly or Monthly. If OS is Unix use /. If OS is NT use \

Name Enter Staging string Schema Name Enter Target String Schema DS File Path String

release loadfreq dirsep

Job

Release String

Number Frequency of String Loading. Directory Separator (For String

bizdate lastbizdate runtype rundate srcstm

NT \ , Unix /) Business Date date Last Business date Date Run Type String

The actual business date for which the run is perform. The business date for which the last run is perform Run Type .The value can be FR Fresh Run / RE Rerun / RR Resume Run. Date on when the run is to perform. Source System Code

Run Date date Source Name string Code

4.2 General

Describe the functionality of a job in Short job description and detailed description in Full job description. Full job description should include the following. Detailed description Creator Name Creation Date Modified Date Reason for modification

For the System Integration Test (SIT), use job sequencer to run the child jobs and include Reset if required, then Run option in the job sequencer. This will enable the job to be re-executed in the event of job aborted due to some environmental problem.

While running the jobs, if one job fails the job sequencer should stop and abort all the other jobs.

ETL Standards

Jobs should be using table structures / file structures etc. only from the metadata repository to eliminate any sort of confusions or erroneous usage. This will help maintaining central metadata repository up to-date and reliable.

Jobs are to be compiled and the logs are to be cleaned which means job should not have any warnings.

Auto Purge of job log should be set in the administrator level and with Up to Previous runs with the values as 2 (two). This can be changed if Business / Operations mentioned specifically.

Keep the job as much as simple. Do not try doing everything in a single job. For Eg. a. Extract data from the source DB and load the data into a sequential file before doing transformations b. Load the transformed data into a sequential file before loading into target DB.

4.3 Job Naming Conventions


A tree hierarchy to be maintained which can be explained as follows. Under the Project name, there will be subcategories by name Extraction, Transformation, Lookup and Load. This will be further divided into all dimensions and facts subgroups. Job Prefix Ext Xfm Agg Cal Del Pro Mrg Jng LCont SCont Upd Oblk JSeq Category Extraction Transformation Transformation Transformation Transformation Transformation Transformation Transformation Transformation Transformation Load Load General Remarks The jobs, which extract data from source DB and checks field & row count. Jobs, which do the derivations, transformations like local code to standard codes & data type conversions. Jobs, which do aggregations. Jobs that do the business calculations. Jobs that delete the data from the tables. Jobs that call a stored procedure to do complex transformations. Jobs that merge few sequential file into one. Jobs that join sequential files using key columns. Jobs that use Local container. Jobs that use shared container. Jobs that update the data in the table. Jobs, which load into Oracle database. Job Sequencer

ETL Standards

If there is a point where two or more Sources are functionally combined, then that will be grouped under a common name and then follow the naming convention as mentioned above. Ie. If two sources are giving customer information those jobs will be present in the respective source directories initially, once it is combined create third directory called Customer and continue in that directory. Eg. MOBILY EXTRACTION ENTITY EXTCustDataFromBill TRANSFORMATION ENTITY XFMCustomerData LOAD ENTITY OBLKCustomerData <Project Name> <Extraction Hierarchy> <Dimension/Fact Name> <JobNamewithSource(Bill)as prefix> EXTCustDataFromCRM <JobNamewithSource(CRM)as prefix>

4.4 Naming convention for the stages and links


These naming conventions are helpful when doing job monitoring. Stages that retrieve data from any database, will have a three letter identity to define whether it is table, view etc is referring and the name of the object it is using to retrieve data. If more than one object is using, then give the name of more important table. The Stage names can be constructed as <Prefix><Functionality><Suffix>. The Link name can be as follows : <To/From><Functionality><Suffix> The following table can be used as Prefix and / or Suffix in the stage names. Prefix Src Tgt Lkp Rej Xfm Agg Pvt Description Source Target Lookup File Rejections Transformations Aggregator Pivot Suffix Seq Tbl Vew Hsh Description Sequential File Table View Hash File.

ETL Standards

Srt Mrg Lcl Lpr Ipc Cont Eg. ODBC Links LookUp Links Aggregator Transformer Hash file Reject Sequential file BCP stage

Sort Merge Link Collector Link Partitioner Inter Process Collector. Container.

SrcAccountTbl ToAccountAgg FromEntityHsh Agg<Functionality> Xfm<Functionality> TgtEntityHsh RejCustSeq TgtErrCustCallSeq TgtCustCallFactTbl

4.5 ODBC stage / Native Stages.


Use always native stage rather than using ODBC stage to avoid another layer in extraction /loading. Wherever possible use Generated Query method, instead of User Defined query. Specify the columns under the Selection tab to specify the Where Clause. Make quote character as 000 in the Native stage, Stage -> General Page window.

Use fully qualified path using parameter for referring an object in the Database. Mention the Schema Name using parameter schema to refer a table in the database like #schema#.<Table Name> as shown below:

ETL Standards

Also in the derivation use the schema name using parameter like #schema#.<Table Name>.<Column Name> as shown below:

If selection criteria is involved, it is to be with the Schema Name using the parameter. Enclose the parameter in single quote if the parameter is of string type. Eg. #schema#.job_control.job_id = #seqno#

4.6 Bulk Load stage


Always write to a text file before Loading into the Table.

If there is a scale factor in the source file it should be carried all the way to the text file that is being input to Bulk Load stage. Bulk Load input text file and Bulk Load stage both must use the same Metadata definition. Do not tamper with the column formats.

4.7 Lookups and references


Where ever lookups are required use Hashed file.

Create all the Hash files in a predefined directory created for this purpose. Choose the Create File and Delete File before create option while creating Hash Files. Always Choose the Type 30 file, which is dynamic hash file unless aware of impact of creating static hashed files.

Choose the Allow Stage Write Cache option while creating Hash files.

ETL Standards

4.8 Transformation

If writing to a text file, the display length is the one that is used. Where a columns values need to be pivoted use star-transformer technique. ie. Write the results to a number of text files. Create a job to append them all into one and Bulk Insert into the target table. This can be used only when the requirement cant meet using a pivot stage.

Use a pivot stage if want to do a Horizontal data pivot.

While transforming more number of records, use Link Partitioner to partition the data and use Link Collector to collect them into one. No of Link Partitioner and Collector stages used depends on the no of processors in the ETL server.

When single Transformation Stage is used to transform the data, Enable the Inter Process Option in the Job Properties this will give a better performance. Use built-in functions rather than creating new functions. For eg. To assign 0 to a column, which could have Null value, use NullToZero function.

4.9 Before / After Job Sub Routines


Call Batch file or shell script in the job control of the job rather using Before / After job subroutine.

Where data in target table needs to be deleted before Bulk Insert, use the REPLACE option in the Load mode of the Bulk insert stage.

Where data in target table needs to be deleted before Insert, use the Before SQL option to delete data in the target table. Mention a short note in the Job description regarding the activity / purpose of the routine.

4.10 Sequential File

For combining multiple sequential files of same format into one sequential file, read sequential files and use Link collector to combine all records from the sources as shown below or use operating system command (Cat or Type) to combine all sequential files into one.

Use round robin method (default) in the Link Collector stage. If there is a requirement to sort the incoming data then only use Sort / Merge option. Use the following information while defining the sequential stage. Delimiter Quote character Default Padding : : : 009 000

ETL Standards

Missing columns action :

Pad with SQL NULL (default).

When extracting source data for the first time, all the character fields are to be

cleansed by removing all the leading and trailing white spaces. If there are more white spaces in between two words that also to be made as single space using the Trim function. Eg.
1. Space (2)+SubString1+space(1)+SubString2+space(3) =>

SubString1+space(1)+SubString2
2. SubString1+space(1)+ SubString2+space(3) => SubString1+space(1)+ SubString2 3. SubString1+space(3)+ SubString2 => SubString1+space(1)+ SubString2 4. Space (2)+ SubString1+space(1)+ SubString2 => SubString1+space(1)+ SubString2

4.11 Directory Structures and File Naming Conventions

There will be few subdirectories created under the main path as follows. Source - All extracted files are placed in this directory. Target - All load files are placed in this directory.

Staging - All intermediate files are placed in this directory.

Lookup - All lookup and hash files are placed in this directory. CDC Siebel stage requires a file to store last extraction date and time. All those files are placed in this directory.

File names must be in initial caps and must not be any underscore and file name The first 3 letters are used to identify each dimension/fact. The next 3 letters are used to identify the file belongs to source/target/staging/lookup and the remaining characters should be meaningful to the process.

should be created in the convention that is mentioned below.

Sequential files must be created under any one of the above mentioned directories

depends on the process.

4.12 Programming Tips and Practices.


Use Functions of Datastage where ever possible instead of using User Defined Logic. Eg. 1. For composite field comparison or when more fields need to be compared between a source and a lookup for any kind of value differences use CRC32 function instead of comparing individual fields.

ETL Standards

2. 3.
4.

Use OCONV or ICONV functions for date conversions. When more number of records need to be processed, use Link Partioner and Link Collector. When using lookup, capture the unmatched records into sequential file for reconciliation purpose.

4.13 Scheduler Using Datastage


Use Job Sequencer to schedule a job.

S-ar putea să vă placă și