Ibm Datastage

IBM DATASTAGE
Day 1
Prepared by : Devendra Kumar Yadav

Agenda
1 Basic Data warehousing concepts
2 ETL Concepts
3 Datastage Architecture
4 Datastage Components
5 Datastage Jobs – Parallel Job features
Confidential | Copyright © Larsen & Toubro Infotech Ltd. 2

Data Warehousing Concepts and Terms
Two terms that are of great importance in understanding of data warehousing

concepts are
Operational Data :
It is the data that is used to run a business. This data is what is typically stored,
retrieved and updated by Online Transaction Processing (OLTP) system.
Operational data is stored in a relational database, but may be stored in legacy,
hierarchical or flat file formats as well.
Informational Data:
It is stored in a format that makes analysis much easier. Analysis can be in the
form of decision support(queries), report generation, executive information
systems, and more in-depth statistical analysiwealth of operational data that
exists in the business. Informational data is what makes up a data warehouse.s.
Informational data is created from the
Definition of Data Warehouse
Definition
 A structured extensible environment designed for the analysis of non-
volatile data, logically and physically transformed from multiple source
applications to align with business structure, updated and maintained for
a long time period, expressed in simple business terms and summarized
for quick analysis.
Why Data warehouse in to picture?

 Data warehouse implementation solved User’s below problems:
 I cannot find the data I need
 I cannot get the data I need
 I cannot understand the data I found
 I cannot use the data I found
BI Architechture
BI Architecture
Meta Data
www.lntinfotech.com
EDW – Top Down Architecture
Bottom Up Approach
Characteristics of a Data Warehouse : Subject -oriented
• It focuses on modeling and analysis of data for decision makers.

• Excludes data not useful in decision support process.
• Data warehouse is organized around subjects such as
sales,product,customer.
Characteristics of a Data Warehouse: Non Volatile
Insert Replace
Data
Operational
Delete Inser Warehouse
t
Load Read-only
Access
Replace Delete
Highly Volatile Non Volatile

Characteristics of a Data Warehouse : Integration
•Data Warehouse is constructed by integrating multiple

heterogeneous sources.
• Data Preprocessing are applied to ensure consistency.
RDBMS
Data
Legacy Warehouse
System
Flat File Data Processing

Data Transformation
Characteristics of a Data Warehouse: Time Variant
Data
Operational Warehouse
Current Data Historic Data

Time horizon: 60 -90 Days Time horizon: 5-10 years
Datawarehouse Application - OLAP
The data undergoes a series of transformations from raw data to

strategic information.
Information Analysis Strategic
Data Warehouse Planning
Transaction Processing Operational
Data Disparate Data Sources Raw

Multidimensional View of Information - CUBE
 Structures data around natural business concepts.

 Provides foundation for efficient, sophisticated business analysis.
East
West
Accountant January February Region Manager
Actual Budget Actual Budget
TV
Sales
VCR
TV
Margin
VCR
Product Manager Financial Analyst

OLTP vs. Data Warehouse
OLTP Datawarehouse
Performance Sensitive Performance Relaxed

Few Records accessed at a time (tens) Large Volumes accessed at a time (millions)
Read/Update Access Mostly Read Access
Data redundancy is absent Redundancy Present
Database Size (100 MB - 100 GB) Database Size (100 GB - few terabytes)
Transaction throughput is the performance metric Query throughput is the performance metric
Have to deal with 1000s of users Have to deal with 100s of users
Application Oriented Subject Oriented
Detailed Data Summarized and Refined Data
Current Up to date Snap shot data
Isolated Data Integrated Data
Repetitive Access Ad-hoc Access
Normal User Knowledge Users

ETL Overview
 Extract Transform and Load is used to populate a data warehouse

 Extract is where data is pulled from source systems
– SQL connect over networks
– Flat files
– Transaction messaging (MSMQ)
 Transformations can be the most complex part of data warehousing
– Convert text to numbers
– Apply business logic in this stage
 Load is where data is loaded into the data warehouse
– Sequential or bulk loading
WHAT IS ETL?
 ETL - extract, transform and load
– is the set of processes by which data is extracted from numerous

databases, applications and systems, transformed as appropriate, and
loaded into target systems -data warehouses, data marts, analytical
applications, etc.
– More than half of all development work for data warehousing projects
is typically dedicated to the design and implementation of ETL
processes.
– Poorly designed ETL processes are costly to maintain, change and

update, so it is critical to make the right choices in terms of the right
technology and tools that will be used for developing and maintaining
the ETL processes.
Confidential © L&T Infotech

Sample Process: File to Load Ready
Data
Warehouse Data Stage
Lookup Data Sets
ETL
Staging DB Process
Load Ready Data
System Generated Files
Raw Data Files
Transforming Examples
Transform
Change
buyer_name reg_id total_sales buyer_name reg_id total_sales
Barr, Adam II 17.60 Barr, Adam 2 17.60
Chai, Sean IV 52.80 Chai, Sean 4 52.80
O’Melia, Erin VI 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ...
Combine
buyer_first buyer_last reg_id total_sales buyer_name reg_id total_sales
Adam Barr 2 17.60 Barr, Adam 2 17.60
Sean Chai 4 52.80 Chai, Sean 4 52.80
Erin O’Melia 6 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ... ...
Calculate
buyer_name price_id qty_id buyer_name price_id qty_id total_sales
Barr, Adam .55 32 Barr, Adam .55 32 17.60
Chai, Sean 1.10 48 Chai, Sean 1.10 48 52.80
O’Melia, Erin .98 9 O’Melia, Erin .98 9 8.82
... ... ... ... ... ... ...
Datastage - Overview
 DataStage is one of the leading ETL products on the BI market.
 Datastage allows integration of the data across multiple systems and

processing high volumes of the data.
 Datastage was formerly known as Ascential DataStage and in 2005 was

acquired by IBM and added to the WebSphere family. Starting from 2006
its official name is IBM WebSphere Datastage and in 2008 it has been
renamed to IBM InfoSphere Datastage.

Datastage Features
 DataStage has the following features to aid the design and processing required
to build a data warehouse:
 Uses graphical design tools. With simple point-and-click techniques you can
draw a scheme to represent your processing requirements.
 Extracts data from any number or type of database.
 Handles all the meta data definitions required to define your data warehouse.
You can view and modify the table definitions at any point during the design of
your application.
 Aggregates data. You can modify SQL SELECT statements used to extract data.
 Transforms data.
 DataStage has a set of predefined transforms and functions you can use to
convert your data. You can easily extend the functionality by defining your own
transforms to use.
 Loads the data warehouse.
 DataStage consists of a number of client and server components.

IBM InfoSphere Information Server

Client Components
 DataStage has four client components which are installed on any PC running
Windows 2000 or Windows NT 4.0 with Service Pack 4 or later:
 DataStage Designer. A design interface used to create DataStage applications

(known as jobs). Each job specifies the data sources, the transforms required,
and the destination of the data. Jobs are compiled to create executables that
are scheduled by the Director and run by the Server (mainframe jobs are
transferred and run on the mainframe).
 DataStage Director. A user interface used to validate, schedule, run, and

monitor DataStage server jobs and parallel jobs.
 DataStage Administrator. A user interface used to perform administration tasks

such as setting up DataStage users, creating and moving projects, and setting
up purging criteria.

Client Logon
DataStage Designer
DataStage Director
Table Definition
DataStage Administrator
SERVER COMPONENTS
 There are three server components:

 Repository. A central store that contains all the information required
to build a data mart or data warehouse.
 DataStage Server. Runs executable jobs that extract, transform, and

load data into a data warehouse.
 DataStage Package Installer. A user interface used to install packaged

DataStage jobs and plug-ins.

DATASTAGE PROJECTS
 You always enter DataStage through a DataStage project while entering

DataStage client you are prompted to attach to a project.
 Each project contains:
 DataStage jobs.
 Built-in components. These are predefined components used in a
job.
 User-defined components. These are customized components
created using the DataStage Manager. Each user-defined component
performs a specific task in a job.
 A complete project may contain several jobs and user-defined
components.
 There is a special class of project called a protected project.
 Nothing can be added, deleted, or changed
 Users can view objects in the project, and but cant change the job’s
design.

DataStage Jobs
Parallel Jobs
 Executed under control of DataStage Server runtime environment
 Build-in functionality for pipeline and partitioning parallelism
 Compiled into OSH (Orchestrate Scripting Language) – Executable C++ class
instances
 Runtime monitoring in DataStage Director
Job Sequences (Batch jobs, Controlling jobs)
 Master Server jobs that kick-off jobs and other activities
 Can kick-off Server or parallel jobs
Server Jobs (requires Server Edition license)
 Executed by the DataStage Server Edition
 Compiled into Basic (interpreted pseudo-code)
 Mainframe jobs (requires Mainframe Edition license)
 Compiled into COBOL
 Executed on the Mainframe, outside of DataStage
Design Elements of Parallel Job
Stages
 Implemented as OSH operators (pre-built components)
 Passive stages (E and L of ETL)
• Read data
• Write data
• e.g. sequential file, Oracle
 Processor (active) stages (T of ETL)
• Transform data
• Filter data
• Aggregate data
• Generate data
• Split/ merge data
• e.g. Transformer, Aggregator, Join, Sort stages
Links
 “Pipes” through which the data moves from stage to stage
CREATING A SAMPLE JOB
 Requirement
 To sync data from production to test

CREATING A SAMPLE JOB CONTD.
Login to Designer by attaching to the required project
Server:Port
Attaching to a
Project

Create a New job by clicking on File  New

Select “Parallel Job” and click on “OK”

Select View Palette

Palette Bar will appear on the designer

Palette Bar will appear on the designer
Palette

Drag and Drop Source Stage From Palette to the design Area

Drag and Drop Processing Stage From Palette to the design Area

Drag and Drop Target Stage From Palette to the design Area

Add Links between the stages

Rename the stages with “proper” names

Save the Job

Edit the properties of the source stage

Edit the properties of the source stage

Provide the mapping in copy stage

Edit the properties of the target stage

Compile the Job

Run the Job

Job- “Running”

Job- “Finished”

Viewing the Job log through Director

Viewing the Job log through Director

Monitoring the Job log through Director

Monitoring the Job log through Director

Checking the data in target table

STEPS TO CREATE A DATASTAGE JOB :
 1-Prepare a list of parameters you will have to use in your job.

 2-Check Whether those parameters are present in parameter list? If no
then add them.
 3-Add those parameters in job parameter tab.
 4-Drag and drop your source ,target and stages that are required to
build the job.
 5-Link the stages as per your requirement.
 6-Enter metadata,user name and password in source and target
stages.Do necessary transformation if required.Put appropriate
annotation.
 7-Comiple the job and run.
 8-Go to Tools -> Run Director to know the status of your job.

Datastage Parallelism
Pipeline Parallelism
Here we will have stages filling pipelines so the next stage could start processing
on that data before the previous one had finished.

Pipeline Parallelism
 Process of pulling records from the source system and moving them through
the sequence of processing functions that are defined in the data-flow
Data can be buffered in blocks so that each process is not slowed when other
components are running
This approach avoids deadlocks and speeds performance by allowing both
upstream and downstream processes to run concurrently

Partition Parallelism
 Using partition parallelism the same job would effectively be run simultaneously
by several processors, each handling a separate subset of the total data.
 An approach to parallelism that involves breaking the record set into partitions,
or subsets of records
Data is partitioned by customer surname before it

flows into the Transformer stage

Combining Pipeline & Partition Parallelism
Here we will have stages processing partitioned data and filling pipelines so the
next one could start on that partition before the previous one had finished.

Repartitioning data
Datastage allows you to repartition between stages as and when needed

Parallel Processing Environments
 SMP - Symmetric Multiprocessing:

 some hardware resources may be shared among processors. The processors
communicate via shared memory and have a single operating system.
 involves a multiprocessor computer hardware architecture where two or more
identical processors are connected to a single shared main memory and are
controlled by a single OS instance.
 Cluster/MPP – Massively Parallel Processing:

 Each processor has exclusive access to hardware resources.
 The processors each have their own operating system, and communicate
via a high-speed network.
 Types of Jobs:
 CPU - limited jobs
 Memory – limited jobs
 Disk I/O limited jobs
The configuration file
 The configuration file describes available processing power in terms of

processing nodes.
 DataStage learns about the shape and size of the system from the configuration
file.
 Processing resources include nodes; and storage resources include both disks for
the permanent storage of data and disks for the temporary storage of data
(scratch disks).
 It organizes the resources needed for a job according to what is defined in the
configuration file. When your system changes, you change the file not the jobs.
node "borodin1“
{ fastname "borodin"
pools "compute_1" ""
resource disk "/sfiles/node1" {pools ""}
resource scratchdisk "/scratch1" {pools "" "sort"}
}
Number of nodes = Number of Instance of process generated
The configuration file Contd.
Number of nodes <= Number of Processors.

Some of the Processors can be free to deal with Other activities.
What do you mean by Pool?
Pool defines a group of related nodes and resources.
In the datastage Job you can specify , which pool you want to use.
How it is useful?
Some processors (nodes) can be dedicated to RDBMS or some to Mainframe.
 The default name of the configuration file is default.apt and it is located in the
Configurations directory in the Server directory of your installation.
Environment Variable : APT_CONFIG_FILE

Default Name : default.apt
Default Path : /opt/IBM/InformationServer/Server/Configurations/
So depending upon your Job, you can select which one will be better for your job.
Datastage Partitioning
 (Auto).
– Datastage attempts to work out the best partitioning method depending on
execution modes of current and preceding stages, and how many nodes are
specified in the Configuration file. This is the default partitioning method.
 Entire.
– Each file written to receives the entire data set.
– Every instance of a stage on every processing node receives the complete data
set as input
 Hash.
– The records are hashed into partitions based on the value of a key column or
columns selected from the Available list.
 Modulus.
– The records are partitioned using a modulus function on the key column
selected from the Available list. This is commonly used to partition on tag
fields.
Datastage Partitioning, cont’d
 Random.
– The records are partitioned randomly, based on the output of a random
number generator.
 Round Robin.
– The records are partitioned on a round robin basis as they enter the stage.
 Same.
– Preserves the partitioning already in place.
 DB2.
– Replicates the DB2 partitioning method of a specific DB2 table. Requires extra
properties to be set.
 Range.
– Divides a data set into approximately equal size partitions, based on one or
more partitioning keys. Range partitioning is often a preprocessing step to
performing a total sort on a data set. Requires extra properties to be set.
Partition Icons
Auto ‘Same’ Partitioning Method
Repartitioning Sequential to parallel

Collection
 Collecting is the process of joining the multiple partitions of a single data set back
together again into a single partition.
 Round Robin.
– The records are collected on a round robin basis as they enter the stage.
 Ordered.
– Reads all records from first partition, then all from second partition and so on.
 Sorted Merge.
– Reads records in an order based on one or more columns of the record. These
columns are called collecting keys
 Auto.
– Datastage will eagerly read any row from an input partitions as it becomes
available. This is the fastest collection method
Collecting Icons
Sorting
Processing Sorted data reduces overhead, and in turn can increase performance. Developers need to
consider whether the data they are processing is sorted or not. This decision may impact the
partitioning and collection methods used to optimize performance.
 The Partitioning tab allows you to specify that data arriving on the input link should be sorted
before being written to the file or files. The sort is always carried out within data partitions.
 When partitioning incoming data, the sort occurs after the partitioning.
 When collecting data, the sort occurs before the collection.
 The availability of sorting options depends on the partitioning or collecting method chosen (it is
not available with the Auto methods).
 Options include:
– Perform Sort. Select this to specify that data coming in on the link should be sorted. Select
the column or columns to sort on from the Available list.
– Stable. Select this if you want to preserve previously sorted data sets. This is the default.
– Unique. Select this to specify that, if multiple records have identical sorting key values,
only one record is retained. If stable sort is also set, the first record is retained.

Ibm Datastage - Training Day1

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Ibm Datastage - Training Day1

Încărcat de

Drepturi de autor:

Formate disponibile

Prepared by : Devendra Kumar Yadav

1 Basic Data warehousing concepts

5 Datastage Jobs – Parallel Job features

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 2

Two terms that are of great importance in understanding of data warehousing

Why Data warehouse in to picture?

• It focuses on modeling and analysis of data for decision makers.

Highly Volatile Non Volatile

•Data Warehouse is constructed by integrating multiple

Flat File Data Processing

Current Data Historic Data

The data undergoes a series of transformations from raw data to

Information Analysis Strategic

Data Warehouse Planning

Transaction Processing Operational

Data Disparate Data Sources Raw

 Structures data around natural business concepts.

Product Manager Financial Analyst

Performance Sensitive Performance Relaxed

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 14

 Extract Transform and Load is used to populate a data warehouse

 ETL - extract, transform and load

– is the set of processes by which data is extracted from numerous

– Poorly designed ETL processes are costly to maintain, change and

Confidential © L&T Infotech

 DataStage is one of the leading ETL products on the BI market.

 Datastage allows integration of the data across multiple systems and

 Datastage was formerly known as Ascential DataStage and in 2005 was

Confidential © L&T Infotech

Confidential © L&T Infotech

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 21

 DataStage Designer. A design interface used to create DataStage applications

 DataStage Director. A user interface used to validate, schedule, run, and

 DataStage Administrator. A user interface used to perform administration tasks

Confidential © L&T Infotech

 There are three server components:

 DataStage Server. Runs executable jobs that extract, transform, and

 DataStage Package Installer. A user interface used to install packaged

Confidential © L&T Infotech

 You always enter DataStage through a DataStage project while entering

Confidential © L&T Infotech

 To sync data from production to test

Confidential © L&T Infotech

Login to Designer by attaching to the required project

Confidential © L&T Infotech

Create a New job by clicking on File  New

Confidential © L&T Infotech

Select “Parallel Job” and click on “OK”

Confidential © L&T Infotech

Select View Palette

Confidential © L&T Infotech

Palette Bar will appear on the designer

Confidential © L&T Infotech

Palette Bar will appear on the designer

Confidential © L&T Infotech

Confidential © L&T Infotech

Confidential © L&T Infotech

Confidential © L&T Infotech

Add Links between the stages

Confidential © L&T Infotech

Rename the stages with “proper” names

Confidential © L&T Infotech