Sunteți pe pagina 1din 77

IBM DATASTAGE

Day 1

Prepared by : Devendra Kumar Yadav


Agenda

1 Basic Data warehousing concepts

2 ETL Concepts

3 Datastage Architecture

4 Datastage Components

5 Datastage Jobs – Parallel Job features

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 2


Data Warehousing Concepts and Terms

Two terms that are of great importance in understanding of data warehousing


concepts are
Operational Data :
It is the data that is used to run a business. This data is what is typically stored,
retrieved and updated by Online Transaction Processing (OLTP) system.
Operational data is stored in a relational database, but may be stored in legacy,
hierarchical or flat file formats as well.

Informational Data:
It is stored in a format that makes analysis much easier. Analysis can be in the
form of decision support(queries), report generation, executive information
systems, and more in-depth statistical analysiwealth of operational data that
exists in the business. Informational data is what makes up a data warehouse.s.
Informational data is created from the
Definition of Data Warehouse

Definition
 A structured extensible environment designed for the analysis of non-
volatile data, logically and physically transformed from multiple source
applications to align with business structure, updated and maintained for
a long time period, expressed in simple business terms and summarized
for quick analysis.

Why Data warehouse in to picture?


 Data warehouse implementation solved User’s below problems:
 I cannot find the data I need
 I cannot get the data I need
 I cannot understand the data I found
 I cannot use the data I found
BI Architechture
BI Architecture

Meta Data

www.lntinfotech.com
EDW – Top Down Architecture
Bottom Up Approach
Characteristics of a Data Warehouse : Subject -oriented

• It focuses on modeling and analysis of data for decision makers.


• Excludes data not useful in decision support process.
• Data warehouse is organized around subjects such as
sales,product,customer.
Characteristics of a Data Warehouse: Non Volatile

Insert Replace

Data
Operational
Delete Inser Warehouse
t
Load Read-only
Access
Replace Delete

Highly Volatile Non Volatile


Characteristics of a Data Warehouse : Integration

•Data Warehouse is constructed by integrating multiple


heterogeneous sources.
• Data Preprocessing are applied to ensure consistency.

RDBMS

Data
Legacy Warehouse
System

Flat File Data Processing


Data Transformation
Characteristics of a Data Warehouse: Time Variant

Data
Operational Warehouse

Current Data Historic Data


Time horizon: 60 -90 Days Time horizon: 5-10 years
Datawarehouse Application - OLAP

The data undergoes a series of transformations from raw data to


strategic information.

Information Analysis Strategic

Data Warehouse Planning

Transaction Processing Operational

Data Disparate Data Sources Raw


Multidimensional View of Information - CUBE

 Structures data around natural business concepts.


 Provides foundation for efficient, sophisticated business analysis.

East
West
Accountant January February Region Manager
Actual Budget Actual Budget
TV
Sales
VCR
TV
Margin
VCR

Product Manager Financial Analyst


OLTP vs. Data Warehouse

OLTP Datawarehouse

Performance Sensitive Performance Relaxed


Few Records accessed at a time (tens) Large Volumes accessed at a time (millions)
Read/Update Access Mostly Read Access
Data redundancy is absent Redundancy Present
Database Size (100 MB - 100 GB) Database Size (100 GB - few terabytes)

Transaction throughput is the performance metric Query throughput is the performance metric
Have to deal with 1000s of users Have to deal with 100s of users
Application Oriented Subject Oriented
Detailed Data Summarized and Refined Data
Current Up to date Snap shot data
Isolated Data Integrated Data
Repetitive Access Ad-hoc Access
Normal User Knowledge Users

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 14


ETL Overview

 Extract Transform and Load is used to populate a data warehouse


 Extract is where data is pulled from source systems
– SQL connect over networks
– Flat files
– Transaction messaging (MSMQ)
 Transformations can be the most complex part of data warehousing
– Convert text to numbers
– Apply business logic in this stage
 Load is where data is loaded into the data warehouse
– Sequential or bulk loading
WHAT IS ETL?

 ETL - extract, transform and load

– is the set of processes by which data is extracted from numerous


databases, applications and systems, transformed as appropriate, and
loaded into target systems -data warehouses, data marts, analytical
applications, etc.

– More than half of all development work for data warehousing projects
is typically dedicated to the design and implementation of ETL
processes.

– Poorly designed ETL processes are costly to maintain, change and


update, so it is critical to make the right choices in terms of the right
technology and tools that will be used for developing and maintaining
the ETL processes.

Confidential © L&T Infotech


Sample Process: File to Load Ready

Data
Warehouse Data Stage
Lookup Data Sets

ETL
Staging DB Process
Load Ready Data
System Generated Files
Raw Data Files
Transforming Examples
Transform
Change
buyer_name reg_id total_sales buyer_name reg_id total_sales
Barr, Adam II 17.60 Barr, Adam 2 17.60
Chai, Sean IV 52.80 Chai, Sean 4 52.80
O’Melia, Erin VI 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ...

Combine
buyer_first buyer_last reg_id total_sales buyer_name reg_id total_sales
Adam Barr 2 17.60 Barr, Adam 2 17.60
Sean Chai 4 52.80 Chai, Sean 4 52.80
Erin O’Melia 6 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ... ...

Calculate
buyer_name price_id qty_id buyer_name price_id qty_id total_sales
Barr, Adam .55 32 Barr, Adam .55 32 17.60
Chai, Sean 1.10 48 Chai, Sean 1.10 48 52.80
O’Melia, Erin .98 9 O’Melia, Erin .98 9 8.82
... ... ... ... ... ... ...
Datastage - Overview

 DataStage is one of the leading ETL products on the BI market.

 Datastage allows integration of the data across multiple systems and


processing high volumes of the data.

 Datastage was formerly known as Ascential DataStage and in 2005 was


acquired by IBM and added to the WebSphere family. Starting from 2006
its official name is IBM WebSphere Datastage and in 2008 it has been
renamed to IBM InfoSphere Datastage.

Confidential © L&T Infotech


Datastage Features

 DataStage has the following features to aid the design and processing required
to build a data warehouse:
 Uses graphical design tools. With simple point-and-click techniques you can
draw a scheme to represent your processing requirements.
 Extracts data from any number or type of database.
 Handles all the meta data definitions required to define your data warehouse.
You can view and modify the table definitions at any point during the design of
your application.
 Aggregates data. You can modify SQL SELECT statements used to extract data.
 Transforms data.
 DataStage has a set of predefined transforms and functions you can use to
convert your data. You can easily extend the functionality by defining your own
transforms to use.
 Loads the data warehouse.
 DataStage consists of a number of client and server components.

Confidential © L&T Infotech


IBM InfoSphere Information Server

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 21


Client Components

 DataStage has four client components which are installed on any PC running
Windows 2000 or Windows NT 4.0 with Service Pack 4 or later:

 DataStage Designer. A design interface used to create DataStage applications


(known as jobs). Each job specifies the data sources, the transforms required,
and the destination of the data. Jobs are compiled to create executables that
are scheduled by the Director and run by the Server (mainframe jobs are
transferred and run on the mainframe).

 DataStage Director. A user interface used to validate, schedule, run, and


monitor DataStage server jobs and parallel jobs.

 DataStage Administrator. A user interface used to perform administration tasks


such as setting up DataStage users, creating and moving projects, and setting
up purging criteria.

Confidential © L&T Infotech


Client Logon
DataStage Designer
DataStage Director
Table Definition
DataStage Administrator
SERVER COMPONENTS

 There are three server components:


 Repository. A central store that contains all the information required
to build a data mart or data warehouse.

 DataStage Server. Runs executable jobs that extract, transform, and


load data into a data warehouse.

 DataStage Package Installer. A user interface used to install packaged


DataStage jobs and plug-ins.

Confidential © L&T Infotech


DATASTAGE PROJECTS

 You always enter DataStage through a DataStage project while entering


DataStage client you are prompted to attach to a project.
 Each project contains:
 DataStage jobs.
 Built-in components. These are predefined components used in a
job.
 User-defined components. These are customized components
created using the DataStage Manager. Each user-defined component
performs a specific task in a job.
 A complete project may contain several jobs and user-defined
components.
 There is a special class of project called a protected project.
 Nothing can be added, deleted, or changed
 Users can view objects in the project, and but cant change the job’s
design.

Confidential © L&T Infotech


DataStage Jobs

Parallel Jobs
 Executed under control of DataStage Server runtime environment
 Build-in functionality for pipeline and partitioning parallelism
 Compiled into OSH (Orchestrate Scripting Language) – Executable C++ class
instances
 Runtime monitoring in DataStage Director
Job Sequences (Batch jobs, Controlling jobs)
 Master Server jobs that kick-off jobs and other activities
 Can kick-off Server or parallel jobs
 Runtime monitoring in DataStage Director
Server Jobs (requires Server Edition license)
 Executed by the DataStage Server Edition
 Compiled into Basic (interpreted pseudo-code)
 Runtime monitoring in DataStage Director
 Mainframe jobs (requires Mainframe Edition license)
 Compiled into COBOL
 Executed on the Mainframe, outside of DataStage
Design Elements of Parallel Job

Stages
 Implemented as OSH operators (pre-built components)
 Passive stages (E and L of ETL)
• Read data
• Write data
• e.g. sequential file, Oracle
 Processor (active) stages (T of ETL)
• Transform data
• Filter data
• Aggregate data
• Generate data
• Split/ merge data
• e.g. Transformer, Aggregator, Join, Sort stages
Links
 “Pipes” through which the data moves from stage to stage
CREATING A SAMPLE JOB

 Requirement

 To sync data from production to test

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Login to Designer by attaching to the required project

Server:Port

Attaching to a
Project

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Create a New job by clicking on File  New

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Select “Parallel Job” and click on “OK”

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Select View Palette

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Palette Bar will appear on the designer

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Palette Bar will appear on the designer

Palette

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Drag and Drop Source Stage From Palette to the design Area

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Drag and Drop Processing Stage From Palette to the design Area

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Drag and Drop Target Stage From Palette to the design Area

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Add Links between the stages

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Rename the stages with “proper” names

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Save the Job

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Edit the properties of the source stage

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Edit the properties of the source stage

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Provide the mapping in copy stage

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Edit the properties of the target stage

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Compile the Job

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Run the Job

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Job- “Running”

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Job- “Finished”

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Viewing the Job log through Director

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Viewing the Job log through Director

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Monitoring the Job log through Director

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Monitoring the Job log through Director

Confidential © L&T Infotech


CREATING A SAMPLE JOB CONTD.

Checking the data in target table

Confidential © L&T Infotech


STEPS TO CREATE A DATASTAGE JOB :

 1-Prepare a list of parameters you will have to use in your job.


 2-Check Whether those parameters are present in parameter list? If no
then add them.
 3-Add those parameters in job parameter tab.
 4-Drag and drop your source ,target and stages that are required to
build the job.
 5-Link the stages as per your requirement.
 6-Enter metadata,user name and password in source and target
stages.Do necessary transformation if required.Put appropriate
annotation.
 7-Comiple the job and run.
 8-Go to Tools -> Run Director to know the status of your job.

Confidential © L&T Infotech


Datastage Parallelism

Pipeline Parallelism

Here we will have stages filling pipelines so the next stage could start processing
on that data before the previous one had finished.

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 59


Datastage Parallelism

Pipeline Parallelism

 Process of pulling records from the source system and moving them through
the sequence of processing functions that are defined in the data-flow
Data can be buffered in blocks so that each process is not slowed when other
components are running
This approach avoids deadlocks and speeds performance by allowing both
upstream and downstream processes to run concurrently

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 60


Datastage Parallelism

Partition Parallelism

 Using partition parallelism the same job would effectively be run simultaneously
by several processors, each handling a separate subset of the total data.
 An approach to parallelism that involves breaking the record set into partitions,
or subsets of records

Data is partitioned by customer surname before it


flows into the Transformer stage

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 61


Datastage Parallelism

Combining Pipeline & Partition Parallelism

Here we will have stages processing partitioned data and filling pipelines so the
next one could start on that partition before the previous one had finished.

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 62


Datastage Parallelism

Repartitioning data

Datastage allows you to repartition between stages as and when needed

Confidential | Copyright © Larsen & Toubro Infotech Ltd. 63


Parallel Processing Environments

 SMP - Symmetric Multiprocessing:


 some hardware resources may be shared among processors. The processors
communicate via shared memory and have a single operating system.
 involves a multiprocessor computer hardware architecture where two or more
identical processors are connected to a single shared main memory and are
controlled by a single OS instance.

 Cluster/MPP – Massively Parallel Processing:


 Each processor has exclusive access to hardware resources.
 The processors each have their own operating system, and communicate
via a high-speed network.

 Types of Jobs:
 CPU - limited jobs
 Memory – limited jobs
 Disk I/O limited jobs
The configuration file

 The configuration file describes available processing power in terms of


processing nodes.
 DataStage learns about the shape and size of the system from the configuration
file.
 Processing resources include nodes; and storage resources include both disks for
the permanent storage of data and disks for the temporary storage of data
(scratch disks).
 It organizes the resources needed for a job according to what is defined in the
configuration file. When your system changes, you change the file not the jobs.
node "borodin1“
{ fastname "borodin"
pools "compute_1" ""
resource disk "/sfiles/node1" {pools ""}
resource scratchdisk "/scratch1" {pools "" "sort"}
}
Number of nodes = Number of Instance of process generated
The configuration file Contd.

Number of nodes <= Number of Processors.


Some of the Processors can be free to deal with Other activities.
What do you mean by Pool?
Pool defines a group of related nodes and resources.
In the datastage Job you can specify , which pool you want to use.
How it is useful?
Some processors (nodes) can be dedicated to RDBMS or some to Mainframe.

 The default name of the configuration file is default.apt and it is located in the
Configurations directory in the Server directory of your installation.

Environment Variable : APT_CONFIG_FILE


Default Name : default.apt
Default Path : /opt/IBM/InformationServer/Server/Configurations/
So depending upon your Job, you can select which one will be better for your job.
The configuration file Contd.
The configuration file Contd.
The configuration file Contd.
The configuration file Contd.
The configuration file Contd.
Datastage Partitioning

 (Auto).
– Datastage attempts to work out the best partitioning method depending on
execution modes of current and preceding stages, and how many nodes are
specified in the Configuration file. This is the default partitioning method.
 Entire.
– Each file written to receives the entire data set.
– Every instance of a stage on every processing node receives the complete data
set as input
 Hash.
– The records are hashed into partitions based on the value of a key column or
columns selected from the Available list.

 Modulus.
– The records are partitioned using a modulus function on the key column
selected from the Available list. This is commonly used to partition on tag
fields.
Datastage Partitioning, cont’d

 Random.
– The records are partitioned randomly, based on the output of a random
number generator.

 Round Robin.
– The records are partitioned on a round robin basis as they enter the stage.

 Same.
– Preserves the partitioning already in place.

 DB2.
– Replicates the DB2 partitioning method of a specific DB2 table. Requires extra
properties to be set.

 Range.
– Divides a data set into approximately equal size partitions, based on one or
more partitioning keys. Range partitioning is often a preprocessing step to
performing a total sort on a data set. Requires extra properties to be set.
Partition Icons

Auto ‘Same’ Partitioning Method

Repartitioning Sequential to parallel


Collection

 Collecting is the process of joining the multiple partitions of a single data set back
together again into a single partition.

 Round Robin.
– The records are collected on a round robin basis as they enter the stage.

 Ordered.
– Reads all records from first partition, then all from second partition and so on.

 Sorted Merge.
– Reads records in an order based on one or more columns of the record. These
columns are called collecting keys

 Auto.
– Datastage will eagerly read any row from an input partitions as it becomes
available. This is the fastest collection method
Collecting Icons
Sorting

Processing Sorted data reduces overhead, and in turn can increase performance. Developers need to
consider whether the data they are processing is sorted or not. This decision may impact the
partitioning and collection methods used to optimize performance.

 The Partitioning tab allows you to specify that data arriving on the input link should be sorted
before being written to the file or files. The sort is always carried out within data partitions.

 When partitioning incoming data, the sort occurs after the partitioning.
 When collecting data, the sort occurs before the collection.

 The availability of sorting options depends on the partitioning or collecting method chosen (it is
not available with the Auto methods).

 Options include:
– Perform Sort. Select this to specify that data coming in on the link should be sorted. Select
the column or columns to sort on from the Available list.

– Stable. Select this if you want to preserve previously sorted data sets. This is the default.

– Unique. Select this to specify that, if multiple records have identical sorting key values,
only one record is retained. If stable sort is also set, the first record is retained.

S-ar putea să vă placă și