Documente Academic
Documente Profesional
Documente Cultură
Enterprise Edition
Proposed Course Agenda
Day 1 Day 3
Review of EE Concepts Combining Data
Sequential Access Configuration Files
Best Practices Extending EE
DBMS as Source Meta Data in EE
Day 2 Day 4
EE Architecture Job Sequencing
Transforming Data Testing and Debugging
DBMS as Target
Sorting Data
The Course Material
Course Manual
Online Help
Using the Course Material
Introduction to DataStage EE
What is DataStage?
Configuring Projects
Module Objectives
After this module you will be able to:
Explain how to create and delete projects
Set project properties in Administrator
Set EE global properties in Administrator
Project Properties
Data
Transform
Source Target
Meta Meta
Data Data
Meta Data
Repository
DataStage Manager
Manager Contents
Job Compile
properties
Tools Palette
Adding Stages and Links
Job Properties
Short and long descriptions
Shows in Manager
Annotation stage
Is a stage on the tool palette
Shows on the job GUI (work area)
Job Properties Documentation
Annotation Stage on the Palette
Annotation Stage Properties
Final Job Work Area with
Documentation
Compiling a Job
Errors or Successful Message
Intro
Part 5
Running Jobs
Module Objectives
After this module you will be able to:
Validate your job
Use DataStage Director to run your job
Set to run options
Monitor your jobs progress
View job log messages
Prerequisite to Job Execution
DSEE DataStage EE
Review
Ascentials Enterprise
Data Integration Platform
Parallel Execution
Day 1 Day 3
Review of EE Concepts Combining Data
Sequential Access Configuration Files
Standards
DBMS Access
Day 2 Day 4
EE Architecture Extending EE
Transforming Data Meta Data Usage
Sorting Data Job Control
Testing
Module Objectives
DataStage architecture
DataStage client review
Administrator
Manager
Designer
Director
Server Repository
Functions
specific to a
project.
Administrator Project Properties
Variables for
parallel
processing
Administrator Environment
Variables
Variables are
category
specific
OSH is what is
run by the EE
Framework
DataStage Manager
Export Objects to MetaStage
Push meta
data to
MetaStage
Designer Workspace
Can execute
the job from
Designer
DataStage Generated OSH
The EE
Framework
runs OSH
Director Executing Jobs
Messages
from previous
run in different
color
Stages
Row
generator
Peek
Row Generator
Edit row in
column tab
Repeatabl
e property
Peek
note
SMP: Shared Everything
cpu cpu
When used with Enterprise Edition:
Data transport uses shared memory cpu cpu
Simplified startup
Operational Data
Transform Clean Load
Archived Data
Data
Warehouse
Disk Disk Disk
Source Target
Traditional approach to batch processing:
Write to disk and read from disk before each processing operation
Sub-optimal utilization of resources
a 10 GB stream leads to 70 GB of I/O
processing resources can sit idle during I/O
Very complex to manage (lots and lots of small jobs)
Becomes impractical with big data volumes
disk I/O consumes the processing
terabytes of disk required for temporary staging
Pipeline Multiprocessing
Data Pipelining
Transform, clean and load processes are executing simultaneously on the same processor
rows are moving forward through the flow
Operational Data
Source Target
Start a downstream process while an upstream process is still
running.
This eliminates intermediate storing to disk, which is critical for big data.
This also keeps the processors busy.
Still has limits on scalability
Think of a conveyor belt moving the rows from process to process!
Partition Parallelism
Data Partitioning
Break up big data into partitions
Pipelining
g
nin
io
Source Data
rtit
Data Warehouse
Transform Clean Load
Pa
Source Target
Repartitioning
Pipelining
g
ing
n in
nin
U-Z
ion
itio
itio
N-T
rtit
rt
Source G- M
art
Data
Pa
Data
pa
A-F Warehouse
p
Transform
Re
Re
Clean Load
FlatF
ile
s C
lea
n1 R
ela
tio
nalD
ata
Im
port M
erg
e A
naly
ze
C
lea
n2
C
entraliz
e dErro
rHandlin
g
C
onfigu
ratio
nFile andE ven
tL ogg
ing
P
erfo
rm ance
O
rch
estra
teA
pp lic
a tio
nF ra
m e
w ork V
isualiz
ation
a n
dRuntim eS ystem
P
ara
llela
cces
stoda
tainR
DBM
S
Pa
rallelpip
elin
ing
C
lea
n1
Im
port
M
erg
e A
naly
ze
C
lea
n2
In
ter-n
odec
o m
mun
ica
tio
ns
Pa
ralle
lacc
esstod
atainfile
s
P
ara
lle
lizationo
fope
ratio
ns
DSEE:
Automatically scales to fit the machine
Handles data flow among multiple CPUs and disks
partitioner
Exercise
Sequential
Fixed or variable length
File Set
Lookup File Set
Data Set
Sequential Stage Introduction
Recordization
Divides input stream into records
Set on the format tab
Columnization
Divides the record into columns
Default set on the format tab but can be overridden on
the columns tab
Can be incomplete if using a schema or not even
specified in the stage if using RCP
File Format Example
R e c o rd d e lim ite r
F ie ld 1 , F ie ld 1 , F ie ld 1 , L a s t fie ld nl
F in a l D e lim ite r = e n d
F ie ld D e lim ite r
F ie ld 1 , F ie ld 1 , F ie ld 1 , L a s t fie ld , nl
F in a l D e lim ite r = c o m m a
Sequential File Stage
Stage categories
General Tab Sequential Source
Descriptor file
File Set Usage
Key column
specified
Key column
dropped in
descriptor file
Data Set
node1:/local/disk1/
node2:/local/disk2/
Quiz!
True or False?
Everything that has been data-partitioned must be
collected in same job
Data Set Stage
Occurs on import
From sequential files or file sets
From RDBMS
Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS
Dsrecords
Lists number of records in a dataset
Data Set Management
Display data
Schema
Data Set Management From Unix
Document using
the annotation
stage
Job Properties Documentation
Description shows in DS
Manager and MetaStage
Naming conventions
Container
Use Iterative Job Design
Copy stage
Transformer Stage
Techniques
Suggestions -
Always include reject link.
Always test for null value before using a column in a
function.
Try to use RCP and only map columns that have a
derivation other than a copy. More on RCP later.
Be aware of Column and Stage variable Data Types.
Often user does not pay attention to Stage Variable type.
Avoid type conversions.
Try to maintain the data type as imported.
The Copy Stage
1. Keep it simple
Jobs with many stages are hard to debug and maintain.
Click to add
environment
variables
DUMP SCORE Output
Double-click Partitoner
And
Collector
Mapping
Node--> partition
Use Multiple Configuration Files
DBMS Access
Objectives
Traditional Client
Client-Server Enterprise Edition
Client
Sort
Client
Client
Client
Load
Client
DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
RDBMS Usage
As a source
Extract data from table (stream link)
Extract as table, generated SQL, or user-defined SQL
User-defined can perform joins, access views
Lookup (reference link)
Normal lookup is memory-based (all table data read into
memory)
Can perform one lookup at a time in DBMS (sparse option)
Continue/drop/fail options
As a target
Inserts
Upserts (Inserts and updates)
Loader
RDBMS Source Stream Link
Stream link
DBMS Source - User-defined SQL
Columns in SQL
statement must match the
meta data in columns tab
Exercise
User-defined SQL
Exercise 4-1
DBMS Source Reference Link
Reject
link
Lookup Reject Link
Link
name
Lookup Stage Properties
Referenc
e link
Write Methods
Delete
Load
Upsert
Write (DB2)
Generated code
can be copied
Upsert mode
determines options
Checking for Nulls
Platform Architecture
Objectives
Interface
Business
Interface
Output
Logic
Input
EE Stage
DSEE Stage Execution
Where:
op is an Orchestrate operator
in.ds is the input data set
out.ds is the output data set
OSH Operators
Will be enabled
for all projects
View OSH in Designer
Operator
Schema
OSH Practice
data files . . .
What gets generated: of x.ds
Multiple files per partition
OSH $ osh operator_A > x.ds Each file up to 2GBytes (or larger)
Computing Architectures: Definition
node node
1 2
partitioner
Partitioning Methods
Auto
Hash
Entire
Range
Range Map
Collectors
collector
sequential Stage
Collectors do NOT synchronize data
Partitioning and Repartitioning Are
Visible On Job Design
Partitioning and Collecting Icons
Partitioner Collector
Setting a Node Constraint in the GUI
Reading Messages in Director
Transforming Data
Module Objectives
Constraint
Other/log option
Show/Hide button
Transforming Data
Derivations
Using expressions
Using functions
Date/time
Constraint Rejects
All expressions are
false and reject row is
checked
Transformer: Execution Order
Sorting Data
Objectives
Important because
Some stages require sorted input
Some stages may run faster I.e Aggregator
Can be performed
Option within stages (use input > partitioning tab and
set partitioning to anything other than auto)
As a separate stage (more complex sorts)
Sorting Alternatives
Sort Stage
Tunable to use more memory before spilling to
scratch.
Note: Spread I/O by adding more scratch file
systems to each node of the APT_CONFIG_FILE
Removing Duplicates
OR
Combining Data
Objectives
Horizontally:
Several input links; one output link (+ optional rejects)
made of columns from different input links. E.g.,
Joins
Lookup
Merge
Vertically:
One input link, one output link with column combining
values from all input rows. E.g.,
Aggregator
Join, Lookup & Merge Stages
Tip:
Check "Input Ordering" tab to make sure
intended Primary is listed first
Join Stage Editor
Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)
Four types:
Inner
Left Outer
Right Outer
Full Outer
Combines:
one source link with
one or more duplicate-free table links
0 1
Lookup
Output Reject
The Lookup Stage
RDBMS LOOKUP
NORMAL
Loads to an in memory hash table first
SPARSE
Select for each row.
Might become a performance
bottleneck.
3. The Merge Stage
Combines
one sorted, duplicate-free master (primary) link with
one or more sorted update (secondary) links.
Pre-sort makes merge "lightweight": few rows need to be in RAM (as with
joins, but opposite to lookup).
0 1 2
0 1 2
Merge
Output Rejects
In this table:
, <comma> = separator between primary and secondary input links
(out and reject links)
The Aggregator Stage
Sum
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation
Aggregator Properties
Aggregation Types
Aggregation types
Containers
Two varieties
Local
Shared
Local
Simplifies a large, complex diagram
Shared
Creates reusable object that many jobs can include
Creating a Container
Create a job
Select (loop) portions to containerize
Edit > Construct container > local or shared
Using a Container
Configuration Files
Objectives
{
node Node1"
{
fastname "BlackHole"
pools "" "node1"
resource disk "/usr/dsadm/Ascential/DataStage/Datasets"
{pools "" }
resource scratchdisk
"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }
}
}
Disk Pools
{{
node
node "n1"
"n1" {{
fastname
fastname s1"
s1"
pool
pool ""
"" "n1"
"n1" "s1"
"s1" "sort"
"sort"
resource
resource disk "/data/n1/d1" {}
disk "/data/n1/d1" {}
resource disk "/data/n1/d2"
resource disk "/data/n1/d2" {} {}
resource
resource scratchdisk
scratchdisk "/scratch"
"/scratch" {"sort"}
{"sort"}
}}
node
node "n2"
"n2" {{
fastname
fastname "s2"
6 pool
"s2"
pool "" "n2" "s2"
"" "n2" "s2" "app1"
"app1"
resource
resource disk "/data/n2/d1" {}
disk "/data/n2/d1" {}
resource
resource scratchdisk "/scratch" {}
scratchdisk "/scratch" {}
}}
4 5 node
node "n3"
"n3" {{
fastname
fastname "s3"
"s3"
pool
pool "" "n3" "s3"
"" "n3" "s3" "app1"
"app1"
2 3 resource
resource disk "/data/n3/d1" {}
disk "/data/n3/d1" {}
resource
resource scratchdisk "/scratch" {}
scratchdisk "/scratch" {}
}}
1 node
node "n4"
"n4" {{
fastname
fastname "s4"
"s4"
pool
pool "" "n4" "s4"
"" "n4" "s4" "app1"
"app1"
resource
resource disk
disk "/data/n4/d1"
"/data/n4/d1" {}
{}
resource
resource scratchdisk "/scratch" {}
scratchdisk "/scratch" {}
}}
...
...
}}
Resource Types
Disk
Scratchdisk
DB2
Oracle
Saswork
Sortwork
Can exist in a pool
Groups resources together
Using Different Configurations
Extending DataStage EE
Objectives
Wrappers
Buildops
Custom Stages
When To Leverage EE Extensibility
Types of situations:
Complex business logic, not easily accomplished using standard
EE stages
Reuse of existing C, C++, Java, COBOL, etc
Wrappers vs. Buildop vs. Custom
Name of stage
Conscientiously maintaining the Creator page for all your wrapped stages
will eventually earn you the thanks of others.
Wrapper Properties Page
stdout or
named pipe
import
output schema
Wrapped stage
Job Run
Hardware Environment:
IBM SP2, 2 nodes with 4 CPUs per node.
Software:
DB2/EEE, COBOL, EE
Original COBOL Application:
Extracted source table, performed lookup against table in DB2,
and Loaded results to target table.
4 hours 20 minutes sequential execution
Enterprise Edition Solution:
Used EE to perform Parallel DB2 Extracts and Loads
Used EE to execute COBOL application in Parallel
EE Framework handled data transfer between
DB2/EEE and COBOL application
30 minutes 8-way parallel execution
Buildops
"Build" stages
from within Enterprise Edition
Identical
to Wrappers,
except: Under the Build
Tab, your program!
Logic Tab for
Business Logic
Temporary
variables
declared [and
initialized] here
First line:
output 0
First line:
Transfer of index 0
Example - sumNoTransfer
Add input columns "a" and "b"; ignore other columns
that might be present in input
Produce a new "sum" column
Do not transfer input columns
a:int32; b:int32
sumNoTransfer
sum:int32
No Transfer
From Peek:
NO TRANSFER
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"
Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
TRANSFER
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"
Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)
Columns vs.
Temporary C++ Variables
Value persistent
throughout "loop" over
Value refreshed from row rows, unless modified in
code
to row
Exercise
Use EE API
Use Custom Stage to add new operator to EE
canvas
Custom Stage
Name of Orchestrate
operator to be used
Custom Stage Properties Tab
The Result
Module 11
Data definitions
Recordization and columnization
Fields have properties that can be set at individual
field level
Data types in GUI are translated to types used by EE
Described as properties on the format/columns tab
(outputs or inputs pages) OR
Using a schema file (can be full or partial)
Schemas
Can be imported into Manager
Can be pointed to by some job stages (i.e. Sequential)
Data Formatting Record Level
Format tab
Meta data described on a record basis
Record level properties
Data Formatting Column Level
Field
and
string
settings
Extended Properties String Type
Properties
depend on the
data type
Schema
Date Vector
Decimal Subrecord
Floating point Raw
Integer Tagged
String
Time
Timestamp
Runtime Column Propagation
Job Sequencer
Build a controlling job much the same way you build
other jobs
Comprised of stages and links
No basic coding
Job Sequencer
Stages
Example
Job Activity
stage
contains
conditional
triggers
Job Activity Properties
Job to be executed
select from dropdown
Job parameters
to be passed
Job Activity Trigger
Different links
having different
triggers
Sequencer Stage
Can be set to
all or any
Notification Stage
Notification
Notification Activity
Sample DataStage log from Mail
Notification
Sample DataStage log from Mail Notification
Notification Activity Message
E-Mail Message
Exercise
Environment variables
Configuration File information
Framework Info/Warning/Error messages
Output from the Peek Stage
Additional info with "Reporting" environments
Tracing/Debug output
Must compile job in trace mode
Adds overhead
Job Level Environmental Variables
Very little integrity checking during compile, should run validate from Director.