Documente Academic
Documente Profesional
Documente Cultură
Day 1
2 ETL Concepts
3 Datastage Architecture
4 Datastage Components
Informational Data:
It is stored in a format that makes analysis much easier. Analysis can be in the
form of decision support(queries), report generation, executive information
systems, and more in-depth statistical analysiwealth of operational data that
exists in the business. Informational data is what makes up a data warehouse.s.
Informational data is created from the
Definition of Data Warehouse
Definition
A structured extensible environment designed for the analysis of non-
volatile data, logically and physically transformed from multiple source
applications to align with business structure, updated and maintained for
a long time period, expressed in simple business terms and summarized
for quick analysis.
Meta Data
www.lntinfotech.com
EDW – Top Down Architecture
Bottom Up Approach
Characteristics of a Data Warehouse : Subject -oriented
Insert Replace
Data
Operational
Delete Inser Warehouse
t
Load Read-only
Access
Replace Delete
RDBMS
Data
Legacy Warehouse
System
Data
Operational Warehouse
East
West
Accountant January February Region Manager
Actual Budget Actual Budget
TV
Sales
VCR
TV
Margin
VCR
OLTP Datawarehouse
Transaction throughput is the performance metric Query throughput is the performance metric
Have to deal with 1000s of users Have to deal with 100s of users
Application Oriented Subject Oriented
Detailed Data Summarized and Refined Data
Current Up to date Snap shot data
Isolated Data Integrated Data
Repetitive Access Ad-hoc Access
Normal User Knowledge Users
– More than half of all development work for data warehousing projects
is typically dedicated to the design and implementation of ETL
processes.
Data
Warehouse Data Stage
Lookup Data Sets
ETL
Staging DB Process
Load Ready Data
System Generated Files
Raw Data Files
Transforming Examples
Transform
Change
buyer_name reg_id total_sales buyer_name reg_id total_sales
Barr, Adam II 17.60 Barr, Adam 2 17.60
Chai, Sean IV 52.80 Chai, Sean 4 52.80
O’Melia, Erin VI 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ...
Combine
buyer_first buyer_last reg_id total_sales buyer_name reg_id total_sales
Adam Barr 2 17.60 Barr, Adam 2 17.60
Sean Chai 4 52.80 Chai, Sean 4 52.80
Erin O’Melia 6 8.82 O’Melia, Erin 6 8.82
... ... ... ... ... ... ...
Calculate
buyer_name price_id qty_id buyer_name price_id qty_id total_sales
Barr, Adam .55 32 Barr, Adam .55 32 17.60
Chai, Sean 1.10 48 Chai, Sean 1.10 48 52.80
O’Melia, Erin .98 9 O’Melia, Erin .98 9 8.82
... ... ... ... ... ... ...
Datastage - Overview
DataStage has the following features to aid the design and processing required
to build a data warehouse:
Uses graphical design tools. With simple point-and-click techniques you can
draw a scheme to represent your processing requirements.
Extracts data from any number or type of database.
Handles all the meta data definitions required to define your data warehouse.
You can view and modify the table definitions at any point during the design of
your application.
Aggregates data. You can modify SQL SELECT statements used to extract data.
Transforms data.
DataStage has a set of predefined transforms and functions you can use to
convert your data. You can easily extend the functionality by defining your own
transforms to use.
Loads the data warehouse.
DataStage consists of a number of client and server components.
DataStage has four client components which are installed on any PC running
Windows 2000 or Windows NT 4.0 with Service Pack 4 or later:
Parallel Jobs
Executed under control of DataStage Server runtime environment
Build-in functionality for pipeline and partitioning parallelism
Compiled into OSH (Orchestrate Scripting Language) – Executable C++ class
instances
Runtime monitoring in DataStage Director
Job Sequences (Batch jobs, Controlling jobs)
Master Server jobs that kick-off jobs and other activities
Can kick-off Server or parallel jobs
Runtime monitoring in DataStage Director
Server Jobs (requires Server Edition license)
Executed by the DataStage Server Edition
Compiled into Basic (interpreted pseudo-code)
Runtime monitoring in DataStage Director
Mainframe jobs (requires Mainframe Edition license)
Compiled into COBOL
Executed on the Mainframe, outside of DataStage
Design Elements of Parallel Job
Stages
Implemented as OSH operators (pre-built components)
Passive stages (E and L of ETL)
• Read data
• Write data
• e.g. sequential file, Oracle
Processor (active) stages (T of ETL)
• Transform data
• Filter data
• Aggregate data
• Generate data
• Split/ merge data
• e.g. Transformer, Aggregator, Join, Sort stages
Links
“Pipes” through which the data moves from stage to stage
CREATING A SAMPLE JOB
Requirement
Server:Port
Attaching to a
Project
Palette
Drag and Drop Source Stage From Palette to the design Area
Drag and Drop Processing Stage From Palette to the design Area
Drag and Drop Target Stage From Palette to the design Area
Job- “Running”
Job- “Finished”
Pipeline Parallelism
Here we will have stages filling pipelines so the next stage could start processing
on that data before the previous one had finished.
Pipeline Parallelism
Process of pulling records from the source system and moving them through
the sequence of processing functions that are defined in the data-flow
Data can be buffered in blocks so that each process is not slowed when other
components are running
This approach avoids deadlocks and speeds performance by allowing both
upstream and downstream processes to run concurrently
Partition Parallelism
Using partition parallelism the same job would effectively be run simultaneously
by several processors, each handling a separate subset of the total data.
An approach to parallelism that involves breaking the record set into partitions,
or subsets of records
Here we will have stages processing partitioned data and filling pipelines so the
next one could start on that partition before the previous one had finished.
Repartitioning data
Types of Jobs:
CPU - limited jobs
Memory – limited jobs
Disk I/O limited jobs
The configuration file
The default name of the configuration file is default.apt and it is located in the
Configurations directory in the Server directory of your installation.
(Auto).
– Datastage attempts to work out the best partitioning method depending on
execution modes of current and preceding stages, and how many nodes are
specified in the Configuration file. This is the default partitioning method.
Entire.
– Each file written to receives the entire data set.
– Every instance of a stage on every processing node receives the complete data
set as input
Hash.
– The records are hashed into partitions based on the value of a key column or
columns selected from the Available list.
Modulus.
– The records are partitioned using a modulus function on the key column
selected from the Available list. This is commonly used to partition on tag
fields.
Datastage Partitioning, cont’d
Random.
– The records are partitioned randomly, based on the output of a random
number generator.
Round Robin.
– The records are partitioned on a round robin basis as they enter the stage.
Same.
– Preserves the partitioning already in place.
DB2.
– Replicates the DB2 partitioning method of a specific DB2 table. Requires extra
properties to be set.
Range.
– Divides a data set into approximately equal size partitions, based on one or
more partitioning keys. Range partitioning is often a preprocessing step to
performing a total sort on a data set. Requires extra properties to be set.
Partition Icons
Collecting is the process of joining the multiple partitions of a single data set back
together again into a single partition.
Round Robin.
– The records are collected on a round robin basis as they enter the stage.
Ordered.
– Reads all records from first partition, then all from second partition and so on.
Sorted Merge.
– Reads records in an order based on one or more columns of the record. These
columns are called collecting keys
Auto.
– Datastage will eagerly read any row from an input partitions as it becomes
available. This is the fastest collection method
Collecting Icons
Sorting
Processing Sorted data reduces overhead, and in turn can increase performance. Developers need to
consider whether the data they are processing is sorted or not. This decision may impact the
partitioning and collection methods used to optimize performance.
The Partitioning tab allows you to specify that data arriving on the input link should be sorted
before being written to the file or files. The sort is always carried out within data partitions.
When partitioning incoming data, the sort occurs after the partitioning.
When collecting data, the sort occurs before the collection.
The availability of sorting options depends on the partitioning or collecting method chosen (it is
not available with the Auto methods).
Options include:
– Perform Sort. Select this to specify that data coming in on the link should be sorted. Select
the column or columns to sort on from the Available list.
– Stable. Select this if you want to preserve previously sorted data sets. This is the default.
– Unique. Select this to specify that, if multiple records have identical sorting key values,
only one record is retained. If stable sort is also set, the first record is retained.