DSX InfoSphere DataStage Is Big Data Integration 2013-05-13

Luncheon Webinar Series
May 13, 2013

InfoSphere DataStage is Big Data Integration
Sponsored By:
Presented by :
Tony Curcio, InfoSphere
Product Management
0

Questions and suggestions regarding presentation topics? - send
to editor@dsxchange.net
Downloading the presentation
Click Presentation YES on Poll Question
Replay will be available within one day with email with details
Bonus Offer Free premium membership for your DataStage
Management! Submit your managements email address and we
will offer him/her access on your behalf.
Email Info@dsxchange.net subject line Managers special.
Join us all at Linkedin http://tinyurl.com/DSXmembers
ISXchange will sponsor Trial membership for new requests at
Linkedin DSX members site
Tony Curcio
InfoSphere Product Management
2013 IBM Corporation
Bigger Data Integration Challenges
New types of data stores

Big Data introduces additional data stores that need to be
integrated both Hadoop based and noSQL based
These data stores dont easily lend themselves to conventional
methods for data movement
New data types and formats

Unstructured data; poly-structured data stores; JSON, Avro,
and what more to come ???
Video, docs, web logs,
Larger volumes
Solutions need to move, transform, cleanse and otherwise
prepare huge data volumes
Big Data requires data scalability
Benefits of InfoSphere DataStage

Speeds Productivity
Graphical design easier to use than hand coding
Promotes Object Reuse

Build once, share, and run anywhere (etl/elt/real-time)
Simplifies Heterogeneity
Common method for diverse data sources
Reduces Operational Cost

Provides a robust framework to manage data integration
Shortens Project Cycles

Pre-built components reduce cost and timelines
Protects from Changes

isolation from underlying technologies changes as they
continue to evolve
Big Data is part of the Information Supply Chain
Analyze
Transactional
& Collaborative
Applications
Business Analytics
Applications
Content
Integrate
Manage
Big Data
Master
Data
Cubes
Streams
Data
External
Information
Sources
Data
Warehouses
Content
Streaming
Information
Information
Governance
Govern
Standards
Quality
Security &
Lifecycle Privacy
Gartner Magic Quadrant

IBM is the only DBMS vendor that can offer an information architecture across the
entire organization, covering information on all systems
5
4 Key Analytical Use Cases for Big Data

Find, visualize,
understand all big
data to improve
decision making
Extend existing
customer views by
incorporating additional
information sources
Integrate big data

and data warehouse
capabilities to increase
operational efficiency
Big Data
Exploration
Data
Warehouse
Augmentation
Enhanced
360o View of
the Customer
Operations
Analysis
Analyze a variety of
machine data for
improved business
results
Data Warehouse Augmentation

Integrate big data and data warehouse
capabilities to increase operational efficiency
Challenges
Leveraging structured, unstructured,
and streaming data sources for deep
analysis
Low latency requirements
Query access to data
Optimizing warehouse for big data
volumes
Metadata management to support
impact analysis and data lineage
Required capabilities
Data Integration Hub Processing
High-speed, massively scalable
read from and write to big data
sources and new data
Big Data Expert
Automatically build MapReduce
logic through simple data flow
design and coordinate workflow
across traditional and big data
platforms
Data Integration
Hub Processing
Connectivity Hub
InfoSphere
DataStage
Effectively handle the complexity of enterprise information sources
and types with a common design paradigm across
heterogeneous landscape with high-speed scalable solution
to speed the delivery of analytics.

Sour
ce
Data
Transfor
m
Sequential
Enrich
4-way Parallel
Disk
Disk
CPU
CPU CPU CPU CPU
Memor
y
Shared
Memory
Uniprocessor
10
Cleanse
SMP System
EDW
64-way Parallel
Dynamic
Instantly get better performance
as hardware resources are
added to any topology
Extendable
Add a new server to scale out
through simple text file edit (or, in
grid config, automatically via
integration with grid management
software).
Data Partitioned
In true MPP fashion (like
Hadoop) data persisted in the
data integration platform is stored
in parallel to scale out the I/O.
Hadoop Integrated
Push all or parts of the process
MPP Clustered System out to Hadoop to take advantage
of its scalability in ELT fashion.
10
Big Data Source Types
Hadoop Distributed File System
noSQL (not-only SQL)
massively scalable and resilient storage
record storage optimized for read (or write)
noSQL
InfoSphere Streams
massive real-time analytics
11
Blazing Fast HDFS
Available since v8.7 in 2011

Extends the simple flat file
paradigm - just add your hadoop
server name and port number
Parallelization techniques to pipe
data in and out at massive scale
Performance study run up to 5.2

TB/hr before hdfs disks were
complete saturated (5 node
hadoop cluster)
12
Simple data flow design for HDFS

Transform/
restructure
the data
Read from an
HDFS file in
parallel
Create new
HDFS file,
fully
parallelized
Join two
HDFS files
13
Agile Connector Accelerators for noSQL

New connectors available on
developerWorks
Plugs into InfoSphere DataStage and

operates just like any other stage.
Includes features to exploit specific
data sources
Open
Code
14
Sample Job with MongoDB and Hive

Accepts specific
MongoDB
directives
Selects what HDFS

data to send down
stream.
Writing data
to MongoDB
Writing data
to Hive
15
Parse and Compose JSON (beta)
Parsing and composing

of JSON data format
Included advanced
transformation
framework already
provided for XML
capabilities
Beta available on
InfoSphere
DataStage 9.1 FP1
16
Big Data
Expert
Big Data Expert
InfoSphere
DataStage
Automatically push transformational processing close to where the
data resides, both SQL for DBMS and MapReduce for Hadoop,
leveraging the same simple data flow design process and coordinate
workflow across all platforms
Automated MapReduce Job Generation
New in 9.1, leverage the same UI and the same stages to build
MapReduce.
Drag and drop stages to the canvas to create a job, rather than have to
learn MapReduce programming.
Push the processing to Hadoop for patterns when you dont want to
transport the data on the network.
19
Build integration
jobs with the
same data flow
tool and stages
Automatically
creates
MapReduce
code.
20
Job includes other

database on
separate system
Recognizes what processing

can run natively in Hadoop
and what requires DataStage
engine to move the data
21
Architecture for Warehouse Landing Zone

Use Case Requirements: Data Warehouse Landing Zone
Large Scale large data volumes, scale out requires open MPP platform
Low Cost low cost storage, compute and commodity hardware
Many Data Types un/semi structured and social datatype coverage
Many Access Patterns exploratory, iterative and discovery oriented
clickstream
sensors
ETL
Replication
Lineage
Quality
Information Server
JAQL
Hive
Analytics
Warehouse
Zone
HBase
transactions
Guardium
BigInsights / Hadoop
content
Masking
all sources
Custom MR
Operational
Warehouse
Zone
Landing Zone
Masking
Optim
22
Combined Workflows for Big Data

Oozie Integration
Same design paradigm for
workflows as for job design.
Directly call an Oozie activity that is
invoking custom MapReduce code.
End-to-end Workflows
Sequence right alongside other
data integration and analytics
activities
Allows users to have the data
sourcing, ETL, Analytics and
delivery of information all controlled
through a single process.
Monitor all stages through
Operations Consoles web based
interace
23
Cross Tool Impact Analysis and Traceability
Understand how traditional and big data sources are being used
Assess impact of change and mitigate risks
Show impact on downstream applications and BI reports
Navigate through impacted areas and drill down
Wrap-up
The IBM Big Data Platform
New analytic applications drive the

requirements for a big data platform
Integrate and manage the full
variety, velocity and volume of data
Apply advanced analytics to

information in its native form
Visualize all available data for adhoc analysis
Development environment for
building new analytic applications
BIG DATA PLATFORM

Systems
Management
Application
Development
Discovery
Accelerators
Hadoop
System
Stream
Computing
Data
Warehouse
Information Integration & Governance
Workload optimization and

scheduling
Security and Governance
26
Data
Media
Content
Machine
Social
Information Integration & Governance for Big Data

Cleanse and Validate Big Data
Integrate & Link Big Data
Accuracy and Entity Matching

with Social Data
De-duplication and
Standardization of Machine Data
In-line Cleansing with Integration
Trusted Data Dashboard and
Reporting on Data Quality
Big Data as a Source

Big Data as a Target
Data Transformations
Data Movement
Integrate w/existing Enterprise
Data Lineage & Impact Analysis
Metadata Integration w/Analytics
Realtime & Data Federation
Master Big Data

Protect Big Data
Activity Monitoring
Data Masking
Data Encryption
On-Demand / In-Place Protection
In-Line Protection (w/ETL etc.)
Active Detection & Alerting
Audit & Archive Big Data
27
Queryable Archive
Structured and Semi-Structured
Optimized Connectors to existing Apps
Hot-Restorable On-the-Fly
Immutable and Secure Access
Automated Legal Hold Capability for Data
Freeze
Big Data as a Supplier

Big Data as a Consumer
Links between Big Data and
Trusted Golden Records
Leverage Master Data in Big
Data Analytics
Entity Resolution at Extreme
Scale Out Levels
Probabilistic Entity Matching
Where to go for learn more.

If youd like to explore this topic further
Contact your IBM account team or your preferred IBM Partner.
If youd like to explore more about InfoSphere DataStage and the

Information Server platform
http://www-01.ibm.com/software/data/integration/info_server/
If youre looking for a Enterprise level Hadoop distribution

InfoSphere Big Insightshttp://www01.ibm.com/software/data/infosphere/biginsights/
29
Thanks

DSX InfoSphere DataStage Is Big Data Integration 2013-05-13

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

DSX InfoSphere DataStage Is Big Data Integration 2013-05-13

Încărcat de

Drepturi de autor:

Formate disponibile

Luncheon Webinar Series

May 13, 2013

InfoSphere DataStage is Big Data Integration

InfoSphere DataStage is Big Data Integration

2013 IBM Corporation

Bigger Data Integration Challenges

New types of data stores

New data types and formats

Benefits of InfoSphere DataStage

Promotes Object Reuse

Reduces Operational Cost

Shortens Project Cycles

Protects from Changes

Big Data is part of the Information Supply Chain

Gartner Magic Quadrant

4 Key Analytical Use Cases for Big Data

Integrate big data

Data Warehouse Augmentation

InfoSphere DataStage is Big Data Integration

CPU CPU CPU CPU

Big Data Source Types

Hadoop Distributed File System

noSQL (not-only SQL)

massively scalable and resilient storage

record storage optimized for read (or write)

Blazing Fast HDFS

Available since v8.7 in 2011

Performance study run up to 5.2

Simple data flow design for HDFS

Agile Connector Accelerators for noSQL

Plugs into InfoSphere DataStage and

Sample Job with MongoDB and Hive

Selects what HDFS

Parse and Compose JSON (beta)

Parsing and composing

Big Data Expert

Automated MapReduce Job Generation

Automated MapReduce Job Generation

2013 IBM Corporation

Automated MapReduce Job Generation

Job includes other

Recognizes what processing

2013 IBM Corporation

Architecture for Warehouse Landing Zone

Combined Workflows for Big Data

Cross Tool Impact Analysis and Traceability

The IBM Big Data Platform

New analytic applications drive the

Apply advanced analytics to

BIG DATA PLATFORM

Information Integration & Governance

Workload optimization and

Information Integration & Governance for Big Data

Integrate & Link Big Data

Accuracy and Entity Matching

Big Data as a Source

Master Big Data

Audit & Archive Big Data

Big Data as a Supplier

Where to go for learn more.

If youd like to explore more about InfoSphere DataStage and the

If youre looking for a Enterprise level Hadoop distribution

S-ar putea să vă placă și