Sunteți pe pagina 1din 30

Luncheon Webinar Series

May 13, 2013


InfoSphere DataStage is Big Data Integration

Sponsored By:
Presented by :
Tony Curcio, InfoSphere
Product Management
0

InfoSphere DataStage is Big Data Integration


Questions and suggestions regarding presentation topics? - send
to editor@dsxchange.net
Downloading the presentation
Click Presentation YES on Poll Question
Replay will be available within one day with email with details
Bonus Offer Free premium membership for your DataStage
Management! Submit your managements email address and we
will offer him/her access on your behalf.
Email Info@dsxchange.net subject line Managers special.
Join us all at Linkedin http://tinyurl.com/DSXmembers
ISXchange will sponsor Trial membership for new requests at
Linkedin DSX members site

InfoSphere DataStage is Big Data Integration

Tony Curcio
InfoSphere Product Management

2013 IBM Corporation

Bigger Data Integration Challenges

New types of data stores


Big Data introduces additional data stores that need to be
integrated both Hadoop based and noSQL based
These data stores dont easily lend themselves to conventional
methods for data movement

New data types and formats


Unstructured data; poly-structured data stores; JSON, Avro,
and what more to come ???
Video, docs, web logs,

Larger volumes
Solutions need to move, transform, cleanse and otherwise
prepare huge data volumes
Big Data requires data scalability

Benefits of InfoSphere DataStage


Speeds Productivity
Graphical design easier to use than hand coding

Promotes Object Reuse


Build once, share, and run anywhere (etl/elt/real-time)

Simplifies Heterogeneity
Common method for diverse data sources

Reduces Operational Cost


Provides a robust framework to manage data integration

Shortens Project Cycles


Pre-built components reduce cost and timelines

Protects from Changes


isolation from underlying technologies changes as they
continue to evolve

Big Data is part of the Information Supply Chain

Analyze

Transactional
& Collaborative
Applications

Business Analytics
Applications

Content

Integrate

Manage

Big Data

Master
Data

Cubes
Streams

Data
External
Information
Sources

Data
Warehouses

Content
Streaming
Information

Information
Governance

Govern
Standards
Quality

Security &
Lifecycle Privacy

Gartner Magic Quadrant


IBM is the only DBMS vendor that can offer an information architecture across the
entire organization, covering information on all systems
5

4 Key Analytical Use Cases for Big Data


Find, visualize,
understand all big
data to improve
decision making

Extend existing
customer views by
incorporating additional
information sources

Integrate big data


and data warehouse
capabilities to increase
operational efficiency

Big Data
Exploration

Data
Warehouse
Augmentation

Enhanced
360o View of
the Customer

Operations
Analysis
Analyze a variety of
machine data for
improved business
results

Data Warehouse Augmentation


Integrate big data and data warehouse
capabilities to increase operational efficiency

Challenges
Leveraging structured, unstructured,
and streaming data sources for deep
analysis
Low latency requirements
Query access to data
Optimizing warehouse for big data
volumes
Metadata management to support
impact analysis and data lineage

Required capabilities
Data Integration Hub Processing
High-speed, massively scalable
read from and write to big data
sources and new data
Big Data Expert
Automatically build MapReduce
logic through simple data flow
design and coordinate workflow
across traditional and big data
platforms

Data Integration
Hub Processing

Connectivity Hub

InfoSphere
DataStage
Effectively handle the complexity of enterprise information sources
and types with a common design paradigm across
heterogeneous landscape with high-speed scalable solution
to speed the delivery of analytics.
2013 IBM Corporation

InfoSphere DataStage is Big Data Integration


Sour
ce
Data

Transfor
m

Sequential

Enrich

4-way Parallel

Disk

Disk

CPU

CPU CPU CPU CPU

Memor
y

Shared
Memory

Uniprocessor

10

Cleanse

SMP System

EDW

64-way Parallel

Dynamic
Instantly get better performance
as hardware resources are
added to any topology
Extendable
Add a new server to scale out
through simple text file edit (or, in
grid config, automatically via
integration with grid management
software).
Data Partitioned
In true MPP fashion (like
Hadoop) data persisted in the
data integration platform is stored
in parallel to scale out the I/O.

Hadoop Integrated
Push all or parts of the process
MPP Clustered System out to Hadoop to take advantage
of its scalability in ELT fashion.
10

Big Data Source Types

Hadoop Distributed File System

noSQL (not-only SQL)

massively scalable and resilient storage

record storage optimized for read (or write)

noSQL
InfoSphere Streams
massive real-time analytics

11

Blazing Fast HDFS

Available since v8.7 in 2011


Extends the simple flat file
paradigm - just add your hadoop
server name and port number
Parallelization techniques to pipe
data in and out at massive scale

Performance study run up to 5.2


TB/hr before hdfs disks were
complete saturated (5 node
hadoop cluster)

12

Simple data flow design for HDFS


Transform/
restructure
the data

Read from an
HDFS file in
parallel

Create new
HDFS file,
fully
parallelized

Join two
HDFS files

13

Agile Connector Accelerators for noSQL


New connectors available on
developerWorks

Plugs into InfoSphere DataStage and


operates just like any other stage.
Includes features to exploit specific
data sources

Open
Code

14

Sample Job with MongoDB and Hive


Accepts specific
MongoDB
directives

Selects what HDFS


data to send down
stream.

Writing data
to MongoDB

Writing data
to Hive

15

Parse and Compose JSON (beta)

Parsing and composing


of JSON data format
Included advanced
transformation
framework already
provided for XML
capabilities

Beta available on
InfoSphere
DataStage 9.1 FP1

16

Big Data
Expert

Big Data Expert

InfoSphere
DataStage
Automatically push transformational processing close to where the
data resides, both SQL for DBMS and MapReduce for Hadoop,
leveraging the same simple data flow design process and coordinate
workflow across all platforms
2013 IBM Corporation

Automated MapReduce Job Generation

New in 9.1, leverage the same UI and the same stages to build
MapReduce.
Drag and drop stages to the canvas to create a job, rather than have to
learn MapReduce programming.
Push the processing to Hadoop for patterns when you dont want to
transport the data on the network.

19

Automated MapReduce Job Generation

Build integration
jobs with the
same data flow
tool and stages

Automatically
creates
MapReduce
code.

20

2013 IBM Corporation

Automated MapReduce Job Generation

Job includes other


database on
separate system

Recognizes what processing


can run natively in Hadoop
and what requires DataStage
engine to move the data

21

2013 IBM Corporation

Architecture for Warehouse Landing Zone


Use Case Requirements: Data Warehouse Landing Zone
Large Scale large data volumes, scale out requires open MPP platform
Low Cost low cost storage, compute and commodity hardware
Many Data Types un/semi structured and social datatype coverage
Many Access Patterns exploratory, iterative and discovery oriented

clickstream

sensors

ETL
Replication

Lineage

Quality

Information Server
JAQL

Hive

Analytics
Warehouse
Zone

HBase

transactions
Guardium

BigInsights / Hadoop

content
Masking
all sources

Custom MR

Operational
Warehouse
Zone

Landing Zone
Masking

Optim
22

Combined Workflows for Big Data


Oozie Integration
Same design paradigm for
workflows as for job design.
Directly call an Oozie activity that is
invoking custom MapReduce code.

End-to-end Workflows
Sequence right alongside other
data integration and analytics
activities
Allows users to have the data
sourcing, ETL, Analytics and
delivery of information all controlled
through a single process.
Monitor all stages through
Operations Consoles web based
interace
23

Cross Tool Impact Analysis and Traceability

Understand how traditional and big data sources are being used
Assess impact of change and mitigate risks
Show impact on downstream applications and BI reports
Navigate through impacted areas and drill down

Wrap-up

The IBM Big Data Platform

New analytic applications drive the


requirements for a big data platform
Integrate and manage the full
variety, velocity and volume of data

Apply advanced analytics to


information in its native form
Visualize all available data for adhoc analysis
Development environment for
building new analytic applications

BIG DATA PLATFORM


Systems
Management

Application
Development

Discovery

Accelerators
Hadoop
System

Stream
Computing

Data
Warehouse

Information Integration & Governance

Workload optimization and


scheduling
Security and Governance
26

Data

Media

Content

Machine

Social

Information Integration & Governance for Big Data


Cleanse and Validate Big Data

Integrate & Link Big Data

Accuracy and Entity Matching


with Social Data
De-duplication and
Standardization of Machine Data
In-line Cleansing with Integration
Trusted Data Dashboard and
Reporting on Data Quality

Big Data as a Source


Big Data as a Target
Data Transformations
Data Movement
Integrate w/existing Enterprise
Data Lineage & Impact Analysis
Metadata Integration w/Analytics
Realtime & Data Federation

Master Big Data


Protect Big Data

Activity Monitoring
Data Masking
Data Encryption
On-Demand / In-Place Protection
In-Line Protection (w/ETL etc.)
Active Detection & Alerting

Audit & Archive Big Data

27

Queryable Archive
Structured and Semi-Structured
Optimized Connectors to existing Apps
Hot-Restorable On-the-Fly
Immutable and Secure Access
Automated Legal Hold Capability for Data
Freeze

Big Data as a Supplier


Big Data as a Consumer
Links between Big Data and
Trusted Golden Records
Leverage Master Data in Big
Data Analytics
Entity Resolution at Extreme
Scale Out Levels
Probabilistic Entity Matching

Where to go for learn more.


If youd like to explore this topic further
Contact your IBM account team or your preferred IBM Partner.

If youd like to explore more about InfoSphere DataStage and the


Information Server platform
http://www-01.ibm.com/software/data/integration/info_server/

If youre looking for a Enterprise level Hadoop distribution


InfoSphere Big Insightshttp://www01.ibm.com/software/data/infosphere/biginsights/

29

Thanks

S-ar putea să vă placă și