Tez 131115140651 Phpapp02

Apache Tez : Accelerating
Hadoop Query Processing

Bikas Saha
@bikassaha
Hortonworks Inc. 2013
Page 1
Tez Introduction
Distributed execution
framework targeted
towards data-processing
applications.
Based on expressing a
computation as a dataflow
graph.
Highly customizable to
meet a broad spectrum of
use cases.
Built on top of YARN the
resource management
framework for Hadoop.
Open source Apache
Page 2
Tez Design Themes

Empowering End Users
Execution Performance
Page 3
Tez Empowering End Users

Expressive dataflow definition APIs
Flexible Input-Processor-Output runtime
model
Data type agnostic
Simplifying deployment
Page 4

Enable definition of complex data flow pipelines using
simple graph connection APIs. Tez expands the logical
plan at runtime.
Targeted towards data processing applications like
Hive/Pig but not limited to it. Hive/Pig query plans
naturally map to Tez dataflow graphs with no translation
impedance.
TaskA-2
TaskC-1 TaskC-2
TaskA-1
TaskB-1
TaskB-2
TaskA-2
TaskD-1
TaskC-1
TaskD-2
TaskE-1
TaskC-2
TaskE-2
Page 5

Samples
Task-1
Task-2
Task-1
Task-2 Partition Stage
Preprocessor Stage
Sampler
Ranges
Distributed Sort
Task-1
Task-2Aggregate Stage
Page 6

model
Construct physical runtime executors dynamically by
connecting different inputs, processors and outputs.
End goal is to have a library of inputs, outputs and
processors that can be programmatically composed to
generate useful tasks.
HDFSInput
ShuffleInput
MapProcessor
ReduceProcessor
JoinProcessor
FileSortedOutput
HDFSOutput
FileSortedOutput
Mapper
FinalReduce
IntermediateJoiner
Input1
Page 7
Input2

Data type agnostic
Tez is only concerned with the movement of data. Files
and streams of bytes.
Does not impose any data format on the user application.
MR application can use Key-Value pairs on top of Tez. Hive
and Pig can use tuple oriented formats that are natural
and native to them.
Tez Task
File
User Code
Key Value
Bytes
Bytes
Tuples
Stream
Page 8

Simplifying deployment
Tez is a completely client side application.
No deployments to do. Simply upload to any accessible
FileSystem and change local Tez configuration to point to
that.
Enables running different versions concurrently. Easy to
test new functionality while keeping stable versions for
production.
HDFS
Leverages YARN local resources.
Tez Lib 1
Tez Lib 2
TezClient
TezTask
TezTask
TezClient
Client
Machine
Node
Manager
Node
Manager
Client
Machine
Page 9

model
Data type agnostic
Simplifying usage
With great power APIs come great
responsibilities
Tez is a framework on which end user
applications can be built
Page 10
Tez Execution Performance

Performance gains over Map Reduce
Optimal resource management
Plan reconfiguration at runtime
Dynamic physical data flow decisions
Page 11

Performance gains over Map Reduce
Eliminate replicated write barrier between successive
computations.
Eliminate job launch overhead of workflow jobs.
Eliminate extra stage of map reads in every workflow job.
Eliminate queue and resource contention suffered by
workflow jobs that are started after a predecessor job
completes.
Pig/Hive - Tez
Pig/Hive - MR
Page 12

Plan reconfiguration at runtime
Dynamic runtime concurrency control based on data size,
user operator resources, available cluster resources and
locality.
Advanced changes in dataflow graph structure.
Progressive graph construction in concert with user
optimizer.
HDFS
Blocks
Stage 1
50 maps
100
partitions
Stage 2
100
reducers
Stage 1
50 maps
100
partitions
Only 10GBs
of data
YARN
Resources
Page 13
Stage 2
100 10
reducers

Optimal resource management
Reuse YARN containers to launch new tasks.
Reuse YARN containers to enable shared objects across
tasks.
TezTask Host
Tez
Application Master
Task Done
Start Task
YARN Container
Shared
Shared Objects
Objects
Start Task
TezTask1
TezTask2
YARN Container
Page 14

Dynamic physical data flow decisions
Decide the type of physical byte movement and storage
on the fly.
Store intermediate data on distributed store, local store or
in-memory.
Transfer bytes via blocking files or streaming and the
spectrum in between.
Producer
(small size)
Producer
Local File
At Runtime
In-Memory
Consumer
Consumer
Page 15
Tez Deep Dive

DAG API
Runtime API and Event Model
Dynamic Graph Reconfiguration
Tez Session
Page 16
Tez Deep Dive DAG API

Simple DAG definition API
DAG dag = new DAG();
Vertex map1 = new Vertex(MapProcessor.class);
Vertex
Vertex
Vertex
Vertex
map2 = new Vertex(MapProcessor.class);

reduce1 = new Vertex(ReduceProcessor.class);
reduce2 = new Vertex(ReduceProcessor.class);
join1 = new Vertex(JoinProcessor.class);
map1
Scatter_Gather
Bipartite
Sequential
Edge edge1 = Edge(map1, reduce1,

SCATTER_GATHER, PERSISTED, SEQUENTIAL,
MOutput.class, RInput.class);
Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER,
PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER,
Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER,
map2
reduce1
reduce2
Scatter_Gather
Bipartite
Sequential
dag.addVertex(map1).addVertex(map2)
join1
.addVertex(reduce1).addVertex(reduce2)
.addVertex(join1)
.addEdge(edge1).addEdge(edge2)
.addEdge(edge3).addEdge(edge4);
Page 17

Edge properties define the connection between
producer and consumer vertices in the DAG
Data movement Defines routing of data between
tasks
One-To-One : Data from the ith producer task routes to the i th
consumer task.
Broadcast : Data from a producer task routes to all consumer
tasks.
Scatter-Gather : Producer tasks scatter data into shards and
consumer tasks gather the data. The ith shard from all producer
tasks routes to the ith consumer task.
Scheduling Defines when a consumer task is

scheduled
Sequential : Consumer task may be scheduled after a producer
task completes.
Concurrent : Consumer task must be co-scheduled with a
producer task.
Data source Defines the lifetime/reliability of a task

output
Persisted : Output will be available after the task exits. Output
may be lost later on.
Page 18

map1
map2
reduce1
reduce2
join1
Page 19
Tez Deep Dive Runtime API
Page 20
Tez Deep Dive Task Execution

Start task shell with
user specified env,
resources etc.
Fetch and
instantiate Input,
Processor, Output
objects
Receive
(incremental) input
information and
process the input
Provide output
information
Provide
control/error events
Task Attempt
(logical in AM)
Env, cmd line,
resources
Input
Processor
Output
Task Attempt
(real on machine)
Start container
Tez Task JVM
Get Task
Input
Processor
Control/Data
Information
Data Events
Control Events
Output
Page 21
Tez Deep Dive Runtime Events

Events used to communicate
between the tasks and between
task and ApplicationMaster (AM)
Data Movement Event used by
producer task to inform the
consumer task about data
location, size etc.
Input Error event sent by task to
AM to inform about errors in
reading input. AM then takes
action by re-generating the input
Other events to send task
completion notification, data
statistics and other control plane
information
Map Task 1
Output1
Output3
Output2
Map Task 2
Output1
Output3
AM
Router
Scatter-Gather Edge
Error Event
Input1
Input2
Reduce Task 2
Output2
Page 22

location, size etc.
information
Map Task 1
Output1
Output3
Output2
Map Task 2
Output1
Output3
Data Event
AM
Router
Scatter-Gather Edge
Error Event
Input1
Input2
Reduce Task 2
Output2
Page 23

location, size etc.
information
Map Task 1
Output1
Output3
Output2
Map Task 2
Output1
Output3
Data Event
AM
Router
Scatter-Gather Edge
Error Event
Input1
Input2
Reduce Task 2
Output2
Page 24
Tez Deep Dive Core Engine

Vertex Manager
Determines task
parallelism
Determines
when tasks in a
vertex can start.
DAG Scheduler
Determines priority
of task
Task Scheduler
Allocates
containers from
YARN and assigns
them to tasks
Start
vertex
Get container
map1
Get Priority
Start
vertex
Vertex Manager
DAG
Scheduler
Task
Scheduler
Start
tasks
reduce1
Get Priority
Get container
Page 25
Tez Automatic Reduce Parallelism

Event Model
Map tasks send
data statistics
events to the
Reduce Vertex
Manager.
Vertex Manager
Map Vertex
Vertex Manager
Pluggable user logic
that understands the
data statistics and can
formulate the correct
parallelism. Advises
vertex controller on
parallelism
Vertex State
Machine
App Master
Reduce Vertex
Cancel Task
Page 26

Event Model
Map tasks send
data statistics
events to the
Reduce Vertex
Manager.
Vertex Manager
Data Size
Statistics
Map Vertex
Vertex Manager
parallelism
Vertex State
Machine
App Master
Reduce Vertex
Cancel Task
Page 27

Event Model
Map tasks send
data statistics
events to the
Reduce Vertex
Manager.
Data Size
Statistics
Vertex Manager
Map Vertex
Set Parallelism
Vertex Manager
parallelism
Re-Route
Vertex State
Machine
App Master
Reduce Vertex
Cancel Task
Page 28
Tez Reduce Slow Start/Pre-launch

Event Model
Map completion
events sent to the
Reduce Vertex
Manager.
Vertex Manager
Map Vertex
Vertex Manager
data size. Advises the
vertex controller to
launch the reducers
before all maps have
completed so that
shuffle can start.
Vertex State
Machine
App Master
Reduce Vertex
Page 29

Event Model
Map completion
events sent to the
Reduce Vertex
Manager.
Vertex Manager
Task Completed
Map Vertex
Vertex Manager
launch the reducers
completed so that
shuffle can start.
Vertex State
Machine
App Master
Reduce Vertex
Page 30

Event Model
Map completion
events sent to the
Reduce Vertex
Manager.
Vertex Manager
launch the reducers
completed so that
shuffle can start.
Task Completed
Vertex Manager
Map Vertex
Start Tasks
Vertex State
Machine
App Master
Start
Reduce Vertex
Page 31
Tez Automatic Map Parallelism

Input vertex manager
gets block locations
and estimates the
number of mappers
based on data size,
cluster capacity and
map data limits.
Groups block by
locality
Consumer vertex
parallelism gets
recursively
determined through
the chain of consumer
vertices
Map Vertex
1-1 Edges
1-1 Edges
Page 32

and estimates the
number of mappers
based on data size,
map data limits.
Groups block by
locality
Consumer vertex
parallelism gets
recursively
determined through
vertices
Set
Parallelism
Get
Block
Locations
Map Vertex
1-1 Edges
HDFS
1-1 Edges
Page 33

and estimates the
number of mappers
based on data size,
map data limits.
Groups block by
locality
Consumer vertex
parallelism gets
recursively
determined through
vertices
Set
Parallelism
Map Vertex
Get
Block
Locations
HDFS
Page 34
Tez - Sessions
Start
Session
Client
Submit
DAG
Application Master
Task Scheduler
Container Pool
Key for interactive queries

Analogous to database sessions
and represents a connection
between the user and the cluster
Run multiple DAGs/queries in the
same session
Maintains a pool of reusable
containers for low latency execution
of tasks within and across queries
Takes care of data locality and
releasing resources when idle
Session cache in the Application
Master and in the container pool
reduce re-computation and reinitialization
Pre
Warmed
JVM
Page 35
Shared
Object
Registry
Tez Now and Next
Page 36
Tez Bridge the Data Spectrum

Fact Table
Dimension
Table 1
Fact Table
Broadcast
Join
Result
Table 1
Dimension
Table 2
Broadcast join
for small data sets
Dimension
Table 1
Dimension
Table 1
Dimension
Table 1
Broadcast
Join
Result
Table 2
Dimension
Table 3
Shuffle
Join
Typical pattern in a
TPC-DS query
Result
Table 3
Based on data size,

the query optimizer
can run either plan
as a single Tez job
Page 37
Tez Benchmark Performance
Significant (but not all) speedups due to Tez

DAG support and runtime graph
reconfiguration enable utilizing the
parallelism of the cluster
Tez Session and container reuse enable
efficient and low latency execution
Page 38
Tez Performance Analysis

Tez Session populates
container pool
Dimension table
calculation and HDFS
split generation in parallel
Dimension tables
broadcasted to Hive
MapJoin tasks
Final Reducer prelaunched and fetches

completed inputs
TPCDS Query-27 with Hive on Tez

Architecting the Future of Big Data
Page 39
Tez Current status

Apache Incubator Project
Rapid development. Over 600 jiras opened. Over 400
resolved.
Growing community of contributors and users
Focus on stability
Testing and quality are highest priority.
Code ready and deployed on multi-node environments.
Support for a vast topology of DAGs

Already functionally equivalent to Map Reduce. Existing
Map Reduce jobs can be executed on Tez with few or no
changes.
Hive retargeted to use Tez for execution of queries (HIVE4660).
Work started on Pig to use Tez for execution of scripts
(PIG-3446).
Page 40
Tez Roadmap
Richer DAG support
Support for co-scheduling and streaming
Better fault tolerance with checkpoints
Performance optimizations
More efficiencies in transfer of data
Improve session performance
Usability.
Stability and testability
Recovery and history
Tools for performance analysis and debugging
Page 41
Tez Community
Early adopters and code contributors
welcome
Adopters to drive more scenarios. Contributors to make them
happen.
Hive and Pig communities are on-board and making great
progress - HIVE-4660 and PIG-3446
Tez meetup for developers and users

http://www.meetup.com/Apache-Tez-User-Group
Technical blog series

http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoo
p-data-processing
/ (will soon be available on the Apache Wiki)
Useful links
Work tracking: https://issues.apache.org/jira/browse/TEZ
Code: https://github.com/apache/incubator-tez
Developer list: dev@tez.incubator.apache.org
Page 42
User list: user@tez.incubator.apache.org
Tez Takeaways
Distributed execution framework that works
on computations represented as dataflow
graphs
Naturally maps to execution plans produced
by query optimizers
Customizable execution architecture
designed to enable dynamic performance
optimizations at runtime
Works out of the box with the platform
figuring out the hard stuf
Span the spectrum of interactive latency to
batch
Open source Apache project your use-cases
and code are welcome
Page 43
Tez
Thanks for your time and attention!
Questions?
@bikassaha
Page 44

Tez 131115140651 Phpapp02

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Tez 131115140651 Phpapp02

Încărcat de

Drepturi de autor:

Formate disponibile

Apache Tez : Accelerating

Hadoop Query Processing

Hortonworks Inc. 2013

Tez Design Themes

Hortonworks Inc. 2013

Tez Empowering End Users

Hortonworks Inc. 2013

Tez Empowering End Users

Hortonworks Inc. 2013

Tez Empowering End Users

Task-2 Partition Stage

Hortonworks Inc. 2013

Tez Empowering End Users

Hortonworks Inc. 2013

Tez Empowering End Users

Hortonworks Inc. 2013

Tez Empowering End Users

Hortonworks Inc. 2013

Tez Empowering End Users

Hortonworks Inc. 2013

Tez Execution Performance

Hortonworks Inc. 2013

Tez Execution Performance

Hortonworks Inc. 2013

Tez Execution Performance

Hortonworks Inc. 2013

Tez Execution Performance

Hortonworks Inc. 2013

Tez Execution Performance

Hortonworks Inc. 2013

Tez Deep Dive

Hortonworks Inc. 2013

Tez Deep Dive DAG API

map2 = new Vertex(MapProcessor.class);

Edge edge1 = Edge(map1, reduce1,

Tez Deep Dive DAG API

Scheduling Defines when a consumer task is

Data source Defines the lifetime/reliability of a task

Tez Deep Dive DAG API

Hortonworks Inc. 2013

Tez Deep Dive Runtime API

Hortonworks Inc. 2013

Tez Deep Dive Task Execution

Tez Task JVM

Hortonworks Inc. 2013

Tez Deep Dive Runtime Events

Hortonworks Inc. 2013

Tez Deep Dive Runtime Events

Hortonworks Inc. 2013

Tez Deep Dive Runtime Events

Hortonworks Inc. 2013

Tez Deep Dive Core Engine

Hortonworks Inc. 2013

Tez Automatic Reduce Parallelism

Hortonworks Inc. 2013

Tez Automatic Reduce Parallelism

Hortonworks Inc. 2013

Tez Automatic Reduce Parallelism

Hortonworks Inc. 2013

Tez Reduce Slow Start/Pre-launch

Hortonworks Inc. 2013

Tez Reduce Slow Start/Pre-launch

Hortonworks Inc. 2013