Documente Academic
Documente Profesional
Documente Cultură
Application
Data Processing
Storage
Infrastructure
Goals
Batch
One
stack to
rule them all!
Interactive Streaming
• Increase parallelism
• Why? result
– Reduce work per node
improve latency
T
• Techniques:
– Low latency parallel scheduler
that achieve high locality
– Optimized parallel communication result
patterns (e.g., shuffle, broadcast)
– Efficient recovery from failures
and straggler mitigation
Tnew (< T)
Support Interactive and Streaming Comp.
• Trade between result accuracy and
response times
• Why? 128-
– In-memory processing does not 512GB
doubles
guarantee interactive query every 18
processing months
Data Storage
Management Efficient data sharing across
frameworks
Resource
Infrastructure
Management Share infrastructure across
frameworks
(multi-programming for datacenters)
Berkeley AMPLab lg
“Launched” January 2011: 6 Year Plan
orit
– 8 CS Faculty hm
– a
~40 students s
– 3 software engineers chi eo
• Organized for collaboration:
ne pl
s e
Berkeley AMPLab
• Funding:
– XData, CISE Expedition Grant
HIVE Pig
… Data
Data Processing
HBase Storm MPI Processing
Hadoop
Data
Data Management
HDFS Mgmnt.
Resource
Resource Management Mgmnt.
Mesos
• Management platform that allows multiple framework to share
cluster
• Compatible with existing open analytics stack
• Deployed in production at Twitter on 3,500+ servers
HIVE Pig
… Data
HBase Storm MPI Processing
Hadoop
Data
HDFS Mgmnt.
Resource
Mesos Mgmnt.
Spark
• In-memory framework for interactive and iterative
computations
– Resilient Distributed Dataset (RDD): fault-tolerance, in-
memory storage abstraction
• Scala interface, Java and Python APIs
Spark Hadoop
Data
HDFS Mgmnt.
Resource
Mesos Mgmnt.
Spark Streaming [Alpha Release]
• Large scale streaming computation
• Ensure exactly one semantics
• Integrated with Spark unifies batch, interactive, and streaming
computations!
Spark
Streamin HIVE Pig Data
… Stor MP
g Processing
m I
Spark Hadoop
Data
HDFS Mgmnt.
Resource
Mesos Mgmnt.
Shark Spark SQL
• HIVE over Spark: SQL-like interface (supports Hive 0.9)
– up to 100x faster for in-memory data, and 5-10x for disk
• In tests on hundreds node cluster at
Spark
Streamin HIVE Pig Data
… Stor MP
g Shark Processing
m I
Spark Hadoop
Data
HDFS Mgmnt.
Resource
Mesos Mgmnt.
Tachyon
• High-throughput, fault-tolerant in-memory storage
• Interface compatible to HDFS
• Support for Spark and Hadoop
Spark
Streamin HIVE Pig Data
… Stor MP
g Shark Processing
m I
Spark Hadoop
Tachyon Data
HDFS Mgmnt.
Resource
Mesos Mgmnt.
BlinkDB
• Large scale approximate query engine
• Allow users to specify error or time bounds
• Preliminary prototype starting being tested at Facebook
Spark BlinkDB
Streamin Pig Data
… Stor MP
g Shark HIVE Processing
m I
Spark Hadoop
Tachyon Data
HDFS Mgmnt.
Resource
Mesos Mgmnt.
SparkGraph
• GraphLab API and Toolkits on top of Spark
• Fault tolerance by leveraging Spark
Spark BlinkDB
Spark
Streamin Pig Data
Graph … Stor MP
g Shark HIVE Processing
m I
Spark Hadoop
Tachyon Data
HDFS Mgmnt.
Resource
Mesos Mgmnt.
MLlib
• Declarative approach to ML
• Develop scalable ML algorithms
• Make ML accessible to non-experts
Spark BlinkDB
Spark MLbas
Streamin Pig Data
Graph e … Stor MP
g Shark HIVE Processing
m I
Spark Hadoop
Tachyon Data
HDFS Mgmnt.
Resource
Mesos Mgmnt.
Compatible with Open Source Ecosystem
• Support existing interfaces whenever possible
GraphLab API
Accept inputs
from Kafka,
Flume, Twitter, Support Hive API
TCP
Sockets, …
Spark BlinkDB
Spark MLbas
Streamin Pig Data
Graph e … Stor MP
g Shark HIVE Processing
m
Support HDFS I
Spark Hadoop
API, S3 API, and
Hive metadata
Tachyon Data
HDFS Mgmnt.
Resource
Mesos Mgmnt.
Summary
• Support interactive and streaming computations
– In-memory, fault-tolerant storage abstraction, low-latency
scheduling,...
• Easy to combine batch, streaming, and interactive Batch
computations
– Spark execution engine supports
Spark
all comp. models
• Easy to develop sophisticated algorithms Interacti Streami
ve ng
– Scala interface, APIs for Java, Python, Hive QL, …
– New frameworks targeted to graph based and ML algorithms
• Compatible with existing open source ecosystem
• Open source (Apache/BSD) and fully committed to release high
quality software
– Three-person software engineering team lead by Matt Massie
(creator of Ganglia, 5th Cloudera engineer)
Spark
In-Memory Cluster Computing for
Iterative and Interactive Applications
UC Berkeley
Background
• Commodity clusters have become an important computing
platform for a variety of applications
– In industry: search, machine translation, ad targeting, …
– In research: bioinformatics, NLP, climate simulation, …
• High-level cluster programming models like MapReduce
power many of these apps
• Theme of this work: provide similarly powerful abstractions
for a broader class of applications
Motivation
Current popular programming models for
clusters transform data flowing from stable
storage to stable storage
e.g., MapReduce:
Map
Reduce
Reduce
Map
Motivation
• Acyclic data flow is a powerful abstraction, but is
not efficient for applications that repeatedly reuse a
working set of data:
– Iterative algorithms (many in machine learning)
– Interactive data mining tools (R, Excel, Python)
• Spark makes working sets a first-class concept to
efficiently support these apps
Spark Goal
• Provide distributed memory abstractions for clusters to
support apps with working sets
• Retain the attractive properties of MapReduce:
– Fault tolerance (for crashes & stragglers)
– Data locality
– Scalability
target
Logistic Regression Code
• val data = spark.textFile(...).map(readPoint).cache()
• var w = Vector.random(D)
• println("Final w: " + w)
Logistic Regression Performance
127 s / iteration
Or with combiners:
• RDDs provide:
– Lineage info for fault recovery and debugging
– Adjustable in-memory caching
– Locality-aware parallel operations