Sunteți pe pagina 1din 48

Berkeley Data Analytics Stack

Prof. Harold Liu


15 December 2014
Data Processing Goals
• Low latency (interactive) queries on
historical data: enable faster decisions
– E.g., identify why a site is slow and fix it
• Low latency queries on live data (streaming):
enable decisions on real-time data
– E.g., detect & block worms in real-time (a
worm may infect 1mil hosts in 1.3sec)
• Sophisticated data processing: enable
“better” decisions
– E.g., anomaly detection, trend analysis
Today’s Open Analytics Stack…
• ..mostly focused on large on-disk datasets: great for
batch but slow

Application

Data Processing

Storage

Infrastructure
Goals

Batch

One
stack to
rule them all!

Interactive Streaming

 Easy to combine batch, streaming, and interactive computations


 Easy to develop sophisticated algorithms
 Compatible with existing open source ecosystem (Hadoop/HDFS)
Support Interactive and Streaming Comp.

• Aggressive use of memory 10Gbps


• Why?
1. Memory transfer rates >> disk or SSDs 128-
512GB

2. Many datasets already fit into memory 40-60GB/s


• Inputs of over 90% of jobs in
Facebook, Yahoo!, and Bing clusters
fit into memory 16 cores
• e.g., 1TB = 1 billion records @ 1KB
each 0.2-
1-
1GB/s
(x10 disks) 4GB/s
(x4 disks)
3. Memory density (still) grows with Moore’s 10-30TB
law 1-4TB
• RAM/SSD hybrid memories at
horizon High end datacenter node
Support Interactive and Streaming Comp.

• Increase parallelism
• Why? result
– Reduce work per node 
improve latency
T
• Techniques:
– Low latency parallel scheduler
that achieve high locality
– Optimized parallel communication result
patterns (e.g., shuffle, broadcast)
– Efficient recovery from failures
and straggler mitigation
Tnew (< T)
Support Interactive and Streaming Comp.
• Trade between result accuracy and
response times
• Why? 128-
– In-memory processing does not 512GB
doubles
guarantee interactive query every 18
processing months

• e.g., ~10’s sec just to scan 512 doubles


GB RAM! 40-60GB/s every 36
months
• Gap between memory capacity
and transfer rate increasing
• Challenges:
– accurately estimate error and 16 cores
running time for…
– … arbitrary computations
Berkeley Data Analytics Stack
(BDAS)
New apps: AMP-Genomics, Carat, …
Application
• in-memory processing
• trade between time, quality, and
Data Processing
cost

Data Storage
Management Efficient data sharing across
frameworks

Resource
Infrastructure
Management Share infrastructure across
frameworks
(multi-programming for datacenters)
Berkeley AMPLab lg
 “Launched” January 2011: 6 Year Plan
orit
– 8 CS Faculty hm
– a
~40 students s
– 3 software engineers chi eo
• Organized for collaboration:
ne pl
s e
Berkeley AMPLab
• Funding:
– XData, CISE Expedition Grant

– Industrial, founding sponsors


– 18 other sponsors, including

Goal: Next Generation of Analytics Data Stack for Industry &


Research:
• Berkeley Data Analytics Stack (BDAS)
• Release as Open Source
Berkeley Data Analytics Stack
(BDAS)
• Existing stack components….

HIVE Pig
… Data
Data Processing
HBase Storm MPI Processing
Hadoop

Data
Data Management
HDFS Mgmnt.

Resource
Resource Management Mgmnt.
Mesos
• Management platform that allows multiple framework to share
cluster
• Compatible with existing open analytics stack
• Deployed in production at Twitter on 3,500+ servers

HIVE Pig
… Data
HBase Storm MPI Processing
Hadoop

Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Spark
• In-memory framework for interactive and iterative
computations
– Resilient Distributed Dataset (RDD): fault-tolerance, in-
memory storage abstraction
• Scala interface, Java and Python APIs

HIVE Pig Data



Storm MPI Processing

Spark Hadoop

Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Spark Streaming [Alpha Release]
• Large scale streaming computation
• Ensure exactly one semantics
• Integrated with Spark  unifies batch, interactive, and streaming
computations!

Spark
Streamin HIVE Pig Data
… Stor MP
g Processing
m I
Spark Hadoop

Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Shark  Spark SQL
• HIVE over Spark: SQL-like interface (supports Hive 0.9)
– up to 100x faster for in-memory data, and 5-10x for disk
• In tests on hundreds node cluster at

Spark
Streamin HIVE Pig Data
… Stor MP
g Shark Processing
m I
Spark Hadoop

Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Tachyon
• High-throughput, fault-tolerant in-memory storage
• Interface compatible to HDFS
• Support for Spark and Hadoop
Spark
Streamin HIVE Pig Data
… Stor MP
g Shark Processing
m I
Spark Hadoop

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
BlinkDB
• Large scale approximate query engine
• Allow users to specify error or time bounds
• Preliminary prototype starting being tested at Facebook

Spark BlinkDB
Streamin Pig Data
… Stor MP
g Shark HIVE Processing
m I
Spark Hadoop

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
SparkGraph
• GraphLab API and Toolkits on top of Spark
• Fault tolerance by leveraging Spark

Spark BlinkDB
Spark
Streamin Pig Data
Graph … Stor MP
g Shark HIVE Processing
m I
Spark Hadoop

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
MLlib
• Declarative approach to ML
• Develop scalable ML algorithms
• Make ML accessible to non-experts

Spark BlinkDB
Spark MLbas
Streamin Pig Data
Graph e … Stor MP
g Shark HIVE Processing
m I
Spark Hadoop

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Compatible with Open Source Ecosystem
• Support existing interfaces whenever possible

GraphLab API

Spark BlinkDB Hive Interface


Spark MLbas and Shell
Streamin Pig Data
Graph e … Stor MP
g Shark HIVE Processing
m I
Spark Hadoop

Tachyon Compatibility Data


layer for
HDFS API HDFS
Hadoop, Storm, MPI,
Mgmnt.
etc to run over Mesos
Resource
Mesos Mgmnt.
Compatible with Open Source Ecosystem
• Use existing interfaces whenever possible

Accept inputs
from Kafka,
Flume, Twitter, Support Hive API
TCP
Sockets, …
Spark BlinkDB
Spark MLbas
Streamin Pig Data
Graph e … Stor MP
g Shark HIVE Processing
m
Support HDFS I
Spark Hadoop
API, S3 API, and
Hive metadata

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Summary
• Support interactive and streaming computations
– In-memory, fault-tolerant storage abstraction, low-latency
scheduling,...
• Easy to combine batch, streaming, and interactive Batch
computations
– Spark execution engine supports
Spark
all comp. models
• Easy to develop sophisticated algorithms Interacti Streami
ve ng
– Scala interface, APIs for Java, Python, Hive QL, …
– New frameworks targeted to graph based and ML algorithms
• Compatible with existing open source ecosystem
• Open source (Apache/BSD) and fully committed to release high
quality software
– Three-person software engineering team lead by Matt Massie
(creator of Ganglia, 5th Cloudera engineer)
Spark
In-Memory Cluster Computing for
Iterative and Interactive Applications

UC Berkeley
Background
• Commodity clusters have become an important computing
platform for a variety of applications
– In industry: search, machine translation, ad targeting, …
– In research: bioinformatics, NLP, climate simulation, …
• High-level cluster programming models like MapReduce
power many of these apps
• Theme of this work: provide similarly powerful abstractions
for a broader class of applications
Motivation
Current popular programming models for
clusters transform data flowing from stable
storage to stable storage
e.g., MapReduce:
Map
Reduce

Input Map Output

Reduce
Map
Motivation
• Acyclic data flow is a powerful abstraction, but is
not efficient for applications that repeatedly reuse a
working set of data:
– Iterative algorithms (many in machine learning)
– Interactive data mining tools (R, Excel, Python)
• Spark makes working sets a first-class concept to
efficiently support these apps
Spark Goal
• Provide distributed memory abstractions for clusters to
support apps with working sets
• Retain the attractive properties of MapReduce:
– Fault tolerance (for crashes & stragglers)
– Data locality
– Scalability

Solution: augment data flow model with


“resilient distributed datasets” (RDDs)
Programming Model
• Resilient distributed datasets (RDDs)
– Immutable collections partitioned across cluster that
can be rebuilt if a partition is lost
– Created by transforming data in stable storage using
data flow operators (map, filter, group-by, …)
– Can be cached across parallel operations
• Parallel operations on RDDs
– Reduce, collect, count, save, …
• Restricted shared variables
– Accumulators, broadcast variables
Example: Log Mining
•Load error messages from a log into memory,
then interactively search for various patterns
Base
Transformed Cache 1
lines = spark.textFile(“hdfs://...”) RDD RDD Worke
results r
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()
Cached RDD Parallel
cachedMsgs.filter(_.contains(“foo”)).count operation
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worke
. . . r
Cache 3
Result: full-text search of Worke Block 2
r
Wikipedia in <1 sec (vs 20 sec
Block 3
for on-disk data)
RDDs in More Detail
• An RDD is an immutable, partitioned, logical collection
of records
– Need not be materialized, but rather contains
information to rebuild a dataset from stable storage

• Partitioning can be based on a key in each record


(using hash or range partitioning)

• Built using bulk transformations on other RDDs

• Can be cached for future reuse


RDD Operations
Transformations Parallel operations (Actions)
(define a new RDD) (return a result to driver)
map reduce
filter collect
sample count
union save
groupByKey lookupKey
reduceByKey …
join
cache

RDD Fault Tolerance
• RDDs maintain lineage information that can be used to
reconstruct lost partitions
• e.g.:
cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
.cache()

HdfsRDD FilteredRDD MappedRDD


func: CachedRDD
path: hdfs://… func: split(…)
contains(...)
Example 1: Logistic Regression
• Goal: find best line separating two sets of points

random initial line

target
Logistic Regression Code
• val data = spark.textFile(...).map(readPoint).cache()

• var w = Vector.random(D)

• for (i <- 1 to ITERATIONS) {


• val gradient = data.map(p =>
• (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
• ).reduce(_ + _)
• w -= gradient
•}

• println("Final w: " + w)
Logistic Regression Performance

127 s / iteration

first iteration 174 s


further iterations 6 s
Example 2: MapReduce
• MapReduce data flow can be expressed using RDD
transformations

res = data.flatMap(rec => myMapFunc(rec))


.groupByKey()
.map((key, vals) => myReduceFunc(key, vals))

Or with combiners:

res = data.flatMap(rec => myMapFunc(rec))


.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
Example 3
Other Spark Applications
• Twitter spam classification (Justin Ma)
• EM alg. for traffic prediction (Mobile Millennium)
• K-means clustering
• Alternating Least Squares matrix factorization
• In-memory OLAP aggregation on Hive data
• SQL on Spark (future work)
Conclusion
• By making distributed datasets a first-class primitive,
Spark provides a simple, efficient programming model for
stateful data analytics

• RDDs provide:
– Lineage info for fault recovery and debugging
– Adjustable in-memory caching
– Locality-aware parallel operations

• We plan to make Spark the basis of a suite of batch


and interactive data analysis tools

S-ar putea să vă placă și