Documente Academic
Documente Profesional
Documente Cultură
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,
Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh,
Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura,
David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak,
Christopher Taylor, Ruth Wang, Dale Woodford
Google, Inc.
Abstract
Spanner is Googles scalable, multi-version, globallydistributed, and synchronously-replicated database. It is
the first system to distribute data at global scale and support externally-consistent distributed transactions. This
paper describes how Spanner is structured, its feature set,
the rationale underlying various design decisions, and a
novel time API that exposes clock uncertainty. This API
and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.
Introduction
Spanner is a scalable, globally-distributed database designed, built, and deployed at Google. At the highest level of abstraction, it is a database that shards data
across many sets of Paxos [21] state machines in datacenters spread all over the world. Replication is used for
global availability and geographic locality; clients automatically failover between replicas. Spanner automatically reshards data across machines as the amount of data
or the number of servers changes, and it automatically
migrates data across machines (even across datacenters)
to balance load and in response to failures. Spanner is
designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows.
Applications can use Spanner for high availability,
even in the face of wide-area natural disasters, by replicating their data within or even across continents. Our
initial customer was F1 [35], a rewrite of Googles advertising backend. F1 uses five replicas spread across
the United States. Most other applications will probably
replicate their data across 3 to 5 datacenters in one geographic region, but with relatively independent failure
modes. That is, most applications will choose lower laPublished in the Proceedings of OSDI 2012
Implementation
This section describes the structure of and rationale underlying Spanners implementation. It then describes the
directory abstraction, which is used to manage replication and locality, and is the unit of data movement. Finally, it describes our data model, why Spanner looks
like a relational database instead of a key-value store, and
how applications can control data locality.
A Spanner deployment is called a universe. Given
that Spanner manages data globally, there will be only
a handful of running universes. We currently run a
test/playground universe, a development/production universe, and a production-only universe.
Spanner is organized as a set of zones, where each
zone is the rough analog of a deployment of Bigtable
Published in the Proceedings of OSDI 2012
2.1
2.2
2.3
Data Model
age than Bigtables, and because of its support for synchronous replication across datacenters. (Bigtable only
supports eventually-consistent replication across datacenters.) Examples of well-known Google applications
that use Megastore are Gmail, Picasa, Calendar, Android
Market, and AppEngine. The need to support a SQLlike query language in Spanner was also clear, given
the popularity of Dremel [28] as an interactive dataanalysis tool. Finally, the lack of cross-row transactions
in Bigtable led to frequent complaints; Percolator [32]
was in part built to address this failing. Some authors
have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it
is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack
of transactions. Running two-phase commit over Paxos
mitigates the availability problems.
The application data model is layered on top of the
directory-bucketed key-value mappings supported by the
implementation. An application creates one or more
databases in a universe. Each database can contain an
unlimited number of schematized tables. Tables look
like relational-database tables, with rows, columns, and
versioned values. We will not go into detail about the
query language for Spanner. It looks like SQL with some
extensions to support protocol-buffer-valued fields.
Spanners data model is not purely relational, in that
rows must have names. More precisely, every table is required to have an ordered set of one or more primary-key
columns. This requirement is where Spanner still looks
like a key-value store: the primary keys form the name
for a row, and each table defines a mapping from the
primary-key columns to the non-primary-key columns.
A row has existence only if some value (even if it is
NULL) is defined for the rows keys. Imposing this structure is useful because it lets applications control data locality through their choices of keys.
Figure 4 contains an example Spanner schema for storing photo metadata on a per-user, per-album basis. The
schema language is similar to Megastores, with the additional requirement that every Spanner database must
be partitioned by clients into one or more hierarchies
of tables. Client applications declare the hierarchies in
database schemas via the INTERLEAVE IN declarations. The table at the top of a hierarchy is a directory
table. Each row in a directory table with key K, together
with all of the rows in descendant tables that start with K
in lexicographic order, forms a directory. ON DELETE
CASCADE says that deleting a row in the directory table
deletes any associated child rows. The figure also illustrates the interleaved layout for the example database: for
4
TrueTime
Method
Returns
TT.now()
TT.after(t)
TT.before(t)
Denote the absolute time of an event e by the function tabs (e). In more formal terms, TrueTime guarantees that for an invocation tt = TT.now(), tt.earliest
tabs (enow ) tt.latest, where enow is the invocation event.
The underlying time references used by TrueTime
are GPS and atomic clocks. TrueTime uses two forms
of time reference because they have different failure
modes. GPS reference-source vulnerabilities include antenna and receiver failures, local radio interference, correlated failures (e.g., design faults such as incorrect leapsecond handling and spoofing), and GPS system outages.
Atomic clocks can fail in ways uncorrelated to GPS and
each other, and over long periods of time can drift significantly due to frequency error.
TrueTime is implemented by a set of time master machines per datacenter and a timeslave daemon per machine. The majority of masters have GPS receivers with
dedicated antennas; these masters are separated physically to reduce the effects of antenna failures, radio interference, and spoofing. The remaining masters (which
we refer to as Armageddon masters) are equipped with
atomic clocks. An atomic clock is not that expensive:
the cost of an Armageddon master is of the same order
as that of a GPS master. All masters time references
are regularly compared against each other. Each master also cross-checks the rate at which its reference advances time against its own local clock, and evicts itself
if there is substantial divergence. Between synchronizations, Armageddon masters advertise a slowly increasing
time uncertainty that is derived from conservatively applied worst-case clock drift. GPS masters advertise uncertainty that is typically close to zero.
Every daemon polls a variety of masters [29] to reduce vulnerability to errors from any one master. Some
are GPS masters chosen from nearby datacenters; the
rest are GPS masters from farther datacenters, as well
as some Armageddon masters. Daemons apply a variant
of Marzullos algorithm [27] to detect and reject liars,
and synchronize the local machine clocks to the nonliars. To protect against broken local clocks, machines
that exhibit frequency excursions larger than the worstcase bound derived from component specifications and
operating environment are evicted.
Between synchronizations, a daemon advertises a
slowly increasing time uncertainty. is derived from
conservatively applied worst-case local clock drift. also
depends on time-master uncertainty and communication
delay to the time masters. In our production environment, is typically a sawtooth function of time, varying
from about 1 to 7 ms over each poll interval. is therefore 4 ms most of the time. The daemons poll interval is
currently 30 seconds, and the current applied drift rate is
set at 200 microseconds/second, which together account
5
Timestamp
Discussion
Concurrency
Control
Read-Write Transaction
4.1.2
pessimistic
Read-Only Transaction
4.1.4
lock-free
Operation
4.1.3
lock-free
lock-free
Replica Required
leader
leader for timestamp; any for
read, subject to 4.1.3
any, subject to 4.1.3
any, subject to 4.1.3
Table 2: Types of reads and writes in Spanner, and how they compare.
for the sawtooth bounds from 0 to 6 ms. The remaining 1 ms comes from the communication delay to the
time masters. Excursions from this sawtooth are possible in the presence of failures. For example, occasional
time-master unavailability can cause datacenter-wide increases in . Similarly, overloaded machines and network
links can result in occasional localized spikes.
Concurrency Control
This section describes how TrueTime is used to guarantee the correctness properties around concurrency control, and how those properties are used to implement
features such as externally consistent transactions, lockfree read-only transactions, and non-blocking reads in
the past. These features enable, for example, the guarantee that a whole-database audit read at a timestamp t
will see exactly the effects of every transaction that has
committed as of t.
Going forward, it will be important to distinguish
writes as seen by Paxos (which we will refer to as Paxos
writes unless the context is clear) from Spanner client
writes. For example, two-phase commit generates a
Paxos write for the prepare phase that has no corresponding Spanner client write.
4.1
Timestamp Management
Table 2 lists the types of operations that Spanner supports. The Spanner implementation supports readwrite transactions, read-only transactions (predeclared
snapshot-isolation transactions), and snapshot reads.
Standalone writes are implemented as read-write transactions; non-snapshot standalone reads are implemented
as read-only transactions. Both are internally retried
(clients need not write their own retry loops).
A read-only transaction is a kind of transaction that
has the performance benefits of snapshot isolation [6].
A read-only transaction must be predeclared as not having any writes; it is not simply a read-write transaction
without any writes. Reads in a read-only transaction execute at a system-chosen timestamp without locking, so
that incoming writes are not blocked. The execution of
Published in the Proceedings of OSDI 2012
when all locks have been acquired, but before any locks
have been released. For a given transaction, Spanner assigns it the timestamp that Paxos assigns to the Paxos
write that represents the transaction commit.
Spanner depends on the following monotonicity invariant: within each Paxos group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders. A single leader replica can trivially assign timestamps in monotonically increasing order. This invariant is enforced across leaders by making
use of the disjointness invariant: a leader must only assign timestamps within the interval of its leader lease.
Note that whenever a timestamp s is assigned, smax is
advanced to s to preserve disjointness.
Spanner also enforces the following externalconsistency invariant: if the start of a transaction T2
occurs after the commit of a transaction T1 , then the
commit timestamp of T2 must be greater than the
commit timestamp of T1 . Define the start and commit
events for a transaction Ti by estart
and ecommit
; and
i
i
the commit timestamp of a transaction Ti by si . The
invariant becomes tabs (ecommit
) < tabs (estart
1
2 ) ) s1 < s 2 .
The protocol for executing transactions and assigning
timestamps obeys two rules, which together guarantee
this invariant, as shown below. Define the arrival event
of the commit request at the coordinator leader for a
write Ti to be eserver
.
i
Start The coordinator leader for a write Ti assigns
a commit timestamp si no less than the value of
TT.now().latest, computed after eserver
. Note that the
i
participant leaders do not matter here; Section 4.2.1 describes how they are involved in the implementation of
the next rule.
Commit Wait The coordinator leader ensures that
clients cannot see any data committed by Ti until
TT.after(si ) is true. Commit wait ensures that si is
less than the absolute commit time of Ti , or si <
tabs (ecommit
). The implementation of commit wait is dei
scribed in Section 4.2.1. Proof:
s1 < tabs (ecommit
)
1
tabs (ecommit
)
1
start
tabs (e2 )
tabs (eserver
)
2
<
tabs (estart
2 )
tabs (eserver
)
2
s2
s1 < s 2
4.1.3
(commit wait)
(assumption)
(causality)
(start)
(transitivity)
The monotonicity invariant described in Section 4.1.2 allows Spanner to correctly determine whether a replicas
state is sufficiently up-to-date to satisfy a read. Every
replica tracks a value called safe time tsafe which is the
Published in the Proceedings of OSDI 2012
4.2
Details
Read-Write Transactions
4.2.2
Read-Only Transactions
Assigning a timestamp requires a negotiation phase between all of the Paxos groups that are involved in the
reads. As a result, Spanner requires a scope expression
for every read-only transaction, which is an expression
that summarizes the keys that will be read by the entire
transaction. Spanner automatically infers the scope for
standalone queries.
If the scopes values are served by a single Paxos
group, then the client issues the read-only transaction to
that groups leader. (The current Spanner implementation only chooses a timestamp for a read-only transaction at a Paxos leader.) That leader assigns sread and executes the read. For a single-site read, Spanner generally does better than TT.now().latest. Define LastTS() to
be the timestamp of the last committed write at a Paxos
group. If there are no prepared transactions, the assignment sread = LastTS() trivially satisfies external consistency: the transaction will see the result of the last write,
and therefore be ordered after it.
If the scopes values are served by multiple Paxos
groups, there are several options. The most complicated
option is to do a round of communication with all of
the groupss leaders to negotiate sread based on LastTS().
Spanner currently implements a simpler choice. The
client avoids a negotiation round, and just has its reads
execute at sread = TT.now().latest (which may wait for
safe time to advance). All reads in the transaction can be
sent to replicas that are sufficiently up-to-date.
4.2.3
Schema-Change Transactions
replicas
write
latency (ms)
read-only transaction
snapshot read
write
1D
1
3
5
9.4.6
14.41.0
13.9.6
14.4.4
1.4.1
1.3.1
1.4.05
1.3.1
1.2.1
1.3.04
4.0.3
4.1.05
2.2.5
2.8.3
throughput (Kops/sec)
read-only transaction snapshot read
10.9.4
13.83.2
25.35.2
13.5.1
38.5.3
50.01.1
Table 3: Operation microbenchmarks. Mean and standard deviation over 10 runs. 1D means one replica with commit wait disabled.
4.2.4
Refinements
tTM
safe as defined above has a weakness, in that a single
prepared transaction prevents tsafe from advancing. As
a result, no reads can occur at later timestamps, even
if the reads do not conflict with the transaction. Such
false conflicts can be removed by augmenting tTM
safe with
a fine-grained mapping from key ranges to preparedtransaction timestamps. This information can be stored
in the lock table, which already maps key ranges to
lock metadata. When a read arrives, it only needs to be
checked against the fine-grained safe time for key ranges
with which the read conflicts.
LastTS() as defined above has a similar weakness: if
a transaction has just committed, a non-conflicting readonly transaction must still be assigned sread so as to follow that transaction. As a result, the execution of the read
could be delayed. This weakness can be remedied similarly by augmenting LastTS() with a fine-grained mapping from key ranges to commit timestamps in the lock
table. (We have not yet implemented this optimization.)
When a read-only transaction arrives, its timestamp can
be assigned by taking the maximum value of LastTS()
for the key ranges with which the transaction conflicts,
unless there is a conflicting prepared transaction (which
can be determined from fine-grained safe time).
tPaxos
safe as defined above has a weakness in that it cannot
advance in the absence of Paxos writes. That is, a snapshot read at t cannot execute at Paxos groups whose last
write happened before t. Spanner addresses this problem
by taking advantage of the disjointness of leader-lease
intervals. Each Paxos leader advances tPaxos
safe by keeping
a threshold above which future writes timestamps will
occur: it maintains a mapping MinNextTS(n) from Paxos
sequence number n to the minimum timestamp that may
be assigned to Paxos sequence number n + 1. A replica
can advance tPaxos
1 when it has apsafe to MinNextTS(n)
plied through n.
A single leader can enforce its MinNextTS()
promises easily. Because the timestamps promised
by MinNextTS() lie within a leaders lease, the disjointness invariant enforces MinNextTS() promises across
leaders. If a leader wishes to advance MinNextTS()
beyond the end of its leader lease, it must first extend its
Published in the Proceedings of OSDI 2012
Evaluation
5.1
Microbenchmarks
mean
1
2
5
10
25
50
100
200
17.0 1.4
24.5 2.5
31.5 6.2
30.0 3.7
35.5 5.6
42.7 4.1
71.4 7.6
150.5 11.0
latency (ms)
99th percentile
participants
75.0 34.9
87.6 35.9
104.5 52.2
95.6 25.4
100.4 42.7
93.7 22.9
131.2 17.6
320.3 35.1
1.4M
non-leader
leader-soft
leader-hard
1.2M
1M
800K
600K
400K
200K
0
5.2
Availability
10
15
20
Time in seconds
5.3
TrueTime
Two questions must be answered with respect to TrueTime: is truly a bound on clock uncertainty, and how
bad does get? For the former, the most serious problem would be if a local clocks drift were greater than
200us/sec: that would break assumptions made by TrueTime. Our machine statistics show that bad CPUs are 6
times more likely than bad clocks. That is, clock issues
are extremely infrequent, relative to much more serious
hardware problems. As a result, we believe that TrueTimes implementation is as trustworthy as any other
piece of software upon which Spanner depends.
Figure 6 presents TrueTime data taken at several thousand spanserver machines across datacenters up to 2200
10
10
99.9
99
90
Epsilon (ms)
# fragments
# directories
1
24
59
1014
1599
100500
>100M
341
5336
232
34
7
4
6
3
4
2
2
Mar 29
Mar 30
Mar 31
Date
Apr 1 6AM
8AM
10AM
12PM
after timeslave daemon polls the time masters. 90th, 99th, and
99.9th percentiles are graphed.
5.4
F1
data in external Bigtables, which compromised transactional behavior and the ability to query across all data.
The F1 team chose to use Spanner for several reasons. First, Spanner removes the need to manually reshard. Second, Spanner provides synchronous replication and automatic failover. With MySQL master-slave
replication, failover was difficult, and risked data loss
and downtime. Third, F1 requires strong transactional
semantics, which made using other NoSQL systems impractical. Application semantics requires transactions
across arbitrary data, and consistent reads. The F1 team
also needed secondary indexes on their data (since Spanner does not yet provide automatic support for secondary
indexes), and was able to implement their own consistent
global indexes using Spanner transactions.
All application writes are now by default sent through
F1 to Spanner, instead of the MySQL-based application
stack. F1 has 2 replicas on the west coast of the US, and
3 on the east coast. This choice of replica sites was made
to cope with outages due to potential major natural disasters, and also the choice of their frontend sites. Anecdotally, Spanners automatic failover has been nearly invisible to them. Although there have been unplanned cluster
failures in the last few months, the most that the F1 team
has had to do is update their databases schema to tell
Spanner where to preferentially place Paxos leaders, so
as to keep them close to where their frontends moved.
Spanners timestamp semantics made it efficient for
F1 to maintain in-memory data structures computed from
the database state. F1 maintains a logical history log of
all changes, which is written into Spanner itself as part
of every transaction. F1 takes full snapshots of data at a
timestamp to initialize its data structures, and then reads
incremental changes to update them.
Table 5 illustrates the distribution of the number of
fragments per directory in F1. Each directory typically
corresponds to a customer in the application stack above
F1. The vast majority of directories (and therefore customers) consist of only 1 fragment, which means that
reads and writes to those customers data are guaranteed
to occur on only a single server. The directories with
more than 100 fragments are all tables that contain F1
secondary indexes: writes to more than a few fragments
11
operation
all reads
single-site commit
multi-site commit
latency (ms)
mean std dev
count
8.7
72.3
103.0
21.5B
31.2M
32.1M
376.4
112.8
52.2
Related Work
The notion of layering transactions on top of a replicated store dates at least as far back as Giffords dissertation [16]. Scatter [17] is a recent DHT-based key-value
store that layers transactions on top of consistent replication. Spanner focuses on providing a higher-level interface than Scatter does. Gray and Lamport [18] describe a non-blocking commit protocol based on Paxos.
Their protocol incurs more messaging costs than twophase commit, which would aggravate the cost of commit over widely distributed groups. Walter [36] provides
a variant of snapshot isolation that works within, but not
across datacenters. In contrast, our read-only transactions provide a more natural semantics, because we support external consistency over all operations.
There has been a spate of recent work on reducing
or eliminating locking overheads. Calvin [40] eliminates concurrency control: it pre-assigns timestamps and
then executes the transactions in timestamp order. HStore [39] and Granola [11] each supported their own
classification of transaction types, some of which could
avoid locking. None of these systems provides external
consistency. Spanner addresses the contention issue by
providing support for snapshot isolation.
VoltDB [42] is a sharded in-memory database that
supports master-slave replication over the wide area for
disaster recovery, but not more general replication configurations. It is an example of what has been called
NewSQL, which is a marketplace push to support scalable SQL [38]. A number of commercial databases implement reads in the past, such as MarkLogic [26] and
Oracles Total Recall [30]. Lomet and Li [24] describe an
implementation strategy for such a temporal database.
Farsite derived bounds on clock uncertainty (much
looser than TrueTimes) relative to a trusted clock reference [13]: server leases in Farsite were maintained in the
same way that Spanner maintains Paxos leases. Loosely
synchronized clocks have been used for concurrencycontrol purposes in prior work [2, 23]. We have shown
that TrueTime lets one reason about global time across
sets of Paxos state machines.
Future Work
Conclusions
Acknowledgements
Many people have helped to improve this paper: our
shepherd Jon Howell, who went above and beyond
his responsibilities; the anonymous referees; and many
Googlers: Atul Adya, Fay Chang, Frank Dabek, Sean
Dorward, Bob Gruber, David Held, Nick Kline, Alex
Thomson, and Joel Wein. Our management has been
very supportive of both our work and of publishing this
paper: Aristotle Balogh, Bill Coughran, Urs Holzle,
Doron Meyer, Cos Nicolaou, Kathy Polizzi, Sridhar Ramaswany, and Shivakumar Venkataraman.
We have built upon the work of the Bigtable and
Megastore teams. The F1 team, and Jeff Shute in particular, worked closely with us in developing our data model
and helped immensely in tracking down performance and
correctness bugs. The Platforms team, and Luiz Barroso
and Bob Felderman in particular, helped to make TrueTime happen. Finally, a lot of Googlers used to be on our
team: Ken Ashcraft, Paul Cychosz, Krzysztof Ostrowski,
Amir Voskoboynik, Matthew Weaver, Theo Vassilakis,
and Eric Veach; or have joined our team recently: Nathan
Bales, Adam Beberg, Vadim Borisov, Ken Chen, Brian
Cooper, Cian Cullinan, Robert-Jan Huijsman, Milind
Joshi, Andrey Khorlin, Dawid Kuroczko, Laramie Leavitt, Eric Li, Mike Mammarella, Sunil Mushran, Simon
Nielsen, Ovidiu Platon, Ananth Shrinivas, Vadim Suvorov, and Marcel van der Holst.
References
[1]
[2]
[3]
[4]
Michael Armbrust et al. PIQL: Success-Tolerant Query Processing in the Cloud. Proc. of VLDB. 2011, pp. 181192.
[5]
[6]
[7]
[8]
[9]
[10]
[11]
13
[12]
[37]
[13]
John Douceur and Jon Howell. Scalable Byzantine-FaultQuantifying Clock Synchronization. Tech. rep. MSR-TR-200367. MS Research, 2003.
[38]
[39]
[40]
Alexander Thomson et al. Calvin: Fast Distributed Transactions for Partitioned Database Systems. Proc. of SIGMOD.
2012, pp. 112.
[41]
Ashish Thusoo et al. Hive A Petabyte Scale Data Warehouse Using Hadoop. Proc. of ICDE. 2010, pp. 9961005.
[42]
[14]
[15]
[16]
David K. Gifford. Information Storage in a Decentralized Computer System. Tech. rep. CSL-81-8. PhD dissertation. Xerox
PARC, July 1982.
[17]
[18]
Jim Gray and Leslie Lamport. Consensus on transaction commit. ACM TODS 31.1 (Mar. 2006), pp. 133160.
[19]
Pat Helland. Life beyond Distributed Transactions: an Apostates Opinion. Proc. of CIDR. 2007, pp. 132141.
[20]
[21]
[22]
Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. Reconfiguring a state machine. SIGACT News 41.1 (Mar. 2010), pp. 63
73.
[23]
Barbara Liskov. Practical uses of synchronized clocks in distributed systems. Distrib. Comput. 6.4 (July 1993), pp. 211
219.
[24]
[25]
[26]
[27]
[28]
Sergey Melnik et al. Dremel: Interactive Analysis of WebScale Datasets. Proc. of VLDB. 2010, pp. 330339.
[29]
[30]
[31]
[32]
Daniel Peng and Frank Dabek. Large-scale incremental processing using distributed transactions and notifications. Proc.
of OSDI. 2010, pp. 115.
[33]
[34]
Alexander Shraer et al. Dynamic Reconfiguration of Primary/Backup Clusters. Proc. of USENIX ATC. 2012, pp. 425
438.
[35]
[36]
The simplest means to ensure the disjointness of Paxosleader-lease intervals would be for a leader to issue a synchronous Paxos write of the lease interval, whenever it
would be extended. A subsequent leader would read the
interval and wait until that interval has passed.
TrueTime can be used to ensure disjointness without
these extra log writes. The potential ith leader keeps a
lower bound on the start of a lease vote from replica r as
leader
vi,r
= TT.now().earliest, computed before esend
i,r (defined as when the lease request is sent by the leader).
Each replica r grants a lease at lease egrant
i,r , which hapreceive
pens after ei,r
(when the replica receives a lease request); the lease ends at tend
i,r = TT.now().latest + 10,
computed after ereceive
.
A
replica r obeys the singlei,r
vote rule: it will not grant another lease vote until
TT.after(tend
i,r ) is true. To enforce this rule across different
incarnations of r, Spanner logs a lease vote at the granting replica before granting the lease; this log write can
be piggybacked upon existing Paxos-protocol log writes.
When the ith leader receives a quorum of votes
(event equorum
), it computes its lease interval as
i
leader
leasei = [TT.now().latest, minr (vi,r
) + 10]. The
lease is deemed to have expired at the leader when
leader
TT.before(minr (vi,r
) + 10) is false. To prove disjointness, we make use of the fact that the ith and (i + 1)th
leaders must have one replica in common in their quorums. Call that replica r0. Proof:
leader
leasei .end = minr (vi,r
) + 10
leader
minr (vi,r
)
+ 10
leader
vi,r0
+ 10
leader
vi,r0
+ 10 tabs (esend
i,r0 ) + 10
receive
tabs (esend
i,r0 ) + 10 tabs (ei,r0 ) + 10
tabs (ereceive
i,r0 )
+ 10
tend
i,r0
<
tabs (egrant
i+1,r0 )
tabs (equorum
i+1 )
tend
i,r0
tabs (egrant
i+1,r0 )
tabs (equorum
i+1 )
leasei+1 .start
(by definition)
(min)
(by definition)
(causality)
(by definition)
(single-vote)
(causality)
(by definition)
14
ABSTRACT
Reliability at massive scale is one of the biggest challenges we
face at Amazon.com, one of the largest e-commerce operations in
the world; even the slightest outage has significant financial
consequences and impacts customer trust. The Amazon.com
platform, which provides services for many web sites worldwide,
is implemented on top of an infrastructure of tens of thousands of
servers and network components located in many datacenters
around the world. At this scale, small and large components fail
continuously and the way persistent state is managed in the face
of these failures drives the reliability and scalability of the
software systems.
This paper presents the design and implementation of Dynamo, a
highly available key-value storage system that some of Amazons
core services use to provide an always-on experience. To
achieve this level of availability, Dynamo sacrifices consistency
under certain failure scenarios. It makes extensive use of object
versioning and application-assisted conflict resolution in a manner
that provides a novel interface for developers to use.
General Terms
Algorithms, Management, Measurement, Performance, Design,
Reliability.
1. INTRODUCTION
Amazon runs a world-wide e-commerce platform that serves tens
of millions customers at peak times using tens of thousands of
servers located in many data centers around the world. There are
strict operational requirements on Amazons platform in terms of
performance, reliability and efficiency, and to support continuous
growth the platform needs to be highly scalable. Reliability is one
of the most important requirements because even the slightest
outage has significant financial consequences and impacts
customer trust. In addition, to support continuous growth, the
platform needs to be highly scalable.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SOSP07, October 1417, 2007, Stevenson, Washington, USA.
Copyright 2007 ACM 978-1-59593-591-5/07/0010...$5.00.
205
195
2.1
The storage system for this class of services has the following
requirements:
Query Model: simple read and write operations to a data item that
is uniquely identified by a key. State is stored as binary objects
(i.e., blobs) identified by unique keys. No operations span
multiple data items and there is no need for relational schema.
This requirement is based on the observation that a significant
portion of Amazons services can work with this simple query
model and do not need any relational schema. Dynamo targets
applications that need to store objects that are relatively small
(usually less than 1 MB).
2. BACKGROUND
Amazons e-commerce platform is composed of hundreds of
services that work in concert to deliver functionality ranging from
recommendations to order fulfillment to fraud detection. Each
service is exposed through a well defined interface and is
accessible over the network. These services are hosted in an
infrastructure that consists of tens of thousands of servers located
across many data centers world-wide. Some of these services are
stateless (i.e., services which aggregate responses from other
services) and some are stateful (i.e., a service that generates its
response by executing business logic on its state stored in
persistent store).
2.2
206
196
2.3
Design Considerations
207
197
3.2
3. RELATED WORK
3.1 Peer to Peer Systems
There are several peer-to-peer (P2P) systems that have looked at
the problem of data storage and distribution. The first generation
of P2P systems, such as Freenet and Gnutella1, were
predominantly used as file sharing systems. These were examples
of unstructured P2P networks where the overlay links between
peers were established arbitrarily. In these networks, a search
query is usually flooded through the network to find as many
peers as possible that share the data. P2P systems evolved to the
next generation into what is widely known as structured P2P
networks. These networks employ a globally consistent protocol
to ensure that any node can efficiently route a search query to
some peer that has the desired data. Systems like Pastry [16] and
Chord [20] use routing mechanisms to ensure that queries can be
answered within a bounded number of hops. To reduce the
additional latency introduced by multi-hop routing, some P2P
systems (e.g., [14]) employ O(1) routing where each peer
maintains enough routing information locally so that it can route
requests (to access a data item) to the appropriate peer within a
constant number of hops.
1
http://freenetproject.org/, http://www.gnutella.org
208
198
Key K
A
G
Technique
Advantage
Partitioning
Consistent Hashing
Incremental
Scalability
High Availability
for writes
Version size is
decoupled from
update rates.
Handling temporary
failures
Provides high
availability and
durability guarantee
when some of the
replicas are not
available.
Recovering from
permanent failures
Anti-entropy using
Merkle trees
Synchronizes
divergent replicas in
the background.
Membership and
failure detection
Gossip-based
membership protocol
and failure detection.
Preserves symmetry
and avoids having a
centralized registry
for storing
membership and
node liveness
information.
Nodes B, C
and D store
keys in
range (A,B)
including
K.
3.3
Problem
Discussion
4.1
System Interface
4.2
Partitioning Algorithm
4. SYSTEM ARCHITECTURE
The architecture of a storage system that needs to operate in a
production setting is complex. In addition to the actual data
persistence component, the system needs to have scalable and
robust solutions for load balancing, membership and failure
detection, failure recovery, replica synchronization, overload
handling, state transfer, concurrency and job scheduling, request
marshalling, request routing, system monitoring and alarming,
and configuration management. Describing the details of each of
the solutions is not possible, so this paper focuses on the core
distributed systems techniques used in Dynamo: partitioning,
replication, versioning, membership, failure handling and scaling.
209
199
Thus, each node becomes responsible for the region in the ring
between it and its predecessor node on the ring. The principle
advantage of consistent hashing is that departure or arrival of a
node only affects its immediate neighbors and other nodes remain
unaffected.
return to its caller before the update has been applied at all the
replicas, which can result in scenarios where a subsequent get()
operation may return an object that does not have the latest
updates.. If there are no failures then there is a bound on the
update propagation times. However, under certain failure
scenarios (e.g., server outages or network partitions), updates may
not arrive at all replicas for an extended period of time.
4.3
Replication
4.4
Data Versioning
210
200
object. In practice, this is not likely because the writes are usually
handled by one of the top N nodes in the preference list. In case of
network partitions or multiple server failures, write requests may
be handled by nodes that are not in the top N nodes in the
preference list causing the size of vector clock to grow. In these
scenarios, it is desirable to limit the size of vector clock. To this
end, Dynamo employs the following clock truncation scheme:
Along with each (node, counter) pair, Dynamo stores a timestamp
that indicates the last time the node updated the data item. When
the number of (node, counter) pairs in the vector clock reaches a
threshold (say 10), the oldest pair is removed from the clock.
Clearly, this truncation scheme can lead to inefficiencies in
reconciliation as the descendant relationships cannot be derived
accurately. However, this problem has not surfaced in production
and therefore this issue has not been thoroughly investigated.
4.5
Next assume a different client reads D2 and then tries to update it,
and another node (say Sz) does the write. The system now has D4
(descendant of D2) whose version clock is [(Sx, 2), (Sz, 1)]. A
node that is aware of D1 or D2 could determine, upon receiving
D4 and its clock, that D1 and D2 are overwritten by the new data
and can be garbage collected. A node that is aware of D3 and
receives D4 will find that there is no causal relation between
them. In other words, there are changes in D3 and D4 that are not
reflected in each other. Both versions of the data must be kept and
presented to a client (upon a read) for semantic reconciliation.
Now assume some client reads both D3 and D4 (the context will
reflect that both values were found by the read). The read's
context is a summary of the clocks of D3 and D4, namely [(Sx, 2),
(Sy, 1), (Sz, 1)]. If the client performs the reconciliation and node
Sx coordinates the write, Sx will update its sequence number in
the clock. The new data D5 will have the following clock: [(Sx,
3), (Sy, 1), (Sz, 1)].
211
201
4.6
Using hinted handoff, Dynamo ensures that the read and write
operations are not failed due to temporary node or network
failures. Applications that need the highest level of availability
can set W to 1, which ensures that a write is accepted as long as a
single node in the system has durably written the key it to its local
store. Thus, the write request is only rejected if all nodes in the
system are unavailable. However, in practice, most Amazon
services in production set a higher W to meet the desired level of
durability. A more detailed discussion of configuring N, R and W
follows in section 6.
4.8
4.8.1
When a node starts for the first time, it chooses its set of tokens
(virtual nodes in the consistent hash space) and maps nodes to
their respective token sets. The mapping is persisted on disk and
212
202
initially contains only the local node and token set. The mappings
stored at different Dynamo nodes are reconciled during the same
communication exchange that reconciles the membership change
histories. Therefore, partitioning and placement information also
propagates via the gossip-based protocol and each storage node is
aware of the token ranges handled by its peers. This allows each
node to forward a keys read/write operations to the right set of
nodes directly.
4.8.2
External Discovery
4.8.3
5. IMPLEMENTATION
In Dynamo, each storage node has three main software
components: request coordination, membership and failure
detection, and a local persistence engine. All these components
are implemented in Java.
Failure Detection
The request coordination component is built on top of an eventdriven messaging substrate where the message processing pipeline
is split into multiple stages similar to the SEDA architecture [24].
All communications are implemented using Java NIO channels.
The coordinator executes the read and write requests on behalf of
clients by collecting data from one or more nodes (in the case of
reads) or storing data at one or more nodes (for writes). Each
client request results in the creation of a state machine on the node
that received the client request. The state machine contains all the
logic for identifying the nodes responsible for a key, sending the
requests, waiting for responses, potentially doing retries,
processing the replies and packaging the response to the client.
Each state machine instance handles exactly one client request.
For instance, a read operation implements the following state
machine: (i) send read requests to the nodes, (ii) wait for
minimum number of required responses, (iii) if too few replies
were received within a given time bound, fail the request, (iv)
otherwise gather all the data versions and determine the ones to be
returned and (v) if versioning is enabled, perform syntactic
reconciliation and generate an opaque write context that contains
the vector clock that subsumes all the remaining versions. For the
sake of brevity the failure handling and retry states are left out.
4.9
After the read response has been returned to the caller the state
2
213
203
http://www.oracle.com/database/berkeley-db.html
214
204
Traditional wisdom holds that durability and availability go handin-hand. However, this is not necessarily true here. For instance,
the vulnerability window for durability can be decreased by
increasing W. This may increase the probability of rejecting
requests (thereby decreasing availability) because more storage
hosts need to be alive to process a write request.
The common (N,R,W) configuration used by several instances of
Dynamo is (3,2,2). These values are chosen to meet the necessary
levels of performance, durability, consistency, and availability
SLAs.
6.2
6.1
To study the load imbalance and its correlation with request load,
the total number of requests received by each node was measured
for a period of 24 hours - broken down into intervals of 30
minutes. In a given time window, a node is considered to be inbalance, if the nodes request load deviates from the average load
by a value a less than a certain threshold (here 15%). Otherwise
the node was deemed out-of-balance. Figure 6 presents the
fraction of nodes that are out-of-balance (henceforth,
imbalance ratio) during this time period. For reference, the
corresponding request load received by the entire system during
this time period is also plotted. As seen in the figure, the
imbalance ratio decreases with increasing load. For instance,
during low loads the imbalance ratio is as high as 20% and during
high loads it is close to 10%. Intuitively, this can be explained by
the fact that under high loads, a large number of popular keys are
accessed and due to uniform distribution of keys the load is
evenly distributed. However, during low loads (where load is 1/8th
215
205
Figure 7: Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the
preference list for the key k1 on the consistent hashing ring (N=3). The shaded area indicates the key range for which nodes A,
B, and C form the preference list. Dark arrows indicate the token locations for various nodes.
The fundamental issue with this strategy is that the schemes for
data partitioning and data placement are intertwined. For instance,
in some cases, it is preferred to add more nodes to the system in
order to handle an increase in request load. However, in this
scenario, it is not possible to add nodes without affecting data
partitioning. Ideally, it is desirable to use independent schemes for
partitioning and placement. To this end, following strategies were
evaluated:
216
206
0.9
0.8
0.7
0.6
Strategy 1
0.5
Strategy 2
Strategy 3
0.4
0
5000
10000
15000
20000
25000
30000
35000
217
207
Serverdriven
Clientdriven
99.9th
percentile
read
latency
(ms)
99.9th
percentile
write
latency
(ms)
Average
read
latency
(ms)
Average
write
latency
(ms)
68.9
68.5
3.9
4.02
30.4
30.4
1.55
1.9
6.6
Discussion
th
7. CONCLUSIONS
This paper described Dynamo, a highly available and scalable
data store, used for storing state of a number of core services of
Amazon.coms e-commerce platform. Dynamo has provided the
desired levels of availability and performance and has been
successful in handling server failures, data center failures and
network partitions. Dynamo is incrementally scalable and allows
service owners to scale up and down based on their current
218
208
[9] Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton,
P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H.,
Wells, C., and Zhao, B. 2000. OceanStore: an architecture
for global-scale persistent storage. SIGARCH Comput.
Archit. News 28, 5 (Dec. 2000), 190-201.
The production use of Dynamo for the past year demonstrates that
decentralized techniques can be combined to provide a single
highly-available system. Its success in one of the most
challenging application environments shows that an eventualconsistent storage system can be a building block for highlyavailable applications.
[10] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine,
M., and Lewin, D. 1997. Consistent hashing and random
trees: distributed caching protocols for relieving hot spots on
the World Wide Web. In Proceedings of the Twenty-Ninth
Annual ACM Symposium on theory of Computing (El Paso,
Texas, United States, May 04 - 06, 1997). STOC '97. ACM
Press, New York, NY, 654-663.
ACKNOWLEDGEMENTS
The authors would like to thank Pat Helland for his contribution
to the initial design of Dynamo. We would also like to thank
Marvin Theimer and Robert van Renesse for their comments.
Finally, we would like to thank our shepherd, Jeff Mogul, for his
detailed comments and inputs while preparing the camera ready
version that vastly improved the quality of the paper.
REFERENCES
[1] Adya, A., Bolosky, W. J., Castro, M., Cermak, G., Chaiken,
R., Douceur, J. R., Howell, J., Lorch, J. R., Theimer, M., and
Wattenhofer, R. P. 2002. Farsite: federated, available, and
reliable storage for an incompletely trusted environment.
SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 1-14.
[2]
[3] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach,
D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.
E. 2006. Bigtable: a distributed storage system for structured
data. In Proceedings of the 7th Conference on USENIX
Symposium on Operating Systems Design and
Implementation - Volume 7 (Seattle, WA, November 06 - 08,
2006). USENIX Association, Berkeley, CA, 15-15.
[15] Reiher, P., Heidemann, J., Ratner, D., Skinner, G., and
Popek, G. 1994. Resolving file conflicts in the Ficus file
system. In Proceedings of the USENIX Summer 1994
Technical Conference on USENIX Summer 1994 Technical
Conference - Volume 1 (Boston, Massachusetts, June 06 - 10,
1994). USENIX Association, Berkeley, CA, 12-12..
[16] Rowstron, A., and Druschel, P. Pastry: Scalable,
decentralized object location and routing for large-scale peerto-peer systems. Proceedings of Middleware, pages 329-350,
November, 2001.
[6] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google
file system. In Proceedings of the Nineteenth ACM
Symposium on Operating Systems Principles (Bolton
Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM
Press, New York, NY, 29-43.
[7] Gray, J., Helland, P., O'Neil, P., and Shasha, D. 1996. The
dangers of replication and a solution. In Proceedings of the
1996 ACM SIGMOD international Conference on
Management of Data (Montreal, Quebec, Canada, June 04 06, 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New
York, NY, 173-182.
[20] Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and
Balakrishnan, H. 2001. Chord: A scalable peer-to-peer
lookup service for internet applications. In Proceedings of
the 2001 Conference on Applications, Technologies,
Architectures, and Protocols For Computer Communications
(San Diego, California, United States). SIGCOMM '01.
ACM Press, New York, NY, 149-160.
219
209
220
210
ABSTRACT
Reliability at massive scale is one of the biggest challenges we
face at Amazon.com, one of the largest e-commerce operations in
the world; even the slightest outage has significant financial
consequences and impacts customer trust. The Amazon.com
platform, which provides services for many web sites worldwide,
is implemented on top of an infrastructure of tens of thousands of
servers and network components located in many datacenters
around the world. At this scale, small and large components fail
continuously and the way persistent state is managed in the face
of these failures drives the reliability and scalability of the
software systems.
This paper presents the design and implementation of Dynamo, a
highly available key-value storage system that some of Amazons
core services use to provide an always-on experience. To
achieve this level of availability, Dynamo sacrifices consistency
under certain failure scenarios. It makes extensive use of object
versioning and application-assisted conflict resolution in a manner
that provides a novel interface for developers to use.
General Terms
Algorithms, Management, Measurement, Performance, Design,
Reliability.
1. INTRODUCTION
Amazon runs a world-wide e-commerce platform that serves tens
of millions customers at peak times using tens of thousands of
servers located in many data centers around the world. There are
strict operational requirements on Amazons platform in terms of
performance, reliability and efficiency, and to support continuous
growth the platform needs to be highly scalable. Reliability is one
of the most important requirements because even the slightest
outage has significant financial consequences and impacts
customer trust. In addition, to support continuous growth, the
platform needs to be highly scalable.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SOSP07, October 1417, 2007, Stevenson, Washington, USA.
Copyright 2007 ACM 978-1-59593-591-5/07/0010...$5.00.
205
195
2.1
The storage system for this class of services has the following
requirements:
Query Model: simple read and write operations to a data item that
is uniquely identified by a key. State is stored as binary objects
(i.e., blobs) identified by unique keys. No operations span
multiple data items and there is no need for relational schema.
This requirement is based on the observation that a significant
portion of Amazons services can work with this simple query
model and do not need any relational schema. Dynamo targets
applications that need to store objects that are relatively small
(usually less than 1 MB).
2. BACKGROUND
Amazons e-commerce platform is composed of hundreds of
services that work in concert to deliver functionality ranging from
recommendations to order fulfillment to fraud detection. Each
service is exposed through a well defined interface and is
accessible over the network. These services are hosted in an
infrastructure that consists of tens of thousands of servers located
across many data centers world-wide. Some of these services are
stateless (i.e., services which aggregate responses from other
services) and some are stateful (i.e., a service that generates its
response by executing business logic on its state stored in
persistent store).
2.2
206
196
2.3
Design Considerations
207
197
3.2
3. RELATED WORK
3.1 Peer to Peer Systems
There are several peer-to-peer (P2P) systems that have looked at
the problem of data storage and distribution. The first generation
of P2P systems, such as Freenet and Gnutella1, were
predominantly used as file sharing systems. These were examples
of unstructured P2P networks where the overlay links between
peers were established arbitrarily. In these networks, a search
query is usually flooded through the network to find as many
peers as possible that share the data. P2P systems evolved to the
next generation into what is widely known as structured P2P
networks. These networks employ a globally consistent protocol
to ensure that any node can efficiently route a search query to
some peer that has the desired data. Systems like Pastry [16] and
Chord [20] use routing mechanisms to ensure that queries can be
answered within a bounded number of hops. To reduce the
additional latency introduced by multi-hop routing, some P2P
systems (e.g., [14]) employ O(1) routing where each peer
maintains enough routing information locally so that it can route
requests (to access a data item) to the appropriate peer within a
constant number of hops.
1
http://freenetproject.org/, http://www.gnutella.org
208
198
Key K
A
G
Technique
Advantage
Partitioning
Consistent Hashing
Incremental
Scalability
High Availability
for writes
Version size is
decoupled from
update rates.
Handling temporary
failures
Provides high
availability and
durability guarantee
when some of the
replicas are not
available.
Recovering from
permanent failures
Anti-entropy using
Merkle trees
Synchronizes
divergent replicas in
the background.
Membership and
failure detection
Gossip-based
membership protocol
and failure detection.
Preserves symmetry
and avoids having a
centralized registry
for storing
membership and
node liveness
information.
Nodes B, C
and D store
keys in
range (A,B)
including
K.
3.3
Problem
Discussion
4.1
System Interface
4.2
Partitioning Algorithm
4. SYSTEM ARCHITECTURE
The architecture of a storage system that needs to operate in a
production setting is complex. In addition to the actual data
persistence component, the system needs to have scalable and
robust solutions for load balancing, membership and failure
detection, failure recovery, replica synchronization, overload
handling, state transfer, concurrency and job scheduling, request
marshalling, request routing, system monitoring and alarming,
and configuration management. Describing the details of each of
the solutions is not possible, so this paper focuses on the core
distributed systems techniques used in Dynamo: partitioning,
replication, versioning, membership, failure handling and scaling.
209
199
Thus, each node becomes responsible for the region in the ring
between it and its predecessor node on the ring. The principle
advantage of consistent hashing is that departure or arrival of a
node only affects its immediate neighbors and other nodes remain
unaffected.
return to its caller before the update has been applied at all the
replicas, which can result in scenarios where a subsequent get()
operation may return an object that does not have the latest
updates.. If there are no failures then there is a bound on the
update propagation times. However, under certain failure
scenarios (e.g., server outages or network partitions), updates may
not arrive at all replicas for an extended period of time.
4.3
Replication
4.4
Data Versioning
210
200
object. In practice, this is not likely because the writes are usually
handled by one of the top N nodes in the preference list. In case of
network partitions or multiple server failures, write requests may
be handled by nodes that are not in the top N nodes in the
preference list causing the size of vector clock to grow. In these
scenarios, it is desirable to limit the size of vector clock. To this
end, Dynamo employs the following clock truncation scheme:
Along with each (node, counter) pair, Dynamo stores a timestamp
that indicates the last time the node updated the data item. When
the number of (node, counter) pairs in the vector clock reaches a
threshold (say 10), the oldest pair is removed from the clock.
Clearly, this truncation scheme can lead to inefficiencies in
reconciliation as the descendant relationships cannot be derived
accurately. However, this problem has not surfaced in production
and therefore this issue has not been thoroughly investigated.
4.5
Next assume a different client reads D2 and then tries to update it,
and another node (say Sz) does the write. The system now has D4
(descendant of D2) whose version clock is [(Sx, 2), (Sz, 1)]. A
node that is aware of D1 or D2 could determine, upon receiving
D4 and its clock, that D1 and D2 are overwritten by the new data
and can be garbage collected. A node that is aware of D3 and
receives D4 will find that there is no causal relation between
them. In other words, there are changes in D3 and D4 that are not
reflected in each other. Both versions of the data must be kept and
presented to a client (upon a read) for semantic reconciliation.
Now assume some client reads both D3 and D4 (the context will
reflect that both values were found by the read). The read's
context is a summary of the clocks of D3 and D4, namely [(Sx, 2),
(Sy, 1), (Sz, 1)]. If the client performs the reconciliation and node
Sx coordinates the write, Sx will update its sequence number in
the clock. The new data D5 will have the following clock: [(Sx,
3), (Sy, 1), (Sz, 1)].
211
201
4.6
Using hinted handoff, Dynamo ensures that the read and write
operations are not failed due to temporary node or network
failures. Applications that need the highest level of availability
can set W to 1, which ensures that a write is accepted as long as a
single node in the system has durably written the key it to its local
store. Thus, the write request is only rejected if all nodes in the
system are unavailable. However, in practice, most Amazon
services in production set a higher W to meet the desired level of
durability. A more detailed discussion of configuring N, R and W
follows in section 6.
4.8
4.8.1
When a node starts for the first time, it chooses its set of tokens
(virtual nodes in the consistent hash space) and maps nodes to
their respective token sets. The mapping is persisted on disk and
212
202
initially contains only the local node and token set. The mappings
stored at different Dynamo nodes are reconciled during the same
communication exchange that reconciles the membership change
histories. Therefore, partitioning and placement information also
propagates via the gossip-based protocol and each storage node is
aware of the token ranges handled by its peers. This allows each
node to forward a keys read/write operations to the right set of
nodes directly.
4.8.2
External Discovery
4.8.3
5. IMPLEMENTATION
In Dynamo, each storage node has three main software
components: request coordination, membership and failure
detection, and a local persistence engine. All these components
are implemented in Java.
Failure Detection
The request coordination component is built on top of an eventdriven messaging substrate where the message processing pipeline
is split into multiple stages similar to the SEDA architecture [24].
All communications are implemented using Java NIO channels.
The coordinator executes the read and write requests on behalf of
clients by collecting data from one or more nodes (in the case of
reads) or storing data at one or more nodes (for writes). Each
client request results in the creation of a state machine on the node
that received the client request. The state machine contains all the
logic for identifying the nodes responsible for a key, sending the
requests, waiting for responses, potentially doing retries,
processing the replies and packaging the response to the client.
Each state machine instance handles exactly one client request.
For instance, a read operation implements the following state
machine: (i) send read requests to the nodes, (ii) wait for
minimum number of required responses, (iii) if too few replies
were received within a given time bound, fail the request, (iv)
otherwise gather all the data versions and determine the ones to be
returned and (v) if versioning is enabled, perform syntactic
reconciliation and generate an opaque write context that contains
the vector clock that subsumes all the remaining versions. For the
sake of brevity the failure handling and retry states are left out.
4.9
After the read response has been returned to the caller the state
2
213
203
http://www.oracle.com/database/berkeley-db.html
214
204
Traditional wisdom holds that durability and availability go handin-hand. However, this is not necessarily true here. For instance,
the vulnerability window for durability can be decreased by
increasing W. This may increase the probability of rejecting
requests (thereby decreasing availability) because more storage
hosts need to be alive to process a write request.
The common (N,R,W) configuration used by several instances of
Dynamo is (3,2,2). These values are chosen to meet the necessary
levels of performance, durability, consistency, and availability
SLAs.
6.2
6.1
To study the load imbalance and its correlation with request load,
the total number of requests received by each node was measured
for a period of 24 hours - broken down into intervals of 30
minutes. In a given time window, a node is considered to be inbalance, if the nodes request load deviates from the average load
by a value a less than a certain threshold (here 15%). Otherwise
the node was deemed out-of-balance. Figure 6 presents the
fraction of nodes that are out-of-balance (henceforth,
imbalance ratio) during this time period. For reference, the
corresponding request load received by the entire system during
this time period is also plotted. As seen in the figure, the
imbalance ratio decreases with increasing load. For instance,
during low loads the imbalance ratio is as high as 20% and during
high loads it is close to 10%. Intuitively, this can be explained by
the fact that under high loads, a large number of popular keys are
accessed and due to uniform distribution of keys the load is
evenly distributed. However, during low loads (where load is 1/8th
215
205
Figure 7: Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the
preference list for the key k1 on the consistent hashing ring (N=3). The shaded area indicates the key range for which nodes A,
B, and C form the preference list. Dark arrows indicate the token locations for various nodes.
The fundamental issue with this strategy is that the schemes for
data partitioning and data placement are intertwined. For instance,
in some cases, it is preferred to add more nodes to the system in
order to handle an increase in request load. However, in this
scenario, it is not possible to add nodes without affecting data
partitioning. Ideally, it is desirable to use independent schemes for
partitioning and placement. To this end, following strategies were
evaluated:
216
206
0.9
0.8
0.7
0.6
Strategy 1
0.5
Strategy 2
Strategy 3
0.4
0
5000
10000
15000
20000
25000
30000
35000
217
207
Serverdriven
Clientdriven
99.9th
percentile
read
latency
(ms)
99.9th
percentile
write
latency
(ms)
Average
read
latency
(ms)
Average
write
latency
(ms)
68.9
68.5
3.9
4.02
30.4
30.4
1.55
1.9
6.6
Discussion
th
7. CONCLUSIONS
This paper described Dynamo, a highly available and scalable
data store, used for storing state of a number of core services of
Amazon.coms e-commerce platform. Dynamo has provided the
desired levels of availability and performance and has been
successful in handling server failures, data center failures and
network partitions. Dynamo is incrementally scalable and allows
service owners to scale up and down based on their current
218
208
[9] Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton,
P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H.,
Wells, C., and Zhao, B. 2000. OceanStore: an architecture
for global-scale persistent storage. SIGARCH Comput.
Archit. News 28, 5 (Dec. 2000), 190-201.
The production use of Dynamo for the past year demonstrates that
decentralized techniques can be combined to provide a single
highly-available system. Its success in one of the most
challenging application environments shows that an eventualconsistent storage system can be a building block for highlyavailable applications.
[10] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine,
M., and Lewin, D. 1997. Consistent hashing and random
trees: distributed caching protocols for relieving hot spots on
the World Wide Web. In Proceedings of the Twenty-Ninth
Annual ACM Symposium on theory of Computing (El Paso,
Texas, United States, May 04 - 06, 1997). STOC '97. ACM
Press, New York, NY, 654-663.
ACKNOWLEDGEMENTS
The authors would like to thank Pat Helland for his contribution
to the initial design of Dynamo. We would also like to thank
Marvin Theimer and Robert van Renesse for their comments.
Finally, we would like to thank our shepherd, Jeff Mogul, for his
detailed comments and inputs while preparing the camera ready
version that vastly improved the quality of the paper.
REFERENCES
[1] Adya, A., Bolosky, W. J., Castro, M., Cermak, G., Chaiken,
R., Douceur, J. R., Howell, J., Lorch, J. R., Theimer, M., and
Wattenhofer, R. P. 2002. Farsite: federated, available, and
reliable storage for an incompletely trusted environment.
SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 1-14.
[2]
[3] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach,
D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.
E. 2006. Bigtable: a distributed storage system for structured
data. In Proceedings of the 7th Conference on USENIX
Symposium on Operating Systems Design and
Implementation - Volume 7 (Seattle, WA, November 06 - 08,
2006). USENIX Association, Berkeley, CA, 15-15.
[15] Reiher, P., Heidemann, J., Ratner, D., Skinner, G., and
Popek, G. 1994. Resolving file conflicts in the Ficus file
system. In Proceedings of the USENIX Summer 1994
Technical Conference on USENIX Summer 1994 Technical
Conference - Volume 1 (Boston, Massachusetts, June 06 - 10,
1994). USENIX Association, Berkeley, CA, 12-12..
[16] Rowstron, A., and Druschel, P. Pastry: Scalable,
decentralized object location and routing for large-scale peerto-peer systems. Proceedings of Middleware, pages 329-350,
November, 2001.
[6] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google
file system. In Proceedings of the Nineteenth ACM
Symposium on Operating Systems Principles (Bolton
Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM
Press, New York, NY, 29-43.
[7] Gray, J., Helland, P., O'Neil, P., and Shasha, D. 1996. The
dangers of replication and a solution. In Proceedings of the
1996 ACM SIGMOD international Conference on
Management of Data (Montreal, Quebec, Canada, June 04 06, 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New
York, NY, 173-182.
[20] Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and
Balakrishnan, H. 2001. Chord: A scalable peer-to-peer
lookup service for internet applications. In Proceedings of
the 2001 Conference on Applications, Technologies,
Architectures, and Protocols For Computer Communications
(San Diego, California, United States). SIGCOMM '01.
ACM Press, New York, NY, 149-160.
219
209
220
210
Abstract
We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing
frameworks, such as Hadoop and MPI. Sharing improves
cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by
taking turns reading data stored on each machine. To
support the sophisticated schedulers of todays frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides
how many resources to offer each framework, while
frameworks decide which resources to accept and which
computations to run on them. Our results show that
Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to
50,000 (emulated) nodes, and is resilient to failures.
Introduction
CDF
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MapReduce Jobs
Map & Reduce Tasks
1
10
100
1000
10000
100000
Duration (s)
Target Environment
As an example of a workload we aim to support, consider the Hadoop data warehouse at Facebook [5]. Facebook loads logs from its web services into a 2000-node
Hadoop cluster, where they are used for applications
such as business intelligence, spam detection, and ad
optimization. In addition to production jobs that run
periodically, the cluster is used for many experimental
jobs, ranging from multi-hour machine learning computations to 1-2 minute ad-hoc queries submitted interactively through an SQL interface called Hive [3]. Most
jobs are short (the median job being 84s long), and the
jobs are composed of fine-grained map and reduce tasks
(the median task being 23s), as shown in Figure 1.
To meet the performance requirements of these jobs,
Facebook uses a fair scheduler for Hadoop that takes advantage of the fine-grained nature of the workload to allocate resources at the level of tasks and to optimize data
locality [38]. Unfortunately, this means that the cluster
can only run Hadoop jobs. If a user wishes to write an ad
targeting algorithm in MPI instead of MapReduce, perhaps because MPI is more efficient for this jobs communication pattern, then the user must set up a separate MPI
cluster and import terabytes of data into it. This problem
is not hypothetical; our contacts at Yahoo! and Facebook
report that users want to run MPI and MapReduce Online
(a streaming MapReduce) [11, 10]. Mesos aims to provide fine-grained sharing between multiple cluster computing frameworks to enable these usage scenarios.
2
Hadoop
scheduler
MPI
scheduler
ZooKeeper
quorum
Framework 1
Framework 2
Job 2
Job 1
FW Scheduler
Job 2
Job 1
FW Scheduler
Mesos
master
Standby
master
Standby
master
Mesos slave
Mesos slave
Hadoop
executor
MPI
executor
Hadoop
MPI
executor executor
task
task
task
task
task
Slave 1
Task
Slave 2
Executor
Task
Task
Architecture
Design Philosophy
Task
Mesos
master
task
We begin our description of Mesos by discussing our design philosophy. We then describe the components of
Mesos, our resource allocation mechanisms, and how
Mesos achieves isolation, scalability, and fault tolerance.
3.1
Executor
Allocation
module
<s1, 4cpu, 4gb, >
Mesos slave
Overview
Isolation
Mesos provides performance isolation between framework executors running on the same slave by leveraging
existing OS isolation mechanisms. Since these mechanisms are platform-dependent, we support multiple isolation mechanisms through pluggable isolation modules.
We currently isolate resources using OS container
technologies, specifically Linux Containers [9] and Solaris Projects [13]. These technologies can limit the
CPU, memory, network bandwidth, and (in new Linux
kernels) I/O usage of a process tree. These isolation technologies are not perfect, but using containers is already
an advantage over frameworks like Hadoop, where tasks
from different jobs simply run in separate processes.
Resource Allocation
3.5
Mesos delegates allocation decisions to a pluggable allocation module, so that organizations can tailor allocation to their needs. So far, we have implemented two
allocation modules: one that performs fair sharing based
on a generalization of max-min fairness for multiple resources [21] and one that implements strict priorities.
Similar policies are used in Hadoop and Dryad [25, 38].
In normal operation, Mesos takes advantage of the
fact that most tasks are short, and only reallocates resources when tasks finish. This usually happens frequently enough so that new frameworks acquire their
share quickly. For example, if a frameworks share is
10% of the cluster, it needs to wait approximately 10%
of the mean task length to receive its share. However,
if a cluster becomes filled by long tasks, e.g., due to a
buggy job or a greedy framework, the allocation module
can also revoke (kill) tasks. Before killing a task, Mesos
gives its framework a grace period to clean it up.
We leave it up to the allocation module to select the
policy for revoking tasks, but describe two related mechanisms here. First, while killing a task has a low impact
on many frameworks (e.g., MapReduce), it is harmful for
frameworks with interdependent tasks (e.g., MPI). We allow these frameworks to avoid being killed by letting al-
Because task scheduling in Mesos is a distributed process, it needs to be efficient and robust to failures. Mesos
includes three mechanisms to help with this goal.
First, because some frameworks will always reject certain resources, Mesos lets them short-circuit the rejection
process and avoid communication by providing filters to
the master. We currently support two types of filters:
only offer nodes from list L and only offer nodes with
at least R resources free. However, other types of predicates could also be supported. Note that unlike generic
constraint languages, filters are Boolean predicates that
specify whether a framework will reject one bundle of
resources on one node, so they can be evaluated quickly
on the master. Any resource that does not pass a frameworks filter is treated exactly like a rejected resource.
Second, because a framework may take time to respond to an offer, Mesos counts resources offered to a
framework towards its allocation of the cluster. This is
a strong incentive for frameworks to respond to offers
quickly and to filter resources that they cannot use.
Third, if a framework has not responded to an offer
for a sufficiently long time, Mesos rescinds the offer and
re-offers the resources to other frameworks.
4
Scheduler Callbacks
resourceOffer(offerId, offers)
offerRescinded(offerId)
statusUpdate(taskId, status)
slaveLost(slaveId)
Executor Callbacks
launchTask(taskDescriptor)
killTask(taskId)
Scheduler Actions
replyToOffer(offerId, tasks)
setNeedsOffers(bool)
setFilters(filters)
getGuaranteedShare()
killTask(taskId)
Executor Actions
4.1
sendStatus(taskId, status)
3.6
Fault Tolerance
API Summary
Mesos Behavior
4.2
Homogeneous Tasks
Elastic Framework
Rigid Framework
Constant dist. Exponential dist. Constant dist.
Exponential dist.
Ramp-up time
T
T ln k
T
T ln k
Completion time
(1/2 + )T
(1 + )T
(1 + )T
(ln k + )T
Utilization
1
1
/(1/2 + )
/(ln k 1 + )
Table 2: Ramp-up time, job completion time and utilization for both elastic and rigid frameworks, and for both constant and
exponential task duration distributions. The framework starts with no slots. k is the number of slots the framework is entitled under
the scheduling policy, and T represents the time it takes a job to complete assuming the framework gets all k slots at once.
Framework ramp-up time: If task durations are constant, it will take framework f at most T time to acquire
k slots. This is simply because during a T interval, every
slot will become available, which will enable Mesos to
offer the framework all k of its preferred slots. If the duration distribution is exponential, the expected ramp-up
time can be as high as T ln k [23].
Job completion time: The expected completion time3
of an elastic job is at most (1 + )T , which is within T
(i.e., the mean task duration) of the completion time of
the job when it gets all its slots instantaneously. Rigid
jobs achieve similar completion times for constant task
durations, but exhibit much higher completion times for
exponential job durations, i.e., (ln k + )T . This is simply because it takes a framework T ln k time on average
to acquire all its slots and be able to start its job.
System utilization: Elastic jobs fully utilize their allocated slots, because they can use every slot as soon
as they get it. As a result, assuming infinite demand, a
system running only elastic jobs is fully utilized. Rigid
frameworks achieve slightly worse utilizations, as their
jobs cannot start before they get their full allocations, and
thus they waste the resources held while ramping up.
4.3
Placement Preferences
4.4
Heterogeneous Tasks
So far we have assumed that frameworks have homogeneous task duration distributions, i.e., that all frameworks have the same task duration distribution. In this
section, we discuss frameworks with heterogeneous task
duration distributions. In particular, we consider a workload where tasks that are either short and long, where the
mean duration of the long tasks is significantly longer
than the mean of the short tasks. Such heterogeneous
3 When computing job completion time we assume that the last tasks
of the job running on the frameworks k slots finish at the same time.
reducing latency for new jobs and wasted work for revocation. If frameworks are elastic, they will opportunistically utilize all the resources they can obtain. Finally,
if frameworks do not accept resources that they do not
understand, they will leave them for frameworks that do.
We also note that these properties are met by many
current cluster computing frameworks, such as MapReduce and Dryad, simply because using short independent
tasks simplifies load balancing and fault recovery.
4.6
Framework Incentives
Interdependent framework constraints: It is possible to construct scenarios where, because of esoteric interdependencies between frameworks (e.g., certain tasks
from two frameworks cannot be colocated), only a single global allocation of the cluster performs well. We
argue such scenarios are rare in practice. In the model
discussed in this section, where frameworks only have
preferences over which nodes they use, we showed that
allocations approximate those of optimal schedulers.
Scale elastically: The ability of a framework to use resources as soon as it acquires theminstead of waiting
to reach a given minimum allocationwould allow the
framework to start (and complete) its jobs earlier. In addition, the ability to scale up and down allows a framework to grab unused resources opportunistically, as it can
later release them with little negative impact.
works cannot predict task times and must be able to handle failures and stragglers [18, 40, 38]. These policies
are easy to implement over resource offers.
as an executor, which may be terminated if it is not running tasks. This would make map output files unavailable
to reduce tasks. We solved this problem by providing a
shared file server on each node in the cluster to serve
local files. Such a service is useful beyond Hadoop, to
other frameworks that write data locally on each node.
In total, our Hadoop port is 1500 lines of code.
Implementation
5.2
Hadoop Port
5.3
Spark Framework
Bin
1
2
3
4
5
6
7
8
f(x,w)
f(x,w)
w
f(x,w)
...
a) Dryad
Reduce Tasks
NA
NA
2
NA
10
NA
NA
30
# Jobs Run
38
18
14
12
6
6
4
2
Macrobenchmark Workloads
Evaluation
Map Tasks
1
2
10
50
100
200
400
400
Table 3: Job types for each bin in our Facebook Hadoop mix.
b) Spark
Job Type
selection
text search
aggregation
selection
aggregation
selection
text search
join
Macrobenchmark
4 We
scaled down the largest jobs in [38] to have the workload fit a
quarter of our cluster size.
Share of Cluster
Static Partitioning
Mesos
200
400
600
800
1000
1200
1400
1
0.8
0.6
0.4
0.2
0
1600
Static Partitioning
Mesos
500
1000
Time (s)
Static Partitioning
Mesos
200
400
600
800
2000
2500
3000
Share of Cluster
(c) Spark
1
0.8
0.6
0.4
0.2
0
1500
Time (s)
1000
1200
1400
1600
1
0.8
0.6
0.4
0.2
0
1800
Static Partitioning
Mesos
200
400
600
Time (s)
800
1000
1200
1400
1600
Time (s)
Figure 5: Comparison of cluster shares (fraction of CPUs) over time for each of the frameworks in the Mesos and static partitioning
macrobenchmark scenarios. On Mesos, frameworks can scale up when their demand is high and that of other frameworks is low, and
thus finish jobs faster. Note that the plots time axes are different (e.g., the large Hadoop mix takes 3200s with static partitioning).
100
80
60
40
20
0
Mesos
0
200
400
600
800
1000
Static
1200
1400
1600
Time (s)
Figure 6: Framework shares on Mesos during the macrobenchmark. By pooling resources, Mesos lets each workload scale
up to fill gaps in the demand of others. In addition, fine-grained
sharing allows resources to be reallocated in tens of seconds.
Mesos
0
200
400
600
800
1000
Static
1200
1400
1600
Time (s)
Torque / MPI Our Torque framework ran eight instances of the tachyon raytracing job [35] that is part of
the SPEC MPI2007 benchmark. Six of the jobs ran small
problem sizes and two ran large ones. Both types used 24
parallel tasks. We submitted these jobs at fixed times to
both clusters. The tachyon job is CPU-intensive.
6.1.2
50
40
30
20
10
0
Macrobenchmark Results
Facebook
Hadoop Mix
Large Hadoop
Mix
Spark
Torque / MPI
6319
1.14
3143
1494
2.10
1684
3210
1338
3352
1.26
0.96
Framework
Job Type
Table 4: Aggregate performance of each framework in the macrobenchmark (sum of running times of all the jobs in the framework). The speedup column shows the relative gain on Mesos.
600
80%
480
60%
360
40%
240
20%
120
0%
0
Static
Mesos, no Mesos, 1s Mesos, 5s
partitioning delay sched. delay sched. delay sched.
Data Locality
Framework
Overhead
6.4
3000
Hadoop
Spark
2000
1000
0
0
10
20
30
Number of Iterations
Task Launch
Overhead (seconds)
10000
20000
30000
40000
50000
Number of Nodes
Spark Framework
4000
6.6
Mesos Scalability
Failure Recovery
To evaluate Mesos scalability, we emulated large clusters by running up to 50,000 slave daemons on 99 Amazon EC2 nodes, each with 8 CPU cores and 6 GB RAM.
We used one EC2 node for the master and the rest of the
nodes to run slaves. During the experiment, each of 200
12
Performance Isolation
Condor. The Condor cluster manager uses the ClassAds language [32] to match nodes to jobs. Using a resource specification language is not as flexible for frameworks as resource offers, since not all requirements may
be expressible. Also, porting existing frameworks, which
have their own schedulers, to Condor would be more difficult than porting them to Mesos, where existing schedulers fit naturally into the two-level scheduling model.
Next-Generation Hadoop. Recently, Yahoo! announced a redesign for Hadoop that uses a two-level
scheduling model, where per-application masters request
resources from a central manager [14]. The design aims
to support non-MapReduce applications as well. While
details about the scheduling model in this system are currently unavailable, we believe that the new application
masters could naturally run as Mesos frameworks.
Related Work
Acknowledgements
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
5 Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF.
14
Abstract
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a
fault-tolerant manner. RDDs are motivated by two types
of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data
mining tools. In both cases, keeping data in memory
can improve performance by an order of magnitude.
To achieve fault tolerance efficiently, RDDs provide a
restricted form of shared memory, based on coarsegrained transformations rather than fine-grained updates
to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these
models do not capture. We have implemented RDDs in a
system called Spark, which we evaluate through a variety
of user applications and benchmarks.
Introduction
This section provides an overview of RDDs. We first define RDDs (2.1) and introduce their programming interface in Spark (2.2). We then compare RDDs with finergrained shared memory abstractions (2.3). Finally, we
discuss limitations of the RDD model (2.4).
2.1
RDD Abstraction
differentiate them from other operations on RDDs. Examples of transformations include map, filter, and join.2
RDDs do not need to be materialized at all times. Instead, an RDD has enough information about how it was
derived from other datasets (its lineage) to compute its
partitions from data in stable storage. This is a powerful property: in essence, a program cannot reference an
RDD that it cannot reconstruct after a failure.
Finally, users can control two other aspects of RDDs:
persistence and partitioning. Users can indicate which
RDDs they will reuse and choose a storage strategy for
them (e.g., in-memory storage). They can also ask that
an RDDs elements be partitioned across machines based
on a key in each record. This is useful for placement optimizations, such as ensuring that two datasets that will
be joined together are hash-partitioned in the same way.
2.2
Aspect
lines
filter(_.startsWith(ERROR))
errors
filter(_.contains(HDFS)))
HDFS errors
map(_.split(\t)(3))
time fields
RDDs
Reads
Writes
Coarse-grained
Fine-grained
Consistency
Trivial (immutable)
Up to app / runtime
Difficult
Work
placement
Automatic based on
data locality
Up to app (runtimes
aim for transparency)
2.3
RAM
Worker
Input Data
Driver
RAM
Worker
results
tasks
RAM
Input Data
Worker
Input Data
Spark provides the RDD abstraction through a languageintegrated API similar to DryadLINQ [31] in Scala [2],
a statically typed functional programming language for
the Java VM. We chose Scala due to its combination of
conciseness (which is convenient for interactive use) and
efficiency (due to static typing). However, nothing about
the RDD abstraction requires a functional language.
To use Spark, developers write a driver program that
connects to a cluster of workers, as shown in Figure 2.
The driver defines one or more RDDs and invokes actions on them. Spark code on the driver also tracks the
RDDs lineage. The workers are long-lived processes
that can store RDD partitions in RAM across operations.
As we showed in the log mining example in Section 2.2.1, users provide arguments to RDD opera-
Example Applications
Logistic Regression
Transformations
Actions
map( f : T ) U)
filter( f : T ) Bool)
flatMap( f : T ) Seq[U])
sample(fraction : Float)
groupByKey()
reduceByKey( f : (V, V) ) V)
union()
join()
cogroup()
crossProduct()
mapValues( f : V ) W)
sort(c : Comparator[K])
partitionBy(p : Partitioner[K])
count()
collect()
reduce( f : (T, T) ) T)
lookup(k : K)
save(path : String)
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
RDD[T] ) RDD[U]
RDD[T] ) RDD[T]
RDD[T] ) RDD[U]
RDD[T] ) RDD[T] (Deterministic sampling)
RDD[(K, V)] ) RDD[(K, Seq[V])]
RDD[(K, V)] ) RDD[(K, V)]
(RDD[T], RDD[T]) ) RDD[T]
(RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (V, W))]
(RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (Seq[V], Seq[W]))]
(RDD[T], RDD[U]) ) RDD[(T, U)]
RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)
RDD[(K, V)] ) RDD[(K, V)]
RDD[(K, V)] ) RDD[(K, V)]
RDD[T] ) Long
RDD[T] ) Seq[T]
RDD[T] ) T
RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)
Outputs RDD to a storage system, e.g., HDFS
Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.
PageRank
input file
map
links
ranks0
join
contribs0
reduce + map
ranks1
contribs1
ranks2
contribs2
. . .
This program leads to the RDD lineage graph in Figure 3. On each iteration, we create a new ranks dataset
based on the contribs and ranks from the previous iteration and the static links dataset.6 One interesting feature of this graph is that it grows longer with the number
6 Note that although RDDs are immutable, the variables ranks and
contribs in the program point to different RDDs on each iteration.
Representing RDDs
One of the challenges in providing RDDs as an abstraction is choosing a representation for them that can track
lineage across a wide range of transformations. Ideally,
a system implementing RDDs should provide as rich
a set of transformation operators as possible (e.g., the
ones in Table 2), and let users compose them in arbitrary
ways. We propose a simple graph-based representation
for RDDs that facilitates these goals. We have used this
representation in Spark to support a wide range of transformations without adding special logic to the scheduler
for each one, which greatly simplified the system design.
In a nutshell, we propose representing each RDD
through a common interface that exposes five pieces of
information: a set of partitions, which are atomic pieces
of the dataset; a set of dependencies on parent RDDs;
a function for computing the dataset based on its parents; and metadata about its partitioning scheme and data
placement. For example, an RDD representing an HDFS
file has a partition for each block of the file and knows
which machines each block is on. Meanwhile, the result
Operation
partitions()
Meaning
Return a list of Partition objects
Narrow Dependencies:
Wide Dependencies:
B:
A:
G:
Stage 1
groupBy
groupByKey
map, filter
C:
D:
F:
map
E:
join with inputs
co-partitioned
union
Implementation
Job Scheduling
Stage 2
join
union
Stage 3
Line1
Line 1:
var query = hello
Line 2:
rdd.filter(_.contains(query))
.count()
query:
String
hello
Line2
line1:
Closure1
line1:
eval(s): { return
s.contains(line1.query) }
5.2
Interpreter Integration
Memory Management
in-memory storage as serialized data, and on-disk storage. The first option provides the fastest performance,
because the Java VM can access each RDD element
natively. The second option lets users choose a more
memory-efficient representation than Java object graphs
when space is limited, at the cost of lower performance.8
The third option is useful for RDDs that are too large to
keep in RAM but costly to recompute on each use.
To manage the limited memory available, we use an
LRU eviction policy at the level of RDDs. When a new
RDD partition is computed but there is not enough space
to store it, we evict a partition from the least recently accessed RDD, unless this is the same RDD as the one with
the new partition. In that case, we keep the old partition
in memory to prevent cycling partitions from the same
RDD in and out. This is important because most operations will run tasks over an entire RDD, so it is quite
likely that the partition already in memory will be needed
in the future. We found this default policy to work well in
all our applications so far, but we also give users further
control via a persistence priority for each RDD.
Finally, each instance of Spark on a cluster currently
has its own separate memory space. In future work, we
plan to investigate sharing RDDs across instances of
Spark through a unified memory manager.
5.4
6.1
We implemented two iterative machine learning applications, logistic regression and k-means, to compare the
performance of the following systems:
Hadoop: The Hadoop 0.20.2 stable release.
HadoopBinMem: A Hadoop deployment that converts the input data into a low-overhead binary format
in the first iteration to eliminate text parsing in later
ones, and stores it in an in-memory HDFS instance.
182!
3!
40!
33!
46!
82!
87!
115!
106!
139!
62!
80!
76!
80!
0!
Logistic Regression!
K-Means!
0!
100!
50!
106!
87!
197!
143!
150!
33!
50!
200!
Hadoop !
HadoopBinMem!
Spark!
61!
100!
250!
157!
121!
150!
300!
Iteration time (s)!
200!
76!
62!
250!
111!
80!
Hadoop!
HadoopBinMem!
Spark!
300!
274!
3!
120!
6!
When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions.
160!
184!
200!
116!
We evaluated Spark and RDDs through a series of experiments on Amazon EC2, as well as benchmarks of user
applications. Overall, our results show the following:
Spark outperforms Hadoop by up to 20 in iterative machine learning and graph applications. The
speedup comes from avoiding I/O and deserialization
costs by storing data in memory as Java objects.
First Iteration!
Later Iterations!
240!
15!
Evaluation
them simpler to checkpoint than general shared memory. Because consistency is not a concern, RDDs can be
written out in the background without requiring program
pauses or distributed snapshot schemes.
0!
25!
50!
100!
Number of machines!
25!
50!
100!
Number of machines!
(b) K-Means
6.2
PageRank
171!
80!
Basic Spark!
23!
28!
14!
72!
50!
Spark + Controlled
Partitioning!
0!
30!
60!
Number of machines!
4!
5! 6! 7!
Iteration!
8!
9!
59!
3!
57!
2!
59!
1!
57!
81!
No Failure!
Failure in the 6th Iteration!
58!
140!
120!
100!
80!
60!
40!
20!
0!
58!
100!
56!
Hadoop!
150!
57!
200!
119!
0!
2.9!
2.9!
5!
Binary Input!
6.9!
10!
Text Input!
13.1!
15.4!
15!
8.4!
20!
10!
6.3
Fault Recovery
Traffic Modeling Researchers in the Mobile Millennium project at Berkeley [18] parallelized a learning algorithm for inferring road traffic congestion from sporadic automobile GPS measurements. The source data
were a 10,000 link road network for a metropolitan area,
as well as 600,000 samples of point-to-point trip times
for GPS-equipped automobiles (travel times for each
path may include multiple road links). Using a traffic
model, the system can estimate the time it takes to travel
across individual road links. The researchers trained this
model using an expectation maximization (EM) algorithm that repeats two map and reduceByKey steps iteratively. The application scales nearly linearly from 20 to
80 nodes with 4 cores each, as shown in Figure 13(a).
38.6!
70.6!
27.6!
422!
820!
20!
6!
4!
6.6!
7.0!
8!
4.7!
10!
5.5!
In-Memory Analytics Conviva Inc, a video distribution company, used Spark to accelerate a number of data
analytics reports that previously ran over Hadoop. For
example, one report ran as a series of Hive [1] queries
that computed various statistics for a customer. These
queries all worked on the same subset of the data (records
matching a customer-provided filter), but performed aggregations (averages, percentiles, and COUNT DISTINCT)
over different grouping fields, requiring separate MapReduce jobs. By implementing the queries in Spark and
loading the subset of data shared across them once into
an RDD, the company was able to speed up the report by
40. A report on 200 GB of compressed data that took
20 hours on a Hadoop cluster now runs in 30 minutes
using only two Spark machines. Furthermore, the Spark
program only required 96 GB of RAM, because it only
stored the rows and columns matching the customers filter in an RDD, not the whole decompressed file.
40!
20!
40!
80!
Number of machines!
6.5
60!
0!
20!
40!
80!
Number of machines!
1521!
400!
0!
100%!
4.5!
75%!
800!
80!
3.2!
50%!
1200!
2.8!
25%!
1600!
2.0!
0%!
2000!
1.7!
0!
20!
11.5!
40!
29.7!
60!
40.7!
58.1!
80!
68.8!
100!
2!
0!
100 GB!
500 GB!
Data size (GB)!
1 TB!
Discussion
Although RDDs seem to offer a limited programming interface due to their immutable nature and coarse-grained
transformations, we have found them suitable for a wide
class of applications. In particular, RDDs can express a
surprising number of cluster programming models that
have so far been proposed as separate frameworks, allowing users to compose these models in one program
(e.g., run a MapReduce operation to build a graph, then
run Pregel on it) and share data between them. In this section, we discuss which programming models RDDs can
express and why they are so widely applicable (7.1). In
addition, we discuss another benefit of the lineage information in RDDs that we are pursuing, which is to facilitate debugging across these models (7.2).
7.1
RDDs can efficiently express a number of cluster programming models that have so far been proposed independently. By efficiently, we mean that not only can
RDDs be used to produce the same output as programs
written in these models, but that RDDs can also capture
the optimizations that these frameworks perform, such as
keeping specific data in memory, partitioning it to minimize communication, and recovering from failures efficiently. The models expressible using RDDs include:
MapReduce: This model can be expressed using the
flatMap and groupByKey operations in Spark, or reduceByKey if there is a combiner.
DryadLINQ: The DryadLINQ system provides a
wider range of operators than MapReduce over the more
general Dryad runtime, but these are all bulk operators
that correspond directly to RDD transformations available in Spark (map, groupByKey, join, etc).
SQL: Like DryadLINQ expressions, SQL queries perform data-parallel operations on sets of records.
Pregel: Googles Pregel [22] is a specialized model for
iterative graph applications that at first looks quite different from the set-oriented programming models in other
systems. In Pregel, a program runs as a series of coordinated supersteps. On each superstep, each vertex in the
graph runs a user function that can update state associated with the vertex, change the graph topology, and send
messages to other vertices for use in the next superstep.
This model can express many graph algorithms, including shortest paths, bipartite matching, and PageRank.
The key observation that lets us implement this model
with RDDs is that Pregel applies the same user function
to all the vertices on each iteration. Thus, we can store the
vertex states for each iteration in an RDD and perform
a bulk transformation (flatMap) to apply this function
and generate an RDD of messages. We can then join this
RDD with the vertex states to perform the message exchange. Equally importantly, RDDs allow us to keep vertex states in memory like Pregel does, to minimize communication by controlling their partitioning, and to support partial recovery on failures. We have implemented
Pregel as a 200-line library on top of Spark and refer the
reader to [33] for more details.
Iterative MapReduce: Several recently proposed systems, including HaLoop [7] and Twister [11], provide an
iterative MapReduce model where the user gives the system a series of MapReduce jobs to loop. The systems
keep data partitioned consistently across iterations, and
Twister can also keep it in memory. Both optimizations
are simple to express with RDDs, and we were able to
implement HaLoop as a 200-line library using Spark.
Batched Stream Processing: Researchers have recently proposed several incremental processing systems
for applications that periodically update a result with
new data [21, 15, 4]. For example, an application updating statistics about ad clicks every 15 minutes should be
able to combine intermediate state from the previous 15minute window with data from new logs. These systems
perform bulk operations similar to Dryad, but store application state in distributed filesystems. Placing the intermediate state in RDDs would speed up their processing.
Explaining the Expressivity of RDDs Why are RDDs
able to express these diverse programming models? The
reason is that the restrictions on RDDs have little impact in many parallel applications. In particular, although
RDDs can only be created through bulk transformations,
many parallel programs naturally apply the same operation to many records, making them easy to express. Similarly, the immutability of RDDs is not an obstacle because one can create multiple RDDs to represent versions
of the same dataset. Indeed, many of todays MapReduce
applications run over filesystems that do not allow updates to files, such as HDFS.
One final question is why previous frameworks have
not offered the same level of generality. We believe that
this is because these systems explored specific problems
that MapReduce and Dryad do not handle well, such as
iteration, without observing that the common cause of
these problems was a lack of data sharing abstractions.
7.2
Related Work
Cluster Programming Models: Related work in cluster programming models falls into several classes. First,
data flow models such as MapReduce [10], Dryad [19]
and Ciel [23] support a rich set of operators for processing data but share it through stable storage systems.
RDDs represent a more efficient data sharing abstraction
than stable storage because they avoid the cost of data
replication, I/O and serialization.10
Second, several high-level programming interfaces
for data flow systems, including DryadLINQ [31] and
FlumeJava [8], provide language-integrated APIs where
the user manipulates parallel collections through operators like map and join. However, in these systems,
the parallel collections represent either files on disk or
ephemeral datasets used to express a query plan. Although the systems will pipeline data across operators
in the same query (e.g., a map followed by another
map), they cannot share data efficiently across queries.
We based Sparks API on the parallel collection model
due to its convenience, and do not claim novelty for the
language-integrated interface, but by providing RDDs as
the storage abstraction behind this interface, we allow it
to support a far broader class of applications.
A third class of systems provide high-level interfaces
for specific classes of applications requiring data sharing.
For example, Pregel [22] supports iterative graph applications, while Twister [11] and HaLoop [7] are iterative
MapReduce runtimes. However, these frameworks perform data sharing implicitly for the pattern of computation they support, and do not provide a general abstraction that the user can employ to share data of her choice
among operations of her choice. For example, a user cannot use Pregel or Twister to load a dataset into memory
and then decide what query to run on it. RDDs provide
a distributed storage abstraction explicitly and can thus
support applications that these specialized systems do
not capture, such as interactive data mining.
Finally, some systems expose shared mutable state
to allow the user to perform in-memory computation.
For example, Piccolo [27] lets users run parallel functions that read and update cells in a distributed hash
table. Distributed shared memory (DSM) systems [24]
9 Unlike these systems, an RDD-based debugger will not replay non-
and key-value stores like RAMCloud [25] offer a similar model. RDDs differ from these systems in two ways.
First, RDDs provide a higher-level programming interface based on operators such as map, sort and join,
whereas the interface in Piccolo and DSM is just reads
and updates to table cells. Second, Piccolo and DSM systems implement recovery through checkpoints and rollback, which is more expensive than the lineage-based
strategy of RDDs in many applications. Finally, as discussed in Section 2.3, RDDs also provide other advantages over DSM, such as straggler mitigation.
Caching Systems: Nectar [12] can reuse intermediate
results across DryadLINQ jobs by identifying common
subexpressions with program analysis [16]. This capability would be compelling to add to an RDD-based system.
However, Nectar does not provide in-memory caching (it
places the data in a distributed file system), nor does it
let users explicitly control which datasets to persist and
how to partition them. Ciel [23] and FlumeJava [8] can
likewise cache task results but do not provide in-memory
caching or explicit control over which data is cached.
Ananthanarayanan et al. have proposed adding an inmemory cache to distributed file systems to exploit the
temporal and spatial locality of data access [3]. While
this solution provides faster access to data that is already
in the file system, it is not as efficient a means of sharing intermediate results within an application as RDDs,
because it would still require applications to write these
results to the file system between stages.
Lineage: Capturing lineage or provenance information
for data has long been a research topic in scientific computing and databases, for applications such as explaining
results, allowing them to be reproduced by others, and
recomputing data if a bug is found in a workflow or if
a dataset is lost. We refer the reader to [5] and [9] for
surveys of this work. RDDs provide a parallel programming model where fine-grained lineage is inexpensive to
capture, so that it can be used for failure recovery.
Our lineage-based recovery mechanism is also similar
to the recovery mechanism used within a computation
(job) in MapReduce and Dryad, which track dependencies among a DAG of tasks. However, in these systems,
the lineage information is lost after a job ends, requiring
the use of a replicated storage system to share data across
computations. In contrast, RDDs apply lineage to persist
in-memory data efficiently across computations, without
the cost of replication and disk I/O.
Relational Databases: RDDs are conceptually similar
to views in a database, and persistent RDDs resemble
materialized views [28]. However, like DSM systems,
databases typically allow fine-grained read-write access
to all records, requiring logging of operations and data
for fault tolerance and additional overhead to maintain
Conclusion
Acknowledgements
We thank the first Spark users, including Tim Hunter,
Lester Mackey, Dilip Joseph, and Jibin Zhan, for trying
out our system in their real applications, providing many
good suggestions, and identifying a few research challenges along the way. We also thank our shepherd, Ed
Nightingale, and our reviewers for their feedback. This
research was supported in part by Berkeley AMP Lab
sponsors Google, SAP, Amazon Web Services, Cloudera, Huawei, IBM, Intel, Microsoft, NEC, NetApp and
VMWare, by DARPA (contract #FA8650-11-C-7136),
by a Google PhD Fellowship, and by the Natural Sciences and Engineering Research Council of Canada.
References
[1] Apache Hive. http://hadoop.apache.org/hive.
[2] Scala. http://www.scala-lang.org.
[3] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica.
Disk-locality in datacenter computing considered irrelevant. In
HotOS 11, 2011.
[4] P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and
R. Pasquin. Incoop: MapReduce for incremental computations.
In ACM SOCC 11, 2011.
[5] R. Bose and J. Frew. Lineage retrieval for scientific data
processing: a survey. ACM Computing Surveys, 37:128, 2005.
[6] S. Brin and L. Page. The anatomy of a large-scale hypertextual
web search engine. In WWW, 1998.
[7] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop:
efficient iterative data processing on large clusters. Proc. VLDB
Endow., 3:285296, September 2010.
[8] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry,
R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient
data-parallel pipelines. In PLDI 10. ACM, 2010.
[9] J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in
databases: Why, how, and where. Foundations and Trends in
Databases, 1(4):379474, 2009.
[10] J. Dean and S. Ghemawat. MapReduce: Simplified data
processing on large clusters. In OSDI, 2004.
ABSTRACT
Cassandra is a distributed storage system for managing very
large amounts of structured data spread out across many
commodity servers, while providing highly available service
with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes (possibly spread
across dierent data centers). At this scale, small and large
components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives
the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation
strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple
data model that supports dynamic control over data layout and format. Cassandra system was designed to run on
cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.
1.
INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00.
Prashant Malik
Facebook
2.
RELATED WORK
Distributing data for performance, availability and durability has been widely studied in the file system and database
communities. Compared to P2P storage systems that only
support flat namespaces, distributed file systems typically
support hierarchical namespaces. Systems like Ficus[14] and
Coda[16] replicate files for high availability at the expense
of consistency. Update conflicts are typically managed using specialized conflict resolution procedures. Farsite[2] is
a distributed file system that does not use any centralized
server. Farsite achieves high availability and scalability using replication. The Google File System (GFS)[9] is another
distributed file system built for hosting the state of Googles
internal applications. GFS uses a simple design with a single master server for hosting the entire metadata and where
the data is split into chunks and stored in chunk servers.
However the GFS master is now made fault tolerant using
the Chubby[3] abstraction. Bayou[18] is a distributed relational database system that allows disconnected operations
and provides eventual data consistency. Among these systems, Bayou, Coda and Ficus allow disconnected operations
and are resilient to issues such as network partitions and
outages. These systems dier on their conflict resolution
procedures. For instance, Coda and Ficus perform system
level conflict resolution and Bayou allows application level
resolution. All of them however, guarantee eventual consistency. Similar to these systems, Dynamo[6] allows read and
write operations to continue even during network partitions
and resolves update conflicts using dierent conflict resolution mechanisms, some client driven. Traditional replicated
relational database systems focus on the problem of guaranteeing strong consistency of replicated data. Although
strong consistency provides the application writer a convenient programming model, these systems are limited in
scalability and availability [10]. These systems are not capable of handling network partitions because they typically
provide strong consistency guarantees.
Dynamo[6] is a storage system that is used by Amazon
to store and retrieve user shopping carts. Dynamos Gossip
based membership algorithm helps every node maintain information about every other node. Dynamo can be defined
as a structured overlay with at most one-hop request routing. Dynamo detects updated conflicts using a vector clock
scheme, but prefers a client side conflict resolution mechanism. A write operation in Dynamo also requires a read to
be performed for managing the vector timestamps. This is
can be very limiting in environments where systems need
to handle a very high write throughput. Bigtable[4] provides both structure and data distribution but relies on a
distributed file system for its durability.
3.
DATA MODEL
4.
API
columnN ame can refer to a specific column within a column family, a column family, a super column family, or a
column within a super column.
5.
SYSTEM ARCHITECTURE
The architecture of a storage system that needs to operate in a production setting is complex. In addition to
the actual data persistence component, the system needs to
have the following characteristics; scalable and robust solutions for load balancing, membership and failure detection,
failure recovery, replica synchronization, overload handling,
state transfer, concurrency and job scheduling, request marshalling, request routing, system monitoring and alarming,
and configuration management. Describing the details of
each of the solutions is beyond the scope of this paper, so
we will focus on the core distributed systems techniques used
in Cassandra: partitioning, replication, membership, failure
handling and scaling. All these modules work in synchrony
to handle read/write requests. Typically a read/write request for a key gets routed to any node in the Cassandra
cluster. The node then determines the replicas for this particular key. For writes, the system routes the requests to
the replicas and waits for a quorum of replicas to acknowledge the completion of the writes. For reads, based on the
consistency guarantees required by the client, the system either routes the requests to the closest replica or routes the
requests to all replicas and waits for a quorum of responses.
5.1
Partitioning
5.2
Replication
5.3
Membership
Cluster membership in Cassandra is based on Scuttlebutt[19], a very efficient anti-entropy Gossip based mechanism. The salient feature of Scuttlebutt is that it has very
efficient CPU utilization and very efficient utilization of the
gossip channel. Within the Cassandra system Gossip is not
only used for membership but also to disseminate other system related control state.
5.3.1
Failure Detection
5.4
Bootstrapping
5.5
5.6
Local Persistence
5.7
Implementation Details
data we look into the latest file first and return if we find
the data. Over time the number of data files will increase
on disk. We perform a compaction process, very much like
the Bigtable system, which merges multiple files into one;
essentially merge sort on a bunch of sorted data files. The
system will always compact files that are close to each other
with respect to size i.e there will never be a situation where a
100GB file is compacted with a file which is less than 50GB.
Periodically a major compaction process is run to compact
all related data files into one big file. This compaction process is a disk I/O intensive operation. Many optimizations
can be put in place to not aect in coming read requests.
6.
PRACTICAL EXPERIENCES
6.1
For Inbox Search we maintain a per user index of all messages that have been exchanged between the sender and the
recipients of the message. There are two kinds of search features that are enabled today (a) term search (b) interactions
- given the name of a person return all messages that the
user might have ever sent or received from that person. The
schema consists of two column families. For query (a) the
user id is the key and the words that make up the message
become the super column. Individual message identifiers
of the messages that contain the word become the columns
within the super column. For query (b) again the user id is
the key and the recipients ids are the super columns. For
each of these super columns the individual message identifiers are the columns. In order to make the searches fast
Cassandra provides certain hooks for intelligent caching of
data. For instance when a user clicks into the search bar
an asynchronous message is sent to the Cassandra cluster
to prime the buer cache with that users index. This way
when the actual search query is executed the search results
are likely to already be in memory. The system currently
stores about 50+TB of data on a 150 node cluster, which
is spread out between east and west coast data centers. We
show some production measured numbers for read performance.
Latency Stat
Min
Median
Max
Search Interactions
7.69ms
15.69ms
26.13ms
Term Search
7.78ms
18.27ms
44.41ms
We experimented with various implementations of Failure Detectors such as the ones described in [15] and [5].
Our experience had been that the time to detect failures increased beyond an acceptable limit as the size
of the cluster grew. In one particular experiment in a
cluster of 100 nodes time to taken to detect a failed
node was in the order of two minutes. This is practically unworkable in our environments. With the accrual failure detector with a slightly conservative value
of PHI, set to 5, the average time to detect failures in
the above experiment was about 15 seconds.
7.
Monitoring is not to be taken for granted. The Cassandra system is well integrated with Ganglia[12], a
distributed performance monitoring tool. We expose
various system level metrics to Ganglia and this has
helped us understand the behavior of the system when
subject to our production workload. Disks fail for no
apparent reasons. The bootstrap algorithm has some
CONCLUSION
We have built, implemented, and operated a storage system providing scalability, high performance, and wide applicability. We have empirically demonstrated that Cassandra
can support a very high update throughput while delivering low latency. Future works involves adding compression,
ability to support atomicity across keys and secondary index
support.
8.
ACKNOWLEDGEMENTS
9.
REFERENCES
Microsoft
Abstract
General Terms
Keywords
1. Introduction
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
143
http(s)://AccountName.<service>1.core.windows.net/PartitionNa
me/ObjectName
https://AccountName.service.core.windows.net/
DNS Lookup
Access Blobs,
Tables and Queues
for account
Partition Layer
Front-Ends
Inter-Stamp
Replication
Partition Layer
Stream Layer
Intra-Stamp Replication
Stream Layer
Intra-Stamp Replication
Storage Stamp
Storage Stamp
<service> specifies the service type, which can be blob, table, or queue.
APIs for Windows Azure Blobs, Tables, and Queues can be found
http://msdn.microsoft.com/en-us/library/dd179355.aspx
VIP
DNS
Front-Ends
Account Management
VIP
Location
Service
here:
144
Figure 1 shows the location service with two storage stamps and
the layers within the storage stamps. The LS tracks the resources
used by each storage stamp in production across all locations.
When an application requests a new account for storing data, it
specifies the location affinity for the storage (e.g., US North).
The LS then chooses a storage stamp within that location as the
primary stamp for the account using heuristics based on the load
information across all stamps (which considers the fullness of the
stamps and other metrics such as network and transaction
utilization). The LS then stores the account metadata information
in the chosen storage stamp, which tells the stamp to start taking
traffic for the assigned account. The LS then updates DNS to
allow
requests
to
now
route
from
the
name
https://AccountName.service.core.windows.net/ to that storage
stamps virtual IP (VIP, an IP address the storage stamp exposes
for external traffic).
145
4. Stream Layer
The two main architecture components of the stream layer are the
Stream Manager (SM) and Extent Node (EN) (shown in Figure 3).
Pointer to Extent E3
Extent E1 - Sealed
Extent E2 - Sealed
Extent E3 - Sealed
SM
SM
SM
B. Allocate extent
replica set
1
write
EN
ack
Stream //foo
Pointer to Extent E2
A. Create extent
Partition
Layer/
Client
Pointer to Extent E1
Stream Layer
paxos
2
6
EN
3
5
EN
4
Primary
Secondary
Secondary
EN
EN
EN
EN
EN
Pointer to Extent E4
Extent E4 - Unsealed
The SM periodically polls (syncs) the state of the ENs and what
extents they store. If the SM discovers that an extent is replicated
on fewer than the expected number of ENs, a re-replication of the
extent will lazily be created by the SM to regain the desired level
of replication. For extent replica placement, the SM randomly
chooses ENs across different fault domains, so that they are stored
on nodes that will not have correlated failures due to power,
network, or being on the same rack.
146
The SM does not know anything about blocks, just streams and
extents. The SM is off the critical path of client requests and does
not track each block append, since the total number of blocks can
be huge and the SM cannot scale to track those. Since the stream
and extent state is only tracked within a single stamp, the amount
of state can be kept small enough to fit in the SMs memory. The
only client of the stream layer is the partition layer, and the
partition layer and stream layer are co-designed so that they will
not use more than 50 million extents and no more than 100,000
streams for a single storage stamp given our current stamp sizes.
This parameterization can comfortably fit into 32GB of memory
for the SM.
Extent Nodes (EN) Each extent node maintains the storage for
a set of extent replicas assigned to it by the SM. An EN has N
disks attached, which it completely controls for storing extent
replicas and their blocks. An EN knows nothing about streams,
and only deals with extents and blocks. Internally on an EN
server, every extent on disk is a file, which holds data blocks and
their checksums, and an index which maps extent offsets to blocks
and their file location. Each extent node contains a view about the
extents it owns and where the peer replicas are for a given extent.
This view is a cache kept by the EN of the global state the SM
keeps. ENs only talk to other ENs to replicate block writes
(appends) sent by a client, or to create additional copies of an
existing replica when told to by the SM. When an extent is no
longer referenced by any stream, the SM garbage collects the
extent and notifies the ENs to reclaim the space.
147
appends for an extent are committed in order, and how extents are
sealed upon failures (discussed in Section 4.3.2).
When a stream is opened, the metadata for its extents is cached at
the client, so the client can go directly to the ENs for reading and
writing without talking to the SM until the next extent needs to be
allocated for the stream. If during writing, one of the replicas
ENs is not reachable or there is a disk failure for one of the
replicas, a write failure is returned to the client. The client then
contacts the SM, and the extent that was being appended to is
sealed by the SM at its current commit length (see Section 4.3.2).
At this point the sealed extent can no longer be appended to. The
SM will then allocate a new extent with replicas on different
(available) ENs, which makes it now the last extent of the stream.
The information for this new extent is returned to the client. The
client then continues appending to the stream with its new extent.
This process of sealing by the SM and allocating the new extent is
done on average within 20ms. A key point here is that the client
can continue appending to a stream as soon as the new extent has
been allocated, and it does not rely on a specific node to become
available again.
For the newly sealed extent, the SM will create new replicas to
bring it back to the expected level of redundancy in the
background if needed.
4.3.2 Sealing
To seal an extent, the SM asks all three ENs their current length.
During sealing, either all replicas have the same length, which is
the simple case, or a given replica is longer or shorter than another
replica for the extent. This latter case can only occur during an
append failure where some but not all of the ENs for the replica
are available (i.e., some of the replicas get the append block, but
not all of them). We guarantee that the SM will seal the extent
even if the SM may not be able to reach all the ENs involved.
When sealing the extent, the SM will choose the smallest commit
length based on the available ENs it can talk to. This will not
cause data loss since the primary EN will not return success
unless all replicas have been written to disk for all three ENs. This
means the smallest commit length is sure to contain all the writes
that have been acknowledged to the client. In addition, it is also
fine if the final length contains blocks that were never
acknowledged back to the client, since the client (partition layer)
correctly deals with these as described in Section 4.2. During the
sealing, all of the extent replicas that were reachable by the SM
are sealed to the commit length chosen by the SM.
When reads are issued for an extent that has three replicas, they
are submitted with a deadline value which specifies that the
read should not be attempted if it cannot be fulfilled within the
deadline. If the EN determines the read cannot be fulfilled within
the time constraint, it will immediately reply to the client that the
deadline cannot be met. This mechanism allows the client to
select a different EN to read that data from, likely allowing the
read to complete faster.
Once the sealing is done, the commit length of the extent will
never be changed. If an EN was not reachable by the SM during
the sealing process but later becomes reachable, the SM will force
the EN to synchronize the given extent to the chosen commit
length. This ensures that once an extent is sealed, all its available
replicas (the ones the SM can eventually reach) are bitwise
identical.
This method is also used with erasure coded data. When reads
cannot be serviced in a timely manner due to a heavily loaded
spindle to the data fragment, the read may be serviced faster by
doing a reconstruction rather than reading that data fragment. In
this case, reads (for the range of the fragment needed to satisfy the
client request) are issued to all fragments of an erasure coded
extent, and the first N responses are used to reconstruct the desired
fragment.
148
The durability contract for the stream layer is that when data is
acknowledged as written by the stream layer, there must be at
least three durable copies of the data stored in the system. This
contract allows the system to maintain data durability even in the
face of a cluster-wide power failure. We operate our storage
system in such a way that all writes are made durable to power
safe storage before they are acknowledged back to the client.
As part of maintaining the durability contract while still achieving
good performance, an important optimization for the stream layer
is that on each extent node we reserve a whole disk drive or SSD
as a journal drive for all writes into the extent node. The journal
drive [11] is dedicated solely for writing a single sequential
journal of data, which allows us to reach the full write throughput
potential of the device. When the partition layer does a stream
append, the data is written by the primary EN while in parallel
sent to the two secondaries to be written. When each EN
performs its append, it (a) writes all of the data for the append to
the journal drive and (b) queues up the append to go to the data
disk where the extent file lives on that EN. Once either succeeds,
success can be returned. If the journal succeeds first, the data is
also buffered in memory while it goes to the data disk, and any
reads for that data are served from memory until the data is on the
data disk. From that point on, the data is served from the data
disk. This also enables the combining of contiguous writes into
larger writes to the data disk, and better scheduling of concurrent
writes and reads to get the best throughput. It is a tradeoff for
good latency at the cost of an extra write off the critical path.
Each of the above OTs has a fixed schema stored in the Schema
Table. The primary key for the Blob Table, Entity Table, and
Message Table consists of three properties: AccountName,
PartitionName, and ObjectName. These properties provide the
indexing and sort order for those Object Tables.
The property types supported for an OTs schema are the standard
simple types (bool, binary, string, DateTime, double, GUID,
int32, int64). In addition, the system supports two special types
DictionaryType and BlobType. The DictionaryType allows for
flexible properties (i.e., without a fixed schema) to be added to a
row at any time. These flexible properties are stored inside of the
dictionary type as (name, type, value) tuples. From a data access
standpoint, these flexible properties behave like first-order
properties of the row and are queryable just like any other
property in the row. The BlobType is a special property used to
store large amounts of data and is currently used only by the Blob
Table. BlobType avoids storing the blob data bits with the row
properties in the row data stream. Instead, the blob data bits
are stored in a separate blob data stream and a pointer to the
blobs data bits (list of extent + offset, length pointers) is stored
in the BlobTypes property in the row. This keeps the large data
bits separated from the OTs queryable row property values stored
in the row data stream.
5. Partition Layer
149
Lookup partition
Front End/
Client
Partition
Map Table
Update
Lock
Service
Monitor Lease
Status
PM
Lease Renewal
Partition Assignment
Load Balance
writes
reads
PS1
Persist partition state
PS2
Read partition state
from streams
PS3
Read/Query
Write
Partition Layer
Memory
Table
Stream Layer
Load Metrics
Adaptive
Bloom Filters Range Profiling
Metadata Stream
Extent Ptr
Extent Ptr
Row Data Stream Stores the checkpoint row data and index for
the RangePartition.
Blob Data Stream Is only used by the Blob Table to store the
blob data bits.
Each of the above is a separate stream in the stream layer owned
by an Object Tables RangePartition.
150
For the Blob Tables RangePartitions, we also store the Blob data
bits directly into the commit log stream (to minimize the number
of stream writes for Blob operations), but those data bits are not
part of the row data so they are not put into the memory table.
Instead, the BlobType property for the row tracks the location of
the Blob data bits (extent+offset, length). During checkpoint, the
extents that would be removed from the commit log are instead
concatenated to the RangePartitions Blob data stream. Extent
concatenation is a fast operation provided by the stream layer
since it consists of just adding pointers to extents at the end of the
Blob data stream without copying any data.
For each stamp, we typically see 75 splits and merges and 200
RangePartition load balances per day.
151
6. The PM then updates the Partition Map Table and its metadata
information to reflect the merge.
1. The PM moves C and D so that they are served by the same PS.
The PM then tells the PS to merge (C,D) into E.
2. The PS performs a checkpoint for both C and D, and then
briefly pauses traffic to C and D during step 3.
3. The PS uses the MultiModify stream command to create a new
commit log and data streams for E. Each of these streams is the
concatenation of all of the extents from their respective streams in
C and D. This merge means that the extents in the new commit
log stream for E will be all of Cs extents in the order they were in
Cs commit log stream followed by all of Ds extents in their
original order. This layout is the same for the new row and Blob
data stream(s) for E.
6. Application Throughput
152
WAS cloud storage service, which they can then access from any
XBox console they sign into. The backing storage for this feature
leverages Blob and Table storage.
The XBox Telemetry service stores console-generated diagnostics
and telemetry information for later secure retrieval and offline
processing. For example, various Kinect related features running
on Xbox 360 generate detailed usage files which are uploaded to
the cloud to analyze and improve the Kinect experience based on
customer opt-in. The data is stored directly into Blobs, and
Tables are used to maintain metadata information about the files.
Queues are used to coordinate the processing and the cleaning up
of the Blobs.
Microsofts Zune backend uses Windows Azure for media file
storage and delivery, where files are stored as Blobs.
Table 1 shows the relative breakdown among Blob, Table, and
Queue usage across all (All) services (internal and external) using
WAS as well as for the services described above. The table
shows the breakdown of requests, capacity usage, and ingress and
egress traffic for Blobs, Tables and Queues.
Notice that, the percentage of requests for all services shows that
about 17.9% of all requests are Blob requests, 46.88% of the
requests are Table operations and 35.22% are Queue requests for
all services using WAS. But in terms of capacity, 70.31% of
capacity is in Blobs, 29.68% of capacity is used by Tables, and
0.01% used by Queues. %Ingress is the percentage breakdown
of incoming traffic (bytes) among Blob, Table, and Queue;
%Egress is the same for outbound traffic (bytes). The results
show that different customers have very different usage patterns.
In term of capacity usage, some customers (e.g., Zune and Xbox
GameSaves) have mostly unstructured data (such as media files)
and put those into Blobs, whereas other customers like Bing and
XBox Telemetry that have to index a lot of data have a significant
amount of structured data in Tables. Queues use very little space
compared to Blobs and Tables, since they are primarily used as a
communication mechanism instead of storing data over a long
period of time.
7. Workload Profiles
%Requests %Capacity
17.9
70.31
46.88
29.68
35.22
0.01
0.46
60.45
98.48
39.55
1.06
0
99.68
99.99
0.32
0.01
0
0
26.78
19.57
44.98
80.43
28.24
0
94.64
99.9
5.36
0.1
0
0
%Ingress
48.28
49.61
2.11
16.73
83.14
0.13
99.84
0.16
0
50.25
49.25
0.5
98.22
1.78
0
%Egress
66.17
33.07
0.76
29.11
70.79
0.1
99.88
0.12
0
11.26
88.29
0.45
96.21
3.79
0
153
Given this decision, our goal from the start has been to allow
computation to efficiently access storage with high bandwidth
without the data being on the same node or even in the same rack.
To achieve this goal we are in the process of moving towards our
next generation data center networking architecture [10], which
flattens the data center networking topology and provides full
bisection bandwidth between compute and storage.
After doing the split pass, the PM sorts all of the PSs based on
each of the load balancing metrics - request load, CPU load and
network load. It then uses this to identify which PSs are
overloaded versus lightly loaded. The PM then chooses the PSs
that are heavily loaded and, if there was a recent split from the
prior split pass, the PM will offload one of those RangePartitions
to a lightly loaded server. If there are still highly loaded PSs
(without a recent split to offload), the PM offloads
RangePartitions from them to the lightly loaded PSs.
154
spread evenly across different fault and upgrade domains for the
storage service. This way, if a fault domain goes down, we lose at
most 1/X of the servers for a given layer, where X is the number
of fault domains. Similarly, during a service upgrade at most 1/Y
of the servers for a given layer are upgraded at a given time,
where Y is the number of upgrade domains. To achieve this, we
use rolling upgrades, which enable us to maintain high availability
when upgrading the storage service, and we upgrade a single
upgrade domain at a time. For example, if we have ten upgrade
domains, then upgrading a single domain would potentially
upgrade ten percent of the servers from each layer at a time.
155
9. Related Work
10. Conclusions
156
various peak usage profiles from many customers on the same set
of hardware. This significantly reduces storage cost since the
amount of resources to be provisioned is significantly less than the
sum of the peak resources required to run all of these workloads
on dedicated hardware.
Acknowledgements
[17] P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil, "The LogStructured Merge-Tree (LSM-tree)," Acta Informatica ACTA, vol. 33, no. 4, 1996.
[18] H. Patterson et al., "SnapMirror: File System Based
Asynchronous Mirroring for Disaster Recovery," in
USENIX-FAST, 2002.
[19] Irving S. Reed and Gustave Solomon, "Polynomial Codes
over Certain Finite Fields," Journal of the Society for
Industrial and Applied Mathematics, vol. 8, no. 2, pp. 300304, 1960.
[20] R. Renesse and F. Schneider, "Chain Replication for
Supporting High Throughput and Availability," in USENIXOSDI, 2004.
Reference
[3] M. Burrows, "The Chubby Lock Service for LooselyCoupled Distributed Systems," in OSDI, 2006.
157
Abstract
Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel
jobs that complete in hundreds of milliseconds poses a
major challenge for task schedulers, which will need to
schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high
availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability
limitations of a centralized design. We implement and
deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12%
of an ideal scheduler.
1 Introduction
Todays data analytics clusters are running ever shorter
and higher-fanout jobs. Spurred by demand for lowerlatency interactive data processing, efforts in research and industry alike have produced frameworks
(e.g., Dremel [12], Spark [26], Impala [11]) that stripe
work across thousands of machines or store data in
memory in order to analyze large volumes of data in
seconds, as shown in Figure 1. We expect this trend to
continue with a new generation of frameworks targeting sub-second response times. Bringing response times
into the 100ms range will enable powerful new applications; for example, user-facing services will be able
to run sophisticated parallel computations, such as language translation and highly personalized search, on a
per-query basis.
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for third-party components of this work must be honored. For all other
uses, contact the Owner/Author.
Copyright is held by the Owner/Author(s).
SOSP 13, Nov. 36, 2013, Farmington, Pennsylvania, USA.
ACM 978-1-4503-2388-8/13/11.
http://dx.doi.org/10.1145/2517349.2522716
10 sec.
100 ms
1 ms
2 Design Goals
This paper focuses on fine-grained task scheduling for
low-latency applications.
Low-latency workloads have more demanding
scheduling requirements than batch workloads do,
because batch workloads acquire resources for long periods of time and thus require infrequent task scheduling.
To support a workload composed of sub-second tasks,
a scheduler must provide millisecond-scale scheduling
delay and support millions of task scheduling decisions
per second. Additionally, because low-latency frameworks may be used to power user-facing services, a
scheduler for low-latency workloads should be able to
tolerate scheduler failure.
Sparrow provides fine-grained task scheduling, which
is complementary to the functionality provided by cluster resource managers. Sparrow does not launch new
processes for each task; instead, Sparrow assumes that
a long-running executor process is already running on
each worker machine for each framework, so that Sparrow need only send a short task description (rather than
a large binary) when a task is launched. These executor processes may be launched within a static portion
of a cluster, or via a cluster resource manager (e.g.,
YARN [16], Mesos [8], Omega [20]) that allocates resources to Sparrow along with other frameworks (e.g.,
traditional batch workloads).
Sparrow also makes approximations when scheduling
and trades off many of the complex features supported
by sophisticated, centralized schedulers in order to provide higher scheduling throughput and lower latency. In
particular, Sparrow does not allow certain types of placement constraints (e.g., my job should not be run on machines where User Xs jobs are running), does not perform bin packing, and does not support gang scheduling.
Sparrow supports a small set of features in a way that
can be easily scaled, minimizes latency, and keeps the
design of the system simple. Many applications run lowlatency queries from multiple users, so Sparrow enforces
strict priorities or weighted fair shares when aggregate
demand exceeds capacity. Sparrow also supports basic
2
constraints over job placement, such as per-task constraints (e.g. each task needs to be co-resident with input data) and per-job constraints (e.g., all tasks must be
placed on machines with GPUs). This feature set is similar to that of the Hadoop MapReduce scheduler [23] and
the Spark [26] scheduler.
3 Sample-Based
Parallel Jobs
Scheduling
We assume a single wave job model when we evaluate scheduling techniques because single wave jobs are
most negatively affected by the approximations involved
in our distributed scheduling approach: even a single
delayed task affects the jobs response time. However,
Sparrow also handles multiwave jobs.
for
d1 + o(1) [14].
of time is upper bounded by
i=1
2 The omniscient scheduler uses a greedy scheduling algorithm
based on complete information about which worker machines are busy.
For each incoming job, the scheduler places the jobs tasks on idle
workers, if any exist, and otherwise uses FIFO queueing.
Scheduler
Worker
Job
Worker
Scheduler
Worker
Task 2
Worker
Scheduler
Scheduler (d = 2)
Worker
Scheduler
Worker
Scheduler
Worker
Job
4 probes
Worker
Scheduler Task 1
Worker
Scheduler
Worker
Worker
Worker
2
Figure 2: Placing a parallel, two-task job. Batch sampling outperforms per-task sampling because tasks are
placed in the least loaded of the entire batch of sampled queues.
350
Random
Per-Task
Batch
Batch+Late Binding
Omniscient
300
250
200
150
100
50
0
0.2
0.4
Load
0.6
0.8
Batch sampling improves on per-task sampling by sharing information across all of the probes for a particular
job. Batch sampling is similar to a technique recently
proposed in the context of storage systems [18]. With
per-task sampling, one pair of probes may have gotten
unlucky and sampled two heavily loaded machines (e.g.,
Task 1 in Figure 2(a)), while another pair may have gotten lucky and sampled two lightly loaded machines (e.g,
Task 2 in Figure 2(a)); one of the two lightly loaded machines will go unused. Batch sampling aggregates load
3 We use this distribution because it puts the most stress on our
approximate, distributed scheduling technique. When tasks within a
job are of different duration, the shorter tasks can have longer wait
times without affecting job response time.
does not support many types of constraints (e.g., interjob constraints) supported by some general-purpose resource managers.
Per-job constraints (e.g., all tasks should be run on
a worker with a GPU) are trivially handled at a Sparrow scheduler. Sparrow randomly selects the dm candidate workers from the subset of workers that satisfy the
constraint. Once the dm workers to probe are selected,
scheduling proceeds as described previously.
Sparrow also handles jobs with per-task constraints,
such as constraints that limit tasks to run on machines
where input data is located. Co-locating tasks with input
data typically reduces response time, because input data
does not need to be transferred over the network. For
jobs with per-task constraints, each task may have a different set of machines on which it can run, so Sparrow
cannot aggregate information over all of the probes in
the job using batch sampling. Instead, Sparrow uses pertask sampling, where the scheduler selects the two machines to probe for each task from the set of machines
that the task is constrained to run on, along with late
binding.
Sparrow implements a small optimization over pertask sampling for jobs with per-task constraints. Rather
than probing individually for each task, Sparrow shares
information across tasks when possible. For example,
consider a case where task 0 is constrained to run in
machines A, B, and C, and task 1 is constrained to run
on machines C, D, and E. Suppose the scheduler probed
machines A and B for task 0, which were heavily loaded,
and probed machines C and D for task 1, which were
both idle. In this case, Sparrow will place task 0 on machine C and task 1 on machine D, even though both machines were selected to be probed for task 1.
Although Sparrow cannot use batch sampling for jobs
with per-task constraints, our distributed approach still
provides near-optimal response times for these jobs, because even a centralized scheduler has only a small number of choices for where to place each task. Jobs with
per-task constraints can still use late binding, so the
scheduler is guaranteed to place each task on whichever
of the two probed machines where the task will run
sooner. Storage layers like HDFS typically replicate data
on three different machines, so tasks that read input data
will be constrained to run on one of three machines
where the input data is located. As a result, even an
ideal, omniscient scheduler would only have one additional choice for where to place each task.
5 Analysis
Before delving into our experimental evaluation, we analytically show that batch sampling achieves near-optimal
performance, regardless of the task duration distribution, given some simplifying assumptions. Section 3
demonstrated that Sparrow performs well, but only under one particular workload; this section generalizes
those results to all workloads. We also demonstrate that
with per-task sampling, performance decreases exponentially with the number of tasks in a job, making it
poorly suited for parallel workloads.
m
d
t
n/(mt)
10 tasks/job
1
0.8
0.6
0.4
0.2
0
(1 )m
(1 d )m
! "
i dmi dm
dm
i=m (1 )
i
100 tasks/job
1 0
Per-Task
Batch
10 tasks/job
100 tasks/job
0.8
0.6
0.4
0.2
0
1 0
Per-Task
Batch
4 With the larger, 100-task job, the drop happens more rapidly because the job uses more total probes, which decreases the variance in
the proportion of probes that hit idle machines.
Spark
Frontend
App X
Frontend
Sparrow Scheduler
Application
Frontend
Spark
Frontend
Node
Monitor
Scheduler
submitR
Application
Executor
equest()
Sparrow Scheduler
enqueueR
reserve time
eservation(
Worker
Sparrow Node Monitor
Spark
Executor
App X
Executor
Worker
Sparrow Node Monitor
Spark
Executor
Worker
Sparrow Node Monitor
Time
queue time
get task
time
getTask()
launchT
ask()
App X
Executor
taskComplete()
Figure 6: Frameworks that use Sparrow are decomposed into frontends, which generate tasks, and executors, which run tasks. Frameworks schedule jobs
by communicating with any one of a set of distributed
Sparrow schedulers. Sparrow node monitors run on
each worker machine and federate resource usage.
taskComplete()
taskComplete()
service
time
queries or job specifications (e.g., a SQL query) from exogenous sources (e.g., a data analyst, web service, business application, etc.) and compile them into parallel
tasks for execution on workers. Frontends are typically
distributed over multiple machines to provide high performance and availability. Because Sparrow schedulers
are lightweight, in our deployment, we run a scheduler
on each machine where an application frontend is running to ensure minimum scheduling latency.
Executor processes are responsible for executing
tasks, and are long-lived to avoid startup overhead such
as shipping binaries or caching large datasets in memory.
Executor processes for multiple frameworks may run coresident on a single machine; the node monitor federates
resource usage between co-located frameworks. Sparrow requires executors to accept a launchTask()
RPC from a local node monitor, as shown in Figure 7;
Sparrow uses the launchTask() RPC to pass on the
task description (opaque to Sparrow) originally supplied
by the application frontend.
6 Implementation
We implemented Sparrow to evaluate its performance
on a cluster of 110 Amazon EC2 virtual machines. The
Sparrow code, including scripts to replicate our experimental evaluation, is publicly available at http://
github.com/radlab/sparrow.
a TPC-H workload, which features heterogeneous analytics queries. We provide fine-grained tracing of the
overhead that Sparrow incurs and quantify its performance in comparison with an ideal scheduler. Second,
we demonstrate Sparrows ability to handle scheduler
failures. Third, we evaluate Sparrows ability to isolate
users from one another in accordance with cluster-wide
scheduling policies. Finally, we perform a sensitivity
analysis of key parameters in Sparrows design.
7 Experimental Evaluation
We evaluate Sparrow using a cluster composed of 100
worker machines and 10 schedulers running on Amazon EC2. Unless otherwise specified, we use a probe
ratio of 2. First, we use Sparrow to schedule tasks for
9
4000
3500
3000
2500
2000
1500
1000
500
0
4217 (med.)
q3
5396 (med.)
q4
Cumulative Probability
Random
Per-task sampling
Batch sampling
7881 (med.)
q6
0.8
0.6
0.4
0.2
0
q12
Reserve time
Queue time
10
Milliseconds
100
Delay (ms)
140
120
100
80
60
40
20
0
535
219
Per-task
Sparrow
Constrained Stages
Unconstrained Stages
Figure 8 demonstrates that Sparrow outperforms alternate techniques and provides response times within
12% of an ideal scheduler. Compared to randomly assigning tasks to workers, Sparrow (batch sampling with
late binding) reduces median query response time by 4
8 and reduces 95th percentile response time by over
10. Sparrow also reduces response time compared to
per-task sampling (a nave implementation based on the
power of two choices): batch sampling with late binding provides query response times an average of 0.8
those provided by per-task sampling. Ninety-fifth percentile response times drop by almost a factor of two
with Sparrow, compared to per-task sampling. Late binding reduces median query response time by an average
of 14% compared to batch sampling alone. Sparrow also
provides good absolute performance: Sparrow provides
median response times just 12% higher than those provided by an ideal scheduler.
4000
3000
2000
1000
0
4000
3000
2000
1000
0
Node 1
Failure
Node 2
10
20
30
Time (s)
40
50
6000
5000
4000
3000
2000
1000
0
6000
5000
4000
3000
2000
Task Duration (ms)
1000
60
constrained tasks, Sparrow provides a performance improvement over per-task sampling due to its use of late
binding.
11
Running Tasks
400
350
300
250
200
150
100
50
0
HP
load
0.25
0.25
0.25
0.25
0.25
User 0
User 1
10
20
30
Time (s)
40
50
LP
load
0
0.25
0.5
0.75
1.75
HP response
time in ms
106 (111)
108 (114)
110 (148)
136 (170)
141 (226)
LP response
time in ms
N/A
108 (115)
110 (449)
40.2k (46.2k)
255k (270k)
Table 3: Median and 95th percentile (shown in parentheses) response times for a high priority (HP) and
low priority (LP) user running jobs composed of 10
100ms tasks in a 100-node cluster. Sparrow successfully shields the high priority user from a low priority user. When aggregate load is 1 or more, response
time will grow to be unbounded for at least one user.
ter with tens of thousands of machines running subsecond tasks may require millions of scheduling decisions per second; supporting such an environment would
require 1000 higher scheduling throughput, which is
difficult to imagine even with a significant rearchitecting
of the scheduler. Clusters running low latency workloads
will need to shift from using centralized task schedulers
like Sparks native scheduler to using more scalable distributed schedulers like Sparrow.
over short time intervals. Nonetheless, as shown in Figure 13, Sparrow quickly allocates enough resources to
User 1 when she begins submitting scheduling requests
(10 seconds into the experiment), and the cluster share
allocated by Sparrow exhibits only small fluctuations
from the correct fair share.
Figure 13 demonstrates that Sparrows distributed fairness mechanism enforces cluster-wide fair shares and
quickly adapts to changing user demand. Users 0 and
1 are both given equal shares in a cluster with 400 slots.
Unlike other experiments, we use 100 4-core EC2 machines; Sparrows distributed enforcement works better
as the number of cores increases, so to avoid over stating
performance, we evaluate it under the smallest number
of cores we would expect in a cluster today. User 0 submits at a rate to fully utilize the cluster for the entire
duration of the experiment. User 1 changes her demand
every 10 seconds: she submits at a rate to consume 0%,
25%, 50%, 25%, and finally 0% of the clusters available
slots. Under max-min fairness, each user is allocated her
fair share of the cluster unless the users demand is less
than her share, in which case the unused share is distributed evenly amongst the remaining users. Thus, user
1s max-min share for each 10-second interval is 0 concurrently running tasks, 100 tasks, 200 tasks, 100 tasks,
and finally 0 tasks; user 0s max-min fair share is the remaining resources. Sparrows fairness mechanism lacks
any central authority with a complete view of how many
tasks each user is running, leading to imperfect fairness
12
9279 (95th)
250
574 (med.),
4169 (95th)
300
200
150
100
Ideal
50
0
80% load
1
1.1
1.2
1.5
Probe Ratio
90% load
2
666
10s
6278
100s
to sustain 80% cluster load. Figure 15 illustrates the response time of short jobs when sharing the cluster with
long jobs. We vary the percentage of jobs that are long,
the duration of the long jobs, and the number of cores
on the machine, to illustrate where performance breaks
down. Sparrow provides response times for short tasks
within 11% of ideal (100ms) when running on 16-core
machines, even when 50% of tasks are 3 orders of magnitude longer. When 50% of tasks are 3 orders of magnititude longer, over 99% of the execution time across all
jobs is spent executing long tasks; given this, Sparrows
performance is impressive. Short tasks see more significant performance degredation in a 4-core environment.
rely on centralized architectures. Among logically decentralized schedulers, Sparrow is the first to schedule all of a jobs tasks together, rather than scheduling
each task independently, which improves performance
for parallel jobs.
Deans work on reducing the latency tail in serving
systems [5] is most similar to ours. He proposes using
hedged requests where the client sends each request to
two workers and cancels remaining outstanding requests
when the first result is received. He also describes tied
requests, where clients send each request to two servers,
but the servers communicate directly about the status of
the request: when one server begins executing the request, it cancels the counterpart. Both mechanisms are
similar to Sparrows late binding, but target an environment where each task needs to be scheduled independently (for data locality), so information cannot be
shared across the tasks in a job.
Work on load sharing in distributed systems (e.g., [7])
also uses randomized techniques similar to Sparrows.
In load sharing systems, each processor both generates
and processes work; by default, work is processed where
it is generated. Processors re-distribute queued tasks if
the number of tasks queued at a processor exceeds some
threshold, using either receiver-initiated policies, where
lightly loaded processors request work from randomly
selected other processors, or sender-initiated policies,
where heavily loaded processors offload work to randomly selected recipients. Sparrow represents a combination of sender-initiated and receiver-initiated policies:
schedulers (senders) initiate the assignment of tasks
to workers (receivers) by sending probes, but workers finalize the assignment by responding to probes and
requesting tasks as resources become available.
Projects that explore load balancing tasks in multiprocessor shared-memory architectures (e.g., [19]) echo
many of the design tradeoffs underlying our approach,
such as the need to avoid centralized scheduling points.
They differ from our approach because they focus
on a single machine where the majority of the effort is spent determining when to reschedule processes
amongst cores to balance load.
Quincy [9] targets task-level scheduling in compute
clusters, similar to Sparrow. Quincy maps the scheduling problem onto a graph in order to compute an optimal
schedule that balances data locality, fairness, and starvation freedom. Quincys graph solver supports more sophisticated scheduling policies than Sparrow but takes
over a second to compute a scheduling assignment in
a 2500 node cluster, making it too slow for our target
workload.
In the realm of data analytics frameworks,
Dremel [12] achieves response times of seconds
with extremely high fanout. Dremel uses a hierarchical
without adding significant complexity is a focus of future work. Adding pre-emption, for example, would be a
simple way to mitigate the effects of low-priority users
jobs on higher priority users.
Constraints Our current design does not handle interjob constraints (e.g. the tasks for job A must not run on
racks with tasks for job B). Supporting inter-job constraints across frontends is difficult to do without significantly altering Sparrows design.
Gang scheduling Some applications require gang
scheduling, a feature not implemented by Sparrow. Gang
scheduling is typically implemented using bin-packing
algorithms that search for and reserve time slots in which
an entire job can run. Because Sparrow queues tasks on
several machines, it lacks a central point from which
to perform bin-packing. While Sparrow often places all
jobs on entirely idle machines, this is not guaranteed,
and deadlocks between multiple jobs that require gang
scheduling may occur. Sparrow is not alone: many cluster schedulers do not support gang scheduling [8, 9, 16].
Query-level policies Sparrows performance could be
improved by adding query-level scheduling policies. A
user query (e.g., a SQL query executed using Shark)
may be composed of many stages that are each executed using a separate Sparrow scheduling request; to
optimize query response time, Sparrow should schedule queries in FIFO order. Currently, Sparrows algorithm attempts to schedule jobs in FIFO order; adding
query-level scheduling policies should improve end-toend query performance.
Worker failures Handling worker failures is complicated by Sparrows distributed design, because when a
worker fails, all schedulers with outstanding requests
at that worker must be informed. We envision handling
worker failures with a centralized state store that relies
on occasional heartbeats to maintain a list of currently
alive workers. The state store would periodically disseminate the list of live workers to all schedulers. Since the
information stored in the state store would be soft state,
it could easily be recreated in the event of a state store
failure.
Dynamically adapting the probe ratio Sparrow
could potentially improve performance by dynamically
adapting the probe ratio based on cluster load; however,
such an approach sacrifices some of the simplicity of
Sparrows current design. Exploring whether dynamically changing the probe ratio would significantly increase performance is the subject of ongoing work.
9 Related Work
Scheduling in distributed systems has been extensively
studied in earlier work. Most existing cluster schedulers
14
and strict priorities. Experiments using a synthetic workload demonstrate that Sparrow is resilient to different
probe ratios and distributions of task durations. In light
of these results, we believe that distributed scheduling
using Sparrow presents a viable alternative to centralized schedulers for low latency parallel workloads.
11 Acknowledgments
We are indebted to Aurojit Panda for help with debugging EC2 performance anomalies, Shivaram Venkataraman for insightful comments on several drafts of this paper and for help with Spark integration, Sameer Agarwal
for help with running simulations, Satish Rao for help
with theoretical models of the system, and Peter Bailis,
Ali Ghodsi, Adam Oliner, Sylvia Ratnasamy, and Colin
Scott for helpful comments on earlier drafts of this paper.
We also thank our shepherd, John Wilkes, for helping to
shape the final version of the paper. Finally, we thank
the reviewers from HotCloud 2012, OSDI 2012, NSDI
2013, and SOSP 2013 for their helpful feedback.
This research is supported in part by a Hertz Foundation Fellowship, the Department of Defense through the
National Defense Science & Engineering Graduate Fellowship Program, NSF CISE Expeditions award CCF1139158, DARPA XData Award FA8750-12-2-0331, Intel via the Intel Science and Technology Center for
Cloud Computing (ISTC-CC), and gifts from Amazon
Web Services, Google, SAP, Cisco, Clearstory Data,
Cloudera, Ericsson, Facebook, FitWave, General Electric, Hortonworks, Huawei, Microsoft, NetApp, Oracle,
Samsung, Splunk, VMware, WANdisco and Yahoo!.
References
[1] Apache Thrift.
org.
http://thrift.apache.
10 Conclusion
This paper presents Sparrow, a stateless decentralized
scheduler that provides near optimal performance using
two key techniques: batch sampling and late binding. We
use a TPC-H workload to demonstrate that Sparrow can
provide median response times within 12% of an ideal
scheduler and survives scheduler failures. Sparrow enforces popular scheduler policies, including fair sharing
[17] K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The Case for Tiny Tasks in Compute Clusters.
In Proc. HotOS, 2013.
[5] J. Dean and L. A. Barroso. The Tail at Scale. Communications of the ACM, 56(2), February 2013.
[6] A. Demers, S. Keshav, and S. Shenker. Analysis
and Simulation of a Fair Queueing Algorithm. In
Proc. SIGCOMM, 1989.
[7] D. L. Eager, E. D. Lazowska, and J. Zahorjan. Adaptive Load Sharing in Homogeneous Distributed Systems. IEEE Transactions on Software
Engineering, 1986.
[8] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform For Fine-Grained Resource
Sharing in the Data Center. In Proc. NSDI, 2011.
[21] B. Sharma, V. Chudnovsky, J. L. Hellerstein, R. Rifaat, and C. R. Das. Modeling and Synthesizing
Task Placement Constraints in Google Compute
Clusters. In Proc. SOCC, 2011.
[22] D. Shue, M. J. Freedman, and A. Shaikh. Performance Isolation and Fairness for Multi-Tenant
Cloud Storage. In Proc. OSDI, 2012.
[23] T. White. Hadoop: The Definitive Guide. OReilly
Media, 2009.
Neha Narkhede
LinkedIn Corp.
nnarkhede@linkedin.com
ABSTRACT
Log processing has become a critical component of the data
pipeline for consumer internet companies. We introduce Kafka, a
distributed messaging system that we developed for collecting and
delivering high volumes of log data with low latency. Our system
incorporates ideas from existing log aggregators and messaging
systems, and is suitable for both offline and online message
consumption. We made quite a few unconventional yet practical
design choices in Kafka to make our system efficient and scalable.
Our experimental results show that Kafka has superior
performance when compared to two popular messaging systems.
We have been using Kafka in production for some time and it is
processing hundreds of gigabytes of new data each day.
General Terms
Management, Performance, Design, Experimentation.
Keywords
messaging, distributed, log processing, throughput, online.
1. Introduction
There is a large amount of log data generated at any sizable
internet company. This data typically includes (1) user activity
events corresponding to logins, pageviews, clicks, likes,
sharing, comments, and search queries; (2) operational metrics
such as service call stack, call latency, errors, and system metrics
such as CPU, memory, network, or disk utilization on each
machine. Log data has long been a component of analytics used to
track user engagement, system utilization, and other metrics.
However recent trends in internet applications have made activity
data a part of the production data pipeline used directly in site
features. These uses include (1) search relevance, (2)
recommendations which may be driven by item popularity or cooccurrence in the activity stream, (3) ad targeting and reporting,
and (4) security applications that protect against abusive behaviors
such as spam or unauthorized data scraping, and (5) newsfeed
features that aggregate user status updates or actions for their
friends or connections to read.
This production, real-time usage of log data creates new
challenges for data systems because its volume is orders of
magnitude larger than the real data. For example, search,
recommendations, and advertising often require computing
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
NetDB'11, Jun. 12, 2011, Athens, Greece.
Copyright 2011 ACM 978-1-4503-0652-2/11/06$10.00.
Jun Rao
LinkedIn Corp.
jrao@linkedin.com
2. Related Work
Traditional enterprise messaging systems [1][7][15][17] have
existed for a long time and often play a critical role as an event
bus for processing asynchronous data flows. However, there are a
few reasons why they tend not to be a good fit for log processing.
First, there is a mismatch in features offered by enterprise
systems. Those systems often focus on offering a rich set of
delivery guarantees. For example, IBM Websphere MQ [7] has
transactional supports that allow an application to insert messages
into multiple queues atomically. The JMS [14] specification
allows each individual message to be acknowledged after
consumption, potentially out of order. Such delivery guarantees
are often overkill for collecting log data. For instance, losing a
few pageview events occasionally is certainly not the end of the
world. Those unneeded features tend to increase the complexity of
both the API and the underlying implementation of those systems.
Second, many systems do not focus as strongly on throughput as
their primary design constraint. For example, JMS has no API to
allow the producer to explicitly batch multiple messages into a
producer
BROKER 1
BROKER 2
BROKER 3
topic1/part1
/part2
topic2/part1
topic1/part1
/part2
topic2/part1
topic1/part1
/part2
topic2/part1
consumer
consumer
delete
reads
append
in-memory index
msg-00000000000
msg-00014517018
msg-00030706778
.
.
.
.
.
msg-02050706778
segment file 1
msg-00000000000
msg-00000000215
.
.
.
.
msg-00014516809
segment file N
msg-02050706778
msg-02050706945
.
.
.
.
msg-02614516809
In general, Kafka only guarantees at-least-once delivery. Exactlyonce delivery typically requires two-phase commits and is not
necessary for our applications. Most of the time, a message is
delivered exactly once to each consumer group. However, in the
case when a consumer process crashes without a clean shutdown,
the consumer process that takes over those partitions owned by
the failed consumer may get some duplicate messages that are
after the last offset successfully committed to zookeeper. If an
application cares about duplicates, it must add its own deduplication logic, either using the offsets that we return to the
consumer or some unique key within the message. This is usually
a more cost-effective approach than using two-phase commits.
frontend
analysis datacenter
broker
broker
realtime
service
frontend
Load balancer
realtime
service
DWH
Hadoop
5. Experimental Results
We conducted an experimental study, comparing the performance
of Kafka with Apache ActiveMQ v5.4 [1], a popular open-source
implementation of JMS, and RabbitMQ v2.4 [16], a message
system known for its performance. We used ActiveMQs default
persistent message store KahaDB. Although not presented here,
we also tested an alternative AMQ message store and found its
performance very similar to that of KahaDB. Whenever possible,
we tried to use comparable settings in all systems.
We ran our experiments on 2 Linux machines, each with 8 2GHz
cores, 16GB of memory, 6 disks with RAID 10. The two
machines are connected with a 1Gb network link. One of the
machines was used as the broker and the other machine was used
as the producer or the consumer.
Producer Test: We configured the broker in all systems to
asynchronously flush messages to its persistence store. For each
system, we ran a single producer to publish a total of 10 million
messages, each of 200 bytes. We configured the Kafka producer
to send messages in batches of size 1 and 50. ActiveMQ and
RabbitMQ dont seem to have an easy way to batch messages and
we assume that it used a batch size of 1. The results are shown in
Figure 4. The x-axis represents the amount of data sent to the
broker over time in MB, and the y-axis corresponds to the
producer throughput in messages per second. On average, Kafka
can publish messages at the rate of 50,000 and 400,000 messages
per second for batch size of 1 and 50, respectively. These numbers
7. REFERENCES
[10] http://hadoop.apache.org/zookeeper/
[1] http://activemq.apache.org/
[2] http://avro.apache.org/
[11] http://www.slideshare.net/cloudera/hw09-hadoop-baseddata-mining-platform-for-the-telecom-industry
[12] http://www.slideshare.net/prasadc/hive-percona-2009
[4] http://developer.yahoo.com/blogs/hadoop/posts/2010/06/ena
bling_hadoop_batch_processi_1/
[13] https://issues.apache.org/jira/browse/ZOOKEEPER-775
[8] http://hadoop.apache.org/
[9] http://hadoop.apache.org/hdfs/
Google, Inc.
Abstract
Bigtable is a distributed storage system for managing
structured data that is designed to scale to a very large
size: petabytes of data across thousands of commodity
servers. Many projects at Google store data in Bigtable,
including web indexing, Google Earth, and Google Finance. These applications place very different demands
on Bigtable, both in terms of data size (from URLs to
web pages to satellite imagery) and latency requirements
(from backend bulk processing to real-time data serving).
Despite these varied demands, Bigtable has successfully
provided a flexible, high-performance solution for all of
these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients
dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
1 Introduction
Over the last two and a half years we have designed,
implemented, and deployed a distributed storage system
for managing structured data at Google called Bigtable.
Bigtable is designed to reliably scale to petabytes of
data and thousands of machines. Bigtable has achieved
several goals: wide applicability, scalability, high performance, and high availability. Bigtable is used by
more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of demanding workloads,
which range from throughput-oriented batch-processing
jobs to latency-sensitive serving of data to end users.
The Bigtable clusters used by these products span a wide
range of configurations, from a handful to thousands of
servers, and store up to several hundred terabytes of data.
In many ways, Bigtable resembles a database: it shares
many implementation strategies with databases. Parallel databases [14] and main-memory databases [13] have
To appear in OSDI 2006
2 Data Model
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row
key, column key, and a timestamp; each value in the map
is an uninterpreted array of bytes.
(row:string, column:string, time:int64) string
1
"contents:"
"com.cnn.www"
"anchor:cnnsi.com"
"<html>..."
t3
"<html>..."
t5
"<html>..."
t6
"CNN"
"anchor:my.look.ca"
t9
"CNN.com"
t8
Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con-
tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNNs home page
is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com
and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t 3 , t5 , and t6 .
Rows
The row keys in a table are arbitrary strings (currently up
to 64KB in size, although 10-100 bytes is a typical size
for most of our users). Every read or write of data under
a single row key is atomic (regardless of the number of
different columns being read or written in the row), a
design decision that makes it easier for clients to reason
about the systems behavior in the presence of concurrent
updates to the same row.
Bigtable maintains data in lexicographic order by row
key. The row range for a table is dynamically partitioned.
Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short
row ranges are efficient and typically require communication with only a small number of machines. Clients
can exploit this property by selecting their row keys so
that they get good locality for their data accesses. For
example, in Webtable, pages in the same domain are
grouped together into contiguous rows by reversing the
hostname components of the URLs. For example, we
store data for maps.google.com/index.html under the
key com.google.maps/index.html. Storing pages from
the same domain near each other makes some host and
domain analyses more efficient.
To appear in OSDI 2006
Column Families
Column keys are grouped into sets called column families, which form the basic unit of access control. All data
stored in a column family is usually of the same type (we
compress data in the same column family together). A
column family must be created before data can be stored
under any column key in that family; after a family has
been created, any column key within the family can be
used. It is our intent that the number of distinct column
families in a table be small (in the hundreds at most), and
that families rarely change during operation. In contrast,
a table may have an unbounded number of columns.
A column key is named using the following syntax:
family:qualifier. Column family names must be printable, but qualifiers may be arbitrary strings. An example column family for the Webtable is language, which
stores the language in which a web page was written. We
use only one column key in the language family, and it
stores each web pages language ID. Another useful column family for this table is anchor; each column key in
this family represents a single anchor, as shown in Figure 1. The qualifier is the name of the referring site; the
cell contents is the link text.
Access control and both disk and memory accounting are performed at the column-family level. In our
Webtable example, these controls allow us to manage
several different types of applications: some that add new
base data, some that read the base data and create derived
column families, and some that are only allowed to view
existing data (and possibly not even to view all of the
existing families for privacy reasons).
Timestamps
Each cell in a Bigtable can contain multiple versions of
the same data; these versions are indexed by timestamp.
Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent real
time in microseconds, or be explicitly assigned by client
2
3 API
The Bigtable API provides functions for creating and
deleting tables and column families. It also provides
functions for changing cluster, table, and column family
metadata, such as access control rights.
Client applications can write or delete values in
Bigtable, look up values from individual rows, or iterate over a subset of the data in a table. Figure 2 shows
C++ code that uses a RowMutation abstraction to perform a series of updates. (Irrelevant details were elided
to keep the example short.) The call to Apply performs
an atomic mutation to the Webtable: it adds one anchor
to www.cnn.com and deletes a different anchor.
Figure 3 shows C++ code that uses a Scanner abstraction to iterate over all anchors in a particular row.
Clients can iterate over multiple column families, and
there are several mechanisms for limiting the rows,
columns, and timestamps produced by a scan. For example, we could restrict the scan above to only produce
anchors whose columns match the regular expression
anchor:*.cnn.com, or to only produce anchors whose
timestamps fall within ten days of the current time.
To appear in OSDI 2006
Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily("anchor");
stream->SetReturnAllVersions();
scanner.Lookup("com.cnn.www");
for (; !stream->Done(); stream->Next()) {
printf("%s %s %lld %s\n",
scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
}
4 Building Blocks
Bigtable is built on several other pieces of Google infrastructure. Bigtable uses the distributed Google File
System (GFS) [17] to store log and data files. A Bigtable
cluster typically operates in a shared pool of machines
that run a wide variety of other distributed applications,
and Bigtable processes often share the same machines
with processes from other applications. Bigtable depends on a cluster management system for scheduling
jobs, managing resources on shared machines, dealing
with machine failures, and monitoring machine status.
The Google SSTable file format is used internally to
store Bigtable data. An SSTable provides a persistent,
ordered immutable map from keys to values, where both
keys and values are arbitrary byte strings. Operations are
provided to look up the value associated with a specified
3
5 Implementation
The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers. Tablet servers can be
To appear in OSDI 2006
Chubby file
Root tablet
...
...
...
...
UserTable1
...
...
..
.
..
.
UserTableN
...
...
..
.
...
Read Op
Memory
GFS
tablet log
Write Op
SSTable Files
5.4 Compactions
As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the
memtable is frozen, a new memtable is created, and the
frozen memtable is converted to an SSTable and written
to GFS. This minor compaction process has two goals:
it shrinks the memory usage of the tablet server, and it
reduces the amount of data that has to be read from the
commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.
Every minor compaction creates a new SSTable. If this
behavior continued unchecked, read operations might
need to merge updates from an arbitrary number of
SSTables. Instead, we bound the number of such files
by periodically executing a merging compaction in the
background. A merging compaction reads the contents
of a few SSTables and the memtable, and writes out a
new SSTable. The input SSTables and memtable can be
discarded as soon as the compaction has finished.
A merging compaction that rewrites all SSTables
into exactly one SSTable is called a major compaction.
SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in
older SSTables that are still live. A major compaction,
on the other hand, produces an SSTable that contains
no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major
compactions to them. These major compactions allow
Bigtable to reclaim resources used by deleted data, and
also allow it to ensure that deleted data disappears from
the system in a timely fashion, which is important for
services that store sensitive data.
6 Refinements
The implementation described in the previous section
required a number of refinements to achieve the high
performance, availability, and reliability required by our
users. This section describes portions of the implementation in more detail in order to highlight these refinements.
Locality groups
Clients can group multiple column families together into
a locality group. A separate SSTable is generated for
each locality group in each tablet. Segregating column
families that are not typically accessed together into separate locality groups enables more efficient reads. For
example, page metadata in Webtable (such as language
and checksums) can be in one locality group, and the
contents of the page can be in a different group: an ap6
Compression
Clients can control whether or not the SSTables for a
locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block (whose size
is controllable via a locality group specific tuning parameter). Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. Many clients use a two-pass custom
compression scheme. The first pass uses Bentley and
McIlroys scheme [6], which compresses long common
strings across a large window. The second pass uses a
fast compression algorithm that looks for repetitions in
a small 16 KB window of the data. Both compression
passes are very fastthey encode at 100200 MB/s, and
decode at 4001000 MB/s on modern machines.
Even though we emphasized speed instead of space reduction when choosing our compression algorithms, this
two-pass compression scheme does surprisingly well.
For example, in Webtable, we use this compression
scheme to store Web page contents. In one experiment,
we stored a large number of documents in a compressed
locality group. For the purposes of the experiment, we
limited ourselves to one version of each document instead of storing all versions available to us. The scheme
achieved a 10-to-1 reduction in space. This is much
better than typical Gzip reductions of 3-to-1 or 4-to-1
on HTML pages because of the way Webtable rows are
laid out: all pages from a single host are stored close
to each other. This allows the Bentley-McIlroy algorithm to identify large amounts of shared boilerplate in
pages from the same host. Many applications, not just
Webtable, choose their row names so that similar data
ends up clustered, and therefore achieve very good compression ratios. Compression ratios get even better when
we store multiple versions of the same value in Bigtable.
To appear in OSDI 2006
were co-mingled in the same physical log file. One approach would be for each new tablet server to read this
full commit log file and apply just the entries needed for
the tablets it needs to recover. However, under such a
scheme, if 100 machines were each assigned a single
tablet from a failed tablet server, then the log file would
be read 100 times (once by each server).
We avoid duplicating log reads by first sorting the commit log entries in order of the keys
table, row name, log sequence number.
In the
sorted output, all mutations for a particular tablet are
contiguous and can therefore be read efficiently with one
disk seek followed by a sequential read. To parallelize
the sorting, we partition the log file into 64 MB segments, and sort each segment in parallel on different
tablet servers. This sorting process is coordinated by the
master and is initiated when a tablet server indicates that
it needs to recover mutations from some commit log file.
Writing commit logs to GFS sometimes causes performance hiccups for a variety of reasons (e.g., a GFS server
machine involved in the write crashes, or the network
paths traversed to reach the particular set of three GFS
servers is suffering network congestion, or is heavily
loaded). To protect mutations from GFS latency spikes,
each tablet server actually has two log writing threads,
each writing to its own log file; only one of these two
threads is actively in use at a time. If writes to the active log file are performing poorly, the log file writing is
switched to the other thread, and mutations that are in
the commit log queue are written by the newly active log
writing thread. Log entries contain sequence numbers
to allow the recovery process to elide duplicated entries
resulting from this log switching process.
Speeding up tablet recovery
If the master moves a tablet from one tablet server to
another, the source tablet server first does a minor compaction on that tablet. This compaction reduces recovery time by reducing the amount of uncompacted state in
the tablet servers commit log. After finishing this compaction, the tablet server stops serving the tablet. Before
it actually unloads the tablet, the tablet server does another (usually very fast) minor compaction to eliminate
any remaining uncompacted state in the tablet servers
log that arrived while the first minor compaction was
being performed. After this second minor compaction
is complete, the tablet can be loaded on another tablet
server without requiring any recovery of log entries.
Exploiting immutability
Besides the SSTable caches, various other parts of the
Bigtable system have been simplified by the fact that all
To appear in OSDI 2006
of the SSTables that we generate are immutable. For example, we do not need any synchronization of accesses
to the file system when reading from SSTables. As a result, concurrency control over rows can be implemented
very efficiently. The only mutable data structure that is
accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make
each memtable row copy-on-write and allow reads and
writes to proceed in parallel.
Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage
collecting obsolete SSTables. Each tablets SSTables are
registered in the METADATA table. The master removes
obsolete SSTables as a mark-and-sweep garbage collection [25] over the set of SSTables, where the METADATA
table contains the set of roots.
Finally, the immutability of SSTables enables us to
split tablets quickly. Instead of generating a new set of
SSTables for each child tablet, we let the child tablets
share the SSTables of the parent tablet.
7 Performance Evaluation
We set up a Bigtable cluster with N tablet servers to
measure the performance and scalability of Bigtable as
N is varied. The tablet servers were configured to use 1
GB of memory and to write to a GFS cell consisting of
1786 machines with two 400 GB IDE hard drives each.
N client machines generated the Bigtable load used for
these tests. (We used the same number of clients as tablet
servers to ensure that clients were never a bottleneck.)
Each machine had two dual-core Opteron 2 GHz chips,
enough physical memory to hold the working set of all
running processes, and a single gigabit Ethernet link.
The machines were arranged in a two-level tree-shaped
switched network with approximately 100-200 Gbps of
aggregate bandwidth available at the root. All of the machines were in the same hosting facility and therefore the
round-trip time between any pair of machines was less
than a millisecond.
The tablet servers and master, test clients, and GFS
servers all ran on the same set of machines. Every machine ran a GFS server. Some of the machines also ran
either a tablet server, or a client process, or processes
from other jobs that were using the pool at the same time
as these experiments.
R is the distinct number of Bigtable row keys involved
in the test. R was chosen so that each benchmark read or
wrote approximately 1 GB of data per tablet server.
The sequential write benchmark used row keys with
names 0 to R 1. This space of row keys was partitioned into 10N equal-sized ranges. These ranges were
assigned to the N clients by a central scheduler that as8
500
241
6250
2000
2469
1905
7843
Experiment
random reads
random reads (mem)
random writes
sequential reads
sequential writes
scans
# of Tablet Servers
1
50
250
1212
593
479
10811
8511 8000
8850
3745 3425
4425
2463 2625
8547
3623 2451
15385 10526 9524
4M
3M
2M
scans
random reads (mem)
random writes
sequential reads
sequential writes
random reads
1M
100
200
300
400
500
Figure 6: Number of 1000-byte values read/written per second. The table shows the rate per tablet server; the graph shows the
aggregate rate.
# of tablet servers
0 ..
19
20 ..
49
50 ..
99
100 ..
499
> 500
# of clusters
259
47
20
50
12
8 Real Applications
As of August 2006, there are 388 non-test Bigtable clusters running in various Google machine clusters, with a
combined total of about 24,500 tablet servers. Table 1
shows a rough distribution of tablet servers per cluster.
Many of these clusters are used for development purposes and therefore are idle for significant periods. One
group of 14 busy clusters with 8069 total tablet servers
saw an aggregate volume of more than 1.2 million requests per second, with incoming RPC traffic of about
741 MB/s and outgoing RPC traffic of about 16 GB/s.
Table 2 provides some data about a few of the tables
currently in use. Some tables store data that is served
to users, whereas others store data for batch processing;
the tables range widely in total size, average cell size,
To appear in OSDI 2006
Project
name
Crawl
Crawl
Google Analytics
Google Analytics
Google Base
Google Earth
Google Earth
Orkut
Personalized Search
Table size
(TB)
800
50
20
200
2
0.5
70
9
4
Compression
ratio
11%
33%
29%
14%
31%
64%
47%
# Cells
(billions)
1000
200
10
80
10
8
9
0.9
6
# Column
Families
16
2
1
1
29
7
8
8
93
# Locality
Groups
8
2
1
1
3
2
3
5
11
% in
memory
0%
0%
0%
0%
15%
33%
0%
1%
5%
Latencysensitive?
No
No
Yes
Yes
Yes
Yes
No
Yes
Yes
Table 2: Characteristics of a few tables in production use. Table size (measured before compression) and # Cells indicate approximate sizes. Compression ratio is not given for tables that have compression disabled.
Each row in the imagery table corresponds to a single geographic segment. Rows are named to ensure that
adjacent geographic segments are stored near each other.
The table contains a column family to keep track of the
sources of data for each segment. This column family
has a large number of columns: essentially one for each
raw data image. Since each segment is only built from a
few images, this column family is very sparse.
The preprocessing pipeline relies heavily on MapReduce over Bigtable to transform data. The overall system
processes over 1 MB/sec of data per tablet server during
some of these MapReduce jobs.
The serving system uses one table to index data stored
in GFS. This table is relatively small (500 GB), but it
must serve tens of thousands of queries per second per
datacenter with low latency. As a result, this table is
hosted across hundreds of tablet servers and contains inmemory column families.
The Personalized Search data is replicated across several Bigtable clusters to increase availability and to reduce latency due to distance from clients. The Personalized Search team originally built a client-side replication
mechanism on top of Bigtable that ensured eventual consistency of all replicas. The current system now uses a
replication subsystem that is built into the servers.
The design of the Personalized Search storage system
allows other groups to add new per-user information in
their own columns, and the system is now used by many
other Google properties that need to store per-user configuration options and settings. Sharing a table amongst
many groups resulted in an unusually large number of
column families. To help support sharing, we added a
simple quota mechanism to Bigtable to limit the storage consumption by any particular client in shared tables; this mechanism provides some isolation between
the various product groups using this system for per-user
information storage.
9 Lessons
In the process of designing, implementing, maintaining,
and supporting Bigtable, we gained useful experience
and learned several interesting lessons.
One lesson we learned is that large distributed systems are vulnerable to many types of failures, not just
the standard network partitions and fail-stop failures assumed in many distributed protocols. For example, we
have seen problems due to all of the following causes:
memory and network corruption, large clock skew, hung
machines, extended and asymmetric network partitions,
bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance. As we have gained more
experience with these problems, we have addressed them
by changing various protocols. For example, we added
checksumming to our RPC mechanism. We also handled
11
the behavior of Chubby features that were seldom exercised by other applications. We discovered that we were
spending an inordinate amount of time debugging obscure corner cases, not only in Bigtable code, but also in
Chubby code. Eventually, we scrapped this protocol and
moved to a newer simpler protocol that depends solely
on widely-used Chubby features.
10 Related Work
The Boxwood project [24] has components that overlap
in some ways with Chubby, GFS, and Bigtable, since it
provides for distributed agreement, locking, distributed
chunk storage, and distributed B-tree storage. In each
case where there is overlap, it appears that the Boxwoods component is targeted at a somewhat lower level
than the corresponding Google service. The Boxwood
projects goal is to provide infrastructure for building
higher-level services such as file systems or databases,
while the goal of Bigtable is to directly support client
applications that wish to store data.
Many recent projects have tackled the problem of providing distributed storage or higher-level services over
wide area networks, often at Internet scale. This includes work on distributed hash tables that began with
projects such as CAN [29], Chord [32], Tapestry [37],
and Pastry [30]. These systems address concerns that do
not arise for Bigtable, such as highly variable bandwidth,
untrusted participants, or frequent reconfiguration; decentralized control and Byzantine fault tolerance are not
Bigtable goals.
In terms of the distributed data storage model that one
might provide to application developers, we believe the
key-value pair model provided by distributed B-trees or
distributed hash tables is too limiting. Key-value pairs
are a useful building block, but they should not be the
only building block one provides to developers. The
model we chose is richer than simple key-value pairs,
and supports sparse semi-structured data. Nonetheless,
it is still simple enough that it lends itself to a very efficient flat-file representation, and it is transparent enough
(via locality groups) to allow our users to tune important
behaviors of the system.
Several database vendors have developed parallel
databases that can store large volumes of data. Oracles
Real Application Cluster database [27] uses shared disks
to store data (Bigtable uses GFS) and a distributed lock
manager (Bigtable uses Chubby). IBMs DB2 Parallel
Edition [4] is based on a shared-nothing [33] architecture
similar to Bigtable. Each DB2 server is responsible for
a subset of the rows in a table which it stores in a local
relational database. Both products provide a complete
relational model with transactions.
12
Given the unusual interface to Bigtable, an interesting question is how difficult it has been for our users to
adapt to using it. New users are sometimes uncertain of
how to best use the Bigtable interface, particularly if they
are accustomed to using relational databases that support
general-purpose transactions. Nevertheless, the fact that
many Google products successfully use Bigtable demonstrates that our design works well in practice.
We are in the process of implementing several additional Bigtable features, such as support for secondary
indices and infrastructure for building cross-data-center
replicated Bigtables with multiple master replicas. We
have also begun deploying Bigtable as a service to product groups, so that individual groups do not need to maintain their own clusters. As our service clusters scale,
we will need to deal with more resource-sharing issues
within Bigtable itself [3, 5].
Finally, we have found that there are significant advantages to building our own storage solution at Google.
We have gotten a substantial amount of flexibility from
designing our own data model for Bigtable. In addition, our control over Bigtables implementation, and
the other Google infrastructure upon which Bigtable depends, means that we can remove bottlenecks and inefficiencies as they arise.
Acknowledgements
We thank the anonymous reviewers, David Nagle, and
our shepherd Brad Calder, for their feedback on this paper. The Bigtable system has benefited greatly from the
feedback of our many users within Google. In addition,
we thank the following people for their contributions to
Bigtable: Dan Aguayo, Sameer Ajmani, Zhifeng Chen,
Bill Coughran, Mike Epstein, Healfdene Goguen, Robert
Griesemer, Jeremy Hylton, Josh Hyman, Alex Khesin,
Joanna Kulik, Alberto Lerner, Sherry Listgarten, Mike
Maloney, Eduardo Pinheiro, Kathy Polizzi, Frank Yellin,
and Arthur Zwiegincew.
References
11 Conclusions
We have described Bigtable, a distributed system for
storing structured data at Google. Bigtable clusters have
been in production use since April 2005, and we spent
roughly seven person-years on design and implementation before that date. As of August 2006, more than sixty
projects are using Bigtable. Our users like the performance and high availability provided by the Bigtable implementation, and that they can scale the capacity of their
clusters by simply adding more machines to the system
as their resource demands change over time.
To appear in OSDI 2006
13
[22]
KX . COM .
[6] B ENTLEY, J. L., AND M C I LROY, M. D. Data compression using long common strings. In Data Compression
Conference (1999), pp. 287295.
[25] M C C ARTHY, J. Recursive functions of symbolic expressions and their computation by machine. CACM 3, 4 (Apr.
1960), 184195.
[8] B URROWS , M. The Chubby lock service for looselycoupled distributed systems. In Proc. of the 7th OSDI
(Nov. 2006).
[9] C HANDRA , T., G RIESEMER , R., AND R EDSTONE , J.
Paxos made live An engineering perspective. In Proc.
of PODC (2007).
[10] C OMER , D. Ubiquitous B-tree. Computing Surveys 11, 2
(June 1979), 121137.
[11] C OPELAND , G. P., A LEXANDER , W., B OUGHTER ,
E. E., AND K ELLER , T. W. Data placement in Bubba. In
Proc. of SIGMOD (1988), pp. 99108.
[12] D EAN , J., AND G HEMAWAT, S. MapReduce: Simplified
data processing on large clusters. In Proc. of the 6th OSDI
(Dec. 2004), pp. 137150.
[13] D E W ITT, D., K ATZ , R., O LKEN , F., S HAPIRO , L.,
S TONEBRAKER , M., AND W OOD , D. Implementation
techniques for main memory database systems. In Proc.
of SIGMOD (June 1984), pp. 18.
[14] D E W ITT, D. J., AND G RAY, J. Parallel database systems: The future of high performance database systems.
CACM 35, 6 (June 1992), 8598.
[15] F RENCH , C. D. One size fits all database architectures
do not work for DSS. In Proc. of SIGMOD (May 1995),
pp. 449450.
[16] G AWLICK , D., AND K INKADE , D. Varieties of concurrency control in IMS/VS fast path. Database Engineering
Bulletin 8, 2 (1985), 310.
[17] G HEMAWAT, S., G OBIOFF , H., AND L EUNG , S.-T. The
Google file system. In Proc. of the 19th ACM SOSP (Dec.
2003), pp. 2943.
[18] G RAY, J. Notes on database operating systems. In Operating Systems An Advanced Course, vol. 60 of Lecture
Notes in Computer Science. Springer-Verlag, 1978.
[19] G REER , R. Daytona and the fourth-generation language
Cymbal. In Proc. of SIGMOD (1999), pp. 525526.
[20] H AGMANN , R. Reimplementing the Cedar file system
using logging and group commit. In Proc. of the 11th
SOSP (Dec. 1987), pp. 155162.
[21] H ARTMAN , J. H., AND O USTERHOUT, J. K. The Zebra
striped network file system. In Proc. of the 14th SOSP
(Asheville, NC, 1993), pp. 2943.
[27]
[28] P IKE , R., D ORWARD , S., G RIESEMER , R., AND Q UIN LAN , S. Interpreting the data: Parallel analysis with
Sawzall. Scientific Programming Journal 13, 4 (2005),
227298.
[29] R ATNASAMY, S., F RANCIS , P., H ANDLEY, M., K ARP,
R., AND S HENKER , S. A scalable content-addressable
network. In Proc. of SIGCOMM (Aug. 2001), pp. 161
172.
[30] ROWSTRON , A., AND D RUSCHEL , P. Pastry: Scalable, distributed object location and routing for largescale peer-to-peer systems. In Proc. of Middleware 2001
(Nov. 2001), pp. 329350.
[31]
SENSAGE . COM .
Product page.
sensage.com/products-sensage.htm.
SYBASE . COM .
www.sybase.com/products/databaseservers/sybaseiq. Product page.
14
Abstract
We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing
frameworks, such as Hadoop and MPI. Sharing improves
cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by
taking turns reading data stored on each machine. To
support the sophisticated schedulers of todays frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides
how many resources to offer each framework, while
frameworks decide which resources to accept and which
computations to run on them. Our results show that
Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to
50,000 (emulated) nodes, and is resilient to failures.
Introduction
CDF
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MapReduce Jobs
Map & Reduce Tasks
1
10
100
1000
10000
100000
Duration (s)
Target Environment
As an example of a workload we aim to support, consider the Hadoop data warehouse at Facebook [5]. Facebook loads logs from its web services into a 2000-node
Hadoop cluster, where they are used for applications
such as business intelligence, spam detection, and ad
optimization. In addition to production jobs that run
periodically, the cluster is used for many experimental
jobs, ranging from multi-hour machine learning computations to 1-2 minute ad-hoc queries submitted interactively through an SQL interface called Hive [3]. Most
jobs are short (the median job being 84s long), and the
jobs are composed of fine-grained map and reduce tasks
(the median task being 23s), as shown in Figure 1.
To meet the performance requirements of these jobs,
Facebook uses a fair scheduler for Hadoop that takes advantage of the fine-grained nature of the workload to allocate resources at the level of tasks and to optimize data
locality [38]. Unfortunately, this means that the cluster
can only run Hadoop jobs. If a user wishes to write an ad
targeting algorithm in MPI instead of MapReduce, perhaps because MPI is more efficient for this jobs communication pattern, then the user must set up a separate MPI
cluster and import terabytes of data into it. This problem
is not hypothetical; our contacts at Yahoo! and Facebook
report that users want to run MPI and MapReduce Online
(a streaming MapReduce) [11, 10]. Mesos aims to provide fine-grained sharing between multiple cluster computing frameworks to enable these usage scenarios.
2
Hadoop
scheduler
MPI
scheduler
ZooKeeper
quorum
Framework 1
Framework 2
Job 2
Job 1
FW Scheduler
Job 2
Job 1
FW Scheduler
Mesos
master
Standby
master
Standby
master
Mesos slave
Mesos slave
Hadoop
executor
MPI
executor
Hadoop
MPI
executor executor
task
task
task
task
task
Slave 1
Task
Slave 2
Executor
Task
Task
Architecture
Design Philosophy
Task
Mesos
master
task
We begin our description of Mesos by discussing our design philosophy. We then describe the components of
Mesos, our resource allocation mechanisms, and how
Mesos achieves isolation, scalability, and fault tolerance.
3.1
Executor
Allocation
module
<s1, 4cpu, 4gb, >
Mesos slave
Overview
Isolation
Mesos provides performance isolation between framework executors running on the same slave by leveraging
existing OS isolation mechanisms. Since these mechanisms are platform-dependent, we support multiple isolation mechanisms through pluggable isolation modules.
We currently isolate resources using OS container
technologies, specifically Linux Containers [9] and Solaris Projects [13]. These technologies can limit the
CPU, memory, network bandwidth, and (in new Linux
kernels) I/O usage of a process tree. These isolation technologies are not perfect, but using containers is already
an advantage over frameworks like Hadoop, where tasks
from different jobs simply run in separate processes.
Resource Allocation
3.5
Mesos delegates allocation decisions to a pluggable allocation module, so that organizations can tailor allocation to their needs. So far, we have implemented two
allocation modules: one that performs fair sharing based
on a generalization of max-min fairness for multiple resources [21] and one that implements strict priorities.
Similar policies are used in Hadoop and Dryad [25, 38].
In normal operation, Mesos takes advantage of the
fact that most tasks are short, and only reallocates resources when tasks finish. This usually happens frequently enough so that new frameworks acquire their
share quickly. For example, if a frameworks share is
10% of the cluster, it needs to wait approximately 10%
of the mean task length to receive its share. However,
if a cluster becomes filled by long tasks, e.g., due to a
buggy job or a greedy framework, the allocation module
can also revoke (kill) tasks. Before killing a task, Mesos
gives its framework a grace period to clean it up.
We leave it up to the allocation module to select the
policy for revoking tasks, but describe two related mechanisms here. First, while killing a task has a low impact
on many frameworks (e.g., MapReduce), it is harmful for
frameworks with interdependent tasks (e.g., MPI). We allow these frameworks to avoid being killed by letting al-
Because task scheduling in Mesos is a distributed process, it needs to be efficient and robust to failures. Mesos
includes three mechanisms to help with this goal.
First, because some frameworks will always reject certain resources, Mesos lets them short-circuit the rejection
process and avoid communication by providing filters to
the master. We currently support two types of filters:
only offer nodes from list L and only offer nodes with
at least R resources free. However, other types of predicates could also be supported. Note that unlike generic
constraint languages, filters are Boolean predicates that
specify whether a framework will reject one bundle of
resources on one node, so they can be evaluated quickly
on the master. Any resource that does not pass a frameworks filter is treated exactly like a rejected resource.
Second, because a framework may take time to respond to an offer, Mesos counts resources offered to a
framework towards its allocation of the cluster. This is
a strong incentive for frameworks to respond to offers
quickly and to filter resources that they cannot use.
Third, if a framework has not responded to an offer
for a sufficiently long time, Mesos rescinds the offer and
re-offers the resources to other frameworks.
4
Scheduler Callbacks
resourceOffer(offerId, offers)
offerRescinded(offerId)
statusUpdate(taskId, status)
slaveLost(slaveId)
Executor Callbacks
launchTask(taskDescriptor)
killTask(taskId)
Scheduler Actions
replyToOffer(offerId, tasks)
setNeedsOffers(bool)
setFilters(filters)
getGuaranteedShare()
killTask(taskId)
Executor Actions
4.1
sendStatus(taskId, status)
3.6
Fault Tolerance
API Summary
Mesos Behavior
4.2
Homogeneous Tasks
Elastic Framework
Rigid Framework
Constant dist. Exponential dist. Constant dist.
Exponential dist.
Ramp-up time
T
T ln k
T
T ln k
Completion time
(1/2 + )T
(1 + )T
(1 + )T
(ln k + )T
Utilization
1
1
/(1/2 + )
/(ln k 1 + )
Table 2: Ramp-up time, job completion time and utilization for both elastic and rigid frameworks, and for both constant and
exponential task duration distributions. The framework starts with no slots. k is the number of slots the framework is entitled under
the scheduling policy, and T represents the time it takes a job to complete assuming the framework gets all k slots at once.
Framework ramp-up time: If task durations are constant, it will take framework f at most T time to acquire
k slots. This is simply because during a T interval, every
slot will become available, which will enable Mesos to
offer the framework all k of its preferred slots. If the duration distribution is exponential, the expected ramp-up
time can be as high as T ln k [23].
Job completion time: The expected completion time3
of an elastic job is at most (1 + )T , which is within T
(i.e., the mean task duration) of the completion time of
the job when it gets all its slots instantaneously. Rigid
jobs achieve similar completion times for constant task
durations, but exhibit much higher completion times for
exponential job durations, i.e., (ln k + )T . This is simply because it takes a framework T ln k time on average
to acquire all its slots and be able to start its job.
System utilization: Elastic jobs fully utilize their allocated slots, because they can use every slot as soon
as they get it. As a result, assuming infinite demand, a
system running only elastic jobs is fully utilized. Rigid
frameworks achieve slightly worse utilizations, as their
jobs cannot start before they get their full allocations, and
thus they waste the resources held while ramping up.
4.3
Placement Preferences
4.4
Heterogeneous Tasks
So far we have assumed that frameworks have homogeneous task duration distributions, i.e., that all frameworks have the same task duration distribution. In this
section, we discuss frameworks with heterogeneous task
duration distributions. In particular, we consider a workload where tasks that are either short and long, where the
mean duration of the long tasks is significantly longer
than the mean of the short tasks. Such heterogeneous
3 When computing job completion time we assume that the last tasks
of the job running on the frameworks k slots finish at the same time.
reducing latency for new jobs and wasted work for revocation. If frameworks are elastic, they will opportunistically utilize all the resources they can obtain. Finally,
if frameworks do not accept resources that they do not
understand, they will leave them for frameworks that do.
We also note that these properties are met by many
current cluster computing frameworks, such as MapReduce and Dryad, simply because using short independent
tasks simplifies load balancing and fault recovery.
4.6
Framework Incentives
Interdependent framework constraints: It is possible to construct scenarios where, because of esoteric interdependencies between frameworks (e.g., certain tasks
from two frameworks cannot be colocated), only a single global allocation of the cluster performs well. We
argue such scenarios are rare in practice. In the model
discussed in this section, where frameworks only have
preferences over which nodes they use, we showed that
allocations approximate those of optimal schedulers.
Scale elastically: The ability of a framework to use resources as soon as it acquires theminstead of waiting
to reach a given minimum allocationwould allow the
framework to start (and complete) its jobs earlier. In addition, the ability to scale up and down allows a framework to grab unused resources opportunistically, as it can
later release them with little negative impact.
works cannot predict task times and must be able to handle failures and stragglers [18, 40, 38]. These policies
are easy to implement over resource offers.
as an executor, which may be terminated if it is not running tasks. This would make map output files unavailable
to reduce tasks. We solved this problem by providing a
shared file server on each node in the cluster to serve
local files. Such a service is useful beyond Hadoop, to
other frameworks that write data locally on each node.
In total, our Hadoop port is 1500 lines of code.
Implementation
5.2
Hadoop Port
5.3
Spark Framework
Bin
1
2
3
4
5
6
7
8
f(x,w)
f(x,w)
w
f(x,w)
...
a) Dryad
Reduce Tasks
NA
NA
2
NA
10
NA
NA
30
# Jobs Run
38
18
14
12
6
6
4
2
Macrobenchmark Workloads
Evaluation
Map Tasks
1
2
10
50
100
200
400
400
Table 3: Job types for each bin in our Facebook Hadoop mix.
b) Spark
Job Type
selection
text search
aggregation
selection
aggregation
selection
text search
join
Macrobenchmark
4 We
scaled down the largest jobs in [38] to have the workload fit a
quarter of our cluster size.
Share of Cluster
Static Partitioning
Mesos
200
400
600
800
1000
1200
1400
1
0.8
0.6
0.4
0.2
0
1600
Static Partitioning
Mesos
500
1000
Time (s)
Static Partitioning
Mesos
200
400
600
800
2000
2500
3000
Share of Cluster
(c) Spark
1
0.8
0.6
0.4
0.2
0
1500
Time (s)
1000
1200
1400
1600
1
0.8
0.6
0.4
0.2
0
1800
Static Partitioning
Mesos
200
400
600
Time (s)
800
1000
1200
1400
1600
Time (s)
Figure 5: Comparison of cluster shares (fraction of CPUs) over time for each of the frameworks in the Mesos and static partitioning
macrobenchmark scenarios. On Mesos, frameworks can scale up when their demand is high and that of other frameworks is low, and
thus finish jobs faster. Note that the plots time axes are different (e.g., the large Hadoop mix takes 3200s with static partitioning).
100
80
60
40
20
0
Mesos
0
200
400
600
800
1000
Static
1200
1400
1600
Time (s)
Figure 6: Framework shares on Mesos during the macrobenchmark. By pooling resources, Mesos lets each workload scale
up to fill gaps in the demand of others. In addition, fine-grained
sharing allows resources to be reallocated in tens of seconds.
Mesos
0
200
400
600
800
1000
Static
1200
1400
1600
Time (s)
Torque / MPI Our Torque framework ran eight instances of the tachyon raytracing job [35] that is part of
the SPEC MPI2007 benchmark. Six of the jobs ran small
problem sizes and two ran large ones. Both types used 24
parallel tasks. We submitted these jobs at fixed times to
both clusters. The tachyon job is CPU-intensive.
6.1.2
50
40
30
20
10
0
Macrobenchmark Results
Facebook
Hadoop Mix
Large Hadoop
Mix
Spark
Torque / MPI
6319
1.14
3143
1494
2.10
1684
3210
1338
3352
1.26
0.96
Framework
Job Type
Table 4: Aggregate performance of each framework in the macrobenchmark (sum of running times of all the jobs in the framework). The speedup column shows the relative gain on Mesos.
600
80%
480
60%
360
40%
240
20%
120
0%
0
Static
Mesos, no Mesos, 1s Mesos, 5s
partitioning delay sched. delay sched. delay sched.
Data Locality
Framework
Overhead
6.4
3000
Hadoop
Spark
2000
1000
0
0
10
20
30
Number of Iterations
Task Launch
Overhead (seconds)
10000
20000
30000
40000
50000
Number of Nodes
Spark Framework
4000
6.6
Mesos Scalability
Failure Recovery
To evaluate Mesos scalability, we emulated large clusters by running up to 50,000 slave daemons on 99 Amazon EC2 nodes, each with 8 CPU cores and 6 GB RAM.
We used one EC2 node for the master and the rest of the
nodes to run slaves. During the experiment, each of 200
12
Performance Isolation
Condor. The Condor cluster manager uses the ClassAds language [32] to match nodes to jobs. Using a resource specification language is not as flexible for frameworks as resource offers, since not all requirements may
be expressible. Also, porting existing frameworks, which
have their own schedulers, to Condor would be more difficult than porting them to Mesos, where existing schedulers fit naturally into the two-level scheduling model.
Next-Generation Hadoop. Recently, Yahoo! announced a redesign for Hadoop that uses a two-level
scheduling model, where per-application masters request
resources from a central manager [14]. The design aims
to support non-MapReduce applications as well. While
details about the scheduling model in this system are currently unavailable, we believe that the new application
masters could naturally run as Mesos frameworks.
Related Work
Acknowledgements
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
5 Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF.
14
Nick Mathewson
The Free Haven Project
nickm@freehaven.net
Abstract
We present Tor, a circuit-based low-latency anonymous communication service. This second-generation Onion Routing
system addresses limitations in the original design by adding
perfect forward secrecy, congestion control, directory servers,
integrity checking, configurable exit policies, and a practical design for location-hidden services via rendezvous points.
Tor works on the real-world Internet, requires no special privileges or kernel modifications, requires little synchronization
or coordination between nodes, and provides a reasonable
tradeoff between anonymity, usability, and efficiency. We
briefly describe our experiences with an international network
of more than 30 nodes. We close with a list of open problems
in anonymous communication.
Overview
Paul Syverson
Naval Research Lab
syverson@itd.nrl.navy.mil
Related work
Modern anonymity systems date to Chaums Mix-Net design [10]. Chaum proposed hiding the correspondence between sender and recipient by wrapping messages in layers
of public-key cryptography, and relaying them through a path
composed of mixes. Each mix in turn decrypts, delays, and
re-orders messages before relaying them onward.
Subsequent relay-based anonymity designs have diverged
in two main directions. Systems like Babel [28], Mixmaster [36], and Mixminion [15] have tried to maximize
anonymity at the cost of introducing comparatively large
and variable latencies. Because of this decision, these highlatency networks resist strong global adversaries, but introduce too much lag for interactive tasks like web browsing,
Internet chat, or SSH connections.
Tor belongs to the second category: low-latency designs
that try to anonymize interactive network traffic. These systems handle a variety of bidirectional protocols. They also
provide more convenient mail delivery than the high-latency
anonymous email networks, because the remote mail server
provides explicit and timely delivery confirmation. But because these designs typically involve many packets that must
be delivered quickly, it is difficult for them to prevent an attacker who can eavesdrop both ends of the communication
from correlating the timing and volume of traffic entering the
anonymity network with traffic leaving it [45]. These protocols are similarly vulnerable to an active adversary who introduces timing patterns into traffic entering the network and
looks for correlated patterns among exiting traffic. Although
some work has been done to frustrate these attacks, most designs protect primarily against traffic analysis rather than traffic confirmation (see Section 3.1).
The simplest low-latency designs are single-hop proxies
such as the Anonymizer [3]: a single trusted server strips
the datas origin before relaying it. These designs are easy to
analyze, but users must trust the anonymizing proxy. Concentrating the traffic to this single point increases the anonymity
set (the people a given user is hiding among), but it is vulnerable if the adversary can observe all traffic entering and
leaving the proxy.
protocol-layer decision requires a compromise between flexibility and anonymity. For example, a system that understands
HTTP can strip identifying information from requests, can
take advantage of caching to limit the number of requests that
leave the network, and can batch or encode requests to minimize the number of connections. On the other hand, an IPlevel anonymizer can handle nearly any protocol, even ones
unforeseen by its designers (though these systems require
kernel-level modifications to some operating systems, and so
are more complex and less portable). TCP-level anonymity
networks like Tor present a middle approach: they are application neutral (so long as the application supports, or can
be tunneled across, TCP), but by treating application connections as data streams rather than raw TCP packets, they avoid
the inefficiencies of tunneling TCP over TCP.
Distributed-trust anonymizing systems need to prevent attackers from adding too many servers and thus compromising
user paths. Tor relies on a small set of well-known directory
servers, run by independent parties, to decide which nodes
can join. Tarzan and MorphMix allow unknown users to run
servers, and use a limited resource (like IP addresses) to prevent an attacker from controlling too much of the network.
Crowds suggests requiring written, notarized requests from
potential crowd members.
Anonymous communication is essential for censorshipresistant systems like Eternity [2], Free Haven [19], Publius [53], and Tangler [52]. Tors rendezvous points enable
connections between mutually anonymous entities; they are a
building block for location-hidden servers, which are needed
by Eternity and Free Haven.
Goals
Like other low-latency anonymity designs, Tor seeks to frustrate attackers from linking communication partners, or from
linking multiple communications to or from a single user.
Within this main goal, however, several considerations have
directed Tors evolution.
Deployability: The design must be deployed and used in
the real world. Thus it must not be expensive to run (for
example, by requiring more bandwidth than volunteers are
willing to provide); must not place a heavy liability burden
on operators (for example, by allowing attackers to implicate
onion routers in illegal activities); and must not be difficult
or expensive to implement (for example, by requiring kernel
patches, or separate proxies for every protocol). We also cannot require non-anonymous parties (such as websites) to run
our software. (Our rendezvous point design does not meet
this goal for non-anonymous users talking to hidden servers,
however; see Section 5.)
Usability: A hard-to-use system has fewer usersand because anonymity systems hide users among users, a system
with fewer users provides less anonymity. Usability is thus
Non-goals
In favoring simple, deployable designs, we have explicitly deferred several possible goals, either because they are solved
elsewhere, or because they are not yet solved.
Not peer-to-peer: Tarzan and MorphMix aim to scale
to completely decentralized peer-to-peer environments with
thousands of short-lived servers, many of which may be controlled by an adversary. This approach is appealing, but still
has many open problems [24, 43].
Not secure against end-to-end attacks: Tor does not
claim to completely solve end-to-end timing or intersection
attacks. Some approaches, such as having users run their own
onion routers, may help; see Section 9 for more discussion.
No protocol normalization: Tor does not provide protocol normalization like Privoxy or the Anonymizer. If senders
want anonymity from responders while using complex and
variable protocols like HTTP, Tor must be layered with a
filtering proxy such as Privoxy to hide differences between
clients, and expunge protocol features that leak identity. Note
that by this separation Tor can also provide services that are
anonymous to the network yet authenticated to the responder,
like SSH. Similarly, Tor does not integrate tunneling for nonstream-based protocols like UDP; this must be provided by
an external service if appropriate.
Not steganographic: Tor does not try to conceal who is
connected to the network.
3.1
Threat Model
ephemeral keys. The TLS protocol also establishes a shortterm link key when communicating between ORs. Short-term
keys are rotated periodically and independently, to limit the
impact of key compromise.
Section 4.1 presents the fixed-size cells that are the unit
of communication in Tor. We describe in Section 4.2 how
circuits are built, extended, truncated, and destroyed. Section 4.3 describes how TCP streams are routed through the
network. We address integrity checking in Section 4.4, and
resource limiting in Section 4.5. Finally, Section 4.6 talks
about congestion control and fairness issues.
4.1
Cells
509 bytes
CircID CMD
2
DATA
4.2
498
DATA
(link is TLSencrypted)
Create c1, E(g^x1)
OR 1
(link is TLSencryped)
OR 2
website
(unencrypted)
Relay c1{{Connected}}
Relay c1{{Data, "HTTP GET..."}}
Legend:
E(x)RSA encryption
{X}AES encryption
cNa circID
(TCP handshake)
"HTTP GET..."
(response)
...
...
...
Constructing a circuit
A users OP constructs circuits incrementally, negotiating a
symmetric key with each OR on the circuit, one hop at a time.
To begin creating a new circuit, the OP (call her Alice) sends
a create cell to the first node in her chosen path (call him
Bob). (She chooses a new circID CAB not currently used on
the connection from her to Bob.) The create cells payload
contains the first half of the Diffie-Hellman handshake (g x ),
encrypted to the onion key of the OR (call him Bob). Bob
responds with a created cell containing g y along with a hash
of the negotiated key K = g xy .
Once the circuit has been established, Alice and Bob can
send one another relay cells encrypted with the negotiated
4.3
Once Alice has established the circuit (so she shares keys with
each OR on the circuit), she can send relay cells. Upon receiving a relay cell, an OR looks up the corresponding circuit,
and decrypts the relay header and payload with the session
key for that circuit. If the cell is headed away from Alice the
OR then checks whether the decrypted cell has a valid digest
(as an optimization, the first two bytes of the integrity check
are zero, so in most cases we can avoid computing the hash).
If valid, it accepts the relay cell and processes it as described
below. Otherwise, the OR looks up the circID and OR for the
next step in the circuit, replaces the circID as appropriate, and
sends the decrypted relay cell to the next OR. (If the OR at
the end of the circuit receives an unrecognized relay cell, an
error has occurred, and the circuit is torn down.)
1 Actually, the negotiated key is used to derive two symmetric keys: one
for each direction.
Relay cells
address to the Tor client. If the application does DNS resolution first, Alice thereby reveals her destination to the remote
DNS server, rather than sending the hostname through the Tor
network to be resolved at the far end. Common applications
like Mozilla and SSH have this flaw.
With Mozilla, the flaw is easy to address: the filtering
HTTP proxy called Privoxy gives a hostname to the Tor
client, so Alices computer never does DNS resolution. But
a portable general solution, such as is needed for SSH, is an
open problem. Modifying or replacing the local nameserver
can be invasive, brittle, and unportable. Forcing the resolver
library to prefer TCP rather than UDP is hard, and also has
portability problems. Dynamically intercepting system calls
to the resolver library seems a promising direction. We could
also provide a tool similar to dig to perform a private lookup
through the Tor network. Currently, we encourage the use of
privacy-aware proxies like Privoxy wherever possible.
Closing a Tor stream is analogous to closing a TCP stream:
it uses a two-step handshake for normal operation, or a onestep handshake for errors. If the stream closes abnormally,
the adjacent node simply sends a relay teardown cell. If the
stream closes normally, the node sends a relay end cell down
the circuit, and the other side responds with its own relay end
cell. Because all relay cells use layered encryption, only the
destination OR knows that a given relay cell is a request to
close a stream. This two-step handshake allows Tor to support
TCP-based applications that use half-closed connections.
4.4
4.5
4.6
Congestion control
5.1
5.2
5.3
6
6.1
Providing Tor as a public service creates many opportunities for denial-of-service attacks against the network. While
flow control and rate limiting (discussed in Section 4.6) prevent users from consuming more bandwidth than routers are
6.2
6.3
Directory Servers
Passive attacks
Observing user traffic patterns. Observing a users connection will not reveal her destination or data, but it will reveal
traffic patterns (both sent and received). Profiling via user
connection patterns requires further processing, because multiple application streams may be operating simultaneously or
in series over a single circuit.
Observing user content. While content at the user end is
encrypted, connections to responders may not be (indeed, the
responding website itself may be hostile). While filtering
content is not a primary goal of Onion Routing, Tor can directly use Privoxy and related filtering services to anonymize
application data streams.
Option distinguishability. We allow clients to choose configuration options. For example, clients concerned about request linkability should rotate circuits more often than those
concerned about traceability. Allowing choice may attract
users with different needs; but clients who are in the minority may lose more anonymity by appearing distinct than they
gain by optimizing their behavior [1].
End-to-end timing correlation. Tor only minimally hides
such correlations. An attacker watching patterns of traffic at
the initiator and the responder will be able to confirm the correspondence with high probability. The greatest protection
currently available against such confirmation is to hide the
connection between the onion proxy and the first Tor node,
by running the OP on the Tor node or behind a firewall. This
approach requires an observer to separate traffic originating at
the onion router from traffic passing through it: a global observer can do this, but it might be beyond a limited observers
capabilities.
Active attacks
Compromise keys. An attacker who learns the TLS session
key can see control cells and encrypted relay cells on every
circuit on that connection; learning a circuit session key lets
him unwrap one layer of the encryption. An attacker who
learns an ORs TLS private key can impersonate that OR for
the TLS keys lifetime, but he must also learn the onion key
to decrypt create cells (and because of perfect forward secrecy, he cannot hijack already established circuits without
also compromising their session keys). Periodic key rotation
limits the window of opportunity for these attacks. On the
other hand, an attacker who learns a nodes identity key can
replace that node indefinitely by sending new forged descriptors to the directory servers.
Iterated compromise. A roving adversary who can compromise ORs (by system intrusion, legal coercion, or extralegal coercion) could march down the circuit compromising the
nodes until he reaches the end. Unless the adversary can complete this attack within the lifetime of the circuit, however,
the ORs will have discarded the necessary information before
the attack can be completed. (Thanks to the perfect forward
secrecy of session keys, the attacker cannot force nodes to decrypt recorded traffic once the circuits have been closed.) Additionally, building circuits that cross jurisdictions can make
legal coercion harderthis phenomenon is commonly called
jurisdictional arbitrage. The Java Anon Proxy project recently experienced the need for this approach, when a German court forced them to add a backdoor to their nodes [51].
Run a recipient. An adversary running a webserver trivially
4 Note that this fingerprinting attack should not be confused with the much
more complicated latency attacks of [5], which require a fingerprint of the
latencies of all circuits through the network, combined with those from the
network edges to the target user and the responder website.
learns the timing patterns of users connecting to it, and can introduce arbitrary patterns in its responses. End-to-end attacks
become easier: if the adversary can induce users to connect
to his webserver (perhaps by advertising content targeted to
those users), he now holds one end of their connection. There
is also a danger that application protocols and associated programs can be induced to reveal information about the initiator.
Tor depends on Privoxy and similar protocol cleaners to solve
this latter problem.
Run an onion proxy. It is expected that end users will nearly
always run their own local onion proxy. However, in some
settings, it may be necessary for the proxy to run remotely
typically, in institutions that want to monitor the activity of
those connecting to the proxy. Compromising an onion proxy
compromises all future connections through it.
DoS non-observed nodes. An observer who can only watch
some of the Tor network can increase the value of this traffic
by attacking non-observed nodes to shut them down, reduce
their reliability, or persuade users that they are not trustworthy. The best defense here is robustness.
Run a hostile OR. In addition to being a local observer, an
isolated hostile node can create circuits through itself, or alter
traffic patterns to affect traffic at other nodes. Nonetheless, a
hostile node must be immediately adjacent to both endpoints
to compromise the anonymity of a circuit. If an adversary can
run multiple ORs, and can persuade the directory servers that
those ORs are trustworthy and independent, then occasionally
some user will choose one of those ORs for the start and another as the end of a circuit. If an adversary controls m > 1
2
of the traffic
of N nodes, he can correlate at most m
N
although an adversary could still attract a disproportionately
large amount of traffic by running an OR with a permissive
exit policy, or by degrading the reliability of other routers.
Introduce timing into messages. This is simply a stronger
version of passive timing attacks already discussed earlier.
Tagging attacks. A hostile node could tag a cell by altering it. If the stream were, for example, an unencrypted
request to a Web site, the garbled content coming out at the
appropriate time would confirm the association. However, integrity checks on cells prevent this attack.
Replace contents of unauthenticated protocols. When relaying an unauthenticated protocol like HTTP, a hostile exit
node can impersonate the target server. Clients should prefer
protocols with end-to-end authentication.
Replay attacks. Some anonymity protocols are vulnerable
to replay attacks. Tor is not; replaying one side of a handshake will result in a different negotiated session key, and so
the rest of the recorded session cant be used.
Smear attacks. An attacker could use the Tor network for
socially disapproved acts, to bring the network into disrepute
and get its operators to shut it down. Exit policies reduce
the possibilities for abuse, but ultimately the network requires
volunteers who can tolerate some political heat.
Distribute hostile code. An attacker could trick users
Directory attacks
Destroy directory servers. If a few directory servers disappear, the others still decide on a valid directory. So long
as any directory servers remain in operation, they will still
broadcast their views of the network and generate a consensus
directory. (If more than half are destroyed, this directory will
not, however, have enough signatures for clients to use it automatically; human intervention will be necessary for clients
to decide whether to trust the resulting directory.)
Subvert a directory server. By taking over a directory
server, an attacker can partially influence the final directory.
Since ORs are included or excluded by majority vote, the corrupt directory can at worst cast a tie-breaking vote to decide
whether to include marginal ORs. It remains to be seen how
often such marginal cases occur in practice.
Subvert a majority of directory servers. An adversary who
controls more than half the directory servers can include as
many compromised ORs in the final directory as he wishes.
We must ensure that directory server operators are independent and attack-resistant.
Encourage directory server dissent. The directory agreement protocol assumes that directory server operators agree
on the set of directory servers. An adversary who can persuade some of the directory server operators to distrust one
another could split the quorum into mutually hostile camps,
thus partitioning users based on which directory they use. Tor
does not address this attack.
Trick the directory servers into listing a hostile OR. Our
threat model explicitly assumes directory server operators
will be able to filter out most hostile ORs.
Convince the directories that a malfunctioning OR is
working. In the current Tor implementation, directory servers
assume that an OR is running correctly if they can start a
TLS connection to it. A hostile OR could easily subvert this
test by accepting TLS connections from ORs but ignoring all
cells. Directory servers must actively test ORs by building
circuits and streams as appropriate. The tradeoffs of a similar
approach are discussed in [18].
deny Bob service by flooding his introduction points with requests. Because the introduction points can block requests
that lack authorization tokens, however, Bob can restrict the
volume of requests he receives, or require a certain amount of
computation for every request he receives.
Attack an introduction point. An attacker could disrupt a
location-hidden service by disabling its introduction points.
But because a services identity is attached to its public key,
the service can simply re-advertise itself at a different introduction point. Advertisements can also be done secretly so
that only high-priority clients know the address of Bobs introduction points or so that different clients know of different
introduction points. This forces the attacker to disable all possible introduction points.
Compromise an introduction point. An attacker who controls Bobs introduction point can flood Bob with introduction
requests, or prevent valid introduction requests from reaching
him. Bob can notice a flood, and close the circuit. To notice
blocking of valid requests, however, he should periodically
test the introduction point by sending rendezvous requests
and making sure he receives them.
Compromise a rendezvous point. A rendezvous point is no
more sensitive than any other OR on a circuit, since all data
passing through the rendezvous is encrypted with a session
key shared by Alice and Bob.
Based in part on our restrictive default exit policy (we reject SMTP requests) and our low profile, we have had no
abuse issues since the network was deployed in October 2003.
Our slow growth rate gives us time to add features, resolve
bugs, and get a feel for what users actually want from an
anonymity system. Even though having more users would
bolster our anonymity sets, we are not eager to attract the
Kazaa or warez communitieswe feel that we must build a
reputation for privacy, human rights, research, and other socially laudable activities.
As for performance, profiling shows that Tor spends almost
all its CPU time in AES, which is fast. Current latency is
attributable to two factors. First, network latency is critical:
we are intentionally bouncing traffic around the world several
times. Second, our end-to-end congestion control algorithm
focuses on protecting volunteer servers from accidental DoS
rather than on optimizing performance. To quantify these effects, we did some informal tests using a network of 4 nodes
on the same machine (a heavily loaded 1GHz Athlon). We
downloaded a 60 megabyte file from debian.org every 30
minutes for 54 hours (108 sample points). It arrived in about
300 seconds on average, compared to 210s for a direct download. We ran a similar test on the production Tor network,
fetching the front page of cnn.com (55 kilobytes): while
a direct download consistently took about 0.3s, the performance through Tor varied. Some downloads were as fast as
0.4s, with a median at 2.8s, and 90% finishing within 5.3s. It
seems that as the network expands, the chance of building a
slow circuit (one that includes a slow or heavily loaded node
or link) is increasing. On the other hand, as our users remain
satisfied with this increased latency, we can address our performance incrementally as we proceed with development.
Although Tors clique topology and full-visibility directories present scaling problems, we still expect the network to
support a few hundred nodes and maybe 10,000 users before
were forced to become more distributed. With luck, the experience we gain running the current topology will help us
choose among alternatives when the time comes.
10 Future Directions
Tor brings together many innovations into a unified deployable system. The next immediate steps include:
Scalability: Tors emphasis on deployability and design
simplicity has led us to adopt a clique topology, semicentralized directories, and a full-network-visibility model
for client knowledge. These properties will not scale past
a few hundred servers. Section 9 describes some promising
approaches, but more deployment experience will be helpful
in learning the relative importance of these bottlenecks.
Bandwidth classes: This paper assumes that all ORs have
good bandwidth and latency. We should instead adopt the
MorphMix model, where nodes advertise their bandwidth
level (DSL, T1, T3), and Alice avoids bottlenecks by choosing nodes that match or exceed her bandwidth. In this way
DSL users can usefully join the Tor network.
Incentives: Volunteers who run nodes are rewarded with
publicity and possibly better anonymity [1]. More nodes
means increased scalability, and more users can mean more
anonymity. We need to continue examining the incentive
structures for participating in Tor. Further, we need to explore more approaches to limiting abuse, and understand why
most people dont bother using privacy systems.
Cover traffic: Currently Tor omits cover trafficits costs
in performance and bandwidth are clear but its security benefits are not well understood. We must pursue more research
on link-level cover traffic and long-range cover traffic to determine whether some simple padding method offers provable
protection against our chosen adversary.
Caching at exit nodes: Perhaps each exit node should run
a caching web proxy [47], to improve anonymity for cached
pages (Alices request never leaves the Tor network), to improve speed, and to reduce bandwidth cost. On the other
hand, forward security is weakened because caches constitute a record of retrieved files. We must find the right balance
between usability and security.
Acknowledgments
We thank Peter Palfrader, Geoff Goodell, Adam Shostack,
Joseph Sokol-Margolis, John Bashinski, and Zack Brown for
editing and comments; Matej Pfajfar, Andrei Serjantov, Marc
Rennhard for design discussions; Bram Cohen for congestion
control discussions; Adam Back for suggesting telescoping
circuits; and Cathy Meadows for formal analysis of the extend protocol. This work has been supported by ONR and
DARPA.
References
[1] A. Acquisti, R. Dingledine, and P. Syverson. On the economics of anonymity. In R. N. Wright, editor, Financial Cryptography. Springer-Verlag, LNCS 2742, 2003.
[2] R. Anderson. The eternity service. In Pragocrypt 96, 1996.
[3] The Anonymizer. <http://anonymizer.com/>.
[4] A. Back, I. Goldberg, and A. Shostack. Freedom systems 2.1
security issues and analysis. White paper, Zero Knowledge
Systems, Inc., May 2001.
[5] A. Back, U. Moller, and A. Stiglic. Traffic analysis attacks and trade-offs in anonymity providing systems. In I. S.
Moskowitz, editor, Information Hiding (IH 2001), pages 245
257. Springer-Verlag, LNCS 2137, 2001.
[6] M. Bellare, P. Rogaway, and D. Wagner. The EAX mode of
operation: A two-pass authenticated-encryption scheme optimized for simplicity and efficiency. In Fast Software Encryption 2004, February 2004.
[44] M. Rennhard, S. Rafaeli, L. Mathy, B. Plattner, and D. Hutchison. Analysis of an Anonymity Network for Web Browsing.
In IEEE 7th Intl. Workshop on Enterprise Security (WET ICE
2002), Pittsburgh, USA, June 2002.
[45] A. Serjantov and P. Sewell. Passive attack analysis for
connection-based anonymity systems. In Computer Security
ESORICS 2003. Springer-Verlag, LNCS 2808, October 2003.
[46] R. Sherwood, B. Bhattacharjee, and A. Srinivasan. p5 : A protocol for scalable anonymous communication. In IEEE Symposium on Security and Privacy, pages 5870. IEEE CS, 2002.
[47] A. Shubina and S. Smith. Using caching for browsing
anonymity. ACM SIGEcom Exchanges, 4(2), Sept 2003.
[48] P. Syverson, M. Reed, and D. Goldschlag. Onion Routing
access configurations. In DARPA Information Survivability
Conference and Exposition (DISCEX 2000), volume 1, pages
3440. IEEE CS Press, 2000.
[49] P. Syverson, G. Tsudik, M. Reed, and C. Landwehr. Towards
an Analysis of Onion Routing Security. In H. Federrath, editor, Designing Privacy Enhancing Technologies: Workshop
on Design Issue in Anonymity and Unobservability, pages 96
114. Springer-Verlag, LNCS 2009, July 2000.
[50] A. Tannenbaum. Computer networks, 1996.
[51] The AN.ON Project.
German police proceeds against
anonymity service.
Press release, September 2003.
<http://www.datenschutzzentrum.de/
material/themen/presse/anon-bka_e.htm>.
[52] M. Waldman and D. Mazi`eres. Tangler: A censorshipresistant publishing system based on document entanglements. In 8th ACM Conference on Computer and Communications Security (CCS-8), pages 86135. ACM Press, 2001.
[53] M. Waldman, A. Rubin, and L. Cranor. Publius: A robust,
tamper-evident, censorship-resistant and source-anonymous
web publishing system. In Proc. 9th USENIX Security Symposium, pages 5972, August 2000.
[54] M. Wright, M. Adler, B. N. Levine, and C. Shields. Defending
anonymous communication against passive logging attacks. In
IEEE Symposium on Security and Privacy, pages 2841. IEEE
CS, May 2003.
{phunt,mahadev}@yahoo-inc.com
{fpj,breed}@yahoo-inc.com
Abstract
In this paper, we describe ZooKeeper, a service for coordinating processes of distributed applications. Since
ZooKeeper is part of critical infrastructure, ZooKeeper
aims to provide a simple and high performance kernel
for building more complex coordination primitives at the
client. It incorporates elements from group messaging,
shared registers, and distributed lock services in a replicated, centralized service. The interface exposed by ZooKeeper has the wait-free aspects of shared registers with
an event-driven mechanism similar to cache invalidations
of distributed file systems to provide a simple, yet powerful coordination service.
The ZooKeeper interface enables a high-performance
service implementation. In addition to the wait-free
property, ZooKeeper provides a per client guarantee of
FIFO execution of requests and linearizability for all requests that change the ZooKeeper state. These design decisions enable the implementation of a high performance
processing pipeline with read requests being satisfied by
local servers. We show for the target workloads, 2:1
to 100:1 read to write ratio, that ZooKeeper can handle
tens to hundreds of thousands of transactions per second.
This performance allows ZooKeeper to be used extensively by client applications.
Introduction
formance and fault tolerance, it is not sufficient for coordination. We have also to provide order guarantees for
operations. In particular, we have found that guaranteeing both FIFO client ordering of all operations and linearizable writes enables an efficient implementation of
the service and it is sufficient to implement coordination
primitives of interest to our applications. In fact, we can
implement consensus for any number of processes with
our API, and according to the hierarchy of Herlihy, ZooKeeper implements a universal object [14].
The ZooKeeper service comprises an ensemble of
servers that use replication to achieve high availability
and performance. Its high performance enables applications comprising a large number of processes to use
such a coordination kernel to manage all aspects of coordination. We were able to implement ZooKeeper using a simple pipelined architecture that allows us to have
hundreds or thousands of requests outstanding while still
achieving low latency. Such a pipeline naturally enables
the execution of operations from a single client in FIFO
order. Guaranteeing FIFO client order enables clients to
submit operations asynchronously. With asynchronous
operations, a client is able to have multiple outstanding
operations at a time. This feature is desirable, for example, when a new client becomes a leader and it has to manipulate metadata and update it accordingly. Without the
possibility of multiple outstanding operations, the time
of initialization can be of the order of seconds instead of
sub-second.
To guarantee that update operations satisfy linearizability, we implement a leader-based atomic broadcast
protocol [23], called Zab [24]. A typical workload
of a ZooKeeper application, however, is dominated by
read operations and it becomes desirable to scale read
throughput. In ZooKeeper, servers process read operations locally, and we do not use Zab to totally order them.
Caching data on the client side is an important technique to increase the performance of reads. For example,
it is useful for a process to cache the identifier of the
current leader instead of probing ZooKeeper every time
it needs to know the leader. ZooKeeper uses a watch
mechanism to enable clients to cache data without managing the client cache directly. With this mechanism,
a client can watch for an update to a given data object,
and receive a notification upon an update. Chubby manages the client cache directly. It blocks updates to invalidate the caches of all clients caching the data being
changed. Under this design, if any of these clients is
slow or faulty, the update is delayed. Chubby uses leases
to prevent a faulty client from blocking the system indefinitely. Leases, however, only bound the impact of slow
or faulty clients, whereas ZooKeeper watches avoid the
problem altogether.
In this paper we discuss our design and implementa-
tion of ZooKeeper. With ZooKeeper, we are able to implement all coordination primitives that our applications
require, even though only writes are linearizable. To validate our approach we show how we implement some
coordination primitives with ZooKeeper.
To summarize, in this paper our main contributions are:
Coordination kernel: We propose a wait-free coordination service with relaxed consistency guarantees
for use in distributed systems. In particular, we describe our design and implementation of a coordination kernel, which we have used in many critical applications to implement various coordination
techniques.
Coordination recipes: We show how ZooKeeper can
be used to build higher level coordination primitives, even blocking and strongly consistent primitives, that are often used in distributed applications.
Experience with Coordination: We share some of the
ways that we use ZooKeeper and evaluate its performance.
2.1
Service overview
chical keys. The hierarchal namespace is useful for allocating subtrees for the namespace of different applications and for setting access rights to those subtrees. We
also exploit the concept of directories on the client side to
build higher level primitives as we will see in section 2.4.
Unlike files in file systems, znodes are not designed
for general data storage. Instead, znodes map to abstractions of the client application, typically corresponding
to meta-data used for coordination purposes. To illustrate, in Figure 1 we have two subtrees, one for Application 1 (/app1) and another for Application 2 (/app2).
The subtree for Application 1 implements a simple group
membership protocol: each client process pi creates a
znode p i under /app1, which persists as long as the
process is running.
Although znodes have not been designed for general
data storage, ZooKeeper does allow clients to store some
information that can be used for meta-data or configuration in a distributed computation. For example, in a
leader-based application, it is useful for an application
server that is just starting to learn which other server is
currently the leader. To accomplish this goal, we can
have the current leader write this information in a known
location in the znode space. Znodes also have associated
meta-data with time stamps and version counters, which
allow clients to track changes to znodes and execute conditional updates based on the version of the znode.
/app1
/app1/p_1
/app1/p_2
/app2
/app1/p_3
2.2
Client API
Data model. The data model of ZooKeeper is essentially a file system with a simplified API and only full
data reads and writes, or a key/value table with hierar3
2.3
client to have multiple outstanding operations, and consequently we can choose to guarantee no specific order
for outstanding operations of the same client or to guarantee FIFO order. We choose the latter for our property.
It is important to observe that all results that hold for
linearizable objects also hold for A-linearizable objects
because a system that satisfies A-linearizability also satisfies linearizability. Because only update requests are Alinearizable, ZooKeeper processes read requests locally
at each replica. This allows the service to scale linearly
as servers are added to the system.
To see how these two guarantees interact, consider the
following scenario. A system comprising a number of
processes elects a leader to command worker processes.
When a new leader takes charge of the system, it must
change a large number of configuration parameters and
notify the other processes once it finishes. We then have
two important requirements:
As the new leader starts making changes, we do not
want other processes to start using the configuration
that is being changed;
If the new leader dies before the configuration has
been fully updated, we do not want the processes to
use this partial configuration.
Observe that distributed locks, such as the locks provided by Chubby, would help with the first requirement
but are insufficient for the second. With ZooKeeper,
the new leader can designate a path as the ready znode;
other processes will only use the configuration when that
znode exists. The new leader makes the configuration
change by deleting ready, updating the various configuration znodes, and creating ready. All of these changes
can be pipelined and issued asynchronously to quickly
update the configuration state. Although the latency of a
change operation is of the order of 2 milliseconds, a new
leader that must update 5000 different znodes will take
10 seconds if the requests are issued one after the other;
by issuing the requests asynchronously the requests will
take less than a second. Because of the ordering guarantees, if a process sees the ready znode, it must also see
all the configuration changes made by the new leader. If
the new leader dies before the ready znode is created, the
other processes know that the configuration has not been
finalized and do not use it.
The above scheme still has a problem: what happens
if a process sees that ready exists before the new leader
starts to make a change and then starts reading the configuration while the change is in progress. This problem
is solved by the ordering guarantee for the notifications:
if a client is watching for a change, the client will see
the notification event before it sees the new state of the
system after the change is made. Consequently, if the
process that reads the ready znode requests to be notified
of changes to that znode, it will see a notification inform-
ZooKeeper guarantees
ing the client of the change before it can read any of the
new configuration.
Another problem can arise when clients have their own
communication channels in addition to ZooKeeper. For
example, consider two clients A and B that have a shared
configuration in ZooKeeper and communicate through a
shared communication channel. If A changes the shared
configuration in ZooKeeper and tells B of the change
through the shared communication channel, B would expect to see the change when it re-reads the configuration.
If Bs ZooKeeper replica is slightly behind As, it may
not see the new configuration. Using the above guarantees B can make sure that it sees the most up-to-date
information by issuing a write before re-reading the configuration. To handle this scenario more efficiently ZooKeeper provides the sync request: when followed by
a read, constitutes a slow read. sync causes a server
to apply all pending write requests before processing the
read without the overhead of a full write. This primitive
is similar in idea to the flush primitive of ISIS [5].
ZooKeeper also has the following two liveness and
durability guarantees: if a majority of ZooKeeper servers
are active and communicating the service will be available; and if the ZooKeeper service responds successfully
to a change request, that change persists across any number of failures as long as a quorum of servers is eventually able to recover.
2.4
Examples of primitives
Simple Locks Although ZooKeeper is not a lock service, it can be used to implement locks. Applications
using ZooKeeper usually use synchronization primitives
tailored to their needs, such as those shown above. Here
we show how to implement locks with ZooKeeper to
show that it can implement a wide variety of general synchronization primitives.
The simplest lock implementation uses lock files.
The lock is represented by a znode. To acquire a lock,
a client tries to create the designated znode with the
EPHEMERAL flag. If the create succeeds, the client
holds the lock. Otherwise, the client can read the znode with the watch flag set to be notified if the current
leader dies. A client releases the lock when it dies or explicitly deletes the znode. Other clients that are waiting
for a lock try again to acquire a lock once they observe
the znode being deleted.
While this simple locking protocol works, it does have
some problems. First, it suffers from the herd effect. If
there are many clients waiting to acquire a lock, they will
all vie for the lock when it is released even though only
one client can acquire the lock. Second, it only implements exclusive locking. The following two primitives
show how both of these problems can be overcome.
Lock
1 n = create(l + /lock-, EPHEMERAL|SEQUENTIAL)
2 C = getChildren(l, false)
3 if n is lowest znode in C, exit
4 p = znode in C ordered just before n
5 if exists(p, true) wait for watch event
6 goto 2
Unlock
1 delete(n)
Double Barrier Double barriers enable clients to synchronize the beginning and the end of a computation.
When enough processes, defined by the barrier threshold, have joined the barrier, processes start their computation and leave the barrier once they have finished. We
represent a barrier in ZooKeeper with a znode, referred
to as b. Every process p registers with b by creating
a znode as a child of b on entry, and unregisters removes the child when it is ready to leave. Processes
can enter the barrier when the number of child znodes
of b exceeds the barrier threshold. Processes can leave
the barrier when all of the processes have removed their
children. We use watches to efficiently wait for enter and
6
Katta Katta [17] is a distributed indexer that uses ZooKeeper for coordination, and it is an example of a nonYahoo! application. Katta divides the work of indexing
using shards. A master server assigns shards to slaves
and tracks progress. Slaves can fail, so the master must
redistribute load as slaves come and go. The master can
also fail, so other servers must be ready to take over in
case of failure. Katta uses ZooKeeper to track the status
of slave servers and the master (group membership),
and to handle master failover (leader election). Katta
also uses ZooKeeper to track and propagate the assignments of shards to slaves (configuration management).
ZooKeeper Applications
read
write
Number of operations
shutdown
nodes
migration_prohibited
topics
broker_disabled
<topic>
1500
<hostname> <hostname>
1000
<topic>
load
# of topics
primary
500
backup
hostname
0
0h
6h
12h 18h 24h 30h 36h 42h 48h 54h 60h 66h
Time in seconds
Figure 2 shows the read and write traffic for a ZooKeeper server used by FS through a period of three days.
To generate this graph, we count the number of operations for every second during the period, and each point
corresponds to the number of operations in that second.
We observe that the read traffic is much higher compared
to the write traffic. During periods in which the rate is
higher than 1, 000 operations per second, the read:write
ratio varies between 10:1 and 100:1. The read operations
in this workload are getData(), getChildren(),
and exists(), in increasing order of prevalence.
7
primary and backup server for each topic along with the
subscribers of that topic. The primary and backup
server znodes not only allow servers to discover the
servers in charge of a topic, but they also manage leader
election and server crashes.
4.1
ZooKeeper Service
Write
Request
Request
Processor
Response
txn
Replicated
Database
Atomic
Broadcast
txn
Read
Request
Request Processor
ZooKeeper Implementation
4.2
Atomic Broadcast
1 Details
4.3
Replicated Database
4.4
Client-Server Interactions
server does not reestablish the session with the client until the server has caught up. The client is guaranteed to
be able to find another server that has a recent view of the
system since the client only sees changes that have been
replicated to a majority of the ZooKeeper servers. This
behavior is important to guarantee durability.
To detect client session failures, ZooKeeper uses timeouts. The leader determines that there has been a failure
if no other server receives anything from a client session within the session timeout. If the client sends requests frequently enough, then there is no need to send
any other message. Otherwise, the client sends heartbeat
messages during periods of low activity. If the client
cannot communicate with a server to send a request or
heartbeat, it connects to a different ZooKeeper server to
re-establish its session. To prevent the session from timing out, the ZooKeeper client library sends a heartbeat
after the session has been idle for s/3 ms and switch to a
new server if it has not heard from a server for 2s/3 ms,
where s is the session timeout in milliseconds.
70000
60000
50000
40000
30000
20000
10000
0
0
20
40
60
Percentage of read requests
80
100
Figure 5: The throughput performance of a saturated system as the ratio of reads to writes vary.
Evaluation
5.1
3 servers
5 servers
7 servers
9 servers
13 servers
80000
Servers
13
9
7
5
3
Throughput
100% Reads
460k
296k
257k
165k
87k
0% Reads
8k
12k
14k
18k
21k
2 The
10
60000
70000
40000
30000
20000
10000
20
40
60
Percentage of read requests
80
8
Size of ensemble
10
12
14
versions all require CPU. The contention for CPU lowers ZooKeeper throughput to substantially less than the
atomic broadcast component in isolation. Because ZooKeeper is a critical production component, up to now our
development focus for ZooKeeper has been correctness
and robustness. There are plenty of opportunities for improving performance significantly by eliminating things
like extra copies, multiple serializations of the same object, more efficient internal data structures, etc.
50000
20000
60000
30000
3 servers
5 servers
7 servers
9 servers
13 servers
80000
40000
10000
50000
100
Throughput
60000
50000
C
40000
30000
A
B
20000
10000
0
0
50
100
150
200
Seconds since start of series
250
300
5.2
# of barriers
200
400
800
1600
5.3
Performance of barriers
In this experiment, we execute a number of barriers sequentially to assess the performance of primitives implemented with ZooKeeper. For a given number of barriers
b, each client first enters all b barriers, and then it leaves
all b barriers in succession. As we use the double-barrier
algorithm of Section 2.4, a client first waits for all other
clients to execute the enter() procedure before moving to next call (similarly for leave()).
We report the results of our experiments in Table 3.
In this experiment, we have 50, 100, and 200 clients
entering a number b of barriers in succession, b 2
{200, 400, 800, 1600}. Although an application can have
thousands of ZooKeeper clients, quite often a much
smaller subset participates in each coordination operation as clients are often grouped according to the
specifics of the application.
Two interesting observations from this experiment are
that the time to process all barriers increase roughly linearly with the number of barriers, showing that concurrent access to the same part of the data tree did not produce any unexpected delay, and that latency increases
proportionally to the number of clients. This is a consequence of not saturating the ZooKeeper service. In
fact, we observe that even with clients proceeding in
lock-step, the throughput of barrier operations (enter and
leave) is between 1,950 and 3,100 operations per second
in all cases. In ZooKeeper operations, this corresponds
to throughput values between 10,700 and 17,000 operations per second. As in our implementation we have a
ratio of reads to writes of 4:1 (80% of read operations),
the throughput our benchmark code uses is much lower
compared to the raw throughput ZooKeeper can achieve
(over 40,000 according to Figure 5). This is due to clients
waiting on other clients.
To assess the latency of requests, we created a benchmark modeled after the Chubby benchmark [6]. We create a worker process that simply sends a create, waits
for it to finish, sends an asynchronous delete of the new
node, and then starts the next create. We vary the number
of workers accordingly, and for each run, we have each
worker create 50,000 nodes. We calculate the throughput
by dividing the number of create requests completed by
the total time it took for all the workers to complete.
3
776
2074
2740
# of clients
100
200
19.8
41.0
34.1
62.0
55.9
112.1
102.7 234.4
Latency of requests
Workers
1
10
20
50
9.4
16.4
28.9
54.0
Number of servers
5
7
9
748
758
711
1832 1572 1540
2336 1934 1890
Related work
ZooKeeper has the goal of providing a service that mitigates the problem of coordinating processes in distributed applications. To achieve this goal, its design uses
ideas from previous coordination services, fault tolerant
systems, distributed algorithms, and file systems.
12
We are not the first to propose a system for the coordination of distributed applications. Some early systems
propose a distributed lock service for transactional applications [13], and for sharing information in clusters
of computers [19]. More recently, Chubby proposes a
system to manage advisory locks for distributed applications [6]. Chubby shares several of the goals of ZooKeeper. It also has a file-system-like interface, and it uses
an agreement protocol to guarantee the consistency of the
replicas. However, ZooKeeper is not a lock service. It
can be used by clients to implement locks, but there are
no lock operations in its API. Unlike Chubby, ZooKeeper
allows clients to connect to any ZooKeeper server, not
just the leader. ZooKeeper clients can use their local
replicas to serve data and manage watches since its consistency model is much more relaxed than Chubby. This
enables ZooKeeper to provide higher performance than
Chubby, allowing applications to make more extensive
use of ZooKeeper.
There have been fault-tolerant systems proposed in
the literature with the goal of mitigating the problem of
building fault-tolerant distributed applications. One early
system is ISIS [5]. The ISIS system transforms abstract
type specifications into fault-tolerant distributed objects,
thus making fault-tolerance mechanisms transparent to
users. Horus [30] and Ensemble [31] are systems that
evolved from ISIS. ZooKeeper embraces the notion of
virtual synchrony of ISIS. Finally, Totem guarantees total
order of message delivery in an architecture that exploits
hardware broadcasts of local area networks [22]. ZooKeeper works with a wide variety of network topologies
which motivated us to rely on TCP connections between
server processes and not assume any special topology or
hardware features. We also do not expose any of the ensemble communication used internally in ZooKeeper.
One important technique for building fault-tolerant
services is state-machine replication [26], and Paxos [20]
is an algorithm that enables efficient implementations
of replicated state-machines for asynchronous systems.
We use an algorithm that shares some of the characteristics of Paxos, but that combines transaction logging
needed for consensus with write-ahead logging needed
for data tree recovery to enable an efficient implementation. There have been proposals of protocols for practical
implementations of Byzantine-tolerant replicated statemachines [7, 10, 18, 1, 28]. ZooKeeper does not assume
that servers can be Byzantine, but we do employ mechanisms such as checksums and sanity checks to catch
non-malicious Byzantine faults. Clement et al. discuss an approach to make ZooKeeper fully Byzantine
fault-tolerant without modifying the current server code
base [9]. To date, we have not observed faults in production that would have been prevented using a fully Byzantine fault-tolerant protocol. [29].
Conclusions
[12]
[13]
[14]
Acknowledgements
[15]
[16]
[17]
[18]
References
[1] M. Abd-El-Malek, G. R. Ganger, G. R. Goodson, M. K. Reiter,
and J. J. Wylie. Fault-scalable byzantine fault-tolerant services.
In SOSP 05: Proceedings of the twentieth ACM symposium on
Operating systems principles, pages 5974, New York, NY, USA,
2005. ACM.
[19]
[2] M. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis. Sinfonia: A new paradigm for building scalable distributed
systems. In SOSP 07: Proceedings of the 21st ACM symposium
on Operating systems principles, New York, NY, 2007.
[21]
[20]
http://aws.
[22]
[23]
[24]
[6] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th ACM/USENIX Symposium on Operating Systems Design and Implementation (OSDI),
2006.
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[11] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazons highly available key-value store. In
14
SOSP 07: Proceedings of the 21st ACM symposium on Operating systems principles, New York, NY, USA, 2007. ACM Press.
J. Gray, P. Helland, P. ONeil, and D. Shasha. The dangers of
replication and a solution. In Proceedings of SIGMOD 96, pages
173182, New York, NY, USA, 1996. ACM.
A. Hastings. Distributed lock management in a transaction processing environment. In Proceedings of IEEE 9th Symposium on
Reliable Distributed Systems, Oct. 1990.
M. Herlihy. Wait-free synchronization. ACM Transactions on
Programming Languages and Systems, 13(1), 1991.
M. Herlihy and J. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming
Languages and Systems, 12(3), July 1990.
J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols,
M. Satyanarayanan, R. N. Sidebotham, and M. J. West. Scale
and performance in a distributed file system. ACM Trans. Comput. Syst., 6(1), 1988.
Katta. Katta - distribute lucene indexes in a grid. http://
katta.wiki.sourceforge.net/, 2008.
R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong.
Zyzzyva: speculative byzantine fault tolerance. SIGOPS Oper.
Syst. Rev., 41(6):4558, 2007.
N. P. Kronenberg, H. M. Levy, and W. D. Strecker. Vaxclusters (extended abstract): a closely-coupled distributed system.
SIGOPS Oper. Syst. Rev., 19(5), 1985.
L. Lamport. The part-time parliament. ACM Transactions on
Computer Systems, 16(2), May 1998.
J. MacCormick, N. Murphy, M. Najork, C. A. Thekkath, and
L. Zhou. Boxwood: Abstractions as the foundation for storage
infrastructure. In Proceedings of the 6th ACM/USENIX Symposium on Operating Systems Design and Implementation (OSDI),
2004.
L. Moser, P. Melliar-Smith, D. Agarwal, R. Budhia, C. LingleyPapadopoulos, and T. Archambault. The totem system. In Proceedings of the 25th International Symposium on Fault-Tolerant
Computing, June 1995.
S. Mullender, editor. Distributed Systems, 2nd edition. ACM
Press, New York, NY, USA, 1993.
B. Reed and F. P. Junqueira. A simple totally ordered broadcast protocol. In LADIS 08: Proceedings of the 2nd Workshop
on Large-Scale Distributed Systems and Middleware, pages 16,
New York, NY, USA, 2008. ACM.
N. Schiper and S. Toueg. A robust and lightweight stable leader
election service for dynamic systems. In DSN, 2008.
F. B. Schneider. Implementing fault-tolerant services using the
state machine approach: A tutorial. ACM Computing Surveys,
22(4), 1990.
A. Sherman, P. A. Lisiecki, A. Berkheimer, and J. Wein. ACMS:
The Akamai configuration management system. In NSDI, 2005.
A. Singh, P. Fonseca, P. Kuznetsov, R. Rodrigues, and P. Maniatis. Zeno: eventually consistent byzantine-fault tolerance.
In NSDI09: Proceedings of the 6th USENIX symposium on
Networked systems design and implementation, pages 169184,
Berkeley, CA, USA, 2009. USENIX Association.
Y. J. Song, F. Junqueira, and B. Reed.
BFT for the
skeptics. http://www.net.t-labs.tu-berlin.de/
petr/BFTW3/abstracts/talk-abstract.pdf.
R. van Renesse and K. Birman. Horus, a flexible group communication systems. Communications of the ACM, 39(16), Apr.
1996.
R. van Renesse, K. Birman, M. Hayden, A. Vaysburd, and
D. Karr. Building adaptive systems using ensemble. Software
- Practice and Experience, 28(5), July 1998.