Sunteți pe pagina 1din 177

Spanner: Googles Globally-Distributed Database

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,
Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh,
Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura,
David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak,
Christopher Taylor, Ruth Wang, Dale Woodford
Google, Inc.
Abstract
Spanner is Googles scalable, multi-version, globallydistributed, and synchronously-replicated database. It is
the first system to distribute data at global scale and support externally-consistent distributed transactions. This
paper describes how Spanner is structured, its feature set,
the rationale underlying various design decisions, and a
novel time API that exposes clock uncertainty. This API
and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

Introduction

Spanner is a scalable, globally-distributed database designed, built, and deployed at Google. At the highest level of abstraction, it is a database that shards data
across many sets of Paxos [21] state machines in datacenters spread all over the world. Replication is used for
global availability and geographic locality; clients automatically failover between replicas. Spanner automatically reshards data across machines as the amount of data
or the number of servers changes, and it automatically
migrates data across machines (even across datacenters)
to balance load and in response to failures. Spanner is
designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows.
Applications can use Spanner for high availability,
even in the face of wide-area natural disasters, by replicating their data within or even across continents. Our
initial customer was F1 [35], a rewrite of Googles advertising backend. F1 uses five replicas spread across
the United States. Most other applications will probably
replicate their data across 3 to 5 datacenters in one geographic region, but with relatively independent failure
modes. That is, most applications will choose lower laPublished in the Proceedings of OSDI 2012

tency over higher availability, as long as they can survive


1 or 2 datacenter failures.
Spanners main focus is managing cross-datacenter
replicated data, but we have also spent a great deal of
time in designing and implementing important database
features on top of our distributed-systems infrastructure.
Even though many projects happily use Bigtable [9], we
have also consistently received complaints from users
that Bigtable can be difficult to use for some kinds of applications: those that have complex, evolving schemas,
or those that want strong consistency in the presence of
wide-area replication. (Similar claims have been made
by other authors [37].) Many applications at Google
have chosen to use Megastore [5] because of its semirelational data model and support for synchronous replication, despite its relatively poor write throughput. As a
consequence, Spanner has evolved from a Bigtable-like
versioned key-value store into a temporal multi-version
database. Data is stored in schematized semi-relational
tables; data is versioned, and each version is automatically timestamped with its commit time; old versions of
data are subject to configurable garbage-collection policies; and applications can read data at old timestamps.
Spanner supports general-purpose transactions, and provides a SQL-based query language.
As a globally-distributed database, Spanner provides
several interesting features. First, the replication configurations for data can be dynamically controlled at a
fine grain by applications. Applications can specify constraints to control which datacenters contain which data,
how far data is from its users (to control read latency),
how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance). Data
can also be dynamically and transparently moved between datacenters by the system to balance resource usage across datacenters. Second, Spanner has two features
that are difficult to implement in a distributed database: it
1

provides externally consistent [16] reads and writes, and


globally-consistent reads across the database at a timestamp. These features enable Spanner to support consistent backups, consistent MapReduce executions [12],
and atomic schema updates, all at global scale, and even
in the presence of ongoing transactions.
These features are enabled by the fact that Spanner assigns globally-meaningful commit timestamps to transactions, even though transactions may be distributed.
The timestamps reflect serialization order. In addition,
the serialization order satisfies external consistency (or
equivalently, linearizability [20]): if a transaction T1
commits before another transaction T2 starts, then T1 s
commit timestamp is smaller than T2 s. Spanner is the
first system to provide such guarantees at global scale.
The key enabler of these properties is a new TrueTime
API and its implementation. The API directly exposes
clock uncertainty, and the guarantees on Spanners timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to
wait out that uncertainty. Googles cluster-management
software provides an implementation of the TrueTime
API. This implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock
references (GPS and atomic clocks).
Section 2 describes the structure of Spanners implementation, its feature set, and the engineering decisions
that went into their design. Section 3 describes our new
TrueTime API and sketches its implementation. Section 4 describes how Spanner uses TrueTime to implement externally-consistent distributed transactions, lockfree read-only transactions, and atomic schema updates.
Section 5 provides some benchmarks on Spanners performance and TrueTime behavior, and discusses the experiences of F1. Sections 6, 7, and 8 describe related and
future work, and summarize our conclusions.

Implementation

This section describes the structure of and rationale underlying Spanners implementation. It then describes the
directory abstraction, which is used to manage replication and locality, and is the unit of data movement. Finally, it describes our data model, why Spanner looks
like a relational database instead of a key-value store, and
how applications can control data locality.
A Spanner deployment is called a universe. Given
that Spanner manages data globally, there will be only
a handful of running universes. We currently run a
test/playground universe, a development/production universe, and a production-only universe.
Spanner is organized as a set of zones, where each
zone is the rough analog of a deployment of Bigtable
Published in the Proceedings of OSDI 2012

Figure 1: Spanner server organization.


servers [9]. Zones are the unit of administrative deployment. The set of zones is also the set of locations across
which data can be replicated. Zones can be added to or
removed from a running system as new datacenters are
brought into service and old ones are turned off, respectively. Zones are also the unit of physical isolation: there
may be one or more zones in a datacenter, for example,
if different applications data must be partitioned across
different sets of servers in the same datacenter.
Figure 1 illustrates the servers in a Spanner universe.
A zone has one zonemaster and between one hundred
and several thousand spanservers. The former assigns
data to spanservers; the latter serve data to clients. The
per-zone location proxies are used by clients to locate
the spanservers assigned to serve their data. The universe master and the placement driver are currently singletons. The universe master is primarily a console that
displays status information about all the zones for interactive debugging. The placement driver handles automated movement of data across zones on the timescale
of minutes. The placement driver periodically communicates with the spanservers to find data that needs to be
moved, either to meet updated replication constraints or
to balance load. For space reasons, we will only describe
the spanserver in any detail.

2.1

Spanserver Software Stack

This section focuses on the spanserver implementation


to illustrate how replication and distributed transactions
have been layered onto our Bigtable-based implementation. The software stack is shown in Figure 2. At the
bottom, each spanserver is responsible for between 100
and 1000 instances of a data structure called a tablet. A
tablet is similar to Bigtables tablet abstraction, in that it
implements a bag of the following mappings:
(key:string, timestamp:int64) ! string

Unlike Bigtable, Spanner assigns timestamps to data,


which is an important way in which Spanner is more
like a multi-version database than a key-value store. A
2

Figure 3: Directories are the unit of data movement between


Paxos groups.

Figure 2: Spanserver software stack.


tablets state is stored in set of B-tree-like files and a
write-ahead log, all on a distributed file system called
Colossus (the successor to the Google File System [15]).
To support replication, each spanserver implements a
single Paxos state machine on top of each tablet. (An
early Spanner incarnation supported multiple Paxos state
machines per tablet, which allowed for more flexible
replication configurations. The complexity of that design led us to abandon it.) Each state machine stores
its metadata and log in its corresponding tablet. Our
Paxos implementation supports long-lived leaders with
time-based leader leases, whose length defaults to 10
seconds. The current Spanner implementation logs every Paxos write twice: once in the tablets log, and once
in the Paxos log. This choice was made out of expediency, and we are likely to remedy this eventually. Our
implementation of Paxos is pipelined, so as to improve
Spanners throughput in the presence of WAN latencies;
but writes are applied by Paxos in order (a fact on which
we will depend in Section 4).
The Paxos state machines are used to implement a
consistently replicated bag of mappings. The key-value
mapping state of each replica is stored in its corresponding tablet. Writes must initiate the Paxos protocol at the
leader; reads access state directly from the underlying
tablet at any replica that is sufficiently up-to-date. The
set of replicas is collectively a Paxos group.
At every replica that is a leader, each spanserver implements a lock table to implement concurrency control.
The lock table contains the state for two-phase locking: it maps ranges of keys to lock states. (Note that
having a long-lived Paxos leader is critical to efficiently
managing the lock table.) In both Bigtable and Spanner, we designed for long-lived transactions (for example, for report generation, which might take on the order
of minutes), which perform poorly under optimistic concurrency control in the presence of conflicts. Operations
Published in the Proceedings of OSDI 2012

that require synchronization, such as transactional reads,


acquire locks in the lock table; other operations bypass
the lock table.
At every replica that is a leader, each spanserver also
implements a transaction manager to support distributed
transactions. The transaction manager is used to implement a participant leader; the other replicas in the group
will be referred to as participant slaves. If a transaction involves only one Paxos group (as is the case for
most transactions), it can bypass the transaction manager,
since the lock table and Paxos together provide transactionality. If a transaction involves more than one Paxos
group, those groups leaders coordinate to perform twophase commit. One of the participant groups is chosen as
the coordinator: the participant leader of that group will
be referred to as the coordinator leader, and the slaves of
that group as coordinator slaves. The state of each transaction manager is stored in the underlying Paxos group
(and therefore is replicated).

2.2

Directories and Placement

On top of the bag of key-value mappings, the Spanner


implementation supports a bucketing abstraction called a
directory, which is a set of contiguous keys that share a
common prefix. (The choice of the term directory is a
historical accident; a better term might be bucket.) We
will explain the source of that prefix in Section 2.3. Supporting directories allows applications to control the locality of their data by choosing keys carefully.
A directory is the unit of data placement. All data in
a directory has the same replication configuration. When
data is moved between Paxos groups, it is moved directory by directory, as shown in Figure 3. Spanner might
move a directory to shed load from a Paxos group; to put
directories that are frequently accessed together into the
same group; or to move a directory into a group that is
closer to its accessors. Directories can be moved while
client operations are ongoing. One could expect that a
50MB directory can be moved in a few seconds.
The fact that a Paxos group may contain multiple directories implies that a Spanner tablet is different from
3

a Bigtable tablet: the former is not necessarily a single


lexicographically contiguous partition of the row space.
Instead, a Spanner tablet is a container that may encapsulate multiple partitions of the row space. We made this
decision so that it would be possible to colocate multiple
directories that are frequently accessed together.
Movedir is the background task used to move directories between Paxos groups [14]. Movedir is also used
to add or remove replicas to Paxos groups [25], because Spanner does not yet support in-Paxos configuration changes. Movedir is not implemented as a single
transaction, so as to avoid blocking ongoing reads and
writes on a bulky data move. Instead, movedir registers
the fact that it is starting to move data and moves the data
in the background. When it has moved all but a nominal
amount of the data, it uses a transaction to atomically
move that nominal amount and update the metadata for
the two Paxos groups.
A directory is also the smallest unit whose geographicreplication properties (or placement, for short) can
be specified by an application. The design of our
placement-specification language separates responsibilities for managing replication configurations. Administrators control two dimensions: the number and types of
replicas, and the geographic placement of those replicas.
They create a menu of named options in these two dimensions (e.g., North America, replicated 5 ways with
1 witness). An application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options. For example,
an application might store each end-users data in its own
directory, which would enable user As data to have three
replicas in Europe, and user Bs data to have five replicas
in North America.
For expository clarity we have over-simplified. In fact,
Spanner will shard a directory into multiple fragments
if it grows too large. Fragments may be served from
different Paxos groups (and therefore different servers).
Movedir actually moves fragments, and not whole directories, between groups.

2.3

Data Model

Spanner exposes the following set of data features


to applications: a data model based on schematized
semi-relational tables, a query language, and generalpurpose transactions.
The move towards supporting these features was driven by many factors. The
need to support schematized semi-relational tables and
synchronous replication is supported by the popularity of Megastore [5]. At least 300 applications within
Google use Megastore (despite its relatively low performance) because its data model is simpler to manPublished in the Proceedings of OSDI 2012

age than Bigtables, and because of its support for synchronous replication across datacenters. (Bigtable only
supports eventually-consistent replication across datacenters.) Examples of well-known Google applications
that use Megastore are Gmail, Picasa, Calendar, Android
Market, and AppEngine. The need to support a SQLlike query language in Spanner was also clear, given
the popularity of Dremel [28] as an interactive dataanalysis tool. Finally, the lack of cross-row transactions
in Bigtable led to frequent complaints; Percolator [32]
was in part built to address this failing. Some authors
have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it
is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack
of transactions. Running two-phase commit over Paxos
mitigates the availability problems.
The application data model is layered on top of the
directory-bucketed key-value mappings supported by the
implementation. An application creates one or more
databases in a universe. Each database can contain an
unlimited number of schematized tables. Tables look
like relational-database tables, with rows, columns, and
versioned values. We will not go into detail about the
query language for Spanner. It looks like SQL with some
extensions to support protocol-buffer-valued fields.
Spanners data model is not purely relational, in that
rows must have names. More precisely, every table is required to have an ordered set of one or more primary-key
columns. This requirement is where Spanner still looks
like a key-value store: the primary keys form the name
for a row, and each table defines a mapping from the
primary-key columns to the non-primary-key columns.
A row has existence only if some value (even if it is
NULL) is defined for the rows keys. Imposing this structure is useful because it lets applications control data locality through their choices of keys.
Figure 4 contains an example Spanner schema for storing photo metadata on a per-user, per-album basis. The
schema language is similar to Megastores, with the additional requirement that every Spanner database must
be partitioned by clients into one or more hierarchies
of tables. Client applications declare the hierarchies in
database schemas via the INTERLEAVE IN declarations. The table at the top of a hierarchy is a directory
table. Each row in a directory table with key K, together
with all of the rows in descendant tables that start with K
in lexicographic order, forms a directory. ON DELETE
CASCADE says that deleting a row in the directory table
deletes any associated child rows. The figure also illustrates the interleaved layout for the example database: for
4

CREATE TABLE Users {


uid INT64 NOT NULL, email STRING
} PRIMARY KEY (uid), DIRECTORY;
CREATE TABLE Albums {
uid INT64 NOT NULL, aid INT64 NOT NULL,
name STRING
} PRIMARY KEY (uid, aid),
INTERLEAVE IN PARENT Users ON DELETE CASCADE;

Figure 4: Example Spanner schema for photo metadata, and


the interleaving implied by INTERLEAVE IN.

example, Albums(2,1) represents the row from the


Albums table for user id 2, album id 1. This
interleaving of tables to form directories is significant
because it allows clients to describe the locality relationships that exist between multiple tables, which is necessary for good performance in a sharded, distributed
database. Without it, Spanner would not know the most
important locality relationships.

TrueTime
Method

Returns

TT.now()
TT.after(t)
TT.before(t)

TTinterval: [earliest, latest]


true if t has definitely passed
true if t has definitely not arrived

Table 1: TrueTime API. The argument t is of type TTstamp.


This section describes the TrueTime API and sketches
its implementation. We leave most of the details for another paper: our goal is to demonstrate the power of
having such an API. Table 1 lists the methods of the
API. TrueTime explicitly represents time as a TTinterval,
which is an interval with bounded time uncertainty (unlike standard time interfaces that give clients no notion
of uncertainty). The endpoints of a TTinterval are of
type TTstamp. The TT.now() method returns a TTinterval
that is guaranteed to contain the absolute time during
which TT.now() was invoked. The time epoch is analogous to UNIX time with leap-second smearing. Define the instantaneous error bound as , which is half of
the intervals width, and the average error bound as .
The TT.after() and TT.before() methods are convenience
wrappers around TT.now().
Published in the Proceedings of OSDI 2012

Denote the absolute time of an event e by the function tabs (e). In more formal terms, TrueTime guarantees that for an invocation tt = TT.now(), tt.earliest
tabs (enow ) tt.latest, where enow is the invocation event.
The underlying time references used by TrueTime
are GPS and atomic clocks. TrueTime uses two forms
of time reference because they have different failure
modes. GPS reference-source vulnerabilities include antenna and receiver failures, local radio interference, correlated failures (e.g., design faults such as incorrect leapsecond handling and spoofing), and GPS system outages.
Atomic clocks can fail in ways uncorrelated to GPS and
each other, and over long periods of time can drift significantly due to frequency error.
TrueTime is implemented by a set of time master machines per datacenter and a timeslave daemon per machine. The majority of masters have GPS receivers with
dedicated antennas; these masters are separated physically to reduce the effects of antenna failures, radio interference, and spoofing. The remaining masters (which
we refer to as Armageddon masters) are equipped with
atomic clocks. An atomic clock is not that expensive:
the cost of an Armageddon master is of the same order
as that of a GPS master. All masters time references
are regularly compared against each other. Each master also cross-checks the rate at which its reference advances time against its own local clock, and evicts itself
if there is substantial divergence. Between synchronizations, Armageddon masters advertise a slowly increasing
time uncertainty that is derived from conservatively applied worst-case clock drift. GPS masters advertise uncertainty that is typically close to zero.
Every daemon polls a variety of masters [29] to reduce vulnerability to errors from any one master. Some
are GPS masters chosen from nearby datacenters; the
rest are GPS masters from farther datacenters, as well
as some Armageddon masters. Daemons apply a variant
of Marzullos algorithm [27] to detect and reject liars,
and synchronize the local machine clocks to the nonliars. To protect against broken local clocks, machines
that exhibit frequency excursions larger than the worstcase bound derived from component specifications and
operating environment are evicted.
Between synchronizations, a daemon advertises a
slowly increasing time uncertainty. is derived from
conservatively applied worst-case local clock drift. also
depends on time-master uncertainty and communication
delay to the time masters. In our production environment, is typically a sawtooth function of time, varying
from about 1 to 7 ms over each poll interval. is therefore 4 ms most of the time. The daemons poll interval is
currently 30 seconds, and the current applied drift rate is
set at 200 microseconds/second, which together account
5

Timestamp
Discussion

Concurrency
Control

Read-Write Transaction

4.1.2

pessimistic

Read-Only Transaction

4.1.4

lock-free

Operation

Snapshot Read, client-provided timestamp


Snapshot Read, client-provided bound

4.1.3

lock-free
lock-free

Replica Required
leader
leader for timestamp; any for
read, subject to 4.1.3
any, subject to 4.1.3
any, subject to 4.1.3

Table 2: Types of reads and writes in Spanner, and how they compare.
for the sawtooth bounds from 0 to 6 ms. The remaining 1 ms comes from the communication delay to the
time masters. Excursions from this sawtooth are possible in the presence of failures. For example, occasional
time-master unavailability can cause datacenter-wide increases in . Similarly, overloaded machines and network
links can result in occasional localized spikes.

Concurrency Control

This section describes how TrueTime is used to guarantee the correctness properties around concurrency control, and how those properties are used to implement
features such as externally consistent transactions, lockfree read-only transactions, and non-blocking reads in
the past. These features enable, for example, the guarantee that a whole-database audit read at a timestamp t
will see exactly the effects of every transaction that has
committed as of t.
Going forward, it will be important to distinguish
writes as seen by Paxos (which we will refer to as Paxos
writes unless the context is clear) from Spanner client
writes. For example, two-phase commit generates a
Paxos write for the prepare phase that has no corresponding Spanner client write.

4.1

Timestamp Management

Table 2 lists the types of operations that Spanner supports. The Spanner implementation supports readwrite transactions, read-only transactions (predeclared
snapshot-isolation transactions), and snapshot reads.
Standalone writes are implemented as read-write transactions; non-snapshot standalone reads are implemented
as read-only transactions. Both are internally retried
(clients need not write their own retry loops).
A read-only transaction is a kind of transaction that
has the performance benefits of snapshot isolation [6].
A read-only transaction must be predeclared as not having any writes; it is not simply a read-write transaction
without any writes. Reads in a read-only transaction execute at a system-chosen timestamp without locking, so
that incoming writes are not blocked. The execution of
Published in the Proceedings of OSDI 2012

the reads in a read-only transaction can proceed on any


replica that is sufficiently up-to-date (Section 4.1.3).
A snapshot read is a read in the past that executes without locking. A client can either specify a timestamp for a
snapshot read, or provide an upper bound on the desired
timestamps staleness and let Spanner choose a timestamp. In either case, the execution of a snapshot read
proceeds at any replica that is sufficiently up-to-date.
For both read-only transactions and snapshot reads,
commit is inevitable once a timestamp has been chosen, unless the data at that timestamp has been garbagecollected. As a result, clients can avoid buffering results
inside a retry loop. When a server fails, clients can internally continue the query on a different server by repeating the timestamp and the current read position.
4.1.1

Paxos Leader Leases

Spanners Paxos implementation uses timed leases to


make leadership long-lived (10 seconds by default). A
potential leader sends requests for timed lease votes;
upon receiving a quorum of lease votes the leader knows
it has a lease. A replica extends its lease vote implicitly
on a successful write, and the leader requests lease-vote
extensions if they are near expiration. Define a leaders
lease interval as starting when it discovers it has a quorum of lease votes, and as ending when it no longer has
a quorum of lease votes (because some have expired).
Spanner depends on the following disjointness invariant:
for each Paxos group, each Paxos leaders lease interval
is disjoint from every other leaders. Appendix A describes how this invariant is enforced.
The Spanner implementation permits a Paxos leader
to abdicate by releasing its slaves from their lease votes.
To preserve the disjointness invariant, Spanner constrains
when abdication is permissible. Define smax to be the
maximum timestamp used by a leader. Subsequent sections will describe when smax is advanced. Before abdicating, a leader must wait until TT.after(smax ) is true.
4.1.2

Assigning Timestamps to RW Transactions

Transactional reads and writes use two-phase locking.


As a result, they can be assigned timestamps at any time
6

when all locks have been acquired, but before any locks
have been released. For a given transaction, Spanner assigns it the timestamp that Paxos assigns to the Paxos
write that represents the transaction commit.
Spanner depends on the following monotonicity invariant: within each Paxos group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders. A single leader replica can trivially assign timestamps in monotonically increasing order. This invariant is enforced across leaders by making
use of the disjointness invariant: a leader must only assign timestamps within the interval of its leader lease.
Note that whenever a timestamp s is assigned, smax is
advanced to s to preserve disjointness.
Spanner also enforces the following externalconsistency invariant: if the start of a transaction T2
occurs after the commit of a transaction T1 , then the
commit timestamp of T2 must be greater than the
commit timestamp of T1 . Define the start and commit
events for a transaction Ti by estart
and ecommit
; and
i
i
the commit timestamp of a transaction Ti by si . The
invariant becomes tabs (ecommit
) < tabs (estart
1
2 ) ) s1 < s 2 .
The protocol for executing transactions and assigning
timestamps obeys two rules, which together guarantee
this invariant, as shown below. Define the arrival event
of the commit request at the coordinator leader for a
write Ti to be eserver
.
i
Start The coordinator leader for a write Ti assigns
a commit timestamp si no less than the value of
TT.now().latest, computed after eserver
. Note that the
i
participant leaders do not matter here; Section 4.2.1 describes how they are involved in the implementation of
the next rule.
Commit Wait The coordinator leader ensures that
clients cannot see any data committed by Ti until
TT.after(si ) is true. Commit wait ensures that si is
less than the absolute commit time of Ti , or si <
tabs (ecommit
). The implementation of commit wait is dei
scribed in Section 4.2.1. Proof:
s1 < tabs (ecommit
)
1
tabs (ecommit
)
1
start
tabs (e2 )
tabs (eserver
)
2

<

tabs (estart
2 )
tabs (eserver
)
2

s2

s1 < s 2

4.1.3

(commit wait)
(assumption)
(causality)
(start)
(transitivity)

Serving Reads at a Timestamp

The monotonicity invariant described in Section 4.1.2 allows Spanner to correctly determine whether a replicas
state is sufficiently up-to-date to satisfy a read. Every
replica tracks a value called safe time tsafe which is the
Published in the Proceedings of OSDI 2012

maximum timestamp at which a replica is up-to-date. A


replica can satisfy a read at a timestamp t if t <= tsafe .
TM
Define tsafe = min(tPaxos
safe , tsafe ), where each Paxos
state machine has a safe time tPaxos
safe and each transacTM
tion manager has a safe time tsafe . tPaxos
safe is simpler: it
is the timestamp of the highest-applied Paxos write. Because timestamps increase monotonically and writes are
applied in order, writes will no longer occur at or below
tPaxos
safe with respect to Paxos.
tTM
safe is 1 at a replica if there are zero prepared (but
not committed) transactionsthat is, transactions in between the two phases of two-phase commit. (For a participant slave, tTM
safe actually refers to the replicas leaders
transaction manager, whose state the slave can infer
through metadata passed on Paxos writes.) If there are
any such transactions, then the state affected by those
transactions is indeterminate: a participant replica does
not know yet whether such transactions will commit. As
we discuss in Section 4.2.1, the commit protocol ensures
that every participant knows a lower bound on a prepared transactions timestamp. Every participant leader
(for a group g) for a transaction Ti assigns a prepare
timestamp sprepare
to its prepare record. The coordinator
i,g
leader ensures that the transactions commit timestamp
si >= sprepare
over all participant groups g. Therefore,
i,g
for every replica in a group g, over all transactions Ti preprepare
pared at g, tTM
) 1 over all transactions
safe = mini (si,g
prepared at g.
4.1.4

Assigning Timestamps to RO Transactions

A read-only transaction executes in two phases: assign


a timestamp sread [8], and then execute the transactions
reads as snapshot reads at sread . The snapshot reads can
execute at any replicas that are sufficiently up-to-date.
The simple assignment of sread = TT.now().latest, at
any time after a transaction starts, preserves external consistency by an argument analogous to that presented for
writes in Section 4.1.2. However, such a timestamp may
require the execution of the data reads at sread to block
if tsafe has not advanced sufficiently. (In addition, note
that choosing a value of sread may also advance smax to
preserve disjointness.) To reduce the chances of blocking, Spanner should assign the oldest timestamp that preserves external consistency. Section 4.2.2 explains how
such a timestamp can be chosen.

4.2

Details

This section explains some of the practical details of


read-write transactions and read-only transactions elided
earlier, as well as the implementation of a special transaction type used to implement atomic schema changes.
7

It then describes some refinements of the basic schemes


as described.
4.2.1

Read-Write Transactions

Like Bigtable, writes that occur in a transaction are


buffered at the client until commit. As a result, reads
in a transaction do not see the effects of the transactions
writes. This design works well in Spanner because a read
returns the timestamps of any data read, and uncommitted writes have not yet been assigned timestamps.
Reads within read-write transactions use woundwait [33] to avoid deadlocks. The client issues reads
to the leader replica of the appropriate group, which
acquires read locks and then reads the most recent
data. While a client transaction remains open, it sends
keepalive messages to prevent participant leaders from
timing out its transaction. When a client has completed
all reads and buffered all writes, it begins two-phase
commit. The client chooses a coordinator group and
sends a commit message to each participants leader with
the identity of the coordinator and any buffered writes.
Having the client drive two-phase commit avoids sending data twice across wide-area links.
A non-coordinator-participant leader first acquires
write locks. It then chooses a prepare timestamp that
must be larger than any timestamps it has assigned to previous transactions (to preserve monotonicity), and logs a
prepare record through Paxos. Each participant then notifies the coordinator of its prepare timestamp.
The coordinator leader also first acquires write locks,
but skips the prepare phase. It chooses a timestamp for
the entire transaction after hearing from all other participant leaders. The commit timestamp s must be greater or
equal to all prepare timestamps (to satisfy the constraints
discussed in Section 4.1.3), greater than TT.now().latest
at the time the coordinator received its commit message,
and greater than any timestamps the leader has assigned
to previous transactions (again, to preserve monotonicity). The coordinator leader then logs a commit record
through Paxos (or an abort if it timed out while waiting
on the other participants).
Before allowing any coordinator replica to apply
the commit record, the coordinator leader waits until
TT.after(s), so as to obey the commit-wait rule described
in Section 4.1.2. Because the coordinator leader chose s
based on TT.now().latest, and now waits until that timestamp is guaranteed to be in the past, the expected wait
is at least 2 . This wait is typically overlapped with
Paxos communication. After commit wait, the coordinator sends the commit timestamp to the client and all
other participant leaders. Each participant leader logs the
transactions outcome through Paxos. All participants
apply at the same timestamp and then release locks.
Published in the Proceedings of OSDI 2012

4.2.2

Read-Only Transactions

Assigning a timestamp requires a negotiation phase between all of the Paxos groups that are involved in the
reads. As a result, Spanner requires a scope expression
for every read-only transaction, which is an expression
that summarizes the keys that will be read by the entire
transaction. Spanner automatically infers the scope for
standalone queries.
If the scopes values are served by a single Paxos
group, then the client issues the read-only transaction to
that groups leader. (The current Spanner implementation only chooses a timestamp for a read-only transaction at a Paxos leader.) That leader assigns sread and executes the read. For a single-site read, Spanner generally does better than TT.now().latest. Define LastTS() to
be the timestamp of the last committed write at a Paxos
group. If there are no prepared transactions, the assignment sread = LastTS() trivially satisfies external consistency: the transaction will see the result of the last write,
and therefore be ordered after it.
If the scopes values are served by multiple Paxos
groups, there are several options. The most complicated
option is to do a round of communication with all of
the groupss leaders to negotiate sread based on LastTS().
Spanner currently implements a simpler choice. The
client avoids a negotiation round, and just has its reads
execute at sread = TT.now().latest (which may wait for
safe time to advance). All reads in the transaction can be
sent to replicas that are sufficiently up-to-date.

4.2.3

Schema-Change Transactions

TrueTime enables Spanner to support atomic schema


changes. It would be infeasible to use a standard transaction, because the number of participants (the number of
groups in a database) could be in the millions. Bigtable
supports atomic schema changes in one datacenter, but
its schema changes block all operations.
A Spanner schema-change transaction is a generally
non-blocking variant of a standard transaction. First, it
is explicitly assigned a timestamp in the future, which
is registered in the prepare phase. As a result, schema
changes across thousands of servers can complete with
minimal disruption to other concurrent activity. Second, reads and writes, which implicitly depend on the
schema, synchronize with any registered schema-change
timestamp at time t: they may proceed if their timestamps precede t, but they must block behind the schemachange transaction if their timestamps are after t. Without TrueTime, defining the schema change to happen at t
would be meaningless.
8

replicas

write

latency (ms)
read-only transaction

snapshot read

write

1D
1
3
5

9.4.6
14.41.0
13.9.6
14.4.4

1.4.1
1.3.1
1.4.05

1.3.1
1.2.1
1.3.04

4.0.3
4.1.05
2.2.5
2.8.3

throughput (Kops/sec)
read-only transaction snapshot read

10.9.4
13.83.2
25.35.2

13.5.1
38.5.3
50.01.1

Table 3: Operation microbenchmarks. Mean and standard deviation over 10 runs. 1D means one replica with commit wait disabled.
4.2.4

Refinements

tTM
safe as defined above has a weakness, in that a single
prepared transaction prevents tsafe from advancing. As
a result, no reads can occur at later timestamps, even
if the reads do not conflict with the transaction. Such
false conflicts can be removed by augmenting tTM
safe with
a fine-grained mapping from key ranges to preparedtransaction timestamps. This information can be stored
in the lock table, which already maps key ranges to
lock metadata. When a read arrives, it only needs to be
checked against the fine-grained safe time for key ranges
with which the read conflicts.
LastTS() as defined above has a similar weakness: if
a transaction has just committed, a non-conflicting readonly transaction must still be assigned sread so as to follow that transaction. As a result, the execution of the read
could be delayed. This weakness can be remedied similarly by augmenting LastTS() with a fine-grained mapping from key ranges to commit timestamps in the lock
table. (We have not yet implemented this optimization.)
When a read-only transaction arrives, its timestamp can
be assigned by taking the maximum value of LastTS()
for the key ranges with which the transaction conflicts,
unless there is a conflicting prepared transaction (which
can be determined from fine-grained safe time).
tPaxos
safe as defined above has a weakness in that it cannot
advance in the absence of Paxos writes. That is, a snapshot read at t cannot execute at Paxos groups whose last
write happened before t. Spanner addresses this problem
by taking advantage of the disjointness of leader-lease
intervals. Each Paxos leader advances tPaxos
safe by keeping
a threshold above which future writes timestamps will
occur: it maintains a mapping MinNextTS(n) from Paxos
sequence number n to the minimum timestamp that may
be assigned to Paxos sequence number n + 1. A replica
can advance tPaxos
1 when it has apsafe to MinNextTS(n)
plied through n.
A single leader can enforce its MinNextTS()
promises easily. Because the timestamps promised
by MinNextTS() lie within a leaders lease, the disjointness invariant enforces MinNextTS() promises across
leaders. If a leader wishes to advance MinNextTS()
beyond the end of its leader lease, it must first extend its
Published in the Proceedings of OSDI 2012

lease. Note that smax is always advanced to the highest


value in MinNextTS() to preserve disjointness.
A leader by default advances MinNextTS() values every 8 seconds. Thus, in the absence of prepared transactions, healthy slaves in an idle Paxos group can serve
reads at timestamps greater than 8 seconds old in the
worst case. A leader may also advance MinNextTS() values on demand from slaves.

Evaluation

We first measure Spanners performance with respect to


replication, transactions, and availability. We then provide some data on TrueTime behavior, and a case study
of our first client, F1.

5.1

Microbenchmarks

Table 3 presents some microbenchmarks for Spanner.


These measurements were taken on timeshared machines: each spanserver ran on scheduling units of 4GB
RAM and 4 cores (AMD Barcelona 2200MHz). Clients
were run on separate machines. Each zone contained one
spanserver. Clients and zones were placed in a set of datacenters with network distance of less than 1ms. (Such a
layout should be commonplace: most applications do not
need to distribute all of their data worldwide.) The test
database was created with 50 Paxos groups with 2500 directories. Operations were standalone reads and writes of
4KB. All reads were served out of memory after a compaction, so that we are only measuring the overhead of
Spanners call stack. In addition, one unmeasured round
of reads was done first to warm any location caches.
For the latency experiments, clients issued sufficiently
few operations so as to avoid queuing at the servers.
From the 1-replica experiments, commit wait is about
5ms, and Paxos latency is about 9ms. As the number
of replicas increases, the latency stays roughly constant
with less standard deviation because Paxos executes in
parallel at a groups replicas. As the number of replicas
increases, the latency to achieve a quorum becomes less
sensitive to slowness at one slave replica.
For the throughput experiments, clients issued sufficiently many operations so as to saturate the servers
9

mean

1
2
5
10
25
50
100
200

17.0 1.4
24.5 2.5
31.5 6.2
30.0 3.7
35.5 5.6
42.7 4.1
71.4 7.6
150.5 11.0

Cumulative reads completed

latency (ms)
99th percentile

participants

75.0 34.9
87.6 35.9
104.5 52.2
95.6 25.4
100.4 42.7
93.7 22.9
131.2 17.6
320.3 35.1

Table 4: Two-phase commit scalability. Mean and standard

1.4M
non-leader
leader-soft
leader-hard

1.2M
1M
800K
600K
400K
200K
0

deviations over 10 runs.

CPUs. Snapshot reads can execute at any up-to-date


replicas, so their throughput increases almost linearly
with the number of replicas. Single-read read-only transactions only execute at leaders because timestamp assignment must happen at leaders. Read-only-transaction
throughput increases with the number of replicas because
the number of effective spanservers increases: in the
experimental setup, the number of spanservers equaled
the number of replicas, and leaders were randomly distributed among the zones. Write throughput benefits
from the same experimental artifact (which explains the
increase in throughput from 3 to 5 replicas), but that benefit is outweighed by the linear increase in the amount of
work performed per write, as the number of replicas increases.
Table 4 demonstrates that two-phase commit can scale
to a reasonable number of participants: it summarizes
a set of experiments run across 3 zones, each with 25
spanservers. Scaling up to 50 participants is reasonable
in both mean and 99th-percentile, and latencies start to
rise noticeably at 100 participants.

5.2

Availability

Figure 5 illustrates the availability benefits of running


Spanner in multiple datacenters. It shows the results of
three experiments on throughput in the presence of datacenter failure, all of which are overlaid onto the same
time scale. The test universe consisted of 5 zones Zi ,
each of which had 25 spanservers. The test database was
sharded into 1250 Paxos groups, and 100 test clients constantly issued non-snapshot reads at an aggregrate rate
of 50K reads/second. All of the leaders were explicitly placed in Z1 . Five seconds into each test, all of
the servers in one zone were killed: non-leader kills Z2 ;
leader-hard kills Z1 ; leader-soft kills Z1 , but it gives notifications to all of the servers that they should handoff
leadership first.
Killing Z2 has no effect on read throughput. Killing
Z1 while giving the leaders time to handoff leadership to
Published in the Proceedings of OSDI 2012

10

15

20

Time in seconds

Figure 5: Effect of killing servers on throughput.


a different zone has a minor effect: the throughput drop
is not visible in the graph, but is around 3-4%. On the
other hand, killing Z1 with no warning has a severe effect: the rate of completion drops almost to 0. As leaders
get re-elected, though, the throughput of the system rises
to approximately 100K reads/second because of two artifacts of our experiment: there is extra capacity in the
system, and operations are queued while the leader is unavailable. As a result, the throughput of the system rises
before leveling off again at its steady-state rate.
We can also see the effect of the fact that Paxos leader
leases are set to 10 seconds. When we kill the zone,
the leader-lease expiration times for the groups should
be evenly distributed over the next 10 seconds. Soon after each lease from a dead leader expires, a new leader is
elected. Approximately 10 seconds after the kill time, all
of the groups have leaders and throughput has recovered.
Shorter lease times would reduce the effect of server
deaths on availability, but would require greater amounts
of lease-renewal network traffic. We are in the process of
designing and implementing a mechanism that will cause
slaves to release Paxos leader leases upon leader failure.

5.3

TrueTime

Two questions must be answered with respect to TrueTime: is truly a bound on clock uncertainty, and how
bad does get? For the former, the most serious problem would be if a local clocks drift were greater than
200us/sec: that would break assumptions made by TrueTime. Our machine statistics show that bad CPUs are 6
times more likely than bad clocks. That is, clock issues
are extremely infrequent, relative to much more serious
hardware problems. As a result, we believe that TrueTimes implementation is as trustworthy as any other
piece of software upon which Spanner depends.
Figure 6 presents TrueTime data taken at several thousand spanserver machines across datacenters up to 2200
10

10
99.9
99
90

Epsilon (ms)

# fragments

# directories

1
24
59
1014
1599
100500

>100M
341
5336
232
34
7

4
6
3
4
2
2

Mar 29

Mar 30

Mar 31

Date

Apr 1 6AM

8AM

10AM

12PM

Figure 6: Distribution of TrueTime values, sampled right

after timeslave daemon polls the time masters. 90th, 99th, and
99.9th percentiles are graphed.

km apart. It plots the 90th, 99th, and 99.9th percentiles


of , sampled at timeslave daemons immediately after
polling the time masters. This sampling elides the sawtooth in due to local-clock uncertainty, and therefore
measures time-master uncertainty (which is generally 0)
plus communication delay to the time masters.
The data shows that these two factors in determining
the base value of are generally not a problem. However, there can be significant tail-latency issues that cause
higher values of . The reduction in tail latencies beginning on March 30 were due to networking improvements
that reduced transient network-link congestion. The increase in on April 13, approximately one hour in duration, resulted from the shutdown of 2 time masters at a
datacenter for routine maintenance. We continue to investigate and remove causes of TrueTime spikes.

5.4

Table 5: Distribution of directory-fragment counts in F1.

Date (April 13)

F1

Spanner started being experimentally evaluated under


production workloads in early 2011, as part of a rewrite
of Googles advertising backend called F1 [35]. This
backend was originally based on a MySQL database that
was manually sharded many ways. The uncompressed
dataset is tens of terabytes, which is small compared to
many NoSQL instances, but was large enough to cause
difficulties with sharded MySQL. The MySQL sharding
scheme assigned each customer and all related data to a
fixed shard. This layout enabled the use of indexes and
complex query processing on a per-customer basis, but
required some knowledge of the sharding in application
business logic. Resharding this revenue-critical database
as it grew in the number of customers and their data was
extremely costly. The last resharding took over two years
of intense effort, and involved coordination and testing
across dozens of teams to minimize risk. This operation
was too complex to do regularly: as a result, the team had
to limit growth on the MySQL database by storing some
Published in the Proceedings of OSDI 2012

data in external Bigtables, which compromised transactional behavior and the ability to query across all data.
The F1 team chose to use Spanner for several reasons. First, Spanner removes the need to manually reshard. Second, Spanner provides synchronous replication and automatic failover. With MySQL master-slave
replication, failover was difficult, and risked data loss
and downtime. Third, F1 requires strong transactional
semantics, which made using other NoSQL systems impractical. Application semantics requires transactions
across arbitrary data, and consistent reads. The F1 team
also needed secondary indexes on their data (since Spanner does not yet provide automatic support for secondary
indexes), and was able to implement their own consistent
global indexes using Spanner transactions.
All application writes are now by default sent through
F1 to Spanner, instead of the MySQL-based application
stack. F1 has 2 replicas on the west coast of the US, and
3 on the east coast. This choice of replica sites was made
to cope with outages due to potential major natural disasters, and also the choice of their frontend sites. Anecdotally, Spanners automatic failover has been nearly invisible to them. Although there have been unplanned cluster
failures in the last few months, the most that the F1 team
has had to do is update their databases schema to tell
Spanner where to preferentially place Paxos leaders, so
as to keep them close to where their frontends moved.
Spanners timestamp semantics made it efficient for
F1 to maintain in-memory data structures computed from
the database state. F1 maintains a logical history log of
all changes, which is written into Spanner itself as part
of every transaction. F1 takes full snapshots of data at a
timestamp to initialize its data structures, and then reads
incremental changes to update them.
Table 5 illustrates the distribution of the number of
fragments per directory in F1. Each directory typically
corresponds to a customer in the application stack above
F1. The vast majority of directories (and therefore customers) consist of only 1 fragment, which means that
reads and writes to those customers data are guaranteed
to occur on only a single server. The directories with
more than 100 fragments are all tables that contain F1
secondary indexes: writes to more than a few fragments
11

operation
all reads
single-site commit
multi-site commit

latency (ms)
mean std dev

count

8.7
72.3
103.0

21.5B
31.2M
32.1M

376.4
112.8
52.2

Table 6: F1-perceived operation latencies measured over the


course of 24 hours.

of such tables are extremely uncommon. The F1 team


has only seen such behavior when they do untuned bulk
data loads as transactions.
Table 6 presents Spanner operation latencies as measured from F1 servers. Replicas in the east-coast data
centers are given higher priority in choosing Paxos leaders. The data in the table is measured from F1 servers
in those data centers. The large standard deviation in
write latencies is caused by a pretty fat tail due to lock
conflicts. The even larger standard deviation in read latencies is partially due to the fact that Paxos leaders are
spread across two data centers, only one of which has
machines with SSDs. In addition, the measurement includes every read in the system from two datacenters:
the mean and standard deviation of the bytes read were
roughly 1.6KB and 119KB, respectively.

Related Work

Consistent replication across datacenters as a storage


service has been provided by Megastore [5] and DynamoDB [3]. DynamoDB presents a key-value interface,
and only replicates within a region. Spanner follows
Megastore in providing a semi-relational data model,
and even a similar schema language. Megastore does
not achieve high performance. It is layered on top of
Bigtable, which imposes high communication costs. It
also does not support long-lived leaders: multiple replicas may initiate writes. All writes from different replicas necessarily conflict in the Paxos protocol, even if
they do not logically conflict: throughput collapses on
a Paxos group at several writes per second. Spanner provides higher performance, general-purpose transactions,
and external consistency.
Pavlo et al. [31] have compared the performance of
databases and MapReduce [12]. They point to several
other efforts that have been made to explore database
functionality layered on distributed key-value stores [1,
4, 7, 41] as evidence that the two worlds are converging.
We agree with the conclusion, but demonstrate that integrating multiple layers has its advantages: integrating
concurrency control with replication reduces the cost of
commit wait in Spanner, for example.
Published in the Proceedings of OSDI 2012

The notion of layering transactions on top of a replicated store dates at least as far back as Giffords dissertation [16]. Scatter [17] is a recent DHT-based key-value
store that layers transactions on top of consistent replication. Spanner focuses on providing a higher-level interface than Scatter does. Gray and Lamport [18] describe a non-blocking commit protocol based on Paxos.
Their protocol incurs more messaging costs than twophase commit, which would aggravate the cost of commit over widely distributed groups. Walter [36] provides
a variant of snapshot isolation that works within, but not
across datacenters. In contrast, our read-only transactions provide a more natural semantics, because we support external consistency over all operations.
There has been a spate of recent work on reducing
or eliminating locking overheads. Calvin [40] eliminates concurrency control: it pre-assigns timestamps and
then executes the transactions in timestamp order. HStore [39] and Granola [11] each supported their own
classification of transaction types, some of which could
avoid locking. None of these systems provides external
consistency. Spanner addresses the contention issue by
providing support for snapshot isolation.
VoltDB [42] is a sharded in-memory database that
supports master-slave replication over the wide area for
disaster recovery, but not more general replication configurations. It is an example of what has been called
NewSQL, which is a marketplace push to support scalable SQL [38]. A number of commercial databases implement reads in the past, such as MarkLogic [26] and
Oracles Total Recall [30]. Lomet and Li [24] describe an
implementation strategy for such a temporal database.
Farsite derived bounds on clock uncertainty (much
looser than TrueTimes) relative to a trusted clock reference [13]: server leases in Farsite were maintained in the
same way that Spanner maintains Paxos leases. Loosely
synchronized clocks have been used for concurrencycontrol purposes in prior work [2, 23]. We have shown
that TrueTime lets one reason about global time across
sets of Paxos state machines.

Future Work

We have spent most of the last year working with the


F1 team to transition Googles advertising backend from
MySQL to Spanner. We are actively improving its monitoring and support tools, as well as tuning its performance. In addition, we have been working on improving
the functionality and performance of our backup/restore
system. We are currently implementing the Spanner
schema language, automatic maintenance of secondary
indices, and automatic load-based resharding. Longer
term, there are a couple of features that we plan to in12

vestigate. Optimistically doing reads in parallel may be


a valuable strategy to pursue, but initial experiments have
indicated that the right implementation is non-trivial. In
addition, we plan to eventually support direct changes of
Paxos configurations [22, 34].
Given that we expect many applications to replicate
their data across datacenters that are relatively close to
each other, TrueTime may noticeably affect performance. We see no insurmountable obstacle to reducing below 1ms. Time-master-query intervals can be
reduced, and better clock crystals are relatively cheap.
Time-master query latency could be reduced with improved networking technology, or possibly even avoided
through alternate time-distribution technology.
Finally, there are obvious areas for improvement. Although Spanner is scalable in the number of nodes, the
node-local data structures have relatively poor performance on complex SQL queries, because they were designed for simple key-value accesses. Algorithms and
data structures from DB literature could improve singlenode performance a great deal. Second, moving data automatically between datacenters in response to changes
in client load has long been a goal of ours, but to make
that goal effective, we would also need the ability to
move client-application processes between datacenters in
an automated, coordinated fashion. Moving processes
raises the even more difficult problem of managing resource acquisition and allocation between datacenters.

Conclusions

To summarize, Spanner combines and extends on ideas


from two research communities: from the database community, a familiar, easy-to-use, semi-relational interface,
transactions, and an SQL-based query language; from
the systems community, scalability, automatic sharding,
fault tolerance, consistent replication, external consistency, and wide-area distribution. Since Spanners inception, we have taken more than 5 years to iterate to the
current design and implementation. Part of this long iteration phase was due to a slow realization that Spanner
should do more than tackle the problem of a globallyreplicated namespace, and should also focus on database
features that Bigtable was missing.
One aspect of our design stands out: the linchpin of
Spanners feature set is TrueTime. We have shown that
reifying clock uncertainty in the time API makes it possible to build distributed systems with much stronger time
semantics. In addition, as the underlying system enforces tighter bounds on clock uncertainty, the overhead
of the stronger semantics decreases. As a community, we
should no longer depend on loosely synchronized clocks
and weak time APIs in designing distributed algorithms.
Published in the Proceedings of OSDI 2012

Acknowledgements
Many people have helped to improve this paper: our
shepherd Jon Howell, who went above and beyond
his responsibilities; the anonymous referees; and many
Googlers: Atul Adya, Fay Chang, Frank Dabek, Sean
Dorward, Bob Gruber, David Held, Nick Kline, Alex
Thomson, and Joel Wein. Our management has been
very supportive of both our work and of publishing this
paper: Aristotle Balogh, Bill Coughran, Urs Holzle,
Doron Meyer, Cos Nicolaou, Kathy Polizzi, Sridhar Ramaswany, and Shivakumar Venkataraman.
We have built upon the work of the Bigtable and
Megastore teams. The F1 team, and Jeff Shute in particular, worked closely with us in developing our data model
and helped immensely in tracking down performance and
correctness bugs. The Platforms team, and Luiz Barroso
and Bob Felderman in particular, helped to make TrueTime happen. Finally, a lot of Googlers used to be on our
team: Ken Ashcraft, Paul Cychosz, Krzysztof Ostrowski,
Amir Voskoboynik, Matthew Weaver, Theo Vassilakis,
and Eric Veach; or have joined our team recently: Nathan
Bales, Adam Beberg, Vadim Borisov, Ken Chen, Brian
Cooper, Cian Cullinan, Robert-Jan Huijsman, Milind
Joshi, Andrey Khorlin, Dawid Kuroczko, Laramie Leavitt, Eric Li, Mike Mammarella, Sunil Mushran, Simon
Nielsen, Ovidiu Platon, Ananth Shrinivas, Vadim Suvorov, and Marcel van der Holst.

References
[1]

Azza Abouzeid et al. HadoopDB: an architectural hybrid of


MapReduce and DBMS technologies for analytical workloads.
Proc. of VLDB. 2009, pp. 922933.

[2]

A. Adya et al. Efficient optimistic concurrency control using


loosely synchronized clocks. Proc. of SIGMOD. 1995, pp. 23
34.

[3]

Amazon. Amazon DynamoDB. 2012.

[4]

Michael Armbrust et al. PIQL: Success-Tolerant Query Processing in the Cloud. Proc. of VLDB. 2011, pp. 181192.

[5]

Jason Baker et al. Megastore: Providing Scalable, Highly


Available Storage for Interactive Services. Proc. of CIDR.
2011, pp. 223234.

[6]

Hal Berenson et al. A critique of ANSI SQL isolation levels.


Proc. of SIGMOD. 1995, pp. 110.

[7]

Matthias Brantner et al. Building a database on S3. Proc. of


SIGMOD. 2008, pp. 251264.

[8]

A. Chan and R. Gray. Implementing Distributed Read-Only


Transactions. IEEE TOSE SE-11.2 (Feb. 1985), pp. 205212.

[9]

Fay Chang et al. Bigtable: A Distributed Storage System for


Structured Data. ACM TOCS 26.2 (June 2008), 4:14:26.

[10]

Brian F. Cooper et al. PNUTS: Yahoo!s hosted data serving


platform. Proc. of VLDB. 2008, pp. 12771288.

[11]

James Cowling and Barbara Liskov. Granola: Low-Overhead


Distributed Transaction Coordination. Proc. of USENIX ATC.
2012, pp. 223236.

13

[12]

Jeffrey Dean and Sanjay Ghemawat. MapReduce: a flexible


data processing tool. CACM 53.1 (Jan. 2010), pp. 7277.

[37]

Michael Stonebraker. Why Enterprises Are Uninterested in


NoSQL. 2010.

[13]

John Douceur and Jon Howell. Scalable Byzantine-FaultQuantifying Clock Synchronization. Tech. rep. MSR-TR-200367. MS Research, 2003.

[38]

Michael Stonebraker. Six SQL Urban Myths. 2010.

[39]

Michael Stonebraker et al. The end of an architectural era: (its


time for a complete rewrite). Proc. of VLDB. 2007, pp. 1150
1160.

[40]

Alexander Thomson et al. Calvin: Fast Distributed Transactions for Partitioned Database Systems. Proc. of SIGMOD.
2012, pp. 112.

[41]

Ashish Thusoo et al. Hive A Petabyte Scale Data Warehouse Using Hadoop. Proc. of ICDE. 2010, pp. 9961005.

[42]

VoltDB. VoltDB Resources. 2012.

[14]

John R. Douceur and Jon Howell. Distributed directory service


in the Farsite file system. Proc. of OSDI. 2006, pp. 321334.

[15]

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The


Google file system. Proc. of SOSP. Dec. 2003, pp. 2943.

[16]

David K. Gifford. Information Storage in a Decentralized Computer System. Tech. rep. CSL-81-8. PhD dissertation. Xerox
PARC, July 1982.

[17]

Lisa Glendenning et al. Scalable consistency in Scatter. Proc.


of SOSP. 2011.

[18]

Jim Gray and Leslie Lamport. Consensus on transaction commit. ACM TODS 31.1 (Mar. 2006), pp. 133160.

[19]

Pat Helland. Life beyond Distributed Transactions: an Apostates Opinion. Proc. of CIDR. 2007, pp. 132141.

[20]

Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a


correctness condition for concurrent objects. ACM TOPLAS
12.3 (July 1990), pp. 463492.

[21]

Leslie Lamport. The part-time parliament. ACM TOCS 16.2


(May 1998), pp. 133169.

[22]

Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. Reconfiguring a state machine. SIGACT News 41.1 (Mar. 2010), pp. 63
73.

[23]

Barbara Liskov. Practical uses of synchronized clocks in distributed systems. Distrib. Comput. 6.4 (July 1993), pp. 211
219.

[24]

David B. Lomet and Feifei Li. Improving Transaction-Time


DBMS Performance and Functionality. Proc. of ICDE (2009),
pp. 581591.

[25]

Jacob R. Lorch et al. The SMART way to migrate replicated


stateful services. Proc. of EuroSys. 2006, pp. 103115.

[26]

MarkLogic. MarkLogic 5 Product Documentation. 2012.

[27]

Keith Marzullo and Susan Owicki. Maintaining the time in a


distributed system. Proc. of PODC. 1983, pp. 295305.

[28]

Sergey Melnik et al. Dremel: Interactive Analysis of WebScale Datasets. Proc. of VLDB. 2010, pp. 330339.

[29]

D.L. Mills. Time synchronization in DCNET hosts. Internet


Project Report IEN173. COMSAT Laboratories, Feb. 1981.

[30]

Oracle. Oracle Total Recall. 2012.

[31]

Andrew Pavlo et al. A comparison of approaches to large-scale


data analysis. Proc. of SIGMOD. 2009, pp. 165178.

[32]

Daniel Peng and Frank Dabek. Large-scale incremental processing using distributed transactions and notifications. Proc.
of OSDI. 2010, pp. 115.

[33]

Daniel J. Rosenkrantz, Richard E. Stearns, and Philip M. Lewis


II. System level concurrency control for distributed database
systems. ACM TODS 3.2 (June 1978), pp. 178198.

[34]

Alexander Shraer et al. Dynamic Reconfiguration of Primary/Backup Clusters. Proc. of USENIX ATC. 2012, pp. 425
438.

[35]

[36]

Jeff Shute et al. F1 The Fault-Tolerant Distributed RDBMS


Supporting Googles Ad Business. Proc. of SIGMOD. May
2012, pp. 777778.
Yair Sovran et al. Transactional storage for geo-replicated systems. Proc. of SOSP. 2011, pp. 385400.

Published in the Proceedings of OSDI 2012

Paxos Leader-Lease Management

The simplest means to ensure the disjointness of Paxosleader-lease intervals would be for a leader to issue a synchronous Paxos write of the lease interval, whenever it
would be extended. A subsequent leader would read the
interval and wait until that interval has passed.
TrueTime can be used to ensure disjointness without
these extra log writes. The potential ith leader keeps a
lower bound on the start of a lease vote from replica r as
leader
vi,r
= TT.now().earliest, computed before esend
i,r (defined as when the lease request is sent by the leader).
Each replica r grants a lease at lease egrant
i,r , which hapreceive
pens after ei,r
(when the replica receives a lease request); the lease ends at tend
i,r = TT.now().latest + 10,
computed after ereceive
.
A
replica r obeys the singlei,r
vote rule: it will not grant another lease vote until
TT.after(tend
i,r ) is true. To enforce this rule across different
incarnations of r, Spanner logs a lease vote at the granting replica before granting the lease; this log write can
be piggybacked upon existing Paxos-protocol log writes.
When the ith leader receives a quorum of votes
(event equorum
), it computes its lease interval as
i
leader
leasei = [TT.now().latest, minr (vi,r
) + 10]. The
lease is deemed to have expired at the leader when
leader
TT.before(minr (vi,r
) + 10) is false. To prove disjointness, we make use of the fact that the ith and (i + 1)th
leaders must have one replica in common in their quorums. Call that replica r0. Proof:
leader
leasei .end = minr (vi,r
) + 10
leader
minr (vi,r
)

+ 10

leader
vi,r0

+ 10

leader
vi,r0
+ 10 tabs (esend
i,r0 ) + 10

receive
tabs (esend
i,r0 ) + 10 tabs (ei,r0 ) + 10

tabs (ereceive
i,r0 )

+ 10
tend
i,r0

<

tabs (egrant
i+1,r0 )
tabs (equorum
i+1 )

tend
i,r0

tabs (egrant
i+1,r0 )
tabs (equorum
i+1 )

leasei+1 .start

(by definition)
(min)
(by definition)
(causality)
(by definition)
(single-vote)
(causality)
(by definition)

14

Dynamo: Amazons Highly Available Key-value Store


Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall
and Werner Vogels
Amazon.com

One of the lessons our organization has learned from operating


Amazons platform is that the reliability and scalability of a
system is dependent on how its application state is managed.
Amazon uses a highly decentralized, loosely coupled, service
oriented architecture consisting of hundreds of services. In this
environment there is a particular need for storage technologies
that are always available. For example, customers should be able
to view and add items to their shopping cart even if disks are
failing, network routes are flapping, or data centers are being
destroyed by tornados. Therefore, the service responsible for
managing shopping carts requires that it can always write to and
read from its data store, and that its data needs to be available
across multiple data centers.

ABSTRACT
Reliability at massive scale is one of the biggest challenges we
face at Amazon.com, one of the largest e-commerce operations in
the world; even the slightest outage has significant financial
consequences and impacts customer trust. The Amazon.com
platform, which provides services for many web sites worldwide,
is implemented on top of an infrastructure of tens of thousands of
servers and network components located in many datacenters
around the world. At this scale, small and large components fail
continuously and the way persistent state is managed in the face
of these failures drives the reliability and scalability of the
software systems.
This paper presents the design and implementation of Dynamo, a
highly available key-value storage system that some of Amazons
core services use to provide an always-on experience. To
achieve this level of availability, Dynamo sacrifices consistency
under certain failure scenarios. It makes extensive use of object
versioning and application-assisted conflict resolution in a manner
that provides a novel interface for developers to use.

Dealing with failures in an infrastructure comprised of millions of


components is our standard mode of operation; there are always a
small but significant number of server and network components
that are failing at any given time. As such Amazons software
systems need to be constructed in a manner that treats failure
handling as the normal case without impacting availability or
performance.

Categories and Subject Descriptors

To meet the reliability and scaling needs, Amazon has developed


a number of storage technologies, of which the Amazon Simple
Storage Service (also available outside of Amazon and known as
Amazon S3), is probably the best known. This paper presents the
design and implementation of Dynamo, another highly available
and scalable distributed data store built for Amazons platform.
Dynamo is used to manage the state of services that have very
high reliability requirements and need tight control over the
tradeoffs between availability, consistency, cost-effectiveness and
performance. Amazons platform has a very diverse set of
applications with different storage requirements. A select set of
applications requires a storage technology that is flexible enough
to let application designers configure their data store appropriately
based on these tradeoffs to achieve high availability and
guaranteed performance in the most cost effective manner.

D.4.2 [Operating Systems]: Storage Management; D.4.5


[Operating Systems]: Reliability; D.4.2 [Operating Systems]:
Performance;

General Terms
Algorithms, Management, Measurement, Performance, Design,
Reliability.

1. INTRODUCTION
Amazon runs a world-wide e-commerce platform that serves tens
of millions customers at peak times using tens of thousands of
servers located in many data centers around the world. There are
strict operational requirements on Amazons platform in terms of
performance, reliability and efficiency, and to support continuous
growth the platform needs to be highly scalable. Reliability is one
of the most important requirements because even the slightest
outage has significant financial consequences and impacts
customer trust. In addition, to support continuous growth, the
platform needs to be highly scalable.

There are many services on Amazons platform that only need


primary-key access to a data store. For many services, such as
those that provide best seller lists, shopping carts, customer
preferences, session management, sales rank, and product catalog,
the common pattern of using a relational database would lead to
inefficiencies and limit scale and availability. Dynamo provides a
simple primary-key only interface to meet the requirements of
these applications.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SOSP07, October 1417, 2007, Stevenson, Washington, USA.
Copyright 2007 ACM 978-1-59593-591-5/07/0010...$5.00.

Dynamo uses a synthesis of well known techniques to achieve


scalability and availability: Data is partitioned and replicated
using consistent hashing [10], and consistency is facilitated by
object versioning [12]. The consistency among replicas during
updates is maintained by a quorum-like technique and a
decentralized replica synchronization protocol. Dynamo employs

205
195

This paper describes Dynamo, a highly available data storage


technology that addresses the needs of these important classes of
services. Dynamo has a simple key/value interface, is highly
available with a clearly defined consistency window, is efficient
in its resource usage, and has a simple scale out scheme to address
growth in data set size or request rates. Each service that uses
Dynamo runs its own Dynamo instances.

a gossip based distributed failure detection and membership


protocol. Dynamo is a completely decentralized system with
minimal need for manual administration. Storage nodes can be
added and removed from Dynamo without requiring any manual
partitioning or redistribution.
In the past year, Dynamo has been the underlying storage
technology for a number of the core services in Amazons ecommerce platform. It was able to scale to extreme peak loads
efficiently without any downtime during the busy holiday
shopping season. For example, the service that maintains
shopping cart (Shopping Cart Service) served tens of millions
requests that resulted in well over 3 million checkouts in a single
day and the service that manages session state handled hundreds
of thousands of concurrently active sessions.

2.1

System Assumptions and Requirements

The storage system for this class of services has the following
requirements:
Query Model: simple read and write operations to a data item that
is uniquely identified by a key. State is stored as binary objects
(i.e., blobs) identified by unique keys. No operations span
multiple data items and there is no need for relational schema.
This requirement is based on the observation that a significant
portion of Amazons services can work with this simple query
model and do not need any relational schema. Dynamo targets
applications that need to store objects that are relatively small
(usually less than 1 MB).

The main contribution of this work for the research community is


the evaluation of how different techniques can be combined to
provide a single highly-available system. It demonstrates that an
eventually-consistent storage system can be used in production
with demanding applications. It also provides insight into the
tuning of these techniques to meet the requirements of production
systems with very strict performance demands.

ACID Properties: ACID (Atomicity, Consistency, Isolation,


Durability) is a set of properties that guarantee that database
transactions are processed reliably. In the context of databases, a
single logical operation on the data is called a transaction.
Experience at Amazon has shown that data stores that provide
ACID guarantees tend to have poor availability. This has been
widely acknowledged by both the industry and academia [5].
Dynamo targets applications that operate with weaker consistency
(the C in ACID) if this results in high availability. Dynamo
does not provide any isolation guarantees and permits only single
key updates.

The paper is structured as follows. Section 2 presents the


background and Section 3 presents the related work. Section 4
presents the system design and Section 5 describes the
implementation. Section 6 details the experiences and insights
gained by running Dynamo in production and Section 7 concludes
the paper. There are a number of places in this paper where
additional information may have been appropriate but where
protecting Amazons business interests require us to reduce some
level of detail. For this reason, the intra- and inter-datacenter
latencies in section 6, the absolute request rates in section 6.2 and
outage lengths and workloads in section 6.3 are provided through
aggregate measures instead of absolute details.

Efficiency: The system needs to function on a commodity


hardware infrastructure. In Amazons platform, services have
stringent latency requirements which are in general measured at
the 99.9th percentile of the distribution. Given that state access
plays a crucial role in service operation the storage system must
be capable of meeting such stringent SLAs (see Section 2.2
below). Services must be able to configure Dynamo such that they
consistently achieve their latency and throughput requirements.
The tradeoffs are in performance, cost efficiency, availability, and
durability guarantees.

2. BACKGROUND
Amazons e-commerce platform is composed of hundreds of
services that work in concert to deliver functionality ranging from
recommendations to order fulfillment to fraud detection. Each
service is exposed through a well defined interface and is
accessible over the network. These services are hosted in an
infrastructure that consists of tens of thousands of servers located
across many data centers world-wide. Some of these services are
stateless (i.e., services which aggregate responses from other
services) and some are stateful (i.e., a service that generates its
response by executing business logic on its state stored in
persistent store).

Other Assumptions: Dynamo is used only by Amazons internal


services. Its operation environment is assumed to be non-hostile
and there are no security related requirements such as
authentication and authorization. Moreover, since each service
uses its distinct instance of Dynamo, its initial design targets a
scale of up to hundreds of storage hosts. We will discuss the
scalability limitations of Dynamo and possible scalability related
extensions in later sections.

Traditionally production systems store their state in relational


databases. For many of the more common usage patterns of state
persistence, however, a relational database is a solution that is far
from ideal. Most of these services only store and retrieve data by
primary key and do not require the complex querying and
management functionality offered by an RDBMS. This excess
functionality requires expensive hardware and highly skilled
personnel for its operation, making it a very inefficient solution.
In addition, the available replication technologies are limited and
typically choose consistency over availability. Although many
advances have been made in the recent years, it is still not easy to
scale-out databases or use smart partitioning schemes for load
balancing.

2.2

Service Level Agreements (SLA)

To guarantee that the application can deliver its functionality in a


bounded time, each and every dependency in the platform needs
to deliver its functionality with even tighter bounds. Clients and
services engage in a Service Level Agreement (SLA), a formally
negotiated contract where a client and a service agree on several
system-related characteristics, which most prominently include
the clients expected request rate distribution for a particular API
and the expected service latency under those conditions. An
example of a simple SLA is a service guaranteeing that it will

206
196

production systems have shown that this approach provides a


better overall experience compared to those systems that meet
SLAs defined based on the mean or median.
In this paper there are many references to this 99.9th percentile of
distributions, which reflects Amazon engineers relentless focus
on performance from the perspective of the customers
experience. Many papers report on averages, so these are included
where it makes sense for comparison purposes. Nevertheless,
Amazons engineering and optimization efforts are not focused on
averages. Several techniques, such as the load balanced selection
of write coordinators, are purely targeted at controlling
performance at the 99.9th percentile.
Storage systems often play an important role in establishing a
services SLA, especially if the business logic is relatively
lightweight, as is the case for many Amazon services. State
management then becomes the main component of a services
SLA. One of the main design considerations for Dynamo is to
give services control over their system properties, such as
durability and consistency, and to let services make their own
tradeoffs between functionality, performance and costeffectiveness.

2.3

Figure 1: Service-oriented architecture of Amazons


platform

Design Considerations

Data replication algorithms used in commercial systems


traditionally perform synchronous replica coordination in order to
provide a strongly consistent data access interface. To achieve this
level of consistency, these algorithms are forced to tradeoff the
availability of the data under certain failure scenarios. For
instance, rather than dealing with the uncertainty of the
correctness of an answer, the data is made unavailable until it is
absolutely certain that it is correct. From the very early replicated
database works, it is well known that when dealing with the
possibility of network failures, strong consistency and high data
availability cannot be achieved simultaneously [2, 11]. As such
systems and applications need to be aware which properties can
be achieved under which conditions.

provide a response within 300ms for 99.9% of its requests for a


peak client load of 500 requests per second.
In Amazons decentralized service oriented infrastructure, SLAs
play an important role. For example a page request to one of the
e-commerce sites typically requires the rendering engine to
construct its response by sending requests to over 150 services.
These services often have multiple dependencies, which
frequently are other services, and as such it is not uncommon for
the call graph of an application to have more than one level. To
ensure that the page rendering engine can maintain a clear bound
on page delivery each service within the call chain must obey its
performance contract.
Figure 1 shows an abstract view of the architecture of Amazons
platform, where dynamic web content is generated by page
rendering components which in turn query many other services. A
service can use different data stores to manage its state and these
data stores are only accessible within its service boundaries. Some
services act as aggregators by using several other services to
produce a composite response. Typically, the aggregator services
are stateless, although they use extensive caching.

For systems prone to server and network failures, availability can


be increased by using optimistic replication techniques, where
changes are allowed to propagate to replicas in the background,
and concurrent, disconnected work is tolerated. The challenge
with this approach is that it can lead to conflicting changes which
must be detected and resolved. This process of conflict resolution
introduces two problems: when to resolve them and who resolves
them. Dynamo is designed to be an eventually consistent data
store; that is all updates reach all replicas eventually.

A common approach in the industry for forming a performance


oriented SLA is to describe it using average, median and expected
variance. At Amazon we have found that these metrics are not
good enough if the goal is to build a system where all customers
have a good experience, rather than just the majority. For
example if extensive personalization techniques are used then
customers with longer histories require more processing which
impacts performance at the high-end of the distribution. An SLA
stated in terms of mean or median response times will not address
the performance of this important customer segment. To address
this issue, at Amazon, SLAs are expressed and measured at the
99.9th percentile of the distribution. The choice for 99.9% over an
even higher percentile has been made based on a cost-benefit
analysis which demonstrated a significant increase in cost to
improve performance that much. Experiences with Amazons

An important design consideration is to decide when to perform


the process of resolving update conflicts, i.e., whether conflicts
should be resolved during reads or writes. Many traditional data
stores execute conflict resolution during writes and keep the read
complexity simple [7]. In such systems, writes may be rejected if
the data store cannot reach all (or a majority of) the replicas at a
given time. On the other hand, Dynamo targets the design space
of an always writeable data store (i.e., a data store that is highly
available for writes). For a number of Amazon services, rejecting
customer updates could result in a poor customer experience. For
instance, the shopping cart service must allow customers to add
and remove items from their shopping cart even amidst network
and server failures. This requirement forces us to push the
complexity of conflict resolution to the reads in order to ensure
that writes are never rejected.

207
197

Various storage systems, such as Oceanstore [9] and PAST [17]


were built on top of these routing overlays. Oceanstore provides a
global, transactional, persistent storage service that supports
serialized updates on widely replicated data. To allow for
concurrent updates while avoiding many of the problems inherent
with wide-area locking, it uses an update model based on conflict
resolution. Conflict resolution was introduced in [21] to reduce
the number of transaction aborts. Oceanstore resolves conflicts by
processing a series of updates, choosing a total order among them,
and then applying them atomically in that order. It is built for an
environment where the data is replicated on an untrusted
infrastructure. By comparison, PAST provides a simple
abstraction layer on top of Pastry for persistent and immutable
objects. It assumes that the application can build the necessary
storage semantics (such as mutable files) on top of it.

The next design choice is who performs the process of conflict


resolution. This can be done by the data store or the application. If
conflict resolution is done by the data store, its choices are rather
limited. In such cases, the data store can only use simple policies,
such as last write wins [22], to resolve conflicting updates. On
the other hand, since the application is aware of the data schema it
can decide on the conflict resolution method that is best suited for
its clients experience. For instance, the application that maintains
customer shopping carts can choose to merge the conflicting
versions and return a single unified shopping cart. Despite this
flexibility, some application developers may not want to write
their own conflict resolution mechanisms and choose to push it
down to the data store, which in turn chooses a simple policy such
as last write wins.
Other key principles embraced in the design are:

3.2

Incremental scalability: Dynamo should be able to scale out one


storage host (henceforth, referred to as node) at a time, with
minimal impact on both operators of the system and the system
itself.
Symmetry: Every node in Dynamo should have the same set of
responsibilities as its peers; there should be no distinguished node
or nodes that take special roles or extra set of responsibilities. In
our experience, symmetry simplifies the process of system
provisioning and maintenance.
Decentralization: An extension of symmetry, the design should
favor decentralized peer-to-peer techniques over centralized
control. In the past, centralized control has resulted in outages and
the goal is to avoid it as much as possible. This leads to a simpler,
more scalable, and more available system.
Heterogeneity: The system needs to be able to exploit
heterogeneity in the infrastructure it runs on. e.g. the work
distribution must be proportional to the capabilities of the
individual servers. This is essential in adding new nodes with
higher capacity without having to upgrade all hosts at once.

Among these systems, Bayou, Coda and Ficus allow disconnected


operations and are resilient to issues such as network partitions
and outages. These systems differ on their conflict resolution
procedures. For instance, Coda and Ficus perform system level
conflict resolution and Bayou allows application level resolution.
All of them, however, guarantee eventual consistency. Similar to
these systems, Dynamo allows read and write operations to
continue even during network partitions and resolves updated
conflicts using different conflict resolution mechanisms.
Distributed block storage systems like FAB [18] split large size
objects into smaller blocks and stores each block in a highly
available manner. In comparison to these systems, a key-value
store is more suitable in this case because: (a) it is intended to
store relatively small objects (size < 1M) and (b) key-value stores
are easier to configure on a per-application basis. Antiquity is a
wide-area distributed storage system designed to handle multiple
server failures [23]. It uses a secure log to preserve data integrity,
replicates each log on multiple servers for durability, and uses
Byzantine fault tolerance protocols to ensure data consistency. In
contrast to Antiquity, Dynamo does not focus on the problem of
data integrity and security and is built for a trusted environment.
Bigtable is a distributed storage system for managing structured
data. It maintains a sparse, multi-dimensional sorted map and
allows applications to access their data using multiple attributes
[2]. Compared to Bigtable, Dynamo targets applications that
require only key/value access with primary focus on high
availability where updates are not rejected even in the wake of
network partitions or server failures.

3. RELATED WORK
3.1 Peer to Peer Systems
There are several peer-to-peer (P2P) systems that have looked at
the problem of data storage and distribution. The first generation
of P2P systems, such as Freenet and Gnutella1, were
predominantly used as file sharing systems. These were examples
of unstructured P2P networks where the overlay links between
peers were established arbitrarily. In these networks, a search
query is usually flooded through the network to find as many
peers as possible that share the data. P2P systems evolved to the
next generation into what is widely known as structured P2P
networks. These networks employ a globally consistent protocol
to ensure that any node can efficiently route a search query to
some peer that has the desired data. Systems like Pastry [16] and
Chord [20] use routing mechanisms to ensure that queries can be
answered within a bounded number of hops. To reduce the
additional latency introduced by multi-hop routing, some P2P
systems (e.g., [14]) employ O(1) routing where each peer
maintains enough routing information locally so that it can route
requests (to access a data item) to the appropriate peer within a
constant number of hops.
1

Distributed File Systems and Databases

Distributing data for performance, availability and durability has


been widely studied in the file system and database systems
community. Compared to P2P storage systems that only support
flat namespaces, distributed file systems typically support
hierarchical namespaces. Systems like Ficus [15] and Coda [19]
replicate files for high availability at the expense of consistency.
Update conflicts are typically managed using specialized conflict
resolution procedures. The Farsite system [1] is a distributed file
system that does not use any centralized server like NFS. Farsite
achieves high availability and scalability using replication. The
Google File System [6] is another distributed file system built for
hosting the state of Googles internal applications. GFS uses a
simple design with a single master server for hosting the entire
metadata and where the data is split into chunks and stored in
chunkservers. Bayou is a distributed relational database system
that allows disconnected operations and provides eventual data
consistency [21].

http://freenetproject.org/, http://www.gnutella.org

208
198

Table 1: Summary of techniques used in Dynamo and


their advantages.

Key K
A
G

Technique

Advantage

Partitioning

Consistent Hashing

Incremental
Scalability

High Availability
for writes

Vector clocks with


reconciliation during
reads

Version size is
decoupled from
update rates.

Handling temporary
failures

Sloppy Quorum and


hinted handoff

Provides high
availability and
durability guarantee
when some of the
replicas are not
available.

Recovering from
permanent failures

Anti-entropy using
Merkle trees

Synchronizes
divergent replicas in
the background.

Membership and
failure detection

Gossip-based
membership protocol
and failure detection.

Preserves symmetry
and avoids having a
centralized registry
for storing
membership and
node liveness
information.

Nodes B, C
and D store
keys in
range (A,B)
including
K.

Figure 2: Partitioning and replication of keys in Dynamo


ring.
Traditional replicated relational database systems focus on the
problem of guaranteeing strong consistency to replicated data.
Although strong consistency provides the application writer a
convenient programming model, these systems are limited in
scalability and availability [7]. These systems are not capable of
handling network partitions because they typically provide strong
consistency guarantees.

3.3

Problem

Discussion

Table 1 presents a summary of the list of techniques Dynamo uses


and their respective advantages.

Dynamo differs from the aforementioned decentralized storage


systems in terms of its target requirements. First, Dynamo is
targeted mainly at applications that need an always writeable
data store where no updates are rejected due to failures or
concurrent writes. This is a crucial requirement for many Amazon
applications. Second, as noted earlier, Dynamo is built for an
infrastructure within a single administrative domain where all
nodes are assumed to be trusted. Third, applications that use
Dynamo do not require support for hierarchical namespaces (a
norm in many file systems) or complex relational schema
(supported by traditional databases). Fourth, Dynamo is built for
latency sensitive applications that require at least 99.9% of read
and write operations to be performed within a few hundred
milliseconds. To meet these stringent latency requirements, it was
imperative for us to avoid routing requests through multiple nodes
(which is the typical design adopted by several distributed hash
table systems such as Chord and Pastry). This is because multihop routing increases variability in response times, thereby
increasing the latency at higher percentiles. Dynamo can be
characterized as a zero-hop DHT, where each node maintains
enough routing information locally to route a request to the
appropriate node directly.

4.1

System Interface

Dynamo stores objects associated with a key through a simple


interface; it exposes two operations: get() and put(). The get(key)
operation locates the object replicas associated with the key in the
storage system and returns a single object or a list of objects with
conflicting versions along with a context. The put(key, context,
object) operation determines where the replicas of the object
should be placed based on the associated key, and writes the
replicas to disk. The context encodes system metadata about the
object that is opaque to the caller and includes information such as
the version of the object. The context information is stored along
with the object so that the system can verify the validity of the
context object supplied in the put request.
Dynamo treats both the key and the object supplied by the caller
as an opaque array of bytes. It applies a MD5 hash on the key to
generate a 128-bit identifier, which is used to determine the
storage nodes that are responsible for serving the key.

4.2

Partitioning Algorithm

One of the key design requirements for Dynamo is that it must


scale incrementally. This requires a mechanism to dynamically
partition the data over the set of nodes (i.e., storage hosts) in the
system. Dynamos partitioning scheme relies on consistent
hashing to distribute the load across multiple storage hosts. In
consistent hashing [10], the output range of a hash function is
treated as a fixed circular space or ring (i.e. the largest hash
value wraps around to the smallest hash value). Each node in the
system is assigned a random value within this space which
represents its position on the ring. Each data item identified by
a key is assigned to a node by hashing the data items key to yield
its position on the ring, and then walking the ring clockwise to
find the first node with a position larger than the items position.

4. SYSTEM ARCHITECTURE
The architecture of a storage system that needs to operate in a
production setting is complex. In addition to the actual data
persistence component, the system needs to have scalable and
robust solutions for load balancing, membership and failure
detection, failure recovery, replica synchronization, overload
handling, state transfer, concurrency and job scheduling, request
marshalling, request routing, system monitoring and alarming,
and configuration management. Describing the details of each of
the solutions is not possible, so this paper focuses on the core
distributed systems techniques used in Dynamo: partitioning,
replication, versioning, membership, failure handling and scaling.

209
199

Thus, each node becomes responsible for the region in the ring
between it and its predecessor node on the ring. The principle
advantage of consistent hashing is that departure or arrival of a
node only affects its immediate neighbors and other nodes remain
unaffected.

return to its caller before the update has been applied at all the
replicas, which can result in scenarios where a subsequent get()
operation may return an object that does not have the latest
updates.. If there are no failures then there is a bound on the
update propagation times. However, under certain failure
scenarios (e.g., server outages or network partitions), updates may
not arrive at all replicas for an extended period of time.

The basic consistent hashing algorithm presents some challenges.


First, the random position assignment of each node on the ring
leads to non-uniform data and load distribution. Second, the basic
algorithm is oblivious to the heterogeneity in the performance of
nodes. To address these issues, Dynamo uses a variant of
consistent hashing (similar to the one used in [10, 20]): instead of
mapping a node to a single point in the circle, each node gets
assigned to multiple points in the ring. To this end, Dynamo uses
the concept of virtual nodes. A virtual node looks like a single
node in the system, but each node can be responsible for more
than one virtual node. Effectively, when a new node is added to
the system, it is assigned multiple positions (henceforth, tokens)
in the ring. The process of fine-tuning Dynamos partitioning
scheme is discussed in Section 6.

There is a category of applications in Amazons platform that can


tolerate such inconsistencies and can be constructed to operate
under these conditions. For example, the shopping cart application
requires that an Add to Cart operation can never be forgotten or
rejected. If the most recent state of the cart is unavailable, and a
user makes changes to an older version of the cart, that change is
still meaningful and should be preserved. But at the same time it
shouldnt supersede the currently unavailable state of the cart,
which itself may contain changes that should be preserved. Note
that both add to cart and delete item from cart operations are
translated into put requests to Dynamo. When a customer wants to
add an item to (or remove from) a shopping cart and the latest
version is not available, the item is added to (or removed from)
the older version and the divergent versions are reconciled later.

Using virtual nodes has the following advantages:

If a node becomes unavailable (due to failures or routine


maintenance), the load handled by this node is evenly
dispersed across the remaining available nodes.

When a node becomes available again, or a new node is


added to the system, the newly available node accepts a
roughly equivalent amount of load from each of the other
available nodes.

The number of virtual nodes that a node is responsible can


decided based on its capacity, accounting for heterogeneity
in the physical infrastructure.

4.3

In order to provide this kind of guarantee, Dynamo treats the


result of each modification as a new and immutable version of the
data. It allows for multiple versions of an object to be present in
the system at the same time. Most of the time, new versions
subsume the previous version(s), and the system itself can
determine the authoritative version (syntactic reconciliation).
However, version branching may happen, in the presence of
failures combined with concurrent updates, resulting in
conflicting versions of an object. In these cases, the system cannot
reconcile the multiple versions of the same object and the client
must perform the reconciliation in order to collapse multiple
branches of data evolution back into one (semantic
reconciliation). A typical example of a collapse operation is
merging different versions of a customers shopping cart. Using
this reconciliation mechanism, an add to cart operation is never
lost. However, deleted items can resurface.

Replication

To achieve high availability and durability, Dynamo replicates its


data on multiple hosts. Each data item is replicated at N hosts,
where N is a parameter configured per-instance. Each key, k, is
assigned to a coordinator node (described in the previous section).
The coordinator is in charge of the replication of the data items
that fall within its range. In addition to locally storing each key
within its range, the coordinator replicates these keys at the N-1
clockwise successor nodes in the ring. This results in a system
where each node is responsible for the region of the ring between
it and its Nth predecessor. In Figure 2, node B replicates the key k
at nodes C and D in addition to storing it locally. Node D will
store the keys that fall in the ranges (A, B], (B, C], and (C, D].

It is important to understand that certain failure modes can


potentially result in the system having not just two but several
versions of the same data. Updates in the presence of network
partitions and node failures can potentially result in an object
having distinct version sub-histories, which the system will need
to reconcile in the future. This requires us to design applications
that explicitly acknowledge the possibility of multiple versions of
the same data (in order to never lose any updates).
Dynamo uses vector clocks [12] in order to capture causality
between different versions of the same object. A vector clock is
effectively a list of (node, counter) pairs. One vector clock is
associated with every version of every object. One can determine
whether two versions of an object are on parallel branches or have
a causal ordering, by examine their vector clocks. If the counters
on the first objects clock are less-than-or-equal to all of the nodes
in the second clock, then the first is an ancestor of the second and
can be forgotten. Otherwise, the two changes are considered to be
in conflict and require reconciliation.

The list of nodes that is responsible for storing a particular key is


called the preference list. The system is designed, as will be
explained in Section 4.8, so that every node in the system can
determine which nodes should be in this list for any particular
key. To account for node failures, preference list contains more
than N nodes. Note that with the use of virtual nodes, it is possible
that the first N successor positions for a particular key may be
owned by less than N distinct physical nodes (i.e. a node may
hold more than one of the first N positions). To address this, the
preference list for a key is constructed by skipping positions in the
ring to ensure that the list contains only distinct physical nodes.

4.4

In Dynamo, when a client wishes to update an object, it must


specify which version it is updating. This is done by passing the
context it obtained from an earlier read operation, which contains
the vector clock information. Upon processing a read request, if

Data Versioning

Dynamo provides eventual consistency, which allows for updates


to be propagated to all replicas asynchronously. A put() call may

210
200

object. In practice, this is not likely because the writes are usually
handled by one of the top N nodes in the preference list. In case of
network partitions or multiple server failures, write requests may
be handled by nodes that are not in the top N nodes in the
preference list causing the size of vector clock to grow. In these
scenarios, it is desirable to limit the size of vector clock. To this
end, Dynamo employs the following clock truncation scheme:
Along with each (node, counter) pair, Dynamo stores a timestamp
that indicates the last time the node updated the data item. When
the number of (node, counter) pairs in the vector clock reaches a
threshold (say 10), the oldest pair is removed from the clock.
Clearly, this truncation scheme can lead to inefficiencies in
reconciliation as the descendant relationships cannot be derived
accurately. However, this problem has not surfaced in production
and therefore this issue has not been thoroughly investigated.

4.5

Execution of get () and put () operations

Any storage node in Dynamo is eligible to receive client get and


put operations for any key. In this section, for sake of simplicity,
we describe how these operations are performed in a failure-free
environment and in the subsequent section we describe how read
and write operations are executed during failures.

Figure 3: Version evolution of an object over time.


Dynamo has access to multiple branches that cannot be
syntactically reconciled, it will return all the objects at the leaves,
with the corresponding version information in the context. An
update using this context is considered to have reconciled the
divergent versions and the branches are collapsed into a single
new version.

Both get and put operations are invoked using Amazons


infrastructure-specific request processing framework over HTTP.
There are two strategies that a client can use to select a node: (1)
route its request through a generic load balancer that will select a
node based on load information, or (2) use a partition-aware client
library that routes requests directly to the appropriate coordinator
nodes. The advantage of the first approach is that the client does
not have to link any code specific to Dynamo in its application,
whereas the second strategy can achieve lower latency because it
skips a potential forwarding step.

To illustrate the use of vector clocks, let us consider the example


shown in Figure 3. A client writes a new object. The node (say
Sx) that handles the write for this key increases its sequence
number and uses it to create the data's vector clock. The system
now has the object D1 and its associated clock [(Sx, 1)]. The
client updates the object. Assume the same node handles this
request as well. The system now also has object D2 and its
associated clock [(Sx, 2)]. D2 descends from D1 and therefore
over-writes D1, however there may be replicas of D1 lingering at
nodes that have not yet seen D2. Let us assume that the same
client updates the object again and a different server (say Sy)
handles the request. The system now has data D3 and its
associated clock [(Sx, 2), (Sy, 1)].

A node handling a read or write operation is known as the


coordinator. Typically, this is the first among the top N nodes in
the preference list. If the requests are received through a load
balancer, requests to access a key may be routed to any random
node in the ring. In this scenario, the node that receives the
request will not coordinate it if the node is not in the top N of the
requested keys preference list. Instead, that node will forward the
request to the first among the top N nodes in the preference list.
Read and write operations involve the first N healthy nodes in the
preference list, skipping over those that are down or inaccessible.
When all nodes are healthy, the top N nodes in a keys preference
list are accessed. When there are node failures or network
partitions, nodes that are lower ranked in the preference list are
accessed.

Next assume a different client reads D2 and then tries to update it,
and another node (say Sz) does the write. The system now has D4
(descendant of D2) whose version clock is [(Sx, 2), (Sz, 1)]. A
node that is aware of D1 or D2 could determine, upon receiving
D4 and its clock, that D1 and D2 are overwritten by the new data
and can be garbage collected. A node that is aware of D3 and
receives D4 will find that there is no causal relation between
them. In other words, there are changes in D3 and D4 that are not
reflected in each other. Both versions of the data must be kept and
presented to a client (upon a read) for semantic reconciliation.

To maintain consistency among its replicas, Dynamo uses a


consistency protocol similar to those used in quorum systems.
This protocol has two key configurable values: R and W. R is the
minimum number of nodes that must participate in a successful
read operation. W is the minimum number of nodes that must
participate in a successful write operation. Setting R and W such
that R + W > N yields a quorum-like system. In this model, the
latency of a get (or put) operation is dictated by the slowest of the
R (or W) replicas. For this reason, R and W are usually
configured to be less than N, to provide better latency.

Now assume some client reads both D3 and D4 (the context will
reflect that both values were found by the read). The read's
context is a summary of the clocks of D3 and D4, namely [(Sx, 2),
(Sy, 1), (Sz, 1)]. If the client performs the reconciliation and node
Sx coordinates the write, Sx will update its sequence number in
the clock. The new data D5 will have the following clock: [(Sx,
3), (Sy, 1), (Sz, 1)].

Upon receiving a put() request for a key, the coordinator generates


the vector clock for the new version and writes the new version
locally. The coordinator then sends the new version (along with

A possible issue with vector clocks is that the size of vector


clocks may grow if many servers coordinate the writes to an

211
201

the new vector clock) to the N highest-ranked reachable nodes. If


at least W-1 nodes respond then the write is considered
successful.

the original replica node. To handle this and other threats to


durability, Dynamo implements an anti-entropy (replica
synchronization) protocol to keep the replicas synchronized.

Similarly, for a get() request, the coordinator requests all existing


versions of data for that key from the N highest-ranked reachable
nodes in the preference list for that key, and then waits for R
responses before returning the result to the client. If the
coordinator ends up gathering multiple versions of the data, it
returns all the versions it deems to be causally unrelated. The
divergent versions are then reconciled and the reconciled version
superseding the current versions is written back.

To detect the inconsistencies between replicas faster and to


minimize the amount of transferred data, Dynamo uses Merkle
trees [13]. A Merkle tree is a hash tree where leaves are hashes of
the values of individual keys. Parent nodes higher in the tree are
hashes of their respective children. The principal advantage of
Merkle tree is that each branch of the tree can be checked
independently without requiring nodes to download the entire tree
or the entire data set. Moreover, Merkle trees help in reducing the
amount of data that needs to be transferred while checking for
inconsistencies among replicas. For instance, if the hash values of
the root of two trees are equal, then the values of the leaf nodes in
the tree are equal and the nodes require no synchronization. If not,
it implies that the values of some replicas are different. In such
cases, the nodes may exchange the hash values of children and the
process continues until it reaches the leaves of the trees, at which
point the hosts can identify the keys that are out of sync. Merkle
trees minimize the amount of data that needs to be transferred for
synchronization and reduce the number of disk reads performed
during the anti-entropy process.

4.6

Handling Failures: Hinted Handoff

If Dynamo used a traditional quorum approach it would be


unavailable during server failures and network partitions, and
would have reduced durability even under the simplest of failure
conditions. To remedy this it does not enforce strict quorum
membership and instead it uses a sloppy quorum; all read and
write operations are performed on the first N healthy nodes from
the preference list, which may not always be the first N nodes
encountered while walking the consistent hashing ring.
Consider the example of Dynamo configuration given in Figure 2
with N=3. In this example, if node A is temporarily down or
unreachable during a write operation then a replica that would
normally have lived on A will now be sent to node D. This is done
to maintain the desired availability and durability guarantees. The
replica sent to D will have a hint in its metadata that suggests
which node was the intended recipient of the replica (in this case
A). Nodes that receive hinted replicas will keep them in a
separate local database that is scanned periodically. Upon
detecting that A has recovered, D will attempt to deliver the
replica to A. Once the transfer succeeds, D may delete the object
from its local store without decreasing the total number of replicas
in the system.

Dynamo uses Merkle trees for anti-entropy as follows: Each node


maintains a separate Merkle tree for each key range (the set of
keys covered by a virtual node) it hosts. This allows nodes to
compare whether the keys within a key range are up-to-date. In
this scheme, two nodes exchange the root of the Merkle tree
corresponding to the key ranges that they host in common.
Subsequently, using the tree traversal scheme described above the
nodes determine if they have any differences and perform the
appropriate synchronization action. The disadvantage with this
scheme is that many key ranges change when a node joins or
leaves the system thereby requiring the tree(s) to be recalculated.
This issue is addressed, however, by the refined partitioning
scheme described in Section 6.2.

Using hinted handoff, Dynamo ensures that the read and write
operations are not failed due to temporary node or network
failures. Applications that need the highest level of availability
can set W to 1, which ensures that a write is accepted as long as a
single node in the system has durably written the key it to its local
store. Thus, the write request is only rejected if all nodes in the
system are unavailable. However, in practice, most Amazon
services in production set a higher W to meet the desired level of
durability. A more detailed discussion of configuring N, R and W
follows in section 6.

4.8
4.8.1

Membership and Failure Detection


Ring Membership

In Amazons environment node outages (due to failures and


maintenance tasks) are often transient but may last for extended
intervals. A node outage rarely signifies a permanent departure
and therefore should not result in rebalancing of the partition
assignment or repair of the unreachable replicas. Similarly,
manual error could result in the unintentional startup of new
Dynamo nodes. For these reasons, it was deemed appropriate to
use an explicit mechanism to initiate the addition and removal of
nodes from a Dynamo ring. An administrator uses a command
line tool or a browser to connect to a Dynamo node and issue a
membership change to join a node to a ring or remove a node
from a ring. The node that serves the request writes the
membership change and its time of issue to persistent store. The
membership changes form a history because nodes can be
removed and added back multiple times. A gossip-based protocol
propagates membership changes and maintains an eventually
consistent view of membership. Each node contacts a peer chosen
at random every second and the two nodes efficiently reconcile
their persisted membership change histories.

It is imperative that a highly available storage system be capable


of handling the failure of an entire data center(s). Data center
failures happen due to power outages, cooling failures, network
failures, and natural disasters. Dynamo is configured such that
each object is replicated across multiple data centers. In essence,
the preference list of a key is constructed such that the storage
nodes are spread across multiple data centers. These datacenters
are connected through high speed network links. This scheme of
replicating across multiple datacenters allows us to handle entire
data center failures without a data outage.

4.7 Handling permanent failures: Replica


synchronization

When a node starts for the first time, it chooses its set of tokens
(virtual nodes in the consistent hash space) and maps nodes to
their respective token sets. The mapping is persisted on disk and

Hinted handoff works best if the system membership churn is low


and node failures are transient. There are scenarios under which
hinted replicas become unavailable before they can be returned to

212
202

initially contains only the local node and token set. The mappings
stored at different Dynamo nodes are reconciled during the same
communication exchange that reconciles the membership change
histories. Therefore, partitioning and placement information also
propagates via the gossip-based protocol and each storage node is
aware of the token ranges handled by its peers. This allows each
node to forward a keys read/write operations to the right set of
nodes directly.

4.8.2

us consider a simple bootstrapping scenario where node X is


added to the ring shown in Figure 2 between A and B. When X is
added to the system, it is in charge of storing keys in the ranges
(F, G], (G, A] and (A, X]. As a consequence, nodes B, C and D no
longer have to store the keys in these respective ranges.
Therefore, nodes B, C, and D will offer to and upon confirmation
from X transfer the appropriate set of keys. When a node is
removed from the system, the reallocation of keys happens in a
reverse process.

External Discovery

The mechanism described above could temporarily result in a


logically partitioned Dynamo ring.
For example, the
administrator could contact node A to join A to the ring, then
contact node B to join B to the ring. In this scenario, nodes A and
B would each consider itself a member of the ring, yet neither
would be immediately aware of the other. To prevent logical
partitions, some Dynamo nodes play the role of seeds. Seeds are
nodes that are discovered via an external mechanism and are
known to all nodes. Because all nodes eventually reconcile their
membership with a seed, logical partitions are highly unlikely.
Seeds can be obtained either from static configuration or from a
configuration service. Typically seeds are fully functional nodes
in the Dynamo ring.

Operational experience has shown that this approach distributes


the load of key distribution uniformly across the storage nodes,
which is important to meet the latency requirements and to ensure
fast bootstrapping. Finally, by adding a confirmation round
between the source and the destination, it is made sure that the
destination node does not receive any duplicate transfers for a
given key range.

4.8.3

Dynamos local persistence component allows for different


storage engines to be plugged in. Engines that are in use are
Berkeley Database (BDB) Transactional Data Store2, BDB Java
Edition, MySQL, and an in-memory buffer with persistent
backing store. The main reason for designing a pluggable
persistence component is to choose the storage engine best suited
for an applications access patterns. For instance, BDB can handle
objects typically in the order of tens of kilobytes whereas MySQL
can handle objects of larger sizes. Applications choose Dynamos
local persistence engine based on their object size distribution.
The majority of Dynamos production instances use BDB
Transactional Data Store.

5. IMPLEMENTATION
In Dynamo, each storage node has three main software
components: request coordination, membership and failure
detection, and a local persistence engine. All these components
are implemented in Java.

Failure Detection

Failure detection in Dynamo is used to avoid attempts to


communicate with unreachable peers during get() and put()
operations and when transferring partitions and hinted replicas.
For the purpose of avoiding failed attempts at communication, a
purely local notion of failure detection is entirely sufficient: node
A may consider node B failed if node B does not respond to node
As messages (even if B is responsive to node C's messages). In
the presence of a steady rate of client requests generating internode communication in the Dynamo ring, a node A quickly
discovers that a node B is unresponsive when B fails to respond to
a message; Node A then uses alternate nodes to service requests
that map to B's partitions; A periodically retries B to check for the
latter's recovery. In the absence of client requests to drive traffic
between two nodes, neither node really needs to know whether the
other is reachable and responsive.

The request coordination component is built on top of an eventdriven messaging substrate where the message processing pipeline
is split into multiple stages similar to the SEDA architecture [24].
All communications are implemented using Java NIO channels.
The coordinator executes the read and write requests on behalf of
clients by collecting data from one or more nodes (in the case of
reads) or storing data at one or more nodes (for writes). Each
client request results in the creation of a state machine on the node
that received the client request. The state machine contains all the
logic for identifying the nodes responsible for a key, sending the
requests, waiting for responses, potentially doing retries,
processing the replies and packaging the response to the client.
Each state machine instance handles exactly one client request.
For instance, a read operation implements the following state
machine: (i) send read requests to the nodes, (ii) wait for
minimum number of required responses, (iii) if too few replies
were received within a given time bound, fail the request, (iv)
otherwise gather all the data versions and determine the ones to be
returned and (v) if versioning is enabled, perform syntactic
reconciliation and generate an opaque write context that contains
the vector clock that subsumes all the remaining versions. For the
sake of brevity the failure handling and retry states are left out.

Decentralized failure detection protocols use a simple gossip-style


protocol that enable each node in the system to learn about the
arrival (or departure) of other nodes. For detailed information on
decentralized failure detectors and the parameters affecting their
accuracy, the interested reader is referred to [8]. Early designs of
Dynamo used a decentralized failure detector to maintain a
globally consistent view of failure state. Later it was determined
that the explicit node join and leave methods obviates the need for
a global view of failure state. This is because nodes are notified of
permanent node additions and removals by the explicit node join
and leave methods and temporary node failures are detected by
the individual nodes when they fail to communicate with others
(while forwarding requests).

4.9

Adding/Removing Storage Nodes

When a new node (say X) is added into the system, it gets


assigned a number of tokens that are randomly scattered on the
ring. For every key range that is assigned to node X, there may be
a number of nodes (less than or equal to N) that are currently in
charge of handling keys that fall within its token range. Due to the
allocation of key ranges to X, some existing nodes no longer have
to some of their keys and these nodes transfer those keys to X. Let

After the read response has been returned to the caller the state
2

213
203

http://www.oracle.com/database/berkeley-db.html

Figure 4: Average and 99.9 percentiles of latencies for read and


write requests during our peak request season of December 2006.
The intervals between consecutive ticks in the x-axis correspond
to 12 hours. Latencies follow a diurnal pattern similar to the
request rate and 99.9 percentile latencies are an order of
magnitude higher than averages

Figure 5: Comparison of performance of 99.9th percentile


latencies for buffered vs. non-buffered writes over a period of
24 hours. The intervals between consecutive ticks in the x-axis
correspond to one hour.

machine waits for a small period of time to receive any


outstanding responses. If stale versions were returned in any of
the responses, the coordinator updates those nodes with the latest
version. This process is called read repair because it repairs
replicas that have missed a recent update at an opportunistic time
and relieves the anti-entropy protocol from having to do it.

Timestamp based reconciliation: This case differs from the


previous one only in the reconciliation mechanism. In case of
divergent versions, Dynamo performs simple timestamp
based reconciliation logic of last write wins; i.e., the object
with the largest physical timestamp value is chosen as the
correct version. The service that maintains customers
session information is a good example of a service that uses
this mode.

High performance read engine: While Dynamo is built to be


an always writeable data store, a few services are tuning its
quorum characteristics and using it as a high performance
read engine. Typically, these services have a high read
request rate and only a small number of updates. In this
configuration, typically R is set to be 1 and W to be N. For
these services, Dynamo provides the ability to partition and
replicate their data across multiple nodes thereby offering
incremental scalability. Some of these instances function as
the authoritative persistence cache for data stored in more
heavy weight backing stores. Services that maintain product
catalog and promotional items fit in this category.

As noted earlier, write requests are coordinated by one of the top


N nodes in the preference list. Although it is desirable always to
have the first node among the top N to coordinate the writes
thereby serializing all writes at a single location, this approach has
led to uneven load distribution resulting in SLA violations. This is
because the request load is not uniformly distributed across
objects. To counter this, any of the top N nodes in the preference
list is allowed to coordinate the writes. In particular, since each
write usually follows a read operation, the coordinator for a write
is chosen to be the node that replied fastest to the previous read
operation which is stored in the context information of the
request. This optimization enables us to pick the node that has the
data that was read by the preceding read operation thereby
increasing the chances of getting read-your-writes consistency.
It also reduces variability in the performance of the request
handling which improves the performance at the 99.9 percentile.

The main advantage of Dynamo is that its client applications can


tune the values of N, R and W to achieve their desired levels of
performance, availability and durability. For instance, the value of
N determines the durability of each object. A typical value of N
used by Dynamos users is 3.

6. EXPERIENCES & LESSONS LEARNED


Dynamo is used by several services with different configurations.
These instances differ by their version reconciliation logic, and
read/write quorum characteristics. The following are the main
patterns in which Dynamo is used:

The values of W and R impact object availability, durability and


consistency. For instance, if W is set to 1, then the system will
never reject a write request as long as there is at least one node in
the system that can successfully process a write request. However,
low values of W and R can increase the risk of inconsistency as
write requests are deemed successful and returned to the clients
even if they are not processed by a majority of the replicas. This
also introduces a vulnerability window for durability when a write
request is successfully returned to the client even though it has
been persisted at only a small number of nodes.

Business logic specific reconciliation: This is a popular use


case for Dynamo. Each data object is replicated across
multiple nodes. In case of divergent versions, the client
application performs its own reconciliation logic. The
shopping cart service discussed earlier is a prime example of
this category. Its business logic reconciles objects by
merging different versions of a customers shopping cart.

214
204

significant difference in request rate between the daytime and


night). Moreover, the write latencies are higher than read latencies
obviously because write operations always results in disk access.
Also, the 99.9th percentile latencies are around 200 ms and are an
order of magnitude higher than the averages. This is because the
99.9th percentile latencies are affected by several factors such as
variability in request load, object sizes, and locality patterns.
While this level of performance is acceptable for a number of
services, a few customer-facing services required higher levels of
performance. For these services, Dynamo provides the ability to
trade-off durability guarantees for performance. In the
optimization each storage node maintains an object buffer in its
main memory. Each write operation is stored in the buffer and
gets periodically written to storage by a writer thread. In this
scheme, read operations first check if the requested key is present
in the buffer. If so, the object is read from the buffer instead of the
storage engine.

Figure 6: Fraction of nodes that are out-of-balance (i.e., nodes


whose request load is above a certain threshold from the
average system load) and their corresponding request load.
The interval between ticks in x-axis corresponds to a time
period of 30 minutes.

This optimization has resulted in lowering the 99.9th percentile


latency by a factor of 5 during peak traffic even for a very small
buffer of a thousand objects (see Figure 5). Also, as seen in the
figure, write buffering smoothes out higher percentile latencies.
Obviously, this scheme trades durability for performance. In this
scheme, a server crash can result in missing writes that were
queued up in the buffer. To reduce the durability risk, the write
operation is refined to have the coordinator choose one out of the
N replicas to perform a durable write. Since the coordinator
waits only for W responses, the performance of the write
operation is not affected by the performance of the durable write
operation performed by a single replica.

Traditional wisdom holds that durability and availability go handin-hand. However, this is not necessarily true here. For instance,
the vulnerability window for durability can be decreased by
increasing W. This may increase the probability of rejecting
requests (thereby decreasing availability) because more storage
hosts need to be alive to process a write request.
The common (N,R,W) configuration used by several instances of
Dynamo is (3,2,2). These values are chosen to meet the necessary
levels of performance, durability, consistency, and availability
SLAs.

6.2

All the measurements presented in this section were taken on a


live system operating with a configuration of (3,2,2) and running
a couple hundred nodes with homogenous hardware
configurations. As mentioned earlier, each instance of Dynamo
contains nodes that are located in multiple datacenters. These
datacenters are typically connected through high speed network
links. Recall that to generate a successful get (or put) response R
(or W) nodes need to respond to the coordinator. Clearly, the
network latencies between datacenters affect the response time
and the nodes (and their datacenter locations) are chosen such that
the applications target SLAs are met.

6.1

Ensuring Uniform Load distribution

Dynamo uses consistent hashing to partition its key space across


its replicas and to ensure uniform load distribution. A uniform key
distribution can help us achieve uniform load distribution
assuming the access distribution of keys is not highly skewed. In
particular, Dynamos design assumes that even where there is a
significant skew in the access distribution there are enough keys
in the popular end of the distribution so that the load of handling
popular keys can be spread across the nodes uniformly through
partitioning. This section discusses the load imbalance seen in
Dynamo and the impact of different partitioning strategies on load
distribution.

Balancing Performance and Durability

To study the load imbalance and its correlation with request load,
the total number of requests received by each node was measured
for a period of 24 hours - broken down into intervals of 30
minutes. In a given time window, a node is considered to be inbalance, if the nodes request load deviates from the average load
by a value a less than a certain threshold (here 15%). Otherwise
the node was deemed out-of-balance. Figure 6 presents the
fraction of nodes that are out-of-balance (henceforth,
imbalance ratio) during this time period. For reference, the
corresponding request load received by the entire system during
this time period is also plotted. As seen in the figure, the
imbalance ratio decreases with increasing load. For instance,
during low loads the imbalance ratio is as high as 20% and during
high loads it is close to 10%. Intuitively, this can be explained by
the fact that under high loads, a large number of popular keys are
accessed and due to uniform distribution of keys the load is
evenly distributed. However, during low loads (where load is 1/8th

While Dynamos principle design goal is to build a highly


available data store, performance is an equally important criterion
in Amazons platform. As noted earlier, to provide a consistent
customer experience, Amazons services set their performance
targets at higher percentiles (such as the 99.9th or 99.99th
percentiles). A typical SLA required of services that use Dynamo
is that 99.9% of the read and write requests execute within 300ms.
Since Dynamo is run on standard commodity hardware
components that have far less I/O throughput than high-end
enterprise servers, providing consistently high performance for
read and write operations is a non-trivial task. The involvement of
multiple storage nodes in read and write operations makes it even
more challenging, since the performance of these operations is
limited by the slowest of the R or W replicas. Figure 4 shows the
average and 99.9th percentile latencies of Dynamos read and
write operations during a period of 30 days. As seen in the figure,
the latencies exhibit a clear diurnal pattern which is a result of the
diurnal pattern in the incoming request rate (i.e., there is a

215
205

Figure 7: Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the
preference list for the key k1 on the consistent hashing ring (N=3). The shaded area indicates the key range for which nodes A,
B, and C form the preference list. Dark arrows indicate the token locations for various nodes.

The fundamental issue with this strategy is that the schemes for
data partitioning and data placement are intertwined. For instance,
in some cases, it is preferred to add more nodes to the system in
order to handle an increase in request load. However, in this
scenario, it is not possible to add nodes without affecting data
partitioning. Ideally, it is desirable to use independent schemes for
partitioning and placement. To this end, following strategies were
evaluated:

of the measured peak load), fewer popular keys are accessed,


resulting in a higher load imbalance.
This section discusses how Dynamos partitioning scheme has
evolved over time and its implications on load distribution.
Strategy 1: T random tokens per node and partition by token
value: This was the initial strategy deployed in production (and
described in Section 4.2). In this scheme, each node is assigned T
tokens (chosen uniformly at random from the hash space). The
tokens of all nodes are ordered according to their values in the
hash space. Every two consecutive tokens define a range. The last
token and the first token form a range that "wraps" around from
the highest value to the lowest value in the hash space. Because
the tokens are chosen randomly, the ranges vary in size. As nodes
join and leave the system, the token set changes and consequently
the ranges change. Note that the space needed to maintain the
membership at each node increases linearly with the number of
nodes in the system.

Strategy 2: T random tokens per node and equal sized partitions:


In this strategy, the hash space is divided into Q equally sized
partitions/ranges and each node is assigned T random tokens. Q is
usually set such that Q >> N and Q >> S*T, where S is the
number of nodes in the system. In this strategy, the tokens are
only used to build the function that maps values in the hash space
to the ordered lists of nodes and not to decide the partitioning. A
partition is placed on the first N unique nodes that are encountered
while walking the consistent hashing ring clockwise from the end
of the partition. Figure 7 illustrates this strategy for N=3. In this
example, nodes A, B, C are encountered while walking the ring
from the end of the partition that contains key k1. The primary
advantages of this strategy are: (i) decoupling of partitioning and
partition placement, and (ii) enabling the possibility of changing
the placement scheme at runtime.

While using this strategy, the following problems were


encountered. First, when a new node joins the system, it needs to
steal its key ranges from other nodes. However, the nodes
handing the key ranges off to the new node have to scan their
local persistence store to retrieve the appropriate set of data items.
Note that performing such a scan operation on a production node
is tricky as scans are highly resource intensive operations and they
need to be executed in the background without affecting the
customer performance. This requires us to run the bootstrapping
task at the lowest priority. However, this significantly slows the
bootstrapping process and during busy shopping season, when the
nodes are handling millions of requests a day, the bootstrapping
has taken almost a day to complete. Second, when a node
joins/leaves the system, the key ranges handled by many nodes
change and the Merkle trees for the new ranges need to be
recalculated, which is a non-trivial operation to perform on a
production system. Finally, there was no easy way to take a
snapshot of the entire key space due to the randomness in key
ranges, and this made the process of archival complicated. In this
scheme, archiving the entire key space requires us to retrieve the
keys from each node separately, which is highly inefficient.

Strategy 3: Q/S tokens per node, equal-sized partitions: Similar to


strategy 2, this strategy divides the hash space into Q equally
sized partitions and the placement of partition is decoupled from
the partitioning scheme. Moreover, each node is assigned Q/S
tokens where S is the number of nodes in the system. When a
node leaves the system, its tokens are randomly distributed to the
remaining nodes such that these properties are preserved.
Similarly, when a node joins the system it "steals" tokens from
nodes in the system in a way that preserves these properties.
The efficiency of these three strategies is evaluated for a system
with S=30 and N=3. However, comparing these different
strategies in a fair manner is hard as different strategies have
different configurations to tune their efficiency. For instance, the
load distribution property of strategy 1 depends on the number of
tokens (i.e., T) while strategy 3 depends on the number of
partitions (i.e., Q). One fair way to compare these strategies is to

216
206

6.3 Divergent Versions: When and


How Many?

Efficieny (mean load/max load)

0.9

As noted earlier, Dynamo is designed to tradeoff consistency for


availability. To understand the precise impact of different failures
on consistency, detailed data is required on multiple factors:
outage length, type of failure, component reliability, workload etc.
Presenting these numbers in detail is outside of the scope of this
paper. However, this section discusses a good summary metric:
the number of divergent versions seen by the application in a live
production environment.

0.8
0.7

0.6
Strategy 1

0.5

Strategy 2
Strategy 3

0.4
0

5000

10000

15000

20000

25000

30000

Divergent versions of a data item arise in two scenarios. The first


is when the system is facing failure scenarios such as node
failures, data center failures, and network partitions. The second is
when the system is handling a large number of concurrent writers
to a single data item and multiple nodes end up coordinating the
updates concurrently. From both a usability and efficiency
perspective, it is preferred to keep the number of divergent
versions at any given time as low as possible. If the versions
cannot be syntactically reconciled based on vector clocks alone,
they have to be passed to the business logic for semantic
reconciliation. Semantic reconciliation introduces additional load
on services, so it is desirable to minimize the need for it.

35000

Size of me tadata maintained at each node (in abstract units)

Figure 8: Comparison of the load distribution efficiency of


different strategies for system with 30 nodes and N=3 with
equal amount of metadata maintained at each node. The
values of the system size and number of replicas are based on
the typical configuration deployed for majority of our
services.
evaluate the skew in their load distribution while all strategies use
the same amount of space to maintain their membership
information. For instance, in strategy 1 each node needs to
maintain the token positions of all the nodes in the ring and in
strategy 3 each node needs to maintain the information regarding
the partitions assigned to each node.

In our next experiment, the number of versions returned to the


shopping cart service was profiled for a period of 24 hours.
During this period, 99.94% of requests saw exactly one version;
0.00057% of requests saw 2 versions; 0.00047% of requests saw 3
versions and 0.00009% of requests saw 4 versions. This shows
that divergent versions are created rarely.

In our next experiment, these strategies were evaluated by varying


the relevant parameters (T and Q). The load balancing efficiency
of each strategy was measured for different sizes of membership
information that needs to be maintained at each node, where Load
balancing efficiency is defined as the ratio of average number of
requests served by each node to the maximum number of requests
served by the hottest node.

Experience shows that the increase in the number of divergent


versions is contributed not by failures but due to the increase in
number of concurrent writers. The increase in the number of
concurrent writes is usually triggered by busy robots (automated
client programs) and rarely by humans. This issue is not discussed
in detail due to the sensitive nature of the story.

The results are given in Figure 8. As seen in the figure, strategy 3


achieves the best load balancing efficiency and strategy 2 has the
worst load balancing efficiency. For a brief time, Strategy 2
served as an interim setup during the process of migrating
Dynamo instances from using Strategy 1 to Strategy 3. Compared
to Strategy 1, Strategy 3 achieves better efficiency and reduces the
size of membership information maintained at each node by three
orders of magnitude. While storage is not a major issue the nodes
gossip the membership information periodically and as such it is
desirable to keep this information as compact as possible. In
addition to this, strategy 3 is advantageous and simpler to deploy
for the following reasons: (i) Faster bootstrapping/recovery:
Since partition ranges are fixed, they can be stored in separate
files, meaning a partition can be relocated as a unit by simply
transferring the file (avoiding random accesses needed to locate
specific items). This simplifies the process of bootstrapping and
recovery. (ii) Ease of archival: Periodical archiving of the dataset
is a mandatory requirement for most of Amazon storage services.
Archiving the entire dataset stored by Dynamo is simpler in
strategy 3 because the partition files can be archived separately.
By contrast, in Strategy 1, the tokens are chosen randomly and,
archiving the data stored in Dynamo requires retrieving the keys
from individual nodes separately and is usually inefficient and
slow. The disadvantage of strategy 3 is that changing the node
membership requires coordination in order to preserve the
properties required of the assignment.

6.4 Client-driven or Server-driven


Coordination
As mentioned in Section 5, Dynamo has a request coordination
component that uses a state machine to handle incoming requests.
Client requests are uniformly assigned to nodes in the ring by a
load balancer. Any Dynamo node can act as a coordinator for a
read request. Write requests on the other hand will be coordinated
by a node in the keys current preference list. This restriction is
due to the fact that these preferred nodes have the added
responsibility of creating a new version stamp that causally
subsumes the version that has been updated by the write request.
Note that if Dynamos versioning scheme is based on physical
timestamps, any node can coordinate a write request.
An alternative approach to request coordination is to move the
state machine to the client nodes. In this scheme client
applications use a library to perform request coordination locally.
A client periodically picks a random Dynamo node and
downloads its current view of Dynamo membership state. Using
this information the client can determine which set of nodes form
the preference list for any given key. Read requests can be
coordinated at the client node thereby avoiding the extra network
hop that is incurred if the request were assigned to a random
Dynamo node by the load balancer. Writes will either be
forwarded to a node in the keys preference list or can be

217
207

shared across all background tasks. A feedback mechanism based


on the monitored performance of the foreground tasks is
employed to change the number of slices that are available to the
background tasks.

Table 2: Performance of client-driven and server-driven


coordination approaches.

Serverdriven
Clientdriven

99.9th
percentile
read
latency
(ms)

99.9th
percentile
write
latency
(ms)

Average
read
latency
(ms)

Average
write
latency
(ms)

68.9

68.5

3.9

4.02

30.4

30.4

1.55

1.9

The admission controller constantly monitors the behavior of


resource accesses while executing a "foreground" put/get
operation. Monitored aspects include latencies for disk operations,
failed database accesses due to lock-contention and transaction
timeouts, and request queue wait times. This information is used
to check whether the percentiles of latencies (or failures) in a
given trailing time window are close to a desired threshold. For
example, the background controller checks to see how close the
99th percentile database read latency (over the last 60 seconds) is
to a preset threshold (say 50ms). The controller uses such
comparisons to assess the resource availability for the foreground
operations. Subsequently, it decides on how many time slices will
be available to background tasks, thereby using the feedback loop
to limit the intrusiveness of the background activities. Note that a
similar problem of managing background tasks has been studied
in [4].

coordinated locally if Dynamo is using timestamps based


versioning.
An important advantage of the client-driven coordination
approach is that a load balancer is no longer required to uniformly
distribute client load. Fair load distribution is implicitly
guaranteed by the near uniform assignment of keys to the storage
nodes. Obviously, the efficiency of this scheme is dependent on
how fresh the membership information is at the client. Currently
clients poll a random Dynamo node every 10 seconds for
membership updates. A pull based approach was chosen over a
push based one as the former scales better with large number of
clients and requires very little state to be maintained at servers
regarding clients. However, in the worst case the client can be
exposed to stale membership for duration of 10 seconds. In case,
if the client detects its membership table is stale (for instance,
when some members are unreachable), it will immediately refresh
its membership information.

6.6

Discussion

This section summarizes some of the experiences gained during


the process of implementation and maintenance of Dynamo.
Many Amazon internal services have used Dynamo for the past
two years and it has provided significant levels of availability to
its applications. In particular, applications have received
successful responses (without timing out) for 99.9995% of its
requests and no data loss event has occurred to date.
Moreover, the primary advantage of Dynamo is that it provides
the necessary knobs using the three parameters of (N,R,W) to tune
their instance based on their needs.. Unlike popular commercial
data stores, Dynamo exposes data consistency and reconciliation
logic issues to the developers. At the outset, one may expect the
application logic to become more complex. However, historically,
Amazons platform is built for high availability and many
applications are designed to handle different failure modes and
inconsistencies that may arise. Hence, porting such applications to
use Dynamo was a relatively simple task. For new applications
that want to use Dynamo, some analysis is required during the
initial stages of the development to pick the right conflict
resolution mechanisms that meet the business case appropriately.
Finally, Dynamo adopts a full membership model where each
node is aware of the data hosted by its peers. To do this, each
node actively gossips the full routing table with other nodes in the
system. This model works well for a system that contains couple
of hundreds of nodes. However, scaling such a design to run with
tens of thousands of nodes is not trivial because the overhead in
maintaining the routing table increases with the system size. This
limitation might be overcome by introducing hierarchical
extensions to Dynamo. Also, note that this problem is actively
addressed by O(1) DHT systems(e.g., [14]).

th

Table 2 shows the latency improvements at the 99.9 percentile


and averages that were observed for a period of 24 hours using
client-driven coordination compared to the server-driven
approach. As seen in the table, the client-driven coordination
approach reduces the latencies by at least 30 milliseconds for
99.9th percentile latencies and decreases the average by 3 to 4
milliseconds. The latency improvement is because the clientdriven approach eliminates the overhead of the load balancer and
the extra network hop that may be incurred when a request is
assigned to a random node. As seen in the table, average latencies
tend to be significantly lower than latencies at the 99.9th
percentile. This is because Dynamos storage engine caches and
write buffer have good hit ratios. Moreover, since the load
balancers and network introduce additional variability to the
response time, the gain in response time is higher for the 99.9th
percentile than the average.

6.5 Balancing background vs. foreground


tasks
Each node performs different kinds of background tasks for
replica synchronization and data handoff (either due to hinting or
adding/removing nodes) in addition to its normal foreground
put/get operations. In early production settings, these background
tasks triggered the problem of resource contention and affected
the performance of the regular put and get operations. Hence, it
became necessary to ensure that background tasks ran only when
the regular critical operations are not affected significantly. To
this end, the background tasks were integrated with an admission
control mechanism. Each of the background tasks uses this
controller to reserve runtime slices of the resource (e.g. database),

7. CONCLUSIONS
This paper described Dynamo, a highly available and scalable
data store, used for storing state of a number of core services of
Amazon.coms e-commerce platform. Dynamo has provided the
desired levels of availability and performance and has been
successful in handling server failures, data center failures and
network partitions. Dynamo is incrementally scalable and allows
service owners to scale up and down based on their current

218
208

Principles of Distributed Computing (Newport, Rhode


Island, United States). PODC '01. ACM Press, New York,
NY, 170-179.

request load. Dynamo allows service owners to customize their


storage system to meet their desired performance, durability and
consistency SLAs by allowing them to tune the parameters N, R,
and W.

[9] Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton,
P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H.,
Wells, C., and Zhao, B. 2000. OceanStore: an architecture
for global-scale persistent storage. SIGARCH Comput.
Archit. News 28, 5 (Dec. 2000), 190-201.

The production use of Dynamo for the past year demonstrates that
decentralized techniques can be combined to provide a single
highly-available system. Its success in one of the most
challenging application environments shows that an eventualconsistent storage system can be a building block for highlyavailable applications.

[10] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine,
M., and Lewin, D. 1997. Consistent hashing and random
trees: distributed caching protocols for relieving hot spots on
the World Wide Web. In Proceedings of the Twenty-Ninth
Annual ACM Symposium on theory of Computing (El Paso,
Texas, United States, May 04 - 06, 1997). STOC '97. ACM
Press, New York, NY, 654-663.

ACKNOWLEDGEMENTS
The authors would like to thank Pat Helland for his contribution
to the initial design of Dynamo. We would also like to thank
Marvin Theimer and Robert van Renesse for their comments.
Finally, we would like to thank our shepherd, Jeff Mogul, for his
detailed comments and inputs while preparing the camera ready
version that vastly improved the quality of the paper.

[11] Lindsay, B.G., et. al., Notes on Distributed Databases,


Research Report RJ2571(33471), IBM Research, July 1979
[12] Lamport, L. Time, clocks, and the ordering of events in a
distributed system. ACM Communications, 21(7), pp. 558565, 1978.

REFERENCES
[1] Adya, A., Bolosky, W. J., Castro, M., Cermak, G., Chaiken,
R., Douceur, J. R., Howell, J., Lorch, J. R., Theimer, M., and
Wattenhofer, R. P. 2002. Farsite: federated, available, and
reliable storage for an incompletely trusted environment.
SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 1-14.
[2]

[13] Merkle, R. A digital signature based on a conventional


encryption function. Proceedings of CRYPTO, pages 369
378. Springer-Verlag, 1988.
[14] Ramasubramanian, V., and Sirer, E. G. Beehive: O(1)lookup
performance for power-law query distributions in peer-topeer overlays. In Proceedings of the 1st Conference on
Symposium on Networked Systems Design and
Implementation, San Francisco, CA, March 29 - 31, 2004.

Bernstein, P.A., and Goodman, N. An algorithm for


concurrency control and recovery in replicated distributed
databases. ACM Trans. on Database Systems, 9(4):596-615,
December 1984

[3] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach,
D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.
E. 2006. Bigtable: a distributed storage system for structured
data. In Proceedings of the 7th Conference on USENIX
Symposium on Operating Systems Design and
Implementation - Volume 7 (Seattle, WA, November 06 - 08,
2006). USENIX Association, Berkeley, CA, 15-15.

[15] Reiher, P., Heidemann, J., Ratner, D., Skinner, G., and
Popek, G. 1994. Resolving file conflicts in the Ficus file
system. In Proceedings of the USENIX Summer 1994
Technical Conference on USENIX Summer 1994 Technical
Conference - Volume 1 (Boston, Massachusetts, June 06 - 10,
1994). USENIX Association, Berkeley, CA, 12-12..
[16] Rowstron, A., and Druschel, P. Pastry: Scalable,
decentralized object location and routing for large-scale peerto-peer systems. Proceedings of Middleware, pages 329-350,
November, 2001.

[4] Douceur, J. R. and Bolosky, W. J. 2000. Process-based


regulation of low-importance processes. SIGOPS Oper. Syst.
Rev. 34, 2 (Apr. 2000), 26-27.
[5] Fox, A., Gribble, S. D., Chawathe, Y., Brewer, E. A., and
Gauthier, P. 1997. Cluster-based scalable network services.
In Proceedings of the Sixteenth ACM Symposium on
Operating Systems Principles (Saint Malo, France, October
05 - 08, 1997). W. M. Waite, Ed. SOSP '97. ACM Press,
New York, NY, 78-91.

[17] Rowstron, A., and Druschel, P. Storage management and


caching in PAST, a large-scale, persistent peer-to-peer
storage utility. Proceedings of Symposium on Operating
Systems Principles, October 2001.
[18] Saito, Y., Frlund, S., Veitch, A., Merchant, A., and Spence,
S. 2004. FAB: building distributed enterprise disk arrays
from commodity components. SIGOPS Oper. Syst. Rev. 38, 5
(Dec. 2004), 48-58.

[6] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google
file system. In Proceedings of the Nineteenth ACM
Symposium on Operating Systems Principles (Bolton
Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM
Press, New York, NY, 29-43.

[19] Satyanarayanan, M., Kistler, J.J., Siegel, E.H. Coda: A


Resilient Distributed File System. IEEE Workshop on
Workstation Operating Systems, Nov. 1987.

[7] Gray, J., Helland, P., O'Neil, P., and Shasha, D. 1996. The
dangers of replication and a solution. In Proceedings of the
1996 ACM SIGMOD international Conference on
Management of Data (Montreal, Quebec, Canada, June 04 06, 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New
York, NY, 173-182.

[20] Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and
Balakrishnan, H. 2001. Chord: A scalable peer-to-peer
lookup service for internet applications. In Proceedings of
the 2001 Conference on Applications, Technologies,
Architectures, and Protocols For Computer Communications
(San Diego, California, United States). SIGCOMM '01.
ACM Press, New York, NY, 149-160.

[8] Gupta, I., Chandra, T. D., and Goldszmidt, G. S. 2001. On


scalable and efficient distributed failure detectors. In
Proceedings of the Twentieth Annual ACM Symposium on

219
209

[21] Terry, D. B., Theimer, M. M., Petersen, K., Demers, A. J.,


Spreitzer, M. J., and Hauser, C. H. 1995. Managing update
conflicts in Bayou, a weakly connected replicated storage
system. In Proceedings of the Fifteenth ACM Symposium on
Operating Systems Principles (Copper Mountain, Colorado,
United States, December 03 - 06, 1995). M. B. Jones, Ed.
SOSP '95. ACM Press, New York, NY, 172-182.

[23] Weatherspoon, H., Eaton, P., Chun, B., and Kubiatowicz, J.


2007. Antiquity: exploiting a secure log for wide-area
distributed storage. SIGOPS Oper. Syst. Rev. 41, 3 (Jun.
2007), 371-384.
[24] Welsh, M., Culler, D., and Brewer, E. 2001. SEDA: an
architecture for well-conditioned, scalable internet services.
In Proceedings of the Eighteenth ACM Symposium on
Operating Systems Principles (Banff, Alberta, Canada,
October 21 - 24, 2001). SOSP '01. ACM Press, New York,
NY, 230-243.

[22] Thomas, R. H. A majority consensus approach to


concurrency control for multiple copy databases. ACM
Transactions on Database Systems 4 (2): 180-209, 1979.

220
210

Dynamo: Amazons Highly Available Key-value Store


Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall
and Werner Vogels
Amazon.com

One of the lessons our organization has learned from operating


Amazons platform is that the reliability and scalability of a
system is dependent on how its application state is managed.
Amazon uses a highly decentralized, loosely coupled, service
oriented architecture consisting of hundreds of services. In this
environment there is a particular need for storage technologies
that are always available. For example, customers should be able
to view and add items to their shopping cart even if disks are
failing, network routes are flapping, or data centers are being
destroyed by tornados. Therefore, the service responsible for
managing shopping carts requires that it can always write to and
read from its data store, and that its data needs to be available
across multiple data centers.

ABSTRACT
Reliability at massive scale is one of the biggest challenges we
face at Amazon.com, one of the largest e-commerce operations in
the world; even the slightest outage has significant financial
consequences and impacts customer trust. The Amazon.com
platform, which provides services for many web sites worldwide,
is implemented on top of an infrastructure of tens of thousands of
servers and network components located in many datacenters
around the world. At this scale, small and large components fail
continuously and the way persistent state is managed in the face
of these failures drives the reliability and scalability of the
software systems.
This paper presents the design and implementation of Dynamo, a
highly available key-value storage system that some of Amazons
core services use to provide an always-on experience. To
achieve this level of availability, Dynamo sacrifices consistency
under certain failure scenarios. It makes extensive use of object
versioning and application-assisted conflict resolution in a manner
that provides a novel interface for developers to use.

Dealing with failures in an infrastructure comprised of millions of


components is our standard mode of operation; there are always a
small but significant number of server and network components
that are failing at any given time. As such Amazons software
systems need to be constructed in a manner that treats failure
handling as the normal case without impacting availability or
performance.

Categories and Subject Descriptors

To meet the reliability and scaling needs, Amazon has developed


a number of storage technologies, of which the Amazon Simple
Storage Service (also available outside of Amazon and known as
Amazon S3), is probably the best known. This paper presents the
design and implementation of Dynamo, another highly available
and scalable distributed data store built for Amazons platform.
Dynamo is used to manage the state of services that have very
high reliability requirements and need tight control over the
tradeoffs between availability, consistency, cost-effectiveness and
performance. Amazons platform has a very diverse set of
applications with different storage requirements. A select set of
applications requires a storage technology that is flexible enough
to let application designers configure their data store appropriately
based on these tradeoffs to achieve high availability and
guaranteed performance in the most cost effective manner.

D.4.2 [Operating Systems]: Storage Management; D.4.5


[Operating Systems]: Reliability; D.4.2 [Operating Systems]:
Performance;

General Terms
Algorithms, Management, Measurement, Performance, Design,
Reliability.

1. INTRODUCTION
Amazon runs a world-wide e-commerce platform that serves tens
of millions customers at peak times using tens of thousands of
servers located in many data centers around the world. There are
strict operational requirements on Amazons platform in terms of
performance, reliability and efficiency, and to support continuous
growth the platform needs to be highly scalable. Reliability is one
of the most important requirements because even the slightest
outage has significant financial consequences and impacts
customer trust. In addition, to support continuous growth, the
platform needs to be highly scalable.

There are many services on Amazons platform that only need


primary-key access to a data store. For many services, such as
those that provide best seller lists, shopping carts, customer
preferences, session management, sales rank, and product catalog,
the common pattern of using a relational database would lead to
inefficiencies and limit scale and availability. Dynamo provides a
simple primary-key only interface to meet the requirements of
these applications.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SOSP07, October 1417, 2007, Stevenson, Washington, USA.
Copyright 2007 ACM 978-1-59593-591-5/07/0010...$5.00.

Dynamo uses a synthesis of well known techniques to achieve


scalability and availability: Data is partitioned and replicated
using consistent hashing [10], and consistency is facilitated by
object versioning [12]. The consistency among replicas during
updates is maintained by a quorum-like technique and a
decentralized replica synchronization protocol. Dynamo employs

205
195

This paper describes Dynamo, a highly available data storage


technology that addresses the needs of these important classes of
services. Dynamo has a simple key/value interface, is highly
available with a clearly defined consistency window, is efficient
in its resource usage, and has a simple scale out scheme to address
growth in data set size or request rates. Each service that uses
Dynamo runs its own Dynamo instances.

a gossip based distributed failure detection and membership


protocol. Dynamo is a completely decentralized system with
minimal need for manual administration. Storage nodes can be
added and removed from Dynamo without requiring any manual
partitioning or redistribution.
In the past year, Dynamo has been the underlying storage
technology for a number of the core services in Amazons ecommerce platform. It was able to scale to extreme peak loads
efficiently without any downtime during the busy holiday
shopping season. For example, the service that maintains
shopping cart (Shopping Cart Service) served tens of millions
requests that resulted in well over 3 million checkouts in a single
day and the service that manages session state handled hundreds
of thousands of concurrently active sessions.

2.1

System Assumptions and Requirements

The storage system for this class of services has the following
requirements:
Query Model: simple read and write operations to a data item that
is uniquely identified by a key. State is stored as binary objects
(i.e., blobs) identified by unique keys. No operations span
multiple data items and there is no need for relational schema.
This requirement is based on the observation that a significant
portion of Amazons services can work with this simple query
model and do not need any relational schema. Dynamo targets
applications that need to store objects that are relatively small
(usually less than 1 MB).

The main contribution of this work for the research community is


the evaluation of how different techniques can be combined to
provide a single highly-available system. It demonstrates that an
eventually-consistent storage system can be used in production
with demanding applications. It also provides insight into the
tuning of these techniques to meet the requirements of production
systems with very strict performance demands.

ACID Properties: ACID (Atomicity, Consistency, Isolation,


Durability) is a set of properties that guarantee that database
transactions are processed reliably. In the context of databases, a
single logical operation on the data is called a transaction.
Experience at Amazon has shown that data stores that provide
ACID guarantees tend to have poor availability. This has been
widely acknowledged by both the industry and academia [5].
Dynamo targets applications that operate with weaker consistency
(the C in ACID) if this results in high availability. Dynamo
does not provide any isolation guarantees and permits only single
key updates.

The paper is structured as follows. Section 2 presents the


background and Section 3 presents the related work. Section 4
presents the system design and Section 5 describes the
implementation. Section 6 details the experiences and insights
gained by running Dynamo in production and Section 7 concludes
the paper. There are a number of places in this paper where
additional information may have been appropriate but where
protecting Amazons business interests require us to reduce some
level of detail. For this reason, the intra- and inter-datacenter
latencies in section 6, the absolute request rates in section 6.2 and
outage lengths and workloads in section 6.3 are provided through
aggregate measures instead of absolute details.

Efficiency: The system needs to function on a commodity


hardware infrastructure. In Amazons platform, services have
stringent latency requirements which are in general measured at
the 99.9th percentile of the distribution. Given that state access
plays a crucial role in service operation the storage system must
be capable of meeting such stringent SLAs (see Section 2.2
below). Services must be able to configure Dynamo such that they
consistently achieve their latency and throughput requirements.
The tradeoffs are in performance, cost efficiency, availability, and
durability guarantees.

2. BACKGROUND
Amazons e-commerce platform is composed of hundreds of
services that work in concert to deliver functionality ranging from
recommendations to order fulfillment to fraud detection. Each
service is exposed through a well defined interface and is
accessible over the network. These services are hosted in an
infrastructure that consists of tens of thousands of servers located
across many data centers world-wide. Some of these services are
stateless (i.e., services which aggregate responses from other
services) and some are stateful (i.e., a service that generates its
response by executing business logic on its state stored in
persistent store).

Other Assumptions: Dynamo is used only by Amazons internal


services. Its operation environment is assumed to be non-hostile
and there are no security related requirements such as
authentication and authorization. Moreover, since each service
uses its distinct instance of Dynamo, its initial design targets a
scale of up to hundreds of storage hosts. We will discuss the
scalability limitations of Dynamo and possible scalability related
extensions in later sections.

Traditionally production systems store their state in relational


databases. For many of the more common usage patterns of state
persistence, however, a relational database is a solution that is far
from ideal. Most of these services only store and retrieve data by
primary key and do not require the complex querying and
management functionality offered by an RDBMS. This excess
functionality requires expensive hardware and highly skilled
personnel for its operation, making it a very inefficient solution.
In addition, the available replication technologies are limited and
typically choose consistency over availability. Although many
advances have been made in the recent years, it is still not easy to
scale-out databases or use smart partitioning schemes for load
balancing.

2.2

Service Level Agreements (SLA)

To guarantee that the application can deliver its functionality in a


bounded time, each and every dependency in the platform needs
to deliver its functionality with even tighter bounds. Clients and
services engage in a Service Level Agreement (SLA), a formally
negotiated contract where a client and a service agree on several
system-related characteristics, which most prominently include
the clients expected request rate distribution for a particular API
and the expected service latency under those conditions. An
example of a simple SLA is a service guaranteeing that it will

206
196

production systems have shown that this approach provides a


better overall experience compared to those systems that meet
SLAs defined based on the mean or median.
In this paper there are many references to this 99.9th percentile of
distributions, which reflects Amazon engineers relentless focus
on performance from the perspective of the customers
experience. Many papers report on averages, so these are included
where it makes sense for comparison purposes. Nevertheless,
Amazons engineering and optimization efforts are not focused on
averages. Several techniques, such as the load balanced selection
of write coordinators, are purely targeted at controlling
performance at the 99.9th percentile.
Storage systems often play an important role in establishing a
services SLA, especially if the business logic is relatively
lightweight, as is the case for many Amazon services. State
management then becomes the main component of a services
SLA. One of the main design considerations for Dynamo is to
give services control over their system properties, such as
durability and consistency, and to let services make their own
tradeoffs between functionality, performance and costeffectiveness.

2.3

Figure 1: Service-oriented architecture of Amazons


platform

Design Considerations

Data replication algorithms used in commercial systems


traditionally perform synchronous replica coordination in order to
provide a strongly consistent data access interface. To achieve this
level of consistency, these algorithms are forced to tradeoff the
availability of the data under certain failure scenarios. For
instance, rather than dealing with the uncertainty of the
correctness of an answer, the data is made unavailable until it is
absolutely certain that it is correct. From the very early replicated
database works, it is well known that when dealing with the
possibility of network failures, strong consistency and high data
availability cannot be achieved simultaneously [2, 11]. As such
systems and applications need to be aware which properties can
be achieved under which conditions.

provide a response within 300ms for 99.9% of its requests for a


peak client load of 500 requests per second.
In Amazons decentralized service oriented infrastructure, SLAs
play an important role. For example a page request to one of the
e-commerce sites typically requires the rendering engine to
construct its response by sending requests to over 150 services.
These services often have multiple dependencies, which
frequently are other services, and as such it is not uncommon for
the call graph of an application to have more than one level. To
ensure that the page rendering engine can maintain a clear bound
on page delivery each service within the call chain must obey its
performance contract.
Figure 1 shows an abstract view of the architecture of Amazons
platform, where dynamic web content is generated by page
rendering components which in turn query many other services. A
service can use different data stores to manage its state and these
data stores are only accessible within its service boundaries. Some
services act as aggregators by using several other services to
produce a composite response. Typically, the aggregator services
are stateless, although they use extensive caching.

For systems prone to server and network failures, availability can


be increased by using optimistic replication techniques, where
changes are allowed to propagate to replicas in the background,
and concurrent, disconnected work is tolerated. The challenge
with this approach is that it can lead to conflicting changes which
must be detected and resolved. This process of conflict resolution
introduces two problems: when to resolve them and who resolves
them. Dynamo is designed to be an eventually consistent data
store; that is all updates reach all replicas eventually.

A common approach in the industry for forming a performance


oriented SLA is to describe it using average, median and expected
variance. At Amazon we have found that these metrics are not
good enough if the goal is to build a system where all customers
have a good experience, rather than just the majority. For
example if extensive personalization techniques are used then
customers with longer histories require more processing which
impacts performance at the high-end of the distribution. An SLA
stated in terms of mean or median response times will not address
the performance of this important customer segment. To address
this issue, at Amazon, SLAs are expressed and measured at the
99.9th percentile of the distribution. The choice for 99.9% over an
even higher percentile has been made based on a cost-benefit
analysis which demonstrated a significant increase in cost to
improve performance that much. Experiences with Amazons

An important design consideration is to decide when to perform


the process of resolving update conflicts, i.e., whether conflicts
should be resolved during reads or writes. Many traditional data
stores execute conflict resolution during writes and keep the read
complexity simple [7]. In such systems, writes may be rejected if
the data store cannot reach all (or a majority of) the replicas at a
given time. On the other hand, Dynamo targets the design space
of an always writeable data store (i.e., a data store that is highly
available for writes). For a number of Amazon services, rejecting
customer updates could result in a poor customer experience. For
instance, the shopping cart service must allow customers to add
and remove items from their shopping cart even amidst network
and server failures. This requirement forces us to push the
complexity of conflict resolution to the reads in order to ensure
that writes are never rejected.

207
197

Various storage systems, such as Oceanstore [9] and PAST [17]


were built on top of these routing overlays. Oceanstore provides a
global, transactional, persistent storage service that supports
serialized updates on widely replicated data. To allow for
concurrent updates while avoiding many of the problems inherent
with wide-area locking, it uses an update model based on conflict
resolution. Conflict resolution was introduced in [21] to reduce
the number of transaction aborts. Oceanstore resolves conflicts by
processing a series of updates, choosing a total order among them,
and then applying them atomically in that order. It is built for an
environment where the data is replicated on an untrusted
infrastructure. By comparison, PAST provides a simple
abstraction layer on top of Pastry for persistent and immutable
objects. It assumes that the application can build the necessary
storage semantics (such as mutable files) on top of it.

The next design choice is who performs the process of conflict


resolution. This can be done by the data store or the application. If
conflict resolution is done by the data store, its choices are rather
limited. In such cases, the data store can only use simple policies,
such as last write wins [22], to resolve conflicting updates. On
the other hand, since the application is aware of the data schema it
can decide on the conflict resolution method that is best suited for
its clients experience. For instance, the application that maintains
customer shopping carts can choose to merge the conflicting
versions and return a single unified shopping cart. Despite this
flexibility, some application developers may not want to write
their own conflict resolution mechanisms and choose to push it
down to the data store, which in turn chooses a simple policy such
as last write wins.
Other key principles embraced in the design are:

3.2

Incremental scalability: Dynamo should be able to scale out one


storage host (henceforth, referred to as node) at a time, with
minimal impact on both operators of the system and the system
itself.
Symmetry: Every node in Dynamo should have the same set of
responsibilities as its peers; there should be no distinguished node
or nodes that take special roles or extra set of responsibilities. In
our experience, symmetry simplifies the process of system
provisioning and maintenance.
Decentralization: An extension of symmetry, the design should
favor decentralized peer-to-peer techniques over centralized
control. In the past, centralized control has resulted in outages and
the goal is to avoid it as much as possible. This leads to a simpler,
more scalable, and more available system.
Heterogeneity: The system needs to be able to exploit
heterogeneity in the infrastructure it runs on. e.g. the work
distribution must be proportional to the capabilities of the
individual servers. This is essential in adding new nodes with
higher capacity without having to upgrade all hosts at once.

Among these systems, Bayou, Coda and Ficus allow disconnected


operations and are resilient to issues such as network partitions
and outages. These systems differ on their conflict resolution
procedures. For instance, Coda and Ficus perform system level
conflict resolution and Bayou allows application level resolution.
All of them, however, guarantee eventual consistency. Similar to
these systems, Dynamo allows read and write operations to
continue even during network partitions and resolves updated
conflicts using different conflict resolution mechanisms.
Distributed block storage systems like FAB [18] split large size
objects into smaller blocks and stores each block in a highly
available manner. In comparison to these systems, a key-value
store is more suitable in this case because: (a) it is intended to
store relatively small objects (size < 1M) and (b) key-value stores
are easier to configure on a per-application basis. Antiquity is a
wide-area distributed storage system designed to handle multiple
server failures [23]. It uses a secure log to preserve data integrity,
replicates each log on multiple servers for durability, and uses
Byzantine fault tolerance protocols to ensure data consistency. In
contrast to Antiquity, Dynamo does not focus on the problem of
data integrity and security and is built for a trusted environment.
Bigtable is a distributed storage system for managing structured
data. It maintains a sparse, multi-dimensional sorted map and
allows applications to access their data using multiple attributes
[2]. Compared to Bigtable, Dynamo targets applications that
require only key/value access with primary focus on high
availability where updates are not rejected even in the wake of
network partitions or server failures.

3. RELATED WORK
3.1 Peer to Peer Systems
There are several peer-to-peer (P2P) systems that have looked at
the problem of data storage and distribution. The first generation
of P2P systems, such as Freenet and Gnutella1, were
predominantly used as file sharing systems. These were examples
of unstructured P2P networks where the overlay links between
peers were established arbitrarily. In these networks, a search
query is usually flooded through the network to find as many
peers as possible that share the data. P2P systems evolved to the
next generation into what is widely known as structured P2P
networks. These networks employ a globally consistent protocol
to ensure that any node can efficiently route a search query to
some peer that has the desired data. Systems like Pastry [16] and
Chord [20] use routing mechanisms to ensure that queries can be
answered within a bounded number of hops. To reduce the
additional latency introduced by multi-hop routing, some P2P
systems (e.g., [14]) employ O(1) routing where each peer
maintains enough routing information locally so that it can route
requests (to access a data item) to the appropriate peer within a
constant number of hops.
1

Distributed File Systems and Databases

Distributing data for performance, availability and durability has


been widely studied in the file system and database systems
community. Compared to P2P storage systems that only support
flat namespaces, distributed file systems typically support
hierarchical namespaces. Systems like Ficus [15] and Coda [19]
replicate files for high availability at the expense of consistency.
Update conflicts are typically managed using specialized conflict
resolution procedures. The Farsite system [1] is a distributed file
system that does not use any centralized server like NFS. Farsite
achieves high availability and scalability using replication. The
Google File System [6] is another distributed file system built for
hosting the state of Googles internal applications. GFS uses a
simple design with a single master server for hosting the entire
metadata and where the data is split into chunks and stored in
chunkservers. Bayou is a distributed relational database system
that allows disconnected operations and provides eventual data
consistency [21].

http://freenetproject.org/, http://www.gnutella.org

208
198

Table 1: Summary of techniques used in Dynamo and


their advantages.

Key K
A
G

Technique

Advantage

Partitioning

Consistent Hashing

Incremental
Scalability

High Availability
for writes

Vector clocks with


reconciliation during
reads

Version size is
decoupled from
update rates.

Handling temporary
failures

Sloppy Quorum and


hinted handoff

Provides high
availability and
durability guarantee
when some of the
replicas are not
available.

Recovering from
permanent failures

Anti-entropy using
Merkle trees

Synchronizes
divergent replicas in
the background.

Membership and
failure detection

Gossip-based
membership protocol
and failure detection.

Preserves symmetry
and avoids having a
centralized registry
for storing
membership and
node liveness
information.

Nodes B, C
and D store
keys in
range (A,B)
including
K.

Figure 2: Partitioning and replication of keys in Dynamo


ring.
Traditional replicated relational database systems focus on the
problem of guaranteeing strong consistency to replicated data.
Although strong consistency provides the application writer a
convenient programming model, these systems are limited in
scalability and availability [7]. These systems are not capable of
handling network partitions because they typically provide strong
consistency guarantees.

3.3

Problem

Discussion

Table 1 presents a summary of the list of techniques Dynamo uses


and their respective advantages.

Dynamo differs from the aforementioned decentralized storage


systems in terms of its target requirements. First, Dynamo is
targeted mainly at applications that need an always writeable
data store where no updates are rejected due to failures or
concurrent writes. This is a crucial requirement for many Amazon
applications. Second, as noted earlier, Dynamo is built for an
infrastructure within a single administrative domain where all
nodes are assumed to be trusted. Third, applications that use
Dynamo do not require support for hierarchical namespaces (a
norm in many file systems) or complex relational schema
(supported by traditional databases). Fourth, Dynamo is built for
latency sensitive applications that require at least 99.9% of read
and write operations to be performed within a few hundred
milliseconds. To meet these stringent latency requirements, it was
imperative for us to avoid routing requests through multiple nodes
(which is the typical design adopted by several distributed hash
table systems such as Chord and Pastry). This is because multihop routing increases variability in response times, thereby
increasing the latency at higher percentiles. Dynamo can be
characterized as a zero-hop DHT, where each node maintains
enough routing information locally to route a request to the
appropriate node directly.

4.1

System Interface

Dynamo stores objects associated with a key through a simple


interface; it exposes two operations: get() and put(). The get(key)
operation locates the object replicas associated with the key in the
storage system and returns a single object or a list of objects with
conflicting versions along with a context. The put(key, context,
object) operation determines where the replicas of the object
should be placed based on the associated key, and writes the
replicas to disk. The context encodes system metadata about the
object that is opaque to the caller and includes information such as
the version of the object. The context information is stored along
with the object so that the system can verify the validity of the
context object supplied in the put request.
Dynamo treats both the key and the object supplied by the caller
as an opaque array of bytes. It applies a MD5 hash on the key to
generate a 128-bit identifier, which is used to determine the
storage nodes that are responsible for serving the key.

4.2

Partitioning Algorithm

One of the key design requirements for Dynamo is that it must


scale incrementally. This requires a mechanism to dynamically
partition the data over the set of nodes (i.e., storage hosts) in the
system. Dynamos partitioning scheme relies on consistent
hashing to distribute the load across multiple storage hosts. In
consistent hashing [10], the output range of a hash function is
treated as a fixed circular space or ring (i.e. the largest hash
value wraps around to the smallest hash value). Each node in the
system is assigned a random value within this space which
represents its position on the ring. Each data item identified by
a key is assigned to a node by hashing the data items key to yield
its position on the ring, and then walking the ring clockwise to
find the first node with a position larger than the items position.

4. SYSTEM ARCHITECTURE
The architecture of a storage system that needs to operate in a
production setting is complex. In addition to the actual data
persistence component, the system needs to have scalable and
robust solutions for load balancing, membership and failure
detection, failure recovery, replica synchronization, overload
handling, state transfer, concurrency and job scheduling, request
marshalling, request routing, system monitoring and alarming,
and configuration management. Describing the details of each of
the solutions is not possible, so this paper focuses on the core
distributed systems techniques used in Dynamo: partitioning,
replication, versioning, membership, failure handling and scaling.

209
199

Thus, each node becomes responsible for the region in the ring
between it and its predecessor node on the ring. The principle
advantage of consistent hashing is that departure or arrival of a
node only affects its immediate neighbors and other nodes remain
unaffected.

return to its caller before the update has been applied at all the
replicas, which can result in scenarios where a subsequent get()
operation may return an object that does not have the latest
updates.. If there are no failures then there is a bound on the
update propagation times. However, under certain failure
scenarios (e.g., server outages or network partitions), updates may
not arrive at all replicas for an extended period of time.

The basic consistent hashing algorithm presents some challenges.


First, the random position assignment of each node on the ring
leads to non-uniform data and load distribution. Second, the basic
algorithm is oblivious to the heterogeneity in the performance of
nodes. To address these issues, Dynamo uses a variant of
consistent hashing (similar to the one used in [10, 20]): instead of
mapping a node to a single point in the circle, each node gets
assigned to multiple points in the ring. To this end, Dynamo uses
the concept of virtual nodes. A virtual node looks like a single
node in the system, but each node can be responsible for more
than one virtual node. Effectively, when a new node is added to
the system, it is assigned multiple positions (henceforth, tokens)
in the ring. The process of fine-tuning Dynamos partitioning
scheme is discussed in Section 6.

There is a category of applications in Amazons platform that can


tolerate such inconsistencies and can be constructed to operate
under these conditions. For example, the shopping cart application
requires that an Add to Cart operation can never be forgotten or
rejected. If the most recent state of the cart is unavailable, and a
user makes changes to an older version of the cart, that change is
still meaningful and should be preserved. But at the same time it
shouldnt supersede the currently unavailable state of the cart,
which itself may contain changes that should be preserved. Note
that both add to cart and delete item from cart operations are
translated into put requests to Dynamo. When a customer wants to
add an item to (or remove from) a shopping cart and the latest
version is not available, the item is added to (or removed from)
the older version and the divergent versions are reconciled later.

Using virtual nodes has the following advantages:

If a node becomes unavailable (due to failures or routine


maintenance), the load handled by this node is evenly
dispersed across the remaining available nodes.

When a node becomes available again, or a new node is


added to the system, the newly available node accepts a
roughly equivalent amount of load from each of the other
available nodes.

The number of virtual nodes that a node is responsible can


decided based on its capacity, accounting for heterogeneity
in the physical infrastructure.

4.3

In order to provide this kind of guarantee, Dynamo treats the


result of each modification as a new and immutable version of the
data. It allows for multiple versions of an object to be present in
the system at the same time. Most of the time, new versions
subsume the previous version(s), and the system itself can
determine the authoritative version (syntactic reconciliation).
However, version branching may happen, in the presence of
failures combined with concurrent updates, resulting in
conflicting versions of an object. In these cases, the system cannot
reconcile the multiple versions of the same object and the client
must perform the reconciliation in order to collapse multiple
branches of data evolution back into one (semantic
reconciliation). A typical example of a collapse operation is
merging different versions of a customers shopping cart. Using
this reconciliation mechanism, an add to cart operation is never
lost. However, deleted items can resurface.

Replication

To achieve high availability and durability, Dynamo replicates its


data on multiple hosts. Each data item is replicated at N hosts,
where N is a parameter configured per-instance. Each key, k, is
assigned to a coordinator node (described in the previous section).
The coordinator is in charge of the replication of the data items
that fall within its range. In addition to locally storing each key
within its range, the coordinator replicates these keys at the N-1
clockwise successor nodes in the ring. This results in a system
where each node is responsible for the region of the ring between
it and its Nth predecessor. In Figure 2, node B replicates the key k
at nodes C and D in addition to storing it locally. Node D will
store the keys that fall in the ranges (A, B], (B, C], and (C, D].

It is important to understand that certain failure modes can


potentially result in the system having not just two but several
versions of the same data. Updates in the presence of network
partitions and node failures can potentially result in an object
having distinct version sub-histories, which the system will need
to reconcile in the future. This requires us to design applications
that explicitly acknowledge the possibility of multiple versions of
the same data (in order to never lose any updates).
Dynamo uses vector clocks [12] in order to capture causality
between different versions of the same object. A vector clock is
effectively a list of (node, counter) pairs. One vector clock is
associated with every version of every object. One can determine
whether two versions of an object are on parallel branches or have
a causal ordering, by examine their vector clocks. If the counters
on the first objects clock are less-than-or-equal to all of the nodes
in the second clock, then the first is an ancestor of the second and
can be forgotten. Otherwise, the two changes are considered to be
in conflict and require reconciliation.

The list of nodes that is responsible for storing a particular key is


called the preference list. The system is designed, as will be
explained in Section 4.8, so that every node in the system can
determine which nodes should be in this list for any particular
key. To account for node failures, preference list contains more
than N nodes. Note that with the use of virtual nodes, it is possible
that the first N successor positions for a particular key may be
owned by less than N distinct physical nodes (i.e. a node may
hold more than one of the first N positions). To address this, the
preference list for a key is constructed by skipping positions in the
ring to ensure that the list contains only distinct physical nodes.

4.4

In Dynamo, when a client wishes to update an object, it must


specify which version it is updating. This is done by passing the
context it obtained from an earlier read operation, which contains
the vector clock information. Upon processing a read request, if

Data Versioning

Dynamo provides eventual consistency, which allows for updates


to be propagated to all replicas asynchronously. A put() call may

210
200

object. In practice, this is not likely because the writes are usually
handled by one of the top N nodes in the preference list. In case of
network partitions or multiple server failures, write requests may
be handled by nodes that are not in the top N nodes in the
preference list causing the size of vector clock to grow. In these
scenarios, it is desirable to limit the size of vector clock. To this
end, Dynamo employs the following clock truncation scheme:
Along with each (node, counter) pair, Dynamo stores a timestamp
that indicates the last time the node updated the data item. When
the number of (node, counter) pairs in the vector clock reaches a
threshold (say 10), the oldest pair is removed from the clock.
Clearly, this truncation scheme can lead to inefficiencies in
reconciliation as the descendant relationships cannot be derived
accurately. However, this problem has not surfaced in production
and therefore this issue has not been thoroughly investigated.

4.5

Execution of get () and put () operations

Any storage node in Dynamo is eligible to receive client get and


put operations for any key. In this section, for sake of simplicity,
we describe how these operations are performed in a failure-free
environment and in the subsequent section we describe how read
and write operations are executed during failures.

Figure 3: Version evolution of an object over time.


Dynamo has access to multiple branches that cannot be
syntactically reconciled, it will return all the objects at the leaves,
with the corresponding version information in the context. An
update using this context is considered to have reconciled the
divergent versions and the branches are collapsed into a single
new version.

Both get and put operations are invoked using Amazons


infrastructure-specific request processing framework over HTTP.
There are two strategies that a client can use to select a node: (1)
route its request through a generic load balancer that will select a
node based on load information, or (2) use a partition-aware client
library that routes requests directly to the appropriate coordinator
nodes. The advantage of the first approach is that the client does
not have to link any code specific to Dynamo in its application,
whereas the second strategy can achieve lower latency because it
skips a potential forwarding step.

To illustrate the use of vector clocks, let us consider the example


shown in Figure 3. A client writes a new object. The node (say
Sx) that handles the write for this key increases its sequence
number and uses it to create the data's vector clock. The system
now has the object D1 and its associated clock [(Sx, 1)]. The
client updates the object. Assume the same node handles this
request as well. The system now also has object D2 and its
associated clock [(Sx, 2)]. D2 descends from D1 and therefore
over-writes D1, however there may be replicas of D1 lingering at
nodes that have not yet seen D2. Let us assume that the same
client updates the object again and a different server (say Sy)
handles the request. The system now has data D3 and its
associated clock [(Sx, 2), (Sy, 1)].

A node handling a read or write operation is known as the


coordinator. Typically, this is the first among the top N nodes in
the preference list. If the requests are received through a load
balancer, requests to access a key may be routed to any random
node in the ring. In this scenario, the node that receives the
request will not coordinate it if the node is not in the top N of the
requested keys preference list. Instead, that node will forward the
request to the first among the top N nodes in the preference list.
Read and write operations involve the first N healthy nodes in the
preference list, skipping over those that are down or inaccessible.
When all nodes are healthy, the top N nodes in a keys preference
list are accessed. When there are node failures or network
partitions, nodes that are lower ranked in the preference list are
accessed.

Next assume a different client reads D2 and then tries to update it,
and another node (say Sz) does the write. The system now has D4
(descendant of D2) whose version clock is [(Sx, 2), (Sz, 1)]. A
node that is aware of D1 or D2 could determine, upon receiving
D4 and its clock, that D1 and D2 are overwritten by the new data
and can be garbage collected. A node that is aware of D3 and
receives D4 will find that there is no causal relation between
them. In other words, there are changes in D3 and D4 that are not
reflected in each other. Both versions of the data must be kept and
presented to a client (upon a read) for semantic reconciliation.

To maintain consistency among its replicas, Dynamo uses a


consistency protocol similar to those used in quorum systems.
This protocol has two key configurable values: R and W. R is the
minimum number of nodes that must participate in a successful
read operation. W is the minimum number of nodes that must
participate in a successful write operation. Setting R and W such
that R + W > N yields a quorum-like system. In this model, the
latency of a get (or put) operation is dictated by the slowest of the
R (or W) replicas. For this reason, R and W are usually
configured to be less than N, to provide better latency.

Now assume some client reads both D3 and D4 (the context will
reflect that both values were found by the read). The read's
context is a summary of the clocks of D3 and D4, namely [(Sx, 2),
(Sy, 1), (Sz, 1)]. If the client performs the reconciliation and node
Sx coordinates the write, Sx will update its sequence number in
the clock. The new data D5 will have the following clock: [(Sx,
3), (Sy, 1), (Sz, 1)].

Upon receiving a put() request for a key, the coordinator generates


the vector clock for the new version and writes the new version
locally. The coordinator then sends the new version (along with

A possible issue with vector clocks is that the size of vector


clocks may grow if many servers coordinate the writes to an

211
201

the new vector clock) to the N highest-ranked reachable nodes. If


at least W-1 nodes respond then the write is considered
successful.

the original replica node. To handle this and other threats to


durability, Dynamo implements an anti-entropy (replica
synchronization) protocol to keep the replicas synchronized.

Similarly, for a get() request, the coordinator requests all existing


versions of data for that key from the N highest-ranked reachable
nodes in the preference list for that key, and then waits for R
responses before returning the result to the client. If the
coordinator ends up gathering multiple versions of the data, it
returns all the versions it deems to be causally unrelated. The
divergent versions are then reconciled and the reconciled version
superseding the current versions is written back.

To detect the inconsistencies between replicas faster and to


minimize the amount of transferred data, Dynamo uses Merkle
trees [13]. A Merkle tree is a hash tree where leaves are hashes of
the values of individual keys. Parent nodes higher in the tree are
hashes of their respective children. The principal advantage of
Merkle tree is that each branch of the tree can be checked
independently without requiring nodes to download the entire tree
or the entire data set. Moreover, Merkle trees help in reducing the
amount of data that needs to be transferred while checking for
inconsistencies among replicas. For instance, if the hash values of
the root of two trees are equal, then the values of the leaf nodes in
the tree are equal and the nodes require no synchronization. If not,
it implies that the values of some replicas are different. In such
cases, the nodes may exchange the hash values of children and the
process continues until it reaches the leaves of the trees, at which
point the hosts can identify the keys that are out of sync. Merkle
trees minimize the amount of data that needs to be transferred for
synchronization and reduce the number of disk reads performed
during the anti-entropy process.

4.6

Handling Failures: Hinted Handoff

If Dynamo used a traditional quorum approach it would be


unavailable during server failures and network partitions, and
would have reduced durability even under the simplest of failure
conditions. To remedy this it does not enforce strict quorum
membership and instead it uses a sloppy quorum; all read and
write operations are performed on the first N healthy nodes from
the preference list, which may not always be the first N nodes
encountered while walking the consistent hashing ring.
Consider the example of Dynamo configuration given in Figure 2
with N=3. In this example, if node A is temporarily down or
unreachable during a write operation then a replica that would
normally have lived on A will now be sent to node D. This is done
to maintain the desired availability and durability guarantees. The
replica sent to D will have a hint in its metadata that suggests
which node was the intended recipient of the replica (in this case
A). Nodes that receive hinted replicas will keep them in a
separate local database that is scanned periodically. Upon
detecting that A has recovered, D will attempt to deliver the
replica to A. Once the transfer succeeds, D may delete the object
from its local store without decreasing the total number of replicas
in the system.

Dynamo uses Merkle trees for anti-entropy as follows: Each node


maintains a separate Merkle tree for each key range (the set of
keys covered by a virtual node) it hosts. This allows nodes to
compare whether the keys within a key range are up-to-date. In
this scheme, two nodes exchange the root of the Merkle tree
corresponding to the key ranges that they host in common.
Subsequently, using the tree traversal scheme described above the
nodes determine if they have any differences and perform the
appropriate synchronization action. The disadvantage with this
scheme is that many key ranges change when a node joins or
leaves the system thereby requiring the tree(s) to be recalculated.
This issue is addressed, however, by the refined partitioning
scheme described in Section 6.2.

Using hinted handoff, Dynamo ensures that the read and write
operations are not failed due to temporary node or network
failures. Applications that need the highest level of availability
can set W to 1, which ensures that a write is accepted as long as a
single node in the system has durably written the key it to its local
store. Thus, the write request is only rejected if all nodes in the
system are unavailable. However, in practice, most Amazon
services in production set a higher W to meet the desired level of
durability. A more detailed discussion of configuring N, R and W
follows in section 6.

4.8
4.8.1

Membership and Failure Detection


Ring Membership

In Amazons environment node outages (due to failures and


maintenance tasks) are often transient but may last for extended
intervals. A node outage rarely signifies a permanent departure
and therefore should not result in rebalancing of the partition
assignment or repair of the unreachable replicas. Similarly,
manual error could result in the unintentional startup of new
Dynamo nodes. For these reasons, it was deemed appropriate to
use an explicit mechanism to initiate the addition and removal of
nodes from a Dynamo ring. An administrator uses a command
line tool or a browser to connect to a Dynamo node and issue a
membership change to join a node to a ring or remove a node
from a ring. The node that serves the request writes the
membership change and its time of issue to persistent store. The
membership changes form a history because nodes can be
removed and added back multiple times. A gossip-based protocol
propagates membership changes and maintains an eventually
consistent view of membership. Each node contacts a peer chosen
at random every second and the two nodes efficiently reconcile
their persisted membership change histories.

It is imperative that a highly available storage system be capable


of handling the failure of an entire data center(s). Data center
failures happen due to power outages, cooling failures, network
failures, and natural disasters. Dynamo is configured such that
each object is replicated across multiple data centers. In essence,
the preference list of a key is constructed such that the storage
nodes are spread across multiple data centers. These datacenters
are connected through high speed network links. This scheme of
replicating across multiple datacenters allows us to handle entire
data center failures without a data outage.

4.7 Handling permanent failures: Replica


synchronization

When a node starts for the first time, it chooses its set of tokens
(virtual nodes in the consistent hash space) and maps nodes to
their respective token sets. The mapping is persisted on disk and

Hinted handoff works best if the system membership churn is low


and node failures are transient. There are scenarios under which
hinted replicas become unavailable before they can be returned to

212
202

initially contains only the local node and token set. The mappings
stored at different Dynamo nodes are reconciled during the same
communication exchange that reconciles the membership change
histories. Therefore, partitioning and placement information also
propagates via the gossip-based protocol and each storage node is
aware of the token ranges handled by its peers. This allows each
node to forward a keys read/write operations to the right set of
nodes directly.

4.8.2

us consider a simple bootstrapping scenario where node X is


added to the ring shown in Figure 2 between A and B. When X is
added to the system, it is in charge of storing keys in the ranges
(F, G], (G, A] and (A, X]. As a consequence, nodes B, C and D no
longer have to store the keys in these respective ranges.
Therefore, nodes B, C, and D will offer to and upon confirmation
from X transfer the appropriate set of keys. When a node is
removed from the system, the reallocation of keys happens in a
reverse process.

External Discovery

The mechanism described above could temporarily result in a


logically partitioned Dynamo ring.
For example, the
administrator could contact node A to join A to the ring, then
contact node B to join B to the ring. In this scenario, nodes A and
B would each consider itself a member of the ring, yet neither
would be immediately aware of the other. To prevent logical
partitions, some Dynamo nodes play the role of seeds. Seeds are
nodes that are discovered via an external mechanism and are
known to all nodes. Because all nodes eventually reconcile their
membership with a seed, logical partitions are highly unlikely.
Seeds can be obtained either from static configuration or from a
configuration service. Typically seeds are fully functional nodes
in the Dynamo ring.

Operational experience has shown that this approach distributes


the load of key distribution uniformly across the storage nodes,
which is important to meet the latency requirements and to ensure
fast bootstrapping. Finally, by adding a confirmation round
between the source and the destination, it is made sure that the
destination node does not receive any duplicate transfers for a
given key range.

4.8.3

Dynamos local persistence component allows for different


storage engines to be plugged in. Engines that are in use are
Berkeley Database (BDB) Transactional Data Store2, BDB Java
Edition, MySQL, and an in-memory buffer with persistent
backing store. The main reason for designing a pluggable
persistence component is to choose the storage engine best suited
for an applications access patterns. For instance, BDB can handle
objects typically in the order of tens of kilobytes whereas MySQL
can handle objects of larger sizes. Applications choose Dynamos
local persistence engine based on their object size distribution.
The majority of Dynamos production instances use BDB
Transactional Data Store.

5. IMPLEMENTATION
In Dynamo, each storage node has three main software
components: request coordination, membership and failure
detection, and a local persistence engine. All these components
are implemented in Java.

Failure Detection

Failure detection in Dynamo is used to avoid attempts to


communicate with unreachable peers during get() and put()
operations and when transferring partitions and hinted replicas.
For the purpose of avoiding failed attempts at communication, a
purely local notion of failure detection is entirely sufficient: node
A may consider node B failed if node B does not respond to node
As messages (even if B is responsive to node C's messages). In
the presence of a steady rate of client requests generating internode communication in the Dynamo ring, a node A quickly
discovers that a node B is unresponsive when B fails to respond to
a message; Node A then uses alternate nodes to service requests
that map to B's partitions; A periodically retries B to check for the
latter's recovery. In the absence of client requests to drive traffic
between two nodes, neither node really needs to know whether the
other is reachable and responsive.

The request coordination component is built on top of an eventdriven messaging substrate where the message processing pipeline
is split into multiple stages similar to the SEDA architecture [24].
All communications are implemented using Java NIO channels.
The coordinator executes the read and write requests on behalf of
clients by collecting data from one or more nodes (in the case of
reads) or storing data at one or more nodes (for writes). Each
client request results in the creation of a state machine on the node
that received the client request. The state machine contains all the
logic for identifying the nodes responsible for a key, sending the
requests, waiting for responses, potentially doing retries,
processing the replies and packaging the response to the client.
Each state machine instance handles exactly one client request.
For instance, a read operation implements the following state
machine: (i) send read requests to the nodes, (ii) wait for
minimum number of required responses, (iii) if too few replies
were received within a given time bound, fail the request, (iv)
otherwise gather all the data versions and determine the ones to be
returned and (v) if versioning is enabled, perform syntactic
reconciliation and generate an opaque write context that contains
the vector clock that subsumes all the remaining versions. For the
sake of brevity the failure handling and retry states are left out.

Decentralized failure detection protocols use a simple gossip-style


protocol that enable each node in the system to learn about the
arrival (or departure) of other nodes. For detailed information on
decentralized failure detectors and the parameters affecting their
accuracy, the interested reader is referred to [8]. Early designs of
Dynamo used a decentralized failure detector to maintain a
globally consistent view of failure state. Later it was determined
that the explicit node join and leave methods obviates the need for
a global view of failure state. This is because nodes are notified of
permanent node additions and removals by the explicit node join
and leave methods and temporary node failures are detected by
the individual nodes when they fail to communicate with others
(while forwarding requests).

4.9

Adding/Removing Storage Nodes

When a new node (say X) is added into the system, it gets


assigned a number of tokens that are randomly scattered on the
ring. For every key range that is assigned to node X, there may be
a number of nodes (less than or equal to N) that are currently in
charge of handling keys that fall within its token range. Due to the
allocation of key ranges to X, some existing nodes no longer have
to some of their keys and these nodes transfer those keys to X. Let

After the read response has been returned to the caller the state
2

213
203

http://www.oracle.com/database/berkeley-db.html

Figure 4: Average and 99.9 percentiles of latencies for read and


write requests during our peak request season of December 2006.
The intervals between consecutive ticks in the x-axis correspond
to 12 hours. Latencies follow a diurnal pattern similar to the
request rate and 99.9 percentile latencies are an order of
magnitude higher than averages

Figure 5: Comparison of performance of 99.9th percentile


latencies for buffered vs. non-buffered writes over a period of
24 hours. The intervals between consecutive ticks in the x-axis
correspond to one hour.

machine waits for a small period of time to receive any


outstanding responses. If stale versions were returned in any of
the responses, the coordinator updates those nodes with the latest
version. This process is called read repair because it repairs
replicas that have missed a recent update at an opportunistic time
and relieves the anti-entropy protocol from having to do it.

Timestamp based reconciliation: This case differs from the


previous one only in the reconciliation mechanism. In case of
divergent versions, Dynamo performs simple timestamp
based reconciliation logic of last write wins; i.e., the object
with the largest physical timestamp value is chosen as the
correct version. The service that maintains customers
session information is a good example of a service that uses
this mode.

High performance read engine: While Dynamo is built to be


an always writeable data store, a few services are tuning its
quorum characteristics and using it as a high performance
read engine. Typically, these services have a high read
request rate and only a small number of updates. In this
configuration, typically R is set to be 1 and W to be N. For
these services, Dynamo provides the ability to partition and
replicate their data across multiple nodes thereby offering
incremental scalability. Some of these instances function as
the authoritative persistence cache for data stored in more
heavy weight backing stores. Services that maintain product
catalog and promotional items fit in this category.

As noted earlier, write requests are coordinated by one of the top


N nodes in the preference list. Although it is desirable always to
have the first node among the top N to coordinate the writes
thereby serializing all writes at a single location, this approach has
led to uneven load distribution resulting in SLA violations. This is
because the request load is not uniformly distributed across
objects. To counter this, any of the top N nodes in the preference
list is allowed to coordinate the writes. In particular, since each
write usually follows a read operation, the coordinator for a write
is chosen to be the node that replied fastest to the previous read
operation which is stored in the context information of the
request. This optimization enables us to pick the node that has the
data that was read by the preceding read operation thereby
increasing the chances of getting read-your-writes consistency.
It also reduces variability in the performance of the request
handling which improves the performance at the 99.9 percentile.

The main advantage of Dynamo is that its client applications can


tune the values of N, R and W to achieve their desired levels of
performance, availability and durability. For instance, the value of
N determines the durability of each object. A typical value of N
used by Dynamos users is 3.

6. EXPERIENCES & LESSONS LEARNED


Dynamo is used by several services with different configurations.
These instances differ by their version reconciliation logic, and
read/write quorum characteristics. The following are the main
patterns in which Dynamo is used:

The values of W and R impact object availability, durability and


consistency. For instance, if W is set to 1, then the system will
never reject a write request as long as there is at least one node in
the system that can successfully process a write request. However,
low values of W and R can increase the risk of inconsistency as
write requests are deemed successful and returned to the clients
even if they are not processed by a majority of the replicas. This
also introduces a vulnerability window for durability when a write
request is successfully returned to the client even though it has
been persisted at only a small number of nodes.

Business logic specific reconciliation: This is a popular use


case for Dynamo. Each data object is replicated across
multiple nodes. In case of divergent versions, the client
application performs its own reconciliation logic. The
shopping cart service discussed earlier is a prime example of
this category. Its business logic reconciles objects by
merging different versions of a customers shopping cart.

214
204

significant difference in request rate between the daytime and


night). Moreover, the write latencies are higher than read latencies
obviously because write operations always results in disk access.
Also, the 99.9th percentile latencies are around 200 ms and are an
order of magnitude higher than the averages. This is because the
99.9th percentile latencies are affected by several factors such as
variability in request load, object sizes, and locality patterns.
While this level of performance is acceptable for a number of
services, a few customer-facing services required higher levels of
performance. For these services, Dynamo provides the ability to
trade-off durability guarantees for performance. In the
optimization each storage node maintains an object buffer in its
main memory. Each write operation is stored in the buffer and
gets periodically written to storage by a writer thread. In this
scheme, read operations first check if the requested key is present
in the buffer. If so, the object is read from the buffer instead of the
storage engine.

Figure 6: Fraction of nodes that are out-of-balance (i.e., nodes


whose request load is above a certain threshold from the
average system load) and their corresponding request load.
The interval between ticks in x-axis corresponds to a time
period of 30 minutes.

This optimization has resulted in lowering the 99.9th percentile


latency by a factor of 5 during peak traffic even for a very small
buffer of a thousand objects (see Figure 5). Also, as seen in the
figure, write buffering smoothes out higher percentile latencies.
Obviously, this scheme trades durability for performance. In this
scheme, a server crash can result in missing writes that were
queued up in the buffer. To reduce the durability risk, the write
operation is refined to have the coordinator choose one out of the
N replicas to perform a durable write. Since the coordinator
waits only for W responses, the performance of the write
operation is not affected by the performance of the durable write
operation performed by a single replica.

Traditional wisdom holds that durability and availability go handin-hand. However, this is not necessarily true here. For instance,
the vulnerability window for durability can be decreased by
increasing W. This may increase the probability of rejecting
requests (thereby decreasing availability) because more storage
hosts need to be alive to process a write request.
The common (N,R,W) configuration used by several instances of
Dynamo is (3,2,2). These values are chosen to meet the necessary
levels of performance, durability, consistency, and availability
SLAs.

6.2

All the measurements presented in this section were taken on a


live system operating with a configuration of (3,2,2) and running
a couple hundred nodes with homogenous hardware
configurations. As mentioned earlier, each instance of Dynamo
contains nodes that are located in multiple datacenters. These
datacenters are typically connected through high speed network
links. Recall that to generate a successful get (or put) response R
(or W) nodes need to respond to the coordinator. Clearly, the
network latencies between datacenters affect the response time
and the nodes (and their datacenter locations) are chosen such that
the applications target SLAs are met.

6.1

Ensuring Uniform Load distribution

Dynamo uses consistent hashing to partition its key space across


its replicas and to ensure uniform load distribution. A uniform key
distribution can help us achieve uniform load distribution
assuming the access distribution of keys is not highly skewed. In
particular, Dynamos design assumes that even where there is a
significant skew in the access distribution there are enough keys
in the popular end of the distribution so that the load of handling
popular keys can be spread across the nodes uniformly through
partitioning. This section discusses the load imbalance seen in
Dynamo and the impact of different partitioning strategies on load
distribution.

Balancing Performance and Durability

To study the load imbalance and its correlation with request load,
the total number of requests received by each node was measured
for a period of 24 hours - broken down into intervals of 30
minutes. In a given time window, a node is considered to be inbalance, if the nodes request load deviates from the average load
by a value a less than a certain threshold (here 15%). Otherwise
the node was deemed out-of-balance. Figure 6 presents the
fraction of nodes that are out-of-balance (henceforth,
imbalance ratio) during this time period. For reference, the
corresponding request load received by the entire system during
this time period is also plotted. As seen in the figure, the
imbalance ratio decreases with increasing load. For instance,
during low loads the imbalance ratio is as high as 20% and during
high loads it is close to 10%. Intuitively, this can be explained by
the fact that under high loads, a large number of popular keys are
accessed and due to uniform distribution of keys the load is
evenly distributed. However, during low loads (where load is 1/8th

While Dynamos principle design goal is to build a highly


available data store, performance is an equally important criterion
in Amazons platform. As noted earlier, to provide a consistent
customer experience, Amazons services set their performance
targets at higher percentiles (such as the 99.9th or 99.99th
percentiles). A typical SLA required of services that use Dynamo
is that 99.9% of the read and write requests execute within 300ms.
Since Dynamo is run on standard commodity hardware
components that have far less I/O throughput than high-end
enterprise servers, providing consistently high performance for
read and write operations is a non-trivial task. The involvement of
multiple storage nodes in read and write operations makes it even
more challenging, since the performance of these operations is
limited by the slowest of the R or W replicas. Figure 4 shows the
average and 99.9th percentile latencies of Dynamos read and
write operations during a period of 30 days. As seen in the figure,
the latencies exhibit a clear diurnal pattern which is a result of the
diurnal pattern in the incoming request rate (i.e., there is a

215
205

Figure 7: Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the
preference list for the key k1 on the consistent hashing ring (N=3). The shaded area indicates the key range for which nodes A,
B, and C form the preference list. Dark arrows indicate the token locations for various nodes.

The fundamental issue with this strategy is that the schemes for
data partitioning and data placement are intertwined. For instance,
in some cases, it is preferred to add more nodes to the system in
order to handle an increase in request load. However, in this
scenario, it is not possible to add nodes without affecting data
partitioning. Ideally, it is desirable to use independent schemes for
partitioning and placement. To this end, following strategies were
evaluated:

of the measured peak load), fewer popular keys are accessed,


resulting in a higher load imbalance.
This section discusses how Dynamos partitioning scheme has
evolved over time and its implications on load distribution.
Strategy 1: T random tokens per node and partition by token
value: This was the initial strategy deployed in production (and
described in Section 4.2). In this scheme, each node is assigned T
tokens (chosen uniformly at random from the hash space). The
tokens of all nodes are ordered according to their values in the
hash space. Every two consecutive tokens define a range. The last
token and the first token form a range that "wraps" around from
the highest value to the lowest value in the hash space. Because
the tokens are chosen randomly, the ranges vary in size. As nodes
join and leave the system, the token set changes and consequently
the ranges change. Note that the space needed to maintain the
membership at each node increases linearly with the number of
nodes in the system.

Strategy 2: T random tokens per node and equal sized partitions:


In this strategy, the hash space is divided into Q equally sized
partitions/ranges and each node is assigned T random tokens. Q is
usually set such that Q >> N and Q >> S*T, where S is the
number of nodes in the system. In this strategy, the tokens are
only used to build the function that maps values in the hash space
to the ordered lists of nodes and not to decide the partitioning. A
partition is placed on the first N unique nodes that are encountered
while walking the consistent hashing ring clockwise from the end
of the partition. Figure 7 illustrates this strategy for N=3. In this
example, nodes A, B, C are encountered while walking the ring
from the end of the partition that contains key k1. The primary
advantages of this strategy are: (i) decoupling of partitioning and
partition placement, and (ii) enabling the possibility of changing
the placement scheme at runtime.

While using this strategy, the following problems were


encountered. First, when a new node joins the system, it needs to
steal its key ranges from other nodes. However, the nodes
handing the key ranges off to the new node have to scan their
local persistence store to retrieve the appropriate set of data items.
Note that performing such a scan operation on a production node
is tricky as scans are highly resource intensive operations and they
need to be executed in the background without affecting the
customer performance. This requires us to run the bootstrapping
task at the lowest priority. However, this significantly slows the
bootstrapping process and during busy shopping season, when the
nodes are handling millions of requests a day, the bootstrapping
has taken almost a day to complete. Second, when a node
joins/leaves the system, the key ranges handled by many nodes
change and the Merkle trees for the new ranges need to be
recalculated, which is a non-trivial operation to perform on a
production system. Finally, there was no easy way to take a
snapshot of the entire key space due to the randomness in key
ranges, and this made the process of archival complicated. In this
scheme, archiving the entire key space requires us to retrieve the
keys from each node separately, which is highly inefficient.

Strategy 3: Q/S tokens per node, equal-sized partitions: Similar to


strategy 2, this strategy divides the hash space into Q equally
sized partitions and the placement of partition is decoupled from
the partitioning scheme. Moreover, each node is assigned Q/S
tokens where S is the number of nodes in the system. When a
node leaves the system, its tokens are randomly distributed to the
remaining nodes such that these properties are preserved.
Similarly, when a node joins the system it "steals" tokens from
nodes in the system in a way that preserves these properties.
The efficiency of these three strategies is evaluated for a system
with S=30 and N=3. However, comparing these different
strategies in a fair manner is hard as different strategies have
different configurations to tune their efficiency. For instance, the
load distribution property of strategy 1 depends on the number of
tokens (i.e., T) while strategy 3 depends on the number of
partitions (i.e., Q). One fair way to compare these strategies is to

216
206

6.3 Divergent Versions: When and


How Many?

Efficieny (mean load/max load)

0.9

As noted earlier, Dynamo is designed to tradeoff consistency for


availability. To understand the precise impact of different failures
on consistency, detailed data is required on multiple factors:
outage length, type of failure, component reliability, workload etc.
Presenting these numbers in detail is outside of the scope of this
paper. However, this section discusses a good summary metric:
the number of divergent versions seen by the application in a live
production environment.

0.8
0.7

0.6
Strategy 1

0.5

Strategy 2
Strategy 3

0.4
0

5000

10000

15000

20000

25000

30000

Divergent versions of a data item arise in two scenarios. The first


is when the system is facing failure scenarios such as node
failures, data center failures, and network partitions. The second is
when the system is handling a large number of concurrent writers
to a single data item and multiple nodes end up coordinating the
updates concurrently. From both a usability and efficiency
perspective, it is preferred to keep the number of divergent
versions at any given time as low as possible. If the versions
cannot be syntactically reconciled based on vector clocks alone,
they have to be passed to the business logic for semantic
reconciliation. Semantic reconciliation introduces additional load
on services, so it is desirable to minimize the need for it.

35000

Size of me tadata maintained at each node (in abstract units)

Figure 8: Comparison of the load distribution efficiency of


different strategies for system with 30 nodes and N=3 with
equal amount of metadata maintained at each node. The
values of the system size and number of replicas are based on
the typical configuration deployed for majority of our
services.
evaluate the skew in their load distribution while all strategies use
the same amount of space to maintain their membership
information. For instance, in strategy 1 each node needs to
maintain the token positions of all the nodes in the ring and in
strategy 3 each node needs to maintain the information regarding
the partitions assigned to each node.

In our next experiment, the number of versions returned to the


shopping cart service was profiled for a period of 24 hours.
During this period, 99.94% of requests saw exactly one version;
0.00057% of requests saw 2 versions; 0.00047% of requests saw 3
versions and 0.00009% of requests saw 4 versions. This shows
that divergent versions are created rarely.

In our next experiment, these strategies were evaluated by varying


the relevant parameters (T and Q). The load balancing efficiency
of each strategy was measured for different sizes of membership
information that needs to be maintained at each node, where Load
balancing efficiency is defined as the ratio of average number of
requests served by each node to the maximum number of requests
served by the hottest node.

Experience shows that the increase in the number of divergent


versions is contributed not by failures but due to the increase in
number of concurrent writers. The increase in the number of
concurrent writes is usually triggered by busy robots (automated
client programs) and rarely by humans. This issue is not discussed
in detail due to the sensitive nature of the story.

The results are given in Figure 8. As seen in the figure, strategy 3


achieves the best load balancing efficiency and strategy 2 has the
worst load balancing efficiency. For a brief time, Strategy 2
served as an interim setup during the process of migrating
Dynamo instances from using Strategy 1 to Strategy 3. Compared
to Strategy 1, Strategy 3 achieves better efficiency and reduces the
size of membership information maintained at each node by three
orders of magnitude. While storage is not a major issue the nodes
gossip the membership information periodically and as such it is
desirable to keep this information as compact as possible. In
addition to this, strategy 3 is advantageous and simpler to deploy
for the following reasons: (i) Faster bootstrapping/recovery:
Since partition ranges are fixed, they can be stored in separate
files, meaning a partition can be relocated as a unit by simply
transferring the file (avoiding random accesses needed to locate
specific items). This simplifies the process of bootstrapping and
recovery. (ii) Ease of archival: Periodical archiving of the dataset
is a mandatory requirement for most of Amazon storage services.
Archiving the entire dataset stored by Dynamo is simpler in
strategy 3 because the partition files can be archived separately.
By contrast, in Strategy 1, the tokens are chosen randomly and,
archiving the data stored in Dynamo requires retrieving the keys
from individual nodes separately and is usually inefficient and
slow. The disadvantage of strategy 3 is that changing the node
membership requires coordination in order to preserve the
properties required of the assignment.

6.4 Client-driven or Server-driven


Coordination
As mentioned in Section 5, Dynamo has a request coordination
component that uses a state machine to handle incoming requests.
Client requests are uniformly assigned to nodes in the ring by a
load balancer. Any Dynamo node can act as a coordinator for a
read request. Write requests on the other hand will be coordinated
by a node in the keys current preference list. This restriction is
due to the fact that these preferred nodes have the added
responsibility of creating a new version stamp that causally
subsumes the version that has been updated by the write request.
Note that if Dynamos versioning scheme is based on physical
timestamps, any node can coordinate a write request.
An alternative approach to request coordination is to move the
state machine to the client nodes. In this scheme client
applications use a library to perform request coordination locally.
A client periodically picks a random Dynamo node and
downloads its current view of Dynamo membership state. Using
this information the client can determine which set of nodes form
the preference list for any given key. Read requests can be
coordinated at the client node thereby avoiding the extra network
hop that is incurred if the request were assigned to a random
Dynamo node by the load balancer. Writes will either be
forwarded to a node in the keys preference list or can be

217
207

shared across all background tasks. A feedback mechanism based


on the monitored performance of the foreground tasks is
employed to change the number of slices that are available to the
background tasks.

Table 2: Performance of client-driven and server-driven


coordination approaches.

Serverdriven
Clientdriven

99.9th
percentile
read
latency
(ms)

99.9th
percentile
write
latency
(ms)

Average
read
latency
(ms)

Average
write
latency
(ms)

68.9

68.5

3.9

4.02

30.4

30.4

1.55

1.9

The admission controller constantly monitors the behavior of


resource accesses while executing a "foreground" put/get
operation. Monitored aspects include latencies for disk operations,
failed database accesses due to lock-contention and transaction
timeouts, and request queue wait times. This information is used
to check whether the percentiles of latencies (or failures) in a
given trailing time window are close to a desired threshold. For
example, the background controller checks to see how close the
99th percentile database read latency (over the last 60 seconds) is
to a preset threshold (say 50ms). The controller uses such
comparisons to assess the resource availability for the foreground
operations. Subsequently, it decides on how many time slices will
be available to background tasks, thereby using the feedback loop
to limit the intrusiveness of the background activities. Note that a
similar problem of managing background tasks has been studied
in [4].

coordinated locally if Dynamo is using timestamps based


versioning.
An important advantage of the client-driven coordination
approach is that a load balancer is no longer required to uniformly
distribute client load. Fair load distribution is implicitly
guaranteed by the near uniform assignment of keys to the storage
nodes. Obviously, the efficiency of this scheme is dependent on
how fresh the membership information is at the client. Currently
clients poll a random Dynamo node every 10 seconds for
membership updates. A pull based approach was chosen over a
push based one as the former scales better with large number of
clients and requires very little state to be maintained at servers
regarding clients. However, in the worst case the client can be
exposed to stale membership for duration of 10 seconds. In case,
if the client detects its membership table is stale (for instance,
when some members are unreachable), it will immediately refresh
its membership information.

6.6

Discussion

This section summarizes some of the experiences gained during


the process of implementation and maintenance of Dynamo.
Many Amazon internal services have used Dynamo for the past
two years and it has provided significant levels of availability to
its applications. In particular, applications have received
successful responses (without timing out) for 99.9995% of its
requests and no data loss event has occurred to date.
Moreover, the primary advantage of Dynamo is that it provides
the necessary knobs using the three parameters of (N,R,W) to tune
their instance based on their needs.. Unlike popular commercial
data stores, Dynamo exposes data consistency and reconciliation
logic issues to the developers. At the outset, one may expect the
application logic to become more complex. However, historically,
Amazons platform is built for high availability and many
applications are designed to handle different failure modes and
inconsistencies that may arise. Hence, porting such applications to
use Dynamo was a relatively simple task. For new applications
that want to use Dynamo, some analysis is required during the
initial stages of the development to pick the right conflict
resolution mechanisms that meet the business case appropriately.
Finally, Dynamo adopts a full membership model where each
node is aware of the data hosted by its peers. To do this, each
node actively gossips the full routing table with other nodes in the
system. This model works well for a system that contains couple
of hundreds of nodes. However, scaling such a design to run with
tens of thousands of nodes is not trivial because the overhead in
maintaining the routing table increases with the system size. This
limitation might be overcome by introducing hierarchical
extensions to Dynamo. Also, note that this problem is actively
addressed by O(1) DHT systems(e.g., [14]).

th

Table 2 shows the latency improvements at the 99.9 percentile


and averages that were observed for a period of 24 hours using
client-driven coordination compared to the server-driven
approach. As seen in the table, the client-driven coordination
approach reduces the latencies by at least 30 milliseconds for
99.9th percentile latencies and decreases the average by 3 to 4
milliseconds. The latency improvement is because the clientdriven approach eliminates the overhead of the load balancer and
the extra network hop that may be incurred when a request is
assigned to a random node. As seen in the table, average latencies
tend to be significantly lower than latencies at the 99.9th
percentile. This is because Dynamos storage engine caches and
write buffer have good hit ratios. Moreover, since the load
balancers and network introduce additional variability to the
response time, the gain in response time is higher for the 99.9th
percentile than the average.

6.5 Balancing background vs. foreground


tasks
Each node performs different kinds of background tasks for
replica synchronization and data handoff (either due to hinting or
adding/removing nodes) in addition to its normal foreground
put/get operations. In early production settings, these background
tasks triggered the problem of resource contention and affected
the performance of the regular put and get operations. Hence, it
became necessary to ensure that background tasks ran only when
the regular critical operations are not affected significantly. To
this end, the background tasks were integrated with an admission
control mechanism. Each of the background tasks uses this
controller to reserve runtime slices of the resource (e.g. database),

7. CONCLUSIONS
This paper described Dynamo, a highly available and scalable
data store, used for storing state of a number of core services of
Amazon.coms e-commerce platform. Dynamo has provided the
desired levels of availability and performance and has been
successful in handling server failures, data center failures and
network partitions. Dynamo is incrementally scalable and allows
service owners to scale up and down based on their current

218
208

Principles of Distributed Computing (Newport, Rhode


Island, United States). PODC '01. ACM Press, New York,
NY, 170-179.

request load. Dynamo allows service owners to customize their


storage system to meet their desired performance, durability and
consistency SLAs by allowing them to tune the parameters N, R,
and W.

[9] Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton,
P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H.,
Wells, C., and Zhao, B. 2000. OceanStore: an architecture
for global-scale persistent storage. SIGARCH Comput.
Archit. News 28, 5 (Dec. 2000), 190-201.

The production use of Dynamo for the past year demonstrates that
decentralized techniques can be combined to provide a single
highly-available system. Its success in one of the most
challenging application environments shows that an eventualconsistent storage system can be a building block for highlyavailable applications.

[10] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine,
M., and Lewin, D. 1997. Consistent hashing and random
trees: distributed caching protocols for relieving hot spots on
the World Wide Web. In Proceedings of the Twenty-Ninth
Annual ACM Symposium on theory of Computing (El Paso,
Texas, United States, May 04 - 06, 1997). STOC '97. ACM
Press, New York, NY, 654-663.

ACKNOWLEDGEMENTS
The authors would like to thank Pat Helland for his contribution
to the initial design of Dynamo. We would also like to thank
Marvin Theimer and Robert van Renesse for their comments.
Finally, we would like to thank our shepherd, Jeff Mogul, for his
detailed comments and inputs while preparing the camera ready
version that vastly improved the quality of the paper.

[11] Lindsay, B.G., et. al., Notes on Distributed Databases,


Research Report RJ2571(33471), IBM Research, July 1979
[12] Lamport, L. Time, clocks, and the ordering of events in a
distributed system. ACM Communications, 21(7), pp. 558565, 1978.

REFERENCES
[1] Adya, A., Bolosky, W. J., Castro, M., Cermak, G., Chaiken,
R., Douceur, J. R., Howell, J., Lorch, J. R., Theimer, M., and
Wattenhofer, R. P. 2002. Farsite: federated, available, and
reliable storage for an incompletely trusted environment.
SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 1-14.
[2]

[13] Merkle, R. A digital signature based on a conventional


encryption function. Proceedings of CRYPTO, pages 369
378. Springer-Verlag, 1988.
[14] Ramasubramanian, V., and Sirer, E. G. Beehive: O(1)lookup
performance for power-law query distributions in peer-topeer overlays. In Proceedings of the 1st Conference on
Symposium on Networked Systems Design and
Implementation, San Francisco, CA, March 29 - 31, 2004.

Bernstein, P.A., and Goodman, N. An algorithm for


concurrency control and recovery in replicated distributed
databases. ACM Trans. on Database Systems, 9(4):596-615,
December 1984

[3] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach,
D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.
E. 2006. Bigtable: a distributed storage system for structured
data. In Proceedings of the 7th Conference on USENIX
Symposium on Operating Systems Design and
Implementation - Volume 7 (Seattle, WA, November 06 - 08,
2006). USENIX Association, Berkeley, CA, 15-15.

[15] Reiher, P., Heidemann, J., Ratner, D., Skinner, G., and
Popek, G. 1994. Resolving file conflicts in the Ficus file
system. In Proceedings of the USENIX Summer 1994
Technical Conference on USENIX Summer 1994 Technical
Conference - Volume 1 (Boston, Massachusetts, June 06 - 10,
1994). USENIX Association, Berkeley, CA, 12-12..
[16] Rowstron, A., and Druschel, P. Pastry: Scalable,
decentralized object location and routing for large-scale peerto-peer systems. Proceedings of Middleware, pages 329-350,
November, 2001.

[4] Douceur, J. R. and Bolosky, W. J. 2000. Process-based


regulation of low-importance processes. SIGOPS Oper. Syst.
Rev. 34, 2 (Apr. 2000), 26-27.
[5] Fox, A., Gribble, S. D., Chawathe, Y., Brewer, E. A., and
Gauthier, P. 1997. Cluster-based scalable network services.
In Proceedings of the Sixteenth ACM Symposium on
Operating Systems Principles (Saint Malo, France, October
05 - 08, 1997). W. M. Waite, Ed. SOSP '97. ACM Press,
New York, NY, 78-91.

[17] Rowstron, A., and Druschel, P. Storage management and


caching in PAST, a large-scale, persistent peer-to-peer
storage utility. Proceedings of Symposium on Operating
Systems Principles, October 2001.
[18] Saito, Y., Frlund, S., Veitch, A., Merchant, A., and Spence,
S. 2004. FAB: building distributed enterprise disk arrays
from commodity components. SIGOPS Oper. Syst. Rev. 38, 5
(Dec. 2004), 48-58.

[6] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google
file system. In Proceedings of the Nineteenth ACM
Symposium on Operating Systems Principles (Bolton
Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM
Press, New York, NY, 29-43.

[19] Satyanarayanan, M., Kistler, J.J., Siegel, E.H. Coda: A


Resilient Distributed File System. IEEE Workshop on
Workstation Operating Systems, Nov. 1987.

[7] Gray, J., Helland, P., O'Neil, P., and Shasha, D. 1996. The
dangers of replication and a solution. In Proceedings of the
1996 ACM SIGMOD international Conference on
Management of Data (Montreal, Quebec, Canada, June 04 06, 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New
York, NY, 173-182.

[20] Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and
Balakrishnan, H. 2001. Chord: A scalable peer-to-peer
lookup service for internet applications. In Proceedings of
the 2001 Conference on Applications, Technologies,
Architectures, and Protocols For Computer Communications
(San Diego, California, United States). SIGCOMM '01.
ACM Press, New York, NY, 149-160.

[8] Gupta, I., Chandra, T. D., and Goldszmidt, G. S. 2001. On


scalable and efficient distributed failure detectors. In
Proceedings of the Twentieth Annual ACM Symposium on

219
209

[21] Terry, D. B., Theimer, M. M., Petersen, K., Demers, A. J.,


Spreitzer, M. J., and Hauser, C. H. 1995. Managing update
conflicts in Bayou, a weakly connected replicated storage
system. In Proceedings of the Fifteenth ACM Symposium on
Operating Systems Principles (Copper Mountain, Colorado,
United States, December 03 - 06, 1995). M. B. Jones, Ed.
SOSP '95. ACM Press, New York, NY, 172-182.

[23] Weatherspoon, H., Eaton, P., Chun, B., and Kubiatowicz, J.


2007. Antiquity: exploiting a secure log for wide-area
distributed storage. SIGOPS Oper. Syst. Rev. 41, 3 (Jun.
2007), 371-384.
[24] Welsh, M., Culler, D., and Brewer, E. 2001. SEDA: an
architecture for well-conditioned, scalable internet services.
In Proceedings of the Eighteenth ACM Symposium on
Operating Systems Principles (Banff, Alberta, Canada,
October 21 - 24, 2001). SOSP '01. ACM Press, New York,
NY, 230-243.

[22] Thomas, R. H. A majority consensus approach to


concurrency control for multiple copy databases. ACM
Transactions on Database Systems 4 (2): 180-209, 1979.

220
210

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center


Benjamin Hindman, Andy Konwinski, Matei Zaharia,
Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
University of California, Berkeley

Abstract

Two common solutions for sharing a cluster today are


either to statically partition the cluster and run one framework per partition, or to allocate a set of VMs to each
framework. Unfortunately, these solutions achieve neither high utilization nor efficient data sharing. The main
problem is the mismatch between the allocation granularities of these solutions and of existing frameworks. Many
frameworks, such as Hadoop and Dryad, employ a finegrained resource sharing model, where nodes are subdivided into slots and jobs are composed of short tasks
that are matched to slots [25, 38]. The short duration of
tasks and the ability to run multiple tasks per node allow
jobs to achieve high data locality, as each job will quickly
get a chance to run on nodes storing its input data. Short
tasks also allow frameworks to achieve high utilization,
as jobs can rapidly scale when new nodes become available. Unfortunately, because these frameworks are developed independently, there is no way to perform finegrained sharing across frameworks, making it difficult to
share clusters and data efficiently between them.
In this paper, we propose Mesos, a thin resource sharing layer that enables fine-grained sharing across diverse
cluster computing frameworks, by giving frameworks a
common interface for accessing cluster resources.
The main design question for Mesos is how to build
a scalable and efficient system that supports a wide array of both current and future frameworks. This is challenging for several reasons. First, each framework will
have different scheduling needs, based on its programming model, communication pattern, task dependencies,
and data placement. Second, the scheduling system must
scale to clusters of tens of thousands of nodes running
hundreds of jobs with millions of tasks. Finally, because
all the applications in the cluster depend on Mesos, the
system must be fault-tolerant and highly available.
One approach would be for Mesos to implement a centralized scheduler that takes as input framework requirements, resource availability, and organizational policies,
and computes a global schedule for all tasks. While this

We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing
frameworks, such as Hadoop and MPI. Sharing improves
cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by
taking turns reading data stored on each machine. To
support the sophisticated schedulers of todays frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides
how many resources to offer each framework, while
frameworks decide which resources to accept and which
computations to run on them. Our results show that
Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to
50,000 (emulated) nodes, and is resilient to failures.

Introduction

Clusters of commodity servers have become a major


computing platform, powering both large Internet services and a growing number of data-intensive scientific
applications. Driven by these applications, researchers
and practitioners have been developing a diverse array of
cluster computing frameworks to simplify programming
the cluster. Prominent examples include MapReduce
[18], Dryad [24], MapReduce Online [17] (which supports streaming jobs), Pregel [28] (a specialized framework for graph computations), and others [27, 19, 30].
It seems clear that new cluster computing frameworks1
will continue to emerge, and that no framework will be
optimal for all applications. Therefore, organizations
will want to run multiple frameworks in the same cluster,
picking the best one for each application. Multiplexing
a cluster between frameworks improves utilization and
allows applications to share access to large datasets that
may be too costly to replicate across clusters.
1 By

framework, we mean a software system that manages and


executes one or more jobs on a cluster.

CDF

approach can optimize scheduling across frameworks, it


faces several challenges. The first is complexity. The
scheduler would need to provide a sufficiently expressive API to capture all frameworks requirements, and
to solve an online optimization problem for millions
of tasks. Even if such a scheduler were feasible, this
complexity would have a negative impact on its scalability and resilience. Second, as new frameworks and
new scheduling policies for current frameworks are constantly being developed [37, 38, 40, 26], it is not clear
whether we are even at the point to have a full specification of framework requirements. Third, many existing
frameworks implement their own sophisticated scheduling [25, 38], and moving this functionality to a global
scheduler would require expensive refactoring.
Instead, Mesos takes a different approach: delegating
control over scheduling to the frameworks. This is accomplished through a new abstraction, called a resource
offer, which encapsulates a bundle of resources that a
framework can allocate on a cluster node to run tasks.
Mesos decides how many resources to offer each framework, based on an organizational policy such as fair sharing, while frameworks decide which resources to accept
and which tasks to run on them. While this decentralized scheduling model may not always lead to globally
optimal scheduling, we have found that it performs surprisingly well in practice, allowing frameworks to meet
goals such as data locality nearly perfectly. In addition,
resource offers are simple and efficient to implement, allowing Mesos to be highly scalable and robust to failures.
Mesos also provides other benefits to practitioners.
First, even organizations that only use one framework
can use Mesos to run multiple instances of that framework in the same cluster, or multiple versions of the
framework. Our contacts at Yahoo! and Facebook indicate that this would be a compelling way to isolate
production and experimental Hadoop workloads and to
roll out new versions of Hadoop [11, 10]. Second,
Mesos makes it easier to develop and immediately experiment with new frameworks. The ability to share resources across multiple frameworks frees the developers
to build and run specialized frameworks targeted at particular problem domains rather than one-size-fits-all abstractions. Frameworks can therefore evolve faster and
provide better support for each problem domain.
We have implemented Mesos in 10,000 lines of C++.
The system scales to 50,000 (emulated) nodes and uses
ZooKeeper [4] for fault tolerance. To evaluate Mesos, we
have ported three cluster computing systems to run over
it: Hadoop, MPI, and the Torque batch scheduler. To validate our hypothesis that specialized frameworks provide
value over general ones, we have also built a new framework on top of Mesos called Spark, optimized for iterative jobs where a dataset is reused in many parallel oper-

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

MapReduce Jobs
Map & Reduce Tasks
1

10

100

1000

10000

100000

Duration (s)

Figure 1: CDF of job and task durations in Facebooks Hadoop


data warehouse (data from [38]).

ations, and shown that Spark can outperform Hadoop by


10x in iterative machine learning workloads.
This paper is organized as follows. Section 2 details
the data center environment that Mesos is designed for.
Section 3 presents the architecture of Mesos. Section 4
analyzes our distributed scheduling model (resource offers) and characterizes the environments that it works
well in. We present our implementation of Mesos in Section 5 and evaluate it in Section 6. We survey related
work in Section 7. Finally, we conclude in Section 8.

Target Environment

As an example of a workload we aim to support, consider the Hadoop data warehouse at Facebook [5]. Facebook loads logs from its web services into a 2000-node
Hadoop cluster, where they are used for applications
such as business intelligence, spam detection, and ad
optimization. In addition to production jobs that run
periodically, the cluster is used for many experimental
jobs, ranging from multi-hour machine learning computations to 1-2 minute ad-hoc queries submitted interactively through an SQL interface called Hive [3]. Most
jobs are short (the median job being 84s long), and the
jobs are composed of fine-grained map and reduce tasks
(the median task being 23s), as shown in Figure 1.
To meet the performance requirements of these jobs,
Facebook uses a fair scheduler for Hadoop that takes advantage of the fine-grained nature of the workload to allocate resources at the level of tasks and to optimize data
locality [38]. Unfortunately, this means that the cluster
can only run Hadoop jobs. If a user wishes to write an ad
targeting algorithm in MPI instead of MapReduce, perhaps because MPI is more efficient for this jobs communication pattern, then the user must set up a separate MPI
cluster and import terabytes of data into it. This problem
is not hypothetical; our contacts at Yahoo! and Facebook
report that users want to run MPI and MapReduce Online
(a streaming MapReduce) [11, 10]. Mesos aims to provide fine-grained sharing between multiple cluster computing frameworks to enable these usage scenarios.
2

Hadoop
scheduler

MPI
scheduler

ZooKeeper
quorum

Framework 1

Framework 2

Job 2
Job 1
FW Scheduler

Job 2
Job 1
FW Scheduler

<s1, 4cpu, 4gb, >

Mesos
master

Standby
master

Standby
master

Mesos slave

Mesos slave

Hadoop
executor

MPI
executor

Hadoop
MPI
executor executor

task
task

task
task

task

Slave 1
Task

<fw1, task1, 2cpu, 1gb, >


<fw1, task2, 1cpu, 2gb, >

Slave 2
Executor

Task

Task

Figure 3: Resource offer example.

or priority. To support a diverse set of inter-framework


allocation policies, Mesos lets organizations define their
own policies via a pluggable allocation module.
Each framework running on Mesos consists of two
components: a scheduler that registers with the master
to be offered resources, and an executor process that is
launched on slave nodes to run the frameworks tasks.
While the master determines how many resources to offer to each framework, the frameworks schedulers select
which of the offered resources to use. When a framework
accepts offered resources, it passes Mesos a description
of the tasks it wants to launch on them.
Figure 3 shows an example of how a framework gets
scheduled to run tasks. In step (1), slave 1 reports
to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes the allocation module, which tells it that framework 1 should be offered
all available resources. In step (2), the master sends a
resource offer describing these resources to framework
1. In step (3), the frameworks scheduler replies to the
master with information about two tasks to run on the
slave, using h2 CPUs, 1 GB RAMi for the first task, and
h1 CPUs, 2 GB RAMi for the second task. Finally, in
step (4), the master sends the tasks to the slave, which allocates appropriate resources to the frameworks executor, which in turn launches the two tasks (depicted with
dotted borders). Because 1 CPU and 1 GB of RAM are
still free, the allocation module may now offer them to
framework 2. In addition, this resource offer process repeats when tasks finish and new resources become free.
To maintain a thin interface and enable frameworks
to evolve independently, Mesos does not require frameworks to specify their resource requirements or constraints. Instead, Mesos gives frameworks the ability to
reject offers. A framework can reject resources that do
not satisfy its constraints in order to wait for ones that
do. Thus, the rejection mechanism enables frameworks
to support arbitrarily complex resource constraints while
keeping Mesos simple and scalable.
One potential challenge with solely using the rejec-

Architecture

Design Philosophy

Mesos aims to provide a scalable and resilient core for


enabling various frameworks to efficiently share clusters.
Because cluster frameworks are both highly diverse and
rapidly evolving, our overriding design philosophy has
been to define a minimal interface that enables efficient
resource sharing across frameworks, and otherwise push
control of task scheduling and execution to the frameworks. Pushing control to the frameworks has two benefits. First, it allows frameworks to implement diverse approaches to various problems in the cluster (e.g., achieving data locality, dealing with faults), and to evolve these
solutions independently. Second, it keeps Mesos simple
and minimizes the rate of change required of the system,
which makes it easier to keep Mesos scalable and robust.
Although Mesos provides a low-level interface, we expect higher-level libraries implementing common functionality (such as fault tolerance) to be built on top of
it. These libraries would be analogous to library OSes in
the exokernel [20]. Putting this functionality in libraries
rather than in Mesos allows Mesos to remain small and
flexible, and lets the libraries evolve independently.
3.2

Task

Mesos
master

task

We begin our description of Mesos by discussing our design philosophy. We then describe the components of
Mesos, our resource allocation mechanisms, and how
Mesos achieves isolation, scalability, and fault tolerance.
3.1

Executor

Figure 2: Mesos architecture diagram, showing two running


frameworks (Hadoop and MPI).

<task1, s1, 2cpu, 1gb, >


<task2, s1, 1cpu, 2gb, >

Allocation
module
<s1, 4cpu, 4gb, >

Mesos slave

Overview

Figure 2 shows the main components of Mesos. Mesos


consists of a master process that manages slave daemons
running on each cluster node, and frameworks that run
tasks on these slaves.
The master implements fine-grained sharing across
frameworks using resource offers. Each resource offer
is a list of free resources on multiple slaves. The master
decides how many resources to offer to each framework
according to an organizational policy, such as fair sharing
3

tion mechanism to satisfy all framework constraints is


efficiency: a framework may have to wait a long time
before it receives an offer satisfying its constraints, and
Mesos may have to send an offer to many frameworks
before one of them accepts it. To avoid this, Mesos also
allows frameworks to set filters, which are Boolean predicates specifying that a framework will always reject certain resources. For example, a framework might specify
a whitelist of nodes it can run on.
There are two points worth noting. First, filters represent just a performance optimization for the resource offer model, as the frameworks still have the ultimate control to reject any resources that they cannot express filters
for and to choose which tasks to run on each node. Second, as we will show in this paper, when the workload
consists of fine-grained tasks (e.g., in MapReduce and
Dryad workloads), the resource offer model performs
surprisingly well even in the absence of filters. In particular, we have found that a simple policy called delay
scheduling [38], in which frameworks wait for a limited
time to acquire nodes storing their data, yields nearly optimal data locality with a wait time of 1-5s.
In the rest of this section, we describe how Mesos performs two key functions: resource allocation (3.3) and
resource isolation (3.4). We then describe filters and
several other mechanisms that make resource offers scalable and robust (3.5). Finally, we discuss fault tolerance
in Mesos (3.6) and summarize the Mesos API (3.7).
3.3

location modules expose a guaranteed allocation to each


frameworka quantity of resources that the framework
may hold without losing tasks. Frameworks read their
guaranteed allocations through an API call. Allocation
modules are responsible for ensuring that the guaranteed
allocations they provide can all be met concurrently. For
now, we have kept the semantics of guaranteed allocations simple: if a framework is below its guaranteed allocation, none of its tasks should be killed, and if it is
above, any of its tasks may be killed.
Second, to decide when to trigger revocation, Mesos
must know which of the connected frameworks would
use more resources if they were offered them. Frameworks indicate their interest in offers through an API call.
3.4

Isolation

Mesos provides performance isolation between framework executors running on the same slave by leveraging
existing OS isolation mechanisms. Since these mechanisms are platform-dependent, we support multiple isolation mechanisms through pluggable isolation modules.
We currently isolate resources using OS container
technologies, specifically Linux Containers [9] and Solaris Projects [13]. These technologies can limit the
CPU, memory, network bandwidth, and (in new Linux
kernels) I/O usage of a process tree. These isolation technologies are not perfect, but using containers is already
an advantage over frameworks like Hadoop, where tasks
from different jobs simply run in separate processes.

Resource Allocation

3.5

Mesos delegates allocation decisions to a pluggable allocation module, so that organizations can tailor allocation to their needs. So far, we have implemented two
allocation modules: one that performs fair sharing based
on a generalization of max-min fairness for multiple resources [21] and one that implements strict priorities.
Similar policies are used in Hadoop and Dryad [25, 38].
In normal operation, Mesos takes advantage of the
fact that most tasks are short, and only reallocates resources when tasks finish. This usually happens frequently enough so that new frameworks acquire their
share quickly. For example, if a frameworks share is
10% of the cluster, it needs to wait approximately 10%
of the mean task length to receive its share. However,
if a cluster becomes filled by long tasks, e.g., due to a
buggy job or a greedy framework, the allocation module
can also revoke (kill) tasks. Before killing a task, Mesos
gives its framework a grace period to clean it up.
We leave it up to the allocation module to select the
policy for revoking tasks, but describe two related mechanisms here. First, while killing a task has a low impact
on many frameworks (e.g., MapReduce), it is harmful for
frameworks with interdependent tasks (e.g., MPI). We allow these frameworks to avoid being killed by letting al-

Making Resource Offers Scalable and Robust

Because task scheduling in Mesos is a distributed process, it needs to be efficient and robust to failures. Mesos
includes three mechanisms to help with this goal.
First, because some frameworks will always reject certain resources, Mesos lets them short-circuit the rejection
process and avoid communication by providing filters to
the master. We currently support two types of filters:
only offer nodes from list L and only offer nodes with
at least R resources free. However, other types of predicates could also be supported. Note that unlike generic
constraint languages, filters are Boolean predicates that
specify whether a framework will reject one bundle of
resources on one node, so they can be evaluated quickly
on the master. Any resource that does not pass a frameworks filter is treated exactly like a rejected resource.
Second, because a framework may take time to respond to an offer, Mesos counts resources offered to a
framework towards its allocation of the cluster. This is
a strong incentive for frameworks to respond to offers
quickly and to filter resources that they cannot use.
Third, if a framework has not responded to an offer
for a sufficiently long time, Mesos rescinds the offer and
re-offers the resources to other frameworks.
4

Scheduler Callbacks
resourceOffer(offerId, offers)
offerRescinded(offerId)
statusUpdate(taskId, status)
slaveLost(slaveId)
Executor Callbacks
launchTask(taskDescriptor)
killTask(taskId)

the performance of frameworks with short tasks (4.4).


We also discuss how frameworks are incentivized to improve their performance under Mesos, and argue that
these incentives also improve overall cluster utilization
(4.5). We conclude this section with some limitations
of Mesoss distributed scheduling model (4.6).

Scheduler Actions
replyToOffer(offerId, tasks)
setNeedsOffers(bool)
setFilters(filters)
getGuaranteedShare()
killTask(taskId)
Executor Actions

4.1

sendStatus(taskId, status)

In our discussion, we consider three metrics:


Framework ramp-up time: time it takes a new
framework to achieve its allocation (e.g., fair share);

Table 1: Mesos API functions for schedulers and executors.

3.6

Fault Tolerance

Job completion time: time it takes a job to complete,


assuming one job per framework;

Since all the frameworks depend on the Mesos master, it


is critical to make the master fault-tolerant. To achieve
this, we have designed the master to be soft state, so that
a new master can completely reconstruct its internal state
from information held by the slaves and the framework
schedulers. In particular, the masters only state is the list
of active slaves, active frameworks, and running tasks.
This information is sufficient to compute how many resources each framework is using and run the allocation
policy. We run multiple masters in a hot-standby configuration using ZooKeeper [4] for leader election. When
the active master fails, the slaves and schedulers connect
to the next elected master and repopulate its state.
Aside from handling master failures, Mesos reports
node failures and executor crashes to frameworks schedulers. Frameworks can then react to these failures using
the policies of their choice.
Finally, to deal with scheduler failures, Mesos allows a
framework to register multiple schedulers such that when
one fails, another one is notified by the Mesos master to
take over. Frameworks must use their own mechanisms
to share state between their schedulers.
3.7

System utilization: total cluster utilization.


We characterize workloads along two dimensions: elasticity and task duration distribution. An elastic framework, such as Hadoop and Dryad, can scale its resources
up and down, i.e., it can start using nodes as soon as it
acquires them and release them as soon its task finish. In
contrast, a rigid framework, such as MPI, can start running its jobs only after it has acquired a fixed quantity of
resources, and cannot scale up dynamically to take advantage of new resources or scale down without a large
impact on performance. For task durations, we consider
both homogeneous and heterogeneous distributions.
We also differentiate between two types of resources:
mandatory and preferred. A resource is mandatory if a
framework must acquire it in order to run. For example, a
graphical processing unit (GPU) is mandatory if a framework cannot run without access to GPU. In contrast, a resource is preferred if a framework performs better using it, but can also run using another equivalent resource.
For example, a framework may prefer running on a node
that locally stores its data, but may also be able to read
the data remotely if it must.
We assume the amount of mandatory resources requested by a framework never exceeds its guaranteed
share. This ensures that frameworks will not deadlock
waiting for the mandatory resources to become free.2 For
simplicity, we also assume that all tasks have the same resource demands and run on identical slices of machines
called slots, and that each framework runs a single job.

API Summary

Table 1 summarizes the Mesos API. The callback


columns list functions that frameworks must implement,
while actions are operations that they can invoke.

Definitions, Metrics and Assumptions

Mesos Behavior

In this section, we study Mesoss behavior for different


workloads. Our goal is not to develop an exact model of
the system, but to provide a coarse understanding of its
behavior, in order to characterize the environments that
Mesoss distributed scheduling model works well in.
In short, we find that Mesos performs very well when
frameworks can scale up and down elastically, tasks
durations are homogeneous, and frameworks prefer all
nodes equally (4.2). When different frameworks prefer different nodes, we show that Mesos can emulate a
centralized scheduler that performs fair sharing across
frameworks (4.3). In addition, we show that Mesos can
handle heterogeneous task durations without impacting

4.2

Homogeneous Tasks

We consider a cluster with n slots and a framework, f ,


that is entitled to k slots. For the purpose of this analysis, we consider two distributions of the task durations:
constant (i.e., all tasks have the same length) and exponential. Let the mean task duration be T , and assume that
framework f runs a job which requires kT total com2 In workloads where the mandatory resource demands of the active frameworks can exceed the capacity of the cluster, the allocation
module needs to implement admission control.

Elastic Framework
Rigid Framework
Constant dist. Exponential dist. Constant dist.
Exponential dist.
Ramp-up time
T
T ln k
T
T ln k
Completion time
(1/2 + )T
(1 + )T
(1 + )T
(ln k + )T
Utilization
1
1
/(1/2 + )
/(ln k 1 + )
Table 2: Ramp-up time, job completion time and utilization for both elastic and rigid frameworks, and for both constant and
exponential task duration distributions. The framework starts with no slots. k is the number of slots the framework is entitled under
the scheduling policy, and T represents the time it takes a job to complete assuming the framework gets all k slots at once.

putation time. That is, when the framework has k slots,


it takes its job T time to finish.
Table 2 summarizes the job completion times and system utilization for the two types of frameworks and the
two types of task length distributions. As expected, elastic frameworks with constant task durations perform the
best, while rigid frameworks with exponential task duration perform the worst. Due to lack of space, we present
only the results here and include derivations in [23].

(a) there exists a system configuration in which each


framework gets all its preferred slots and achieves its full
allocation, and (b) there is no such configuration, i.e., the
demand for some preferred slots exceeds the supply.
In the first case, it is easy to see that, irrespective of the
initial configuration, the system will converge to the state
where each framework allocates its preferred slots after
at most one T interval. This is simple because during a
T interval all slots become available, and as a result each
framework will be offered its preferred slots.
In the second case, there is no configuration in which
all frameworks can satisfy their preferences. The key
question in this case is how should one allocate the preferred slots across the frameworks demanding them. In
particular, assume there are p slots preferred by m frameworks,
Pm where framework i requests ri such slots, and
i=1 ri > x. While many allocation policies are possible, here we consider a weighted fair allocation policy
where the weight associated with framework i is its intended total allocation, si . In other words, assuming that
each framework
has enough demand, we aim to allocate
Pm
psi /( i=1 si ) preferred slots to framework i.
The challenge in Mesos is that the scheduler does
not know the preferences of each framework. Fortunately, it turns out that there is an easy way to achieve
the weighted allocation of the preferred slots described
above: simply perform lottery scheduling [36], offering slots to frameworks with probabilities proportional to
their intended allocations. In particular, when a slot becomes available, Mesos
Pncan offer that slot to framework i
with probability si /( i=1 si ), where n is the total number of frameworks in the system. Furthermore, because
each framework i receives on average si slots every T
time units, the results for ramp-up times and completion
times in Section 4.2 still hold.

Framework ramp-up time: If task durations are constant, it will take framework f at most T time to acquire
k slots. This is simply because during a T interval, every
slot will become available, which will enable Mesos to
offer the framework all k of its preferred slots. If the duration distribution is exponential, the expected ramp-up
time can be as high as T ln k [23].
Job completion time: The expected completion time3
of an elastic job is at most (1 + )T , which is within T
(i.e., the mean task duration) of the completion time of
the job when it gets all its slots instantaneously. Rigid
jobs achieve similar completion times for constant task
durations, but exhibit much higher completion times for
exponential job durations, i.e., (ln k + )T . This is simply because it takes a framework T ln k time on average
to acquire all its slots and be able to start its job.
System utilization: Elastic jobs fully utilize their allocated slots, because they can use every slot as soon
as they get it. As a result, assuming infinite demand, a
system running only elastic jobs is fully utilized. Rigid
frameworks achieve slightly worse utilizations, as their
jobs cannot start before they get their full allocations, and
thus they waste the resources held while ramping up.
4.3

Placement Preferences

So far, we have assumed that frameworks have no slot


preferences. In practice, different frameworks prefer different nodes and their preferences may change over time.
In this section, we consider the case where frameworks
have different preferred slots.
The natural question is how well Mesos will work
compared to a central scheduler that has full information
about framework preferences. We consider two cases:

4.4

Heterogeneous Tasks

So far we have assumed that frameworks have homogeneous task duration distributions, i.e., that all frameworks have the same task duration distribution. In this
section, we discuss frameworks with heterogeneous task
duration distributions. In particular, we consider a workload where tasks that are either short and long, where the
mean duration of the long tasks is significantly longer
than the mean of the short tasks. Such heterogeneous

3 When computing job completion time we assume that the last tasks

of the job running on the frameworks k slots finish at the same time.

workloads can hurt frameworks with short tasks. In the


worst case, all nodes required by a short job might be
filled with long tasks, so the job may need to wait a long
time (relative to its execution time) to acquire resources.
We note first that random task assignment can work
well if the fraction of long tasks is not very close to 1
and if each node supports multiple slots. For example,
in a cluster with S slots per node, the probability that a
node is filled with long tasks will be S . When S is large
(e.g., in the case of multicore machines), this probability
is small even with > 0.5. If S = 8 and = 0.5, for example, the probability that a node is filled with long tasks
is 0.4%. Thus, a framework with short tasks can still acquire many preferred slots in a short period of time. In
addition, the more slots a framework is able to use, the
likelier it is that at least k of them are running short tasks.
To further alleviate the impact of long tasks, Mesos
can be extended slightly to allow allocation policies to
reserve some resources on each node for short tasks. In
particular, we can associate a maximum task duration
with some of the resources on each node, after which
tasks running on those resources are killed. These time
limits can be exposed to the frameworks in resource offers, allowing them to choose whether to use these resources. This scheme is similar to the common policy of
having a separate queue for short jobs in HPC clusters.
4.5

reducing latency for new jobs and wasted work for revocation. If frameworks are elastic, they will opportunistically utilize all the resources they can obtain. Finally,
if frameworks do not accept resources that they do not
understand, they will leave them for frameworks that do.
We also note that these properties are met by many
current cluster computing frameworks, such as MapReduce and Dryad, simply because using short independent
tasks simplifies load balancing and fault recovery.
4.6

Limitations of Distributed Scheduling

Although we have shown that distributed scheduling


works well in a range of workloads relevant to current
cluster environments, like any decentralized approach, it
can perform worse than a centralized scheduler. We have
identified three limitations of the distributed model:
Fragmentation: When tasks have heterogeneous resource demands, a distributed collection of frameworks
may not be able to optimize bin packing as well as a centralized scheduler. However, note that the wasted space
due to suboptimal bin packing is bounded by the ratio between the largest task size and the node size. Therefore,
clusters running larger nodes (e.g., multicore nodes)
and smaller tasks within those nodes will achieve high
utilization even with distributed scheduling.
There is another possible bad outcome if allocation
modules reallocate resources in a nave manner: when
a cluster is filled by tasks with small resource requirements, a framework f with large resource requirements
may starve, because whenever a small task finishes, f
cannot accept the resources freed by it, but other frameworks can. To accommodate frameworks with large pertask resource requirements, allocation modules can support a minimum offer size on each slave, and abstain from
offering resources on the slave until this amount is free.

Framework Incentives

Mesos implements a decentralized scheduling model,


where each framework decides which offers to accept.
As with any decentralized system, it is important to understand the incentives of entities in the system. In this
section, we discuss the incentives of frameworks (and
their users) to improve the response times of their jobs.
Short tasks: A framework is incentivized to use short
tasks for two reasons. First, it will be able to allocate any
resources reserved for short slots. Second, using small
tasks minimizes the wasted work if the framework loses
a task, either due to revocation or simply due to failures.

Interdependent framework constraints: It is possible to construct scenarios where, because of esoteric interdependencies between frameworks (e.g., certain tasks
from two frameworks cannot be colocated), only a single global allocation of the cluster performs well. We
argue such scenarios are rare in practice. In the model
discussed in this section, where frameworks only have
preferences over which nodes they use, we showed that
allocations approximate those of optimal schedulers.

Scale elastically: The ability of a framework to use resources as soon as it acquires theminstead of waiting
to reach a given minimum allocationwould allow the
framework to start (and complete) its jobs earlier. In addition, the ability to scale up and down allows a framework to grab unused resources opportunistically, as it can
later release them with little negative impact.

Framework complexity: Using resource offers may


make framework scheduling more complex. We argue,
however, that this difficulty is not onerous. First, whether
using Mesos or a centralized scheduler, frameworks need
to know their preferences; in a centralized scheduler,
the framework needs to express them to the scheduler,
whereas in Mesos, it must use them to decide which offers to accept. Second, many scheduling policies for existing frameworks are online algorithms, because frame-

Do not accept unknown resources: Frameworks are


incentivized not to accept resources that they cannot use
because most allocation policies will count all the resources that a framework owns when making offers.
We note that these incentives align well with our goal
of improving utilization. If frameworks use short tasks,
Mesos can reallocate resources quickly between them,
7

works cannot predict task times and must be able to handle failures and stragglers [18, 40, 38]. These policies
are easy to implement over resource offers.

as an executor, which may be terminated if it is not running tasks. This would make map output files unavailable
to reduce tasks. We solved this problem by providing a
shared file server on each node in the cluster to serve
local files. Such a service is useful beyond Hadoop, to
other frameworks that write data locally on each node.
In total, our Hadoop port is 1500 lines of code.

Implementation

We have implemented Mesos in about 10,000 lines of


C++. The system runs on Linux, Solaris and OS X, and
supports frameworks written in C++, Java, and Python.
To reduce the complexity of our implementation, we
use a C++ library called libprocess [7] that provides
an actor-based programming model using efficient asynchronous I/O mechanisms (epoll, kqueue, etc). We
also use ZooKeeper [4] to perform leader election.
Mesos can use Linux containers [9] or Solaris projects
[13] to isolate tasks. We currently isolate CPU cores and
memory. We plan to leverage recently added support for
network and I/O isolation in Linux [8] in the future.
We have implemented four frameworks on top of
Mesos. First, we have ported three existing cluster computing systems: Hadoop [2], the Torque resource scheduler [33], and the MPICH2 implementation of MPI [16].
None of these ports required changing these frameworks
APIs, so all of them can run unmodified user programs.
In addition, we built a specialized framework for iterative
jobs called Spark, which we discuss in Section 5.3.
5.1

5.2

Torque and MPI Ports

We have ported the Torque cluster resource manager to


run as a framework on Mesos. The framework consists
of a Mesos scheduler and executor, written in 360 lines
of Python code, that launch and manage different components of Torque. In addition, we modified 3 lines of
Torque source code to allow it to elastically scale up and
down on Mesos depending on the jobs in its queue.
After registering with the Mesos master, the framework scheduler configures and launches a Torque server
and then periodically monitors the servers job queue.
While the queue is empty, the scheduler releases all tasks
(down to an optional minimum, which we set to 0) and
refuses all resource offers it receives from Mesos. Once
a job gets added to Torques queue (using the standard
qsub command), the scheduler begins accepting new
resource offers. As long as there are jobs in Torques
queue, the scheduler accepts offers as necessary to satisfy the constraints of as many jobs in the queue as possible. On each node where offers are accepted, Mesos
launches our executor, which in turn starts a Torque
backend daemon and registers it with the Torque server.
When enough Torque backend daemons have registered,
the torque server will launch the next job in its queue.
Because jobs that run on Torque (e.g. MPI) may not be
fault tolerant, Torque avoids having its tasks revoked by
not accepting resources beyond its guaranteed allocation.
In addition to the Torque framework, we also created
a Mesos MPI wrapper framework, written in 200 lines
of Python code, for running MPI jobs directly on Mesos.

Hadoop Port

Porting Hadoop to run on Mesos required relatively few


modifications, because Hadoops fine-grained map and
reduce tasks map cleanly to Mesos tasks. In addition, the
Hadoop master, known as the JobTracker, and Hadoop
slaves, known as TaskTrackers, fit naturally into the
Mesos model as a framework scheduler and executor.
To add support for running Hadoop on Mesos, we took
advantage of the fact that Hadoop already has a pluggable API for writing job schedulers. We wrote a Hadoop
scheduler that connects to Mesos, launches TaskTrackers
as its executors, and maps each Hadoop task to a Mesos
task. When there are unlaunched tasks in Hadoop, our
scheduler first starts Mesos tasks on the nodes of the
cluster that it wants to use, and then sends the Hadoop
tasks to them using Hadoops existing internal interfaces.
When tasks finish, our executor notifies Mesos by listening for task finish events using an API in the TaskTracker.
We used delay scheduling [38] to achieve data locality
by waiting for slots on the nodes that contain task input data. In addition, our approach allowed us to reuse
Hadoops existing logic for re-scheduling of failed tasks
and for speculative execution (straggler mitigation).
We also needed to change how map output data is
served to reduce tasks. Hadoop normally writes map
output files to the local filesystem, then serves these to
reduce tasks using an HTTP server included in the TaskTracker. However, the TaskTracker within Mesos runs

5.3

Spark Framework

Mesos enables the creation of specialized frameworks


optimized for workloads for which more general execution layers may not be optimal. To test the hypothesis that simple specialized frameworks provide value,
we identified one class of jobs that were found to perform poorly on Hadoop by machine learning researchers
at our lab: iterative jobs, where a dataset is reused across
a number of iterations. We built a specialized framework
called Spark [39] optimized for these workloads.
One example of an iterative algorithm used in machine learning is logistic regression [22]. This algorithm
seeks to find a line that separates two sets of labeled data
points. The algorithm starts with a random line w. Then,
on each iteration, it computes the gradient of an objective
8

Bin
1
2
3
4
5
6
7
8

f(x,w)

f(x,w)
w

f(x,w)

...
a) Dryad

Reduce Tasks
NA
NA
2
NA
10
NA
NA
30

# Jobs Run
38
18
14
12
6
6
4
2

macrobenchmark consisting of a mix of four workloads:


A Hadoop instance running a mix of small and large
jobs based on the workload at Facebook.
A Hadoop instance running a set of large batch jobs.

function that measures how well the line separates the


points, and shifts w along this gradient. This gradient
computation amounts to evaluating a function f (x, w)
over each data point x and summing the results. An
implementation of logistic regression in Hadoop must
run each iteration as a separate MapReduce job, because
each iteration depends on the w computed at the previous
one. This imposes overhead because every iteration must
re-read the input file into memory. In Dryad, the whole
job can be expressed as a data flow DAG as shown in Figure 4a, but the data must still must be reloaded from disk
at each iteration. Reusing the data in memory between
iterations in Dryad would require cyclic data flow.
Sparks execution is shown in Figure 4b. Spark uses
the long-lived nature of Mesos executors to cache a slice
of the dataset in memory at each executor, and then run
multiple iterations on this cached data. This caching is
achieved in a fault-tolerant manner: if a node is lost,
Spark remembers how to recompute its slice of the data.
By building Spark on top of Mesos, we were able to
keep its implementation small (about 1300 lines of code),
yet still capable of outperforming Hadoop by 10 for
iterative jobs. In particular, using Mesoss API saved us
the time to write a master daemon, slave daemon, and
communication protocols between them for Spark. The
main pieces we had to write were a framework scheduler
(which uses delay scheduling for locality) and user APIs.

Spark running a series of machine learning jobs.

Torque running a series of MPI jobs.


We compared a scenario where the workloads ran as
four frameworks on a 96-node Mesos cluster using fair
sharing to a scenario where they were each given a static
partition of the cluster (24 nodes), and measured job response times and resource utilization in both cases. We
used EC2 nodes with 4 CPU cores and 15 GB of RAM.
We begin by describing the four workloads in more
detail, and then present our results.
6.1.1

Macrobenchmark Workloads

Facebook Hadoop Mix Our Hadoop job mix was


based on the distribution of job sizes and inter-arrival
times at Facebook, reported in [38]. The workload consists of 100 jobs submitted at fixed times over a 25minute period, with a mean inter-arrival time of 14s.
Most of the jobs are small (1-12 tasks), but there are also
large jobs of up to 400 tasks.4 The jobs themselves were
from the Hive benchmark [6], which contains four types
of queries: text search, a simple selection, an aggregation, and a join that gets translated into multiple MapReduce steps. We grouped the jobs into eight bins of job
type and size (listed in Table 3) so that we could compare performance in each bin. We also set the framework
scheduler to perform fair sharing between its jobs, as this
policy is used at Facebook.

Evaluation

Large Hadoop Mix To emulate batch workloads that


need to run continuously, such as web crawling, we had
a second instance of Hadoop run a series of IO-intensive
2400-task text search jobs. A script launched ten of these
jobs, submitting each one after the previous one finished.

We evaluated Mesos through a series of experiments on


the Amazon Elastic Compute Cloud (EC2). We begin
with a macrobenchmark that evaluates how the system
shares resources between four workloads, and go on to
present a series of smaller experiments designed to evaluate overhead, decentralized scheduling, our specialized
framework (Spark), scalability, and failure recovery.
6.1

Map Tasks
1
2
10
50
100
200
400
400

Table 3: Job types for each bin in our Facebook Hadoop mix.

b) Spark

Figure 4: Data flow of a logistic regression job in Dryad


vs. Spark. Solid lines show data flow within the framework.
Dashed lines show reads from a distributed file system. Spark
reuses in-memory data across iterations to improve efficiency.

Job Type
selection
text search
aggregation
selection
aggregation
selection
text search
join

Spark We ran five instances of an iterative machine


learning job on Spark. These were launched by a script
that waited 2 minutes after each job ended to submit
the next. The job we used was alternating least squares

Macrobenchmark

To evaluate the primary goal of Mesos, which is enabling


diverse frameworks to efficiently share a cluster, we ran a

4 We

scaled down the largest jobs in [38] to have the workload fit a
quarter of our cluster size.

(b) Large Hadoop Mix


Share of Cluster

Share of Cluster

(a) Facebook Hadoop Mix


1
0.8
0.6
0.4
0.2
0

Static Partitioning
Mesos

200

400

600

800

1000

1200

1400

1
0.8
0.6
0.4
0.2
0

1600

Static Partitioning
Mesos

500

1000

Time (s)

Static Partitioning
Mesos

200

400

600

800

2000

2500

3000

(d) Torque / MPI


Share of Cluster

Share of Cluster

(c) Spark
1
0.8
0.6
0.4
0.2
0

1500
Time (s)

1000

1200

1400

1600

1
0.8
0.6
0.4
0.2
0

1800

Static Partitioning
Mesos

200

400

600

Time (s)

800

1000

1200

1400

1600

Time (s)

CPU Utilization (%)

Figure 5: Comparison of cluster shares (fraction of CPUs) over time for each of the frameworks in the Mesos and static partitioning
macrobenchmark scenarios. On Mesos, frameworks can scale up when their demand is high and that of other frameworks is low, and
thus finish jobs faster. Note that the plots time axes are different (e.g., the large Hadoop mix takes 3200s with static partitioning).

100
80
60
40
20
0

Mesos
0

200

400

600

800

1000

Static
1200

1400

1600

Memory Utilization (%)

Time (s)

Figure 6: Framework shares on Mesos during the macrobenchmark. By pooling resources, Mesos lets each workload scale
up to fill gaps in the demand of others. In addition, fine-grained
sharing allows resources to be reallocated in tens of seconds.

Mesos
0

200

400

600

800

1000

Static
1200

1400

1600

Time (s)

Figure 7: Average CPU and memory utilization over time


across all nodes in the Mesos cluster vs. static partitioning.

(ALS), a collaborative filtering algorithm [42]. This job


is CPU-intensive but also benefits from caching its input
data on each node, and needs to broadcast updated parameters to all nodes running its tasks on each iteration.

framework by Mesos over time in Figure 6. We see that


Mesos enables each framework to scale up during periods when other frameworks have low demands, and thus
keeps cluster nodes busier. For example, at time 350,
when both Spark and the Facebook Hadoop framework
have no running jobs and Torque is using 1/8 of the cluster, the large-job Hadoop framework scales up to 7/8 of
the cluster. In addition, we see that resources are reallocated rapidly (e.g., when a Facebook Hadoop job starts
around time 360) due to the fine-grained nature of tasks.
Finally, higher allocation of nodes also translates into increased CPU and memory utilization (by 10% for CPU
and 17% for memory), as shown in Figure 7.
A second question is how much better jobs perform
under Mesos than when using a statically partitioned
cluster. We present this data in two ways. First, Figure 5 compares the resource allocation over time of
each framework in the shared and statically partitioned
clusters. Shaded areas show the allocation in the stat-

Torque / MPI Our Torque framework ran eight instances of the tachyon raytracing job [35] that is part of
the SPEC MPI2007 benchmark. Six of the jobs ran small
problem sizes and two ran large ones. Both types used 24
parallel tasks. We submitted these jobs at fixed times to
both clusters. The tachyon job is CPU-intensive.
6.1.2

50
40
30
20
10
0

Macrobenchmark Results

A successful result for Mesos would show two things:


that Mesos achieves higher utilization than static partitioning, and that jobs finish at least as fast in the shared
cluster as they do in their static partition, and possibly
faster due to gaps in the demand of other frameworks.
Our results show both effects, as detailed below.
We show the fraction of CPU cores allocated to each
10

Facebook
Hadoop Mix
Large Hadoop
Mix
Spark
Torque / MPI

Sum of Exec Times w/ Sum of Exec Times


Speedup
Static Partitioning (s)
with Mesos (s)
7235

6319

1.14

3143

1494

2.10

1684
3210

1338
3352

1.26
0.96

Framework

Job Type

Facebook Hadoop selection (1)


Mix
text search (2)
aggregation (3)
selection (4)
aggregation (5)
selection (6)
text search (7)
join (8)
Large Hadoop Mix text search
Spark
ALS
Torque / MPI
small tachyon
large tachyon

Table 4: Aggregate performance of each framework in the macrobenchmark (sum of running times of all the jobs in the framework). The speedup column shows the relative gain on Mesos.

ically partitioned cluster, while solid lines show the


share on Mesos. We see that the fine-grained frameworks (Hadoop and Spark) take advantage of Mesos to
scale up beyond 1/4 of the cluster when global demand
allows this, and consequently finish bursts of submitted jobs faster in Mesos. At the same time, Torque
achieves roughly similar allocations and job durations
under Mesos (with some differences explained later).
Second, Tables 4 and 5 show a breakdown of job performance for each framework. In Table 4, we compare
the aggregate performance of each framework, defined
as the sum of job running times, in the static partitioning
and Mesos scenarios. We see the Hadoop and Spark jobs
as a whole are finishing faster on Mesos, while Torque is
slightly slower. The framework that gains the most is the
large-job Hadoop mix, which almost always has tasks to
run and fills in the gaps in demand of the other frameworks; this framework performs 2x better on Mesos.
Table 5 breaks down the results further by job type.
We observe two notable trends. First, in the Facebook
Hadoop mix, the smaller jobs perform worse on Mesos.
This is due to an interaction between the fair sharing performed by Hadoop (among its jobs) and the fair sharing
in Mesos (among frameworks): During periods of time
when Hadoop has more than 1/4 of the cluster, if any jobs
are submitted to the other frameworks, there is a delay
before Hadoop gets a new resource offer (because any
freed up resources go to the framework farthest below its
share), so any small job submitted during this time is delayed for a long time relative to its length. In contrast,
when running alone, Hadoop can assign resources to the
new job as soon as any of its tasks finishes. This problem with hierarchical fair sharing is also seen in networks
[34], and could be mitigated by running the small jobs on
a separate framework or using a different allocation policy (e.g., using lottery scheduling instead of offering all
freed resources to the framework with the lowest share).
Lastly, Torque is the only framework that performed
worse, on average, on Mesos. The large tachyon jobs
took on average 2 minutes longer, while the small ones
took 20s longer. Some of this delay is due to Torque having to wait to launch 24 tasks on Mesos before starting
each job, but the average time this takes is 12s. We be-

Exec Time w/ Static Avg. Speedup


Partitioning (s)
on Mesos
24
0.84
31
0.90
82
0.94
65
1.40
192
1.26
136
1.71
137
2.14
662
1.35
314
2.21
337
1.36
261
0.91
822
0.88

Local Map Tasks (%)

Table 5: Performance of each job type in the macrobenchmark.


Bins for the Facebook Hadoop mix are in parentheses.
100%

600

80%

480

60%

360

40%

240

20%

120

0%

0
Static
Mesos, no Mesos, 1s Mesos, 5s
partitioning delay sched. delay sched. delay sched.
Data Locality

Job Running TIme (s)

Framework

Job Running Times

Figure 8: Data locality and average job durations for 16


Hadoop instances running on a 93-node cluster using static partitioning, Mesos, or Mesos with delay scheduling.

lieve that the rest of the delay is due to stragglers (slow


nodes). In our standalone Torque run, we saw two jobs
take about 60s longer to run than the others (Fig. 5d). We
discovered that both of these jobs were using a node that
performed slower on single-node benchmarks than the
others (in fact, Linux reported 40% lower bogomips on
it). Because tachyon hands out equal amounts of work
to each node, it runs as slowly as the slowest node.
6.2

Overhead

To measure the overhead Mesos imposes when a single


framework uses the cluster, we ran two benchmarks using MPI and Hadoop on an EC2 cluster with 50 nodes,
each with 2 CPU cores and 6.5 GB RAM. We used the
High-Performance LINPACK [15] benchmark for MPI
and a WordCount job for Hadoop, and ran each job three
times. The MPI job took on average 50.9s without Mesos
and 51.8s with Mesos, while the Hadoop job took 160s
without Mesos and 166s with Mesos. In both cases, the
overhead of using Mesos was less than 4%.
6.3

Data Locality through Delay Scheduling

In this experiment, we evaluated how Mesos resource


offer mechanism enables frameworks to control their
tasks placement, and in particular, data locality. We
ran 16 instances of Hadoop using 93 EC2 nodes, each
with 4 CPU cores and 15 GB RAM. Each node ran a
11

6.4

Running Time (s)

map-only scan job that searched a 100 GB file spread


throughout the cluster on a shared HDFS file system and
outputted 1% of the records. We tested four scenarios:
giving each Hadoop instance its own 5-6 node static partition of the cluster (to emulate organizations that use
coarse-grained cluster sharing systems), and running all
instances on Mesos using either no delay scheduling, 1s
delay scheduling or 5s delay scheduling.
Figure 8 shows averaged measurements from the 16
Hadoop instances across three runs of each scenario. Using static partitioning yields very low data locality (18%)
because the Hadoop instances are forced to fetch data
from nodes outside their partition. In contrast, running
the Hadoop instances on Mesos improves data locality,
even without delay scheduling, because each Hadoop instance has tasks on more nodes of the cluster (there are
4 tasks per node), and can therefore access more blocks
locally. Adding a 1-second delay brings locality above
90%, and a 5-second delay achieves 95% locality, which
is competitive with running one Hadoop instance alone
on the whole cluster. As expected, job performance improves with data locality: jobs run 1.7x faster in the 5s
delay scenario than with static partitioning.

3000
Hadoop
Spark

2000
1000
0
0

10

20

30

Number of Iterations

Task Launch
Overhead (seconds)

Figure 9: Hadoop and Spark logistic regression running times.


1
0.75
0.5
0.25
0
0

10000

20000

30000

40000

50000

Number of Nodes

Figure 10: Mesos masters scalability versus number of slaves.

frameworks running throughout the cluster continuously


launches tasks, starting one task on each slave that it receives a resource offer for. Each task sleeps for a period
of time based on a normal distribution with a mean of
30 seconds and standard deviation of 10s, and then ends.
Each slave runs up to two tasks at a time.
Once the cluster reached steady-state (i.e., the 200
frameworks achieve their fair shares and all resources
were allocated), we launched a test framework that runs a
single 10 second task and measured how long this framework took to finish. This allowed us to calculate the extra
delay incurred over 10s due to having to register with the
master, wait for a resource offer, accept it, wait for the
master to process the response and launch the task on a
slave, and wait for Mesos to report the task as finished.
We plot this extra delay in Figure 10, showing averages of 5 runs. We observe that the overhead remains
small (less than one second) even at 50,000 nodes. In
particular, this overhead is much smaller than the average task and job lengths in data center workloads (see
Section 2). Because Mesos was also keeping the cluster fully allocated, this indicates that the master kept up
with the load placed on it. Unfortunately, the EC2 virtualized environment limited scalability beyond 50,000
slaves, because at 50,000 slaves the master was processing 100,000 packets per second (in+out), which has been
shown to be the current achievable limit on EC2 [12].

Spark Framework

We evaluated the benefit of running iterative jobs using


the specialized Spark framework we developed on top
of Mesos (Section 5.3) over the general-purpose Hadoop
framework. We used a logistic regression job implemented in Hadoop by machine learning researchers in
our lab, and wrote a second version of the job using
Spark. We ran each version separately on 20 EC2 nodes,
each with 4 CPU cores and 15 GB RAM. Each experiment used a 29 GB data file and varied the number of
logistic regression iterations from 1 to 30 (see Figure 9).
With Hadoop, each iteration takes 127s on average,
because it runs as a separate MapReduce job. In contrast,
with Spark, the first iteration takes 174s, but subsequent
iterations only take about 6 seconds, leading to a speedup
of up to 10x for 30 iterations. This happens because the
cost of reading the data from disk and parsing it is much
higher than the cost of evaluating the gradient function
computed by the job on each iteration. Hadoop incurs the
read/parsing cost on each iteration, while Spark reuses
cached blocks of parsed data and only incurs this cost
once. The longer time for the first iteration in Spark is
due to the use of slower text parsing routines.
6.5

4000

6.6

Mesos Scalability

Failure Recovery

To evaluate recovery from master failures, we conducted


an experiment with 200 to 4000 slave daemons on 62
EC2 nodes with 4 cores and 15 GB RAM. We ran 200
frameworks that each launched 20-second tasks, and two
Mesos masters connected to a 5-node ZooKeeper quorum.We synchronized the two masters clocks using NTP

To evaluate Mesos scalability, we emulated large clusters by running up to 50,000 slave daemons on 99 Amazon EC2 nodes, each with 8 CPU cores and 6 GB RAM.
We used one EC2 node for the master and the rest of the
nodes to run slaves. During the experiment, each of 200
12

and measured the mean time to recovery (MTTR) after


killing the active master. The MTTR is the time for all of
the slaves and frameworks to connect to the second master. In all cases, the MTTR was between 4 and 8 seconds,
with 95% confidence intervals of up to 3s on either side.
6.7

the size of VM they require. In contrast, Mesos allows


frameworks to be highly selective about task placement.
Quincy. Quincy [25] is a fair scheduler for Dryad
that uses a centralized scheduling algorithm for Dryads
DAG-based programming model. In contrast, Mesos
provides the lower-level abstraction of resource offers to
support multiple cluster computing frameworks.

Performance Isolation

As discussed in Section 3.4, Mesos leverages existing


OS isolation mechanism to provide performance isolation between different frameworks tasks running on the
same slave. While these mechanisms are not perfect,
a preliminary evaluation of Linux Containers [9] shows
promising results. In particular, using Containers to isolate CPU usage between a MediaWiki web server (consisting of multiple Apache processes running PHP) and a
hog application (consisting of 256 processes spinning
in infinite loops) shows on average only a 30% increase
in request latency for Apache versus a 550% increase
when running without Containers. We refer the reader to
[29] for a fuller evaluation of OS isolation mechanisms.

Condor. The Condor cluster manager uses the ClassAds language [32] to match nodes to jobs. Using a resource specification language is not as flexible for frameworks as resource offers, since not all requirements may
be expressible. Also, porting existing frameworks, which
have their own schedulers, to Condor would be more difficult than porting them to Mesos, where existing schedulers fit naturally into the two-level scheduling model.
Next-Generation Hadoop. Recently, Yahoo! announced a redesign for Hadoop that uses a two-level
scheduling model, where per-application masters request
resources from a central manager [14]. The design aims
to support non-MapReduce applications as well. While
details about the scheduling model in this system are currently unavailable, we believe that the new application
masters could naturally run as Mesos frameworks.

Related Work

HPC and Grid Schedulers. The high performance


computing (HPC) community has long been managing
clusters [33, 41]. However, their target environment typically consists of specialized hardware, such as Infiniband and SANs, where jobs do not need to be scheduled
local to their data. Furthermore, each job is tightly coupled, often using barriers or message passing. Thus, each
job is monolithic, rather than composed of fine-grained
tasks, and does not change its resource demands during
its lifetime. For these reasons, HPC schedulers use centralized scheduling, and require users to declare the required resources at job submission time. Jobs are then
given coarse-grained allocations of the cluster. Unlike
the Mesos approach, this does not allow jobs to locally
access data distributed across the cluster. Furthermore,
jobs cannot grow and shrink dynamically. In contrast,
Mesos supports fine-grained sharing at the level of tasks
and allows frameworks to control their placement.
Grid computing has mostly focused on the problem
of making diverse virtual organizations share geographically distributed and separately administered resources
in a secure and interoperable way. Mesos could well be
used within a virtual organization inside a larger grid.

Conclusion and Future Work

We have presented Mesos, a thin management layer that


allows diverse cluster computing frameworks to efficiently share resources. Mesos is built around two design elements: a fine-grained sharing model at the level
of tasks, and a distributed scheduling mechanism called
resource offers that delegates scheduling decisions to the
frameworks. Together, these elements let Mesos achieve
high utilization, respond quickly to workload changes,
and cater to diverse frameworks while remaining scalable
and robust. We have shown that existing frameworks
can effectively share resources using Mesos, that Mesos
enables the development of specialized frameworks providing major performance gains, such as Spark, and that
Mesoss simple design allows the system to be fault tolerant and to scale to 50,000 nodes.
In future work, we plan to further analyze the resource offer model and determine whether any extensions can improve its efficiency while retaining its flexibility. In particular, it may be possible to have frameworks give richer hints about offers they would like to
receive. Nonetheless, we believe that below any hint
system, frameworks should still have the ability to reject offers and to choose which tasks to launch on each
resource, so that their evolution is not constrained by the
hint language provided by the system.
We are also currently using Mesos to manage resources on a 40-node cluster in our lab and in a test deployment at Twitter, and plan to report on lessons from

Public and Private Clouds. Virtual machine clouds


such as Amazon EC2 [1] and Eucalyptus [31] share
common goals with Mesos, such as isolating applications while providing a low-level abstraction (VMs).
However, they differ from Mesos in several important
ways. First, their relatively coarse grained VM allocation
model leads to less efficient resource utilization and data
sharing than in Mesos. Second, these systems generally
do not let applications specify placement needs beyond
13

these deployments in future work.

[22] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of


Statistical Learning: Data Mining, Inference, and Prediction.
Springer Publishing Company, New York, NY, 2009.
[23] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D.
Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform
for fine-grained resource sharing in the data center. Technical
Report UCB/EECS-2010-87, UC Berkeley, May 2010.
[24] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad:
distributed data-parallel programs from sequential building
blocks. In EuroSys 07, 2007.
[25] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and
A. Goldberg. Quincy: Fair scheduling for distributed computing
clusters. In SOSP, November 2009.
[26] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. On availability of
intermediate data in cloud computations. In HOTOS, May 2009.
[27] D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum.
Stateful bulk processing for incremental analytics. In Proc. ACM
symposium on Cloud computing, SoCC 10, 2010.
[28] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski. Pregel: a system for large-scale
graph processing. In SIGMOD, pages 135146, 2010.
[29] J. N. Matthews, W. Hu, M. Hapuarachchi, T. Deshane,
D. Dimatos, G. Hamilton, M. McCabe, and J. Owens.
Quantifying the performance isolation properties of
virtualization systems. In ExpCS 07, 2007.
[30] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith,
A. Madhavapeddy, and S. Hand. Ciel: a universal execution
engine for distributed data-flow computing. In NSDI, 2011.
[31] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman,
L. Youseff, and D. Zagorodnov. The Eucalyptus open-source
cloud-computing system. In CCA 08, 2008.
[32] R. Raman, M. Livny, and M. Solomon. Matchmaking: An
extensible framework for distributed resource management.
Cluster Computing, 2:129138, April 1999.
[33] G. Staples. TORQUE resource manager. In Proc.
Supercomputing 06, 2006.
[34] I. Stoica, H. Zhang, and T. S. E. Ng. A hierarchical fair service
curve algorithm for link-sharing, real-time and priority services.
In SIGCOMM 97, pages 249262, 1997.
[35] J. Stone. Tachyon ray tracing system.
http://jedi.ks.uiuc.edu/johns/raytracer.
[36] C. A. Waldspurger and W. E. Weihl. Lottery scheduling: flexible
proportional-share resource management. In OSDI, 1994.
[37] Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for
data-parallel computing: interfaces and implementations. In
SOSP 09, pages 247260, 2009.
[38] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,
S. Shenker, and I. Stoica. Delay scheduling: A simple technique
for achieving locality and fairness in cluster scheduling. In
EuroSys 10, 2010.
[39] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and
I. Stoica. Spark: cluster computing with working sets. In Proc.
HotCloud 10, 2010.
[40] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica.
Improving MapReduce performance in heterogeneous
environments. In Proc. OSDI 08, 2008.
[41] S. Zhou. LSF: Load sharing in large-scale heterogeneous
distributed systems. In Workshop on Cluster Computing, 1992.
[42] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale
parallel collaborative filtering for the Netflix prize. In AAIM,
pages 337348. Springer-Verlag, 2008.

Acknowledgements

We thank our industry colleagues at Google, Twitter,


Facebook, Yahoo! and Cloudera for their valuable feedback on Mesos. This research was supported by California MICRO, California Discovery, the Natural Sciences
and Engineering Research Council of Canada, a National
Science Foundation Graduate Research Fellowship,5 the
Swedish Research Council, and the following Berkeley
RAD Lab sponsors: Google, Microsoft, Oracle, Amazon, Cisco, Cloudera, eBay, Facebook, Fujitsu, HP, Intel,
NetApp, SAP, VMware, and Yahoo!.

References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]

[16]

[17]
[18]
[19]
[20]
[21]

Amazon EC2. http://aws.amazon.com/ec2.


Apache Hadoop. http://hadoop.apache.org.
Apache Hive. http://hadoop.apache.org/hive.
Apache ZooKeeper. hadoop.apache.org/zookeeper.
Hive A Petabyte Scale Data Warehouse using Hadoop.
http://www.facebook.com/note.php?note_id=
89508453919.
Hive performance benchmarks. http:
//issues.apache.org/jira/browse/HIVE-396.
LibProcess Homepage. http:
//www.eecs.berkeley.edu/benh/libprocess.
Linux 2.6.33 release notes.
http://kernelnewbies.org/Linux_2_6_33.
Linux containers (LXC) overview document.
http://lxc.sourceforge.net/lxc.html.
Personal communication with Dhruba Borthakur from Facebook.
Personal communication with Owen OMalley and Arun C.
Murthy from the Yahoo! Hadoop team.
RightScale blog. blog.rightscale.com/2010/04/01/
benchmarking-load-balancers-in-the-cloud.
Solaris Resource Management.
http://docs.sun.com/app/docs/doc/817-1592.
The Next Generation of Apache Hadoop MapReduce.
http://developer.yahoo.com/blogs/hadoop/
posts/2011/02/mapreduce-nextgen.
E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney,
J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and
D. Sorensen. LAPACK: a portable linear algebra library for
high-performance computers. In Supercomputing 90, 1990.
A. Bouteiller, F. Cappello, T. Herault, G. Krawezik,
P. Lemarinier, and F. Magniette. Mpich-v2: a fault tolerant MPI
for volatile nodes based on pessimistic sender based message
logging. In Supercomputing 03, 2003.
T. Condie, N. Conway, P. Alvaro, and J. M. Hellerstein.
MapReduce online. In NSDI 10, May 2010.
J. Dean and S. Ghemawat. MapReduce: Simplified data
processing on large clusters. In OSDI, pages 137150, 2004.
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu,
and G. Fox. Twister: a runtime for iterative mapreduce. In Proc.
HPDC 10, 2010.
D. R. Engler, M. F. Kaashoek, and J. OToole. Exokernel: An
operating system architecture for application-level resource
management. In SOSP, pages 251266, 1995.
A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker,
and I. Stoica. Dominant resource fairness: fair allocation of
multiple resource types. In NSDI, 2011.

5 Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF.

14

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for


In-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica
University of California, Berkeley

Abstract
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a
fault-tolerant manner. RDDs are motivated by two types
of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data
mining tools. In both cases, keeping data in memory
can improve performance by an order of magnitude.
To achieve fault tolerance efficiently, RDDs provide a
restricted form of shared memory, based on coarsegrained transformations rather than fine-grained updates
to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these
models do not capture. We have implemented RDDs in a
system called Spark, which we evaluate through a variety
of user applications and benchmarks.

Introduction

Cluster computing frameworks like MapReduce [10] and


Dryad [19] have been widely adopted for large-scale data
analytics. These systems let users write parallel computations using a set of high-level operators, without having
to worry about work distribution and fault tolerance.
Although current frameworks provide numerous abstractions for accessing a clusters computational resources, they lack abstractions for leveraging distributed
memory. This makes them inefficient for an important
class of emerging applications: those that reuse intermediate results across multiple computations. Data reuse is
common in many iterative machine learning and graph
algorithms, including PageRank, K-means clustering,
and logistic regression. Another compelling use case is
interactive data mining, where a user runs multiple adhoc queries on the same subset of the data. Unfortunately, in most current frameworks, the only way to reuse
data between computations (e.g., between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system. This incurs substantial
overheads due to data replication, disk I/O, and serializa-

tion, which can dominate application execution times.


Recognizing this problem, researchers have developed
specialized frameworks for some applications that require data reuse. For example, Pregel [22] is a system for
iterative graph computations that keeps intermediate data
in memory, while HaLoop [7] offers an iterative MapReduce interface. However, these frameworks only support
specific computation patterns (e.g., looping a series of
MapReduce steps), and perform data sharing implicitly
for these patterns. They do not provide abstractions for
more general reuse, e.g., to let a user load several datasets
into memory and run ad-hoc queries across them.
In this paper, we propose a new abstraction called resilient distributed datasets (RDDs) that enables efficient
data reuse in a broad range of applications. RDDs are
fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control
their partitioning to optimize data placement, and manipulate them using a rich set of operators.
The main challenge in designing RDDs is defining a
programming interface that can provide fault tolerance
efficiently. Existing abstractions for in-memory storage
on clusters, such as distributed shared memory [24], keyvalue stores [25], databases, and Piccolo [27], offer an
interface based on fine-grained updates to mutable state
(e.g., cells in a table). With this interface, the only ways
to provide fault tolerance are to replicate the data across
machines or to log updates across machines. Both approaches are expensive for data-intensive workloads, as
they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of
RAM, and they incur substantial storage overhead.
In contrast to these systems, RDDs provide an interface based on coarse-grained transformations (e.g., map,
filter and join) that apply the same operation to many
data items. This allows them to efficiently provide fault
tolerance by logging the transformations used to build a
dataset (its lineage) rather than the actual data.1 If a partition of an RDD is lost, the RDD has enough information
about how it was derived from other RDDs to recompute
1 Checkpointing the data in some RDDs may be useful when a lineage chain grows large, however, and we discuss how to do it in 5.4.

just that partition. Thus, lost data can be recovered, often


quite quickly, without requiring costly replication.
Although an interface based on coarse-grained transformations may at first seem limited, RDDs are a good
fit for many parallel applications, because these applications naturally apply the same operation to multiple
data items. Indeed, we show that RDDs can efficiently
express many cluster programming models that have so
far been proposed as separate systems, including MapReduce, DryadLINQ, SQL, Pregel and HaLoop, as well as
new applications that these systems do not capture, like
interactive data mining. The ability of RDDs to accommodate computing needs that were previously met only
by introducing new frameworks is, we believe, the most
credible evidence of the power of the RDD abstraction.
We have implemented RDDs in a system called Spark,
which is being used for research and production applications at UC Berkeley and several companies. Spark provides a convenient language-integrated programming interface similar to DryadLINQ [31] in the Scala programming language [2]. In addition, Spark can be used interactively to query big datasets from the Scala interpreter.
We believe that Spark is the first system that allows a
general-purpose programming language to be used at interactive speeds for in-memory data mining on clusters.
We evaluate RDDs and Spark through both microbenchmarks and measurements of user applications.
We show that Spark is up to 20 faster than Hadoop for
iterative applications, speeds up a real-world data analytics report by 40, and can be used interactively to scan a
1 TB dataset with 57s latency. More fundamentally, to
illustrate the generality of RDDs, we have implemented
the Pregel and HaLoop programming models on top of
Spark, including the placement optimizations they employ, as relatively small libraries (200 lines of code each).
This paper begins with an overview of RDDs (2) and
Spark (3). We then discuss the internal representation
of RDDs (4), our implementation (5), and experimental results (6). Finally, we discuss how RDDs capture
several existing cluster programming models (7), survey related work (8), and conclude.

Resilient Distributed Datasets (RDDs)

This section provides an overview of RDDs. We first define RDDs (2.1) and introduce their programming interface in Spark (2.2). We then compare RDDs with finergrained shared memory abstractions (2.3). Finally, we
discuss limitations of the RDD model (2.4).
2.1

RDD Abstraction

Formally, an RDD is a read-only, partitioned collection


of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2)
other RDDs. We call these operations transformations to

differentiate them from other operations on RDDs. Examples of transformations include map, filter, and join.2
RDDs do not need to be materialized at all times. Instead, an RDD has enough information about how it was
derived from other datasets (its lineage) to compute its
partitions from data in stable storage. This is a powerful property: in essence, a program cannot reference an
RDD that it cannot reconstruct after a failure.
Finally, users can control two other aspects of RDDs:
persistence and partitioning. Users can indicate which
RDDs they will reuse and choose a storage strategy for
them (e.g., in-memory storage). They can also ask that
an RDDs elements be partitioned across machines based
on a key in each record. This is useful for placement optimizations, such as ensuring that two datasets that will
be joined together are hash-partitioned in the same way.
2.2

Spark Programming Interface

Spark exposes RDDs through a language-integrated API


similar to DryadLINQ [31] and FlumeJava [8], where
each dataset is represented as an object and transformations are invoked using methods on these objects.
Programmers start by defining one or more RDDs
through transformations on data in stable storage
(e.g., map and filter). They can then use these RDDs in
actions, which are operations that return a value to the
application or export data to a storage system. Examples
of actions include count (which returns the number of
elements in the dataset), collect (which returns the elements themselves), and save (which outputs the dataset
to a storage system). Like DryadLINQ, Spark computes
RDDs lazily the first time they are used in an action, so
that it can pipeline transformations.
In addition, programmers can call a persist method to
indicate which RDDs they want to reuse in future operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough
RAM. Users can also request other persistence strategies,
such as storing the RDD only on disk or replicating it
across machines, through flags to persist. Finally, users
can set a persistence priority on each RDD to specify
which in-memory data should spill to disk first.
2.2.1

Example: Console Log Mining

Suppose that a web service is experiencing errors and an


operator wants to search terabytes of logs in the Hadoop
filesystem (HDFS) to find the cause. Using Spark, the operator can load just the error messages from the logs into
RAM across a set of nodes and query them interactively.
She would first type the following Scala code:
2 Although individual RDDs are immutable, it is possible to implement mutable state by having multiple RDDs to represent multiple versions of a dataset. We made RDDs immutable to make it easier to describe lineage graphs, but it would have been equivalent to have our
abstraction be versioned datasets and track versions in lineage graphs.

Aspect

lines
filter(_.startsWith(ERROR))
errors
filter(_.contains(HDFS)))
HDFS errors
map(_.split(\t)(3))
time fields

Figure 1: Lineage graph for the third query in our example.


Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()

Line 1 defines an RDD backed by an HDFS file (as a


collection of lines of text), while line 2 derives a filtered
RDD from it. Line 3 then asks for errors to persist in
memory so that it can be shared across queries. Note that
the argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the cluster. However, the user can now use the RDD in actions,
e.g., to count the number of messages:
errors.count()

The user can also perform further transformations on


the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
.map(_.split(\t)(3))
.collect()

After the first action involving errors runs, Spark will


store the partitions of errors in memory, greatly speeding up subsequent computations on it. Note that the base
RDD, lines, is not loaded into RAM. This is desirable
because the error messages might only be a small fraction of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tolerance, we show the lineage graph for the RDDs in our
third query in Figure 1. In this query, we started with
errors, the result of a filter on lines, and applied a further filter and map before running a collect. The Spark
scheduler will pipeline the latter two transformations and
send a set of tasks to compute them to the nodes holding
the cached partitions of errors. In addition, if a partition
of errors is lost, Spark rebuilds it by applying a filter on
only the corresponding partition of lines.

RDDs

Distr. Shared Mem.

Reads

Coarse- or fine-grained Fine-grained

Writes

Coarse-grained

Fine-grained

Consistency

Trivial (immutable)

Up to app / runtime

Fault recovery Fine-grained and low- Requires checkpoints


overhead using lineage and program rollback
Straggler
mitigation

Possible using backup


tasks

Difficult

Work
placement

Automatic based on
data locality

Up to app (runtimes
aim for transparency)

Behavior if not Similar to existing data Poor performance


enough RAM flow systems
(swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3

Advantages of the RDD Model

To understand the benefits of RDDs as a distributed


memory abstraction, we compare them against distributed shared memory (DSM) in Table 1. In DSM systems, applications read and write to arbitrary locations in
a global address space. Note that under this definition, we
include not only traditional shared memory systems [24],
but also other systems where applications make finegrained writes to shared state, including Piccolo [27],
which provides a shared DHT, and distributed databases.
DSM is a very general abstraction, but this generality
makes it harder to implement in an efficient and faulttolerant manner on commodity clusters.
The main difference between RDDs and DSM is that
RDDs can only be created (written) through coarsegrained transformations, while DSM allows reads and
writes to each memory location.3 This restricts RDDs
to applications that perform bulk writes, but allows for
more efficient fault tolerance. In particular, RDDs do not
need to incur the overhead of checkpointing, as they can
be recovered using lineage.4 Furthermore, only the lost
partitions of an RDD need to be recomputed upon failure, and they can be recomputed in parallel on different
nodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable nature lets a system mitigate slow nodes (stragglers) by running backup copies of slow tasks as in MapReduce [10].
Backup tasks would be hard to implement with DSM, as
the two copies of a task would access the same memory
locations and interfere with each others updates.
Finally, RDDs provide two other benefits over DSM.
First, in bulk operations on RDDs, a runtime can sched3 Note that reads on RDDs can still be fine-grained. For example, an
application can treat an RDD as a large read-only lookup table.
4 In some applications, it can still help to checkpoint RDDs with
long lineage chains, as we discuss in Section 5.4. However, this can be
done in the background because RDDs are immutable, and there is no
need to take a snapshot of the whole application as in DSM.

RAM

Worker
Input Data

Driver

RAM

Worker
results
tasks

RAM

Input Data

Worker
Input Data

Figure 2: Spark runtime. The users driver program launches


multiple workers, which read data blocks from a distributed file
system and can persist computed RDD partitions in memory.

ule tasks based on data locality to improve performance.


Second, RDDs degrade gracefully when there is not
enough memory to store them, as long as they are only
being used in scan-based operations. Partitions that do
not fit in RAM can be stored on disk and will provide
similar performance to current data-parallel systems.
2.4

Applications Not Suitable for RDDs

As discussed in the Introduction, RDDs are best suited


for batch applications that apply the same operation to
all elements of a dataset. In these cases, RDDs can efficiently remember each transformation as one step in a
lineage graph and can recover lost partitions without having to log large amounts of data. RDDs would be less
suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system for a web application or an incremental web crawler.
For these applications, it is more efficient to use systems
that perform traditional update logging and data checkpointing, such as databases, RAMCloud [25], Percolator
[26] and Piccolo [27]. Our goal is to provide an efficient
programming model for batch analytics and leave these
asynchronous applications to specialized systems.

Spark Programming Interface

Spark provides the RDD abstraction through a languageintegrated API similar to DryadLINQ [31] in Scala [2],
a statically typed functional programming language for
the Java VM. We chose Scala due to its combination of
conciseness (which is convenient for interactive use) and
efficiency (due to static typing). However, nothing about
the RDD abstraction requires a functional language.
To use Spark, developers write a driver program that
connects to a cluster of workers, as shown in Figure 2.
The driver defines one or more RDDs and invokes actions on them. Spark code on the driver also tracks the
RDDs lineage. The workers are long-lived processes
that can store RDD partitions in RAM across operations.
As we showed in the log mining example in Section 2.2.1, users provide arguments to RDD opera-

tions like map by passing closures (function literals).


Scala represents each closure as a Java object, and
these objects can be serialized and loaded on another
node to pass the closure across the network. Scala also
saves any variables bound in the closure as fields in
the Java object. For example, one can write code like
var x = 5; rdd.map(_ + x) to add 5 to each element
of an RDD.5
RDDs themselves are statically typed objects
parametrized by an element type. For example,
RDD[Int] is an RDD of integers. However, most of our
examples omit types since Scala supports type inference.
Although our method of exposing RDDs in Scala is
conceptually simple, we had to work around issues with
Scalas closure objects using reflection [33]. We also
needed more work to make Spark usable from the Scala
interpreter, as we shall discuss in Section 5.2. Nonetheless, we did not have to modify the Scala compiler.
3.1

RDD Operations in Spark

Table 2 lists the main RDD transformations and actions


available in Spark. We give the signature of each operation, showing type parameters in square brackets. Recall that transformations are lazy operations that define a
new RDD, while actions launch a computation to return
a value to the program or write data to external storage.
Note that some operations, such as join, are only available on RDDs of key-value pairs. Also, our function
names are chosen to match other APIs in Scala and other
functional languages; for example, map is a one-to-one
mapping, while flatMap maps each input value to one or
more outputs (similar to the map in MapReduce).
In addition to these operators, users can ask for an
RDD to persist. Furthermore, users can get an RDDs
partition order, which is represented by a Partitioner
class, and partition another dataset according to it. Operations such as groupByKey, reduceByKey and sort automatically result in a hash or range partitioned RDD.
3.2

Example Applications

We complement the data mining example in Section


2.2.1 with two iterative applications: logistic regression
and PageRank. The latter also showcases how control of
RDDs partitioning can improve performance.
3.2.1

Logistic Regression

Many machine learning algorithms are iterative in nature


because they run iterative optimization procedures, such
as gradient descent, to maximize a function. They can
thus run much faster by keeping their data in memory.
As an example, the following program implements logistic regression [14], a common classification algorithm
5 We save each closure at the time it is created, so that the map in
this example will always add 5 even if x changes.

Transformations

Actions

map( f : T ) U)
filter( f : T ) Bool)
flatMap( f : T ) Seq[U])
sample(fraction : Float)
groupByKey()
reduceByKey( f : (V, V) ) V)
union()
join()
cogroup()
crossProduct()
mapValues( f : V ) W)
sort(c : Comparator[K])
partitionBy(p : Partitioner[K])
count()
collect()
reduce( f : (T, T) ) T)
lookup(k : K)
save(path : String)

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

RDD[T] ) RDD[U]
RDD[T] ) RDD[T]
RDD[T] ) RDD[U]
RDD[T] ) RDD[T] (Deterministic sampling)
RDD[(K, V)] ) RDD[(K, Seq[V])]
RDD[(K, V)] ) RDD[(K, V)]
(RDD[T], RDD[T]) ) RDD[T]
(RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (V, W))]
(RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (Seq[V], Seq[W]))]
(RDD[T], RDD[U]) ) RDD[(T, U)]
RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)
RDD[(K, V)] ) RDD[(K, V)]
RDD[(K, V)] ) RDD[(K, V)]
RDD[T] ) Long
RDD[T] ) Seq[T]
RDD[T] ) T
RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)
Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates two


sets of points (e.g., spam and non-spam emails). The algorithm uses gradient descent: it starts w at a random
value, and on each iteration, it sums a function of w over
the data to move w in a direction that improves it.
val points = spark.textFile(...)
.map(parsePoint).persist()
var w = // random initial vector
for (i <- 1 to ITERATIONS) {
val gradient = points.map{ p =>
p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)
w -= gradient
}

We start by defining a persistent RDD called points


as the result of a map transformation on a text file that
parses each line of text into a Point object. We then repeatedly run map and reduce on points to compute the
gradient at each step by summing a function of the current w. Keeping points in memory across iterations can
yield a 20 speedup, as we show in Section 6.1.
3.2.2

PageRank

A more complex pattern of data sharing occurs in


PageRank [6]. The algorithm iteratively updates a rank
for each document by adding up contributions from documents that link to it. On each iteration, each document
sends a contribution of nr to its neighbors, where r is its
rank and n is its number of neighbors. It then updates
its rank to a/N + (1 a) ci , where the sum is over
the contributions it received and N is the total number of
documents. We can write PageRank in Spark as follows:
// Load graph as an RDD of (URL, outlinks) pairs

input file

map

links

ranks0
join
contribs0
reduce + map
ranks1
contribs1
ranks2
contribs2

. . .

Figure 3: Lineage graph for datasets in PageRank.


val links = spark.textFile(...).map(...).persist()
var ranks = // RDD of (URL, rank) pairs
for (i <- 1 to ITERATIONS) {
// Build an RDD of (targetURL, float) pairs
// with the contributions sent by each page
val contribs = links.join(ranks).flatMap {
(url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
// Sum contributions by URL and get new ranks
ranks = contribs.reduceByKey((x,y) => x+y)
.mapValues(sum => a/N + (1-a)*sum)
}

This program leads to the RDD lineage graph in Figure 3. On each iteration, we create a new ranks dataset
based on the contribs and ranks from the previous iteration and the static links dataset.6 One interesting feature of this graph is that it grows longer with the number
6 Note that although RDDs are immutable, the variables ranks and
contribs in the program point to different RDDs on each iteration.

of iterations. Thus, in a job with many iterations, it may


be necessary to reliably replicate some of the versions
of ranks to reduce fault recovery times [20]. The user
can call persist with a RELIABLE flag to do this. However,
note that the links dataset does not need to be replicated,
because partitions of it can be rebuilt efficiently by rerunning a map on blocks of the input file. This dataset will
typically be much larger than ranks, because each document has many links but only one number as its rank, so
recovering it using lineage saves time over systems that
checkpoint a programs entire in-memory state.
Finally, we can optimize communication in PageRank
by controlling the partitioning of the RDDs. If we specify a partitioning for links (e.g., hash-partition the link
lists by URL across nodes), we can partition ranks in
the same way and ensure that the join operation between
links and ranks requires no communication (as each
URLs rank will be on the same machine as its link list).
We can also write a custom Partitioner class to group
pages that link to each other together (e.g., partition the
URLs by domain name). Both optimizations can be expressed by calling partitionBy when we define links:
links = spark.textFile(...).map(...)
.partitionBy(myPartFunc).persist()

After this initial call, the join operation between links


and ranks will automatically aggregate the contributions
for each URL to the machine that its link lists is on, calculate its new rank there, and join it with its links. This
type of consistent partitioning across iterations is one of
the main optimizations in specialized frameworks like
Pregel. RDDs let the user express this goal directly.

Representing RDDs

One of the challenges in providing RDDs as an abstraction is choosing a representation for them that can track
lineage across a wide range of transformations. Ideally,
a system implementing RDDs should provide as rich
a set of transformation operators as possible (e.g., the
ones in Table 2), and let users compose them in arbitrary
ways. We propose a simple graph-based representation
for RDDs that facilitates these goals. We have used this
representation in Spark to support a wide range of transformations without adding special logic to the scheduler
for each one, which greatly simplified the system design.
In a nutshell, we propose representing each RDD
through a common interface that exposes five pieces of
information: a set of partitions, which are atomic pieces
of the dataset; a set of dependencies on parent RDDs;
a function for computing the dataset based on its parents; and metadata about its partitioning scheme and data
placement. For example, an RDD representing an HDFS
file has a partition for each block of the file and knows
which machines each block is on. Meanwhile, the result

Operation
partitions()

Meaning
Return a list of Partition objects

preferredLocations(p) List nodes where partition p can be


accessed faster due to data locality
dependencies()

Return a list of dependencies

iterator(p, parentIters) Compute the elements of partition p


given iterators for its parent partitions
partitioner()

Return metadata specifying whether


the RDD is hash/range partitioned

Table 3: Interface used to represent RDDs in Spark.

of a map on this RDD has the same partitions, but applies


the map function to the parents data when computing its
elements. We summarize this interface in Table 3.
The most interesting question in designing this interface is how to represent dependencies between RDDs.
We found it both sufficient and useful to classify dependencies into two types: narrow dependencies, where each
partition of the parent RDD is used by at most one partition of the child RDD, wide dependencies, where multiple child partitions may depend on it. For example, map
leads to a narrow dependency, while join leads to to wide
dependencies (unless the parents are hash-partitioned).
Figure 4 shows other examples.
This distinction is useful for two reasons. First, narrow
dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For
example, one can apply a map followed by a filter on an
element-by-element basis. In contrast, wide dependencies require data from all parent partitions to be available
and to be shuffled across the nodes using a MapReducelike operation. Second, recovery after a node failure is
more efficient with a narrow dependency, as only the lost
parent partitions need to be recomputed, and they can be
recomputed in parallel on different nodes. In contrast, in
a lineage graph with wide dependencies, a single failed
node might cause the loss of some partition from all the
ancestors of an RDD, requiring a complete re-execution.
This common interface for RDDs made it possible to
implement most transformations in Spark in less than 20
lines of code. Indeed, even new Spark users have implemented new transformations (e.g., sampling and various
types of joins) without knowing the details of the scheduler. We sketch some RDD implementations below.
HDFS files: The input RDDs in our samples have been
files in HDFS. For these RDDs, partitions returns one
partition for each block of the file (with the blocks offset
stored in each Partition object), preferredLocations gives
the nodes the block is on, and iterator reads the block.
map: Calling map on any RDD returns a MappedRDD
object. This object has the same partitions and preferred
locations as its parent, but applies the function passed to

Narrow Dependencies:

Wide Dependencies:

B:

A:

G:
Stage 1

groupBy

groupByKey

map, filter

C:

D:

F:

map
E:
join with inputs
co-partitioned
union

join with inputs not


co-partitioned

Figure 4: Examples of narrow and wide dependencies. Each


box is an RDD, with partitions shown as shaded rectangles.

map to the parents records in its iterator method.


union: Calling union on two RDDs returns an RDD
whose partitions are the union of those of the parents.
Each child partition is computed through a narrow dependency on the corresponding parent.7
sample: Sampling is similar to mapping, except that
the RDD stores a random number generator seed for each
partition to deterministically sample parent records.
join: Joining two RDDs may lead to either two narrow dependencies (if they are both hash/range partitioned
with the same partitioner), two wide dependencies, or a
mix (if one parent has a partitioner and one does not). In
either case, the output RDD has a partitioner (either one
inherited from the parents or a default hash partitioner).

Implementation

We have implemented Spark in about 14,000 lines of


Scala. The system runs over the Mesos cluster manager [17], allowing it to share resources with Hadoop,
MPI and other applications. Each Spark program runs as
a separate Mesos application, with its own driver (master) and workers, and resource sharing between these applications is handled by Mesos.
Spark can read data from any Hadoop input source
(e.g., HDFS or HBase) using Hadoops existing input
plugin APIs, and runs on an unmodified version of Scala.
We now sketch several of the technically interesting
parts of the system: our job scheduler (5.1), our Spark
interpreter allowing interactive use (5.2), memory management (5.3), and support for checkpointing (5.4).
5.1

Job Scheduling

Sparks scheduler uses our representation of RDDs, described in Section 4.


Overall, our scheduler is similar to Dryads [19], but
it additionally takes into account which partitions of per7 Note

that our union operation does not drop duplicate values.

Stage 2

join
union

Stage 3

Figure 5: Example of how Spark computes job stages. Boxes


with solid outlines are RDDs. Partitions are shaded rectangles,
in black if they are already in memory. To run an action on RDD
G, we build build stages at wide dependencies and pipeline narrow transformations inside each stage. In this case, stage 1s
output RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a user


runs an action (e.g., count or save) on an RDD, the scheduler examines that RDDs lineage graph to build a DAG
of stages to execute, as illustrated in Figure 5. Each stage
contains as many pipelined transformations with narrow
dependencies as possible. The boundaries of the stages
are the shuffle operations required for wide dependencies, or any already computed partitions that can shortcircuit the computation of a parent RDD. The scheduler
then launches tasks to compute missing partitions from
each stage until it has computed the target RDD.
Our scheduler assigns tasks to machines based on data
locality using delay scheduling [32]. If a task needs to
process a partition that is available in memory on a node,
we send it to that node. Otherwise, if a task processes
a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), we send it to those.
For wide dependencies (i.e., shuffle dependencies), we
currently materialize intermediate records on the nodes
holding parent partitions to simplify fault recovery, much
like MapReduce materializes map outputs.
If a task fails, we re-run it on another node as long
as its stages parents are still available. If some stages
have become unavailable (e.g., because an output from
the map side of a shuffle was lost), we resubmit tasks to
compute the missing partitions in parallel. We do not yet
tolerate scheduler failures, though replicating the RDD
lineage graph would be straightforward.
Finally, although all computations in Spark currently
run in response to actions called in the driver program,
we are also experimenting with letting tasks on the cluster (e.g., maps) call the lookup operation, which provides
random access to elements of hash-partitioned RDDs by
key. In this case, tasks would need to tell the scheduler to
compute the required partition if it is missing.

Line1
Line 1:
var query = hello
Line 2:
rdd.filter(_.contains(query))
.count()

query:

String
hello

Line2
line1:
Closure1
line1:
eval(s): { return
s.contains(line1.query) }

a) Lines typed by user

b) Resulting object graph

Figure 6: Example showing how the Spark interpreter translates


two lines entered by the user into Java objects.

5.2

Interpreter Integration

Scala includes an interactive shell similar to those of


Ruby and Python. Given the low latencies attained with
in-memory data, we wanted to let users run Spark interactively from the interpreter to query big datasets.
The Scala interpreter normally operates by compiling
a class for each line typed by the user, loading it into
the JVM, and invoking a function on it. This class includes a singleton object that contains the variables or
functions on that line and runs the lines code in an initialize method. For example, if the user types var x = 5
followed by println(x), the interpreter defines a class
called Line1 containing x and causes the second line to
compile to println(Line1.getInstance().x).
We made two changes to the interpreter in Spark:
1. Class shipping: To let the worker nodes fetch the
bytecode for the classes created on each line, we
made the interpreter serve these classes over HTTP.
2. Modified code generation: Normally, the singleton
object created for each line of code is accessed
through a static method on its corresponding class.
This means that when we serialize a closure referencing a variable defined on a previous line, such as
Line1.x in the example above, Java will not trace
through the object graph to ship the Line1 instance
wrapping around x. Therefore, the worker nodes will
not receive x. We modified the code generation logic
to reference the instance of each line object directly.
Figure 6 shows how the interpreter translates a set of
lines typed by the user to Java objects after our changes.
We found the Spark interpreter to be useful in processing large traces obtained as part of our research and exploring datasets stored in HDFS. We also plan to use to
run higher-level query languages interactively, e.g., SQL.
5.3

Memory Management

Spark provides three options for storage of persistent


RDDs: in-memory storage as deserialized Java objects,

in-memory storage as serialized data, and on-disk storage. The first option provides the fastest performance,
because the Java VM can access each RDD element
natively. The second option lets users choose a more
memory-efficient representation than Java object graphs
when space is limited, at the cost of lower performance.8
The third option is useful for RDDs that are too large to
keep in RAM but costly to recompute on each use.
To manage the limited memory available, we use an
LRU eviction policy at the level of RDDs. When a new
RDD partition is computed but there is not enough space
to store it, we evict a partition from the least recently accessed RDD, unless this is the same RDD as the one with
the new partition. In that case, we keep the old partition
in memory to prevent cycling partitions from the same
RDD in and out. This is important because most operations will run tasks over an entire RDD, so it is quite
likely that the partition already in memory will be needed
in the future. We found this default policy to work well in
all our applications so far, but we also give users further
control via a persistence priority for each RDD.
Finally, each instance of Spark on a cluster currently
has its own separate memory space. In future work, we
plan to investigate sharing RDDs across instances of
Spark through a unified memory manager.
5.4

Support for Checkpointing

Although lineage can always be used to recover RDDs


after a failure, such recovery may be time-consuming for
RDDs with long lineage chains. Thus, it can be helpful
to checkpoint some RDDs to stable storage.
In general, checkpointing is useful for RDDs with long
lineage graphs containing wide dependencies, such as
the rank datasets in our PageRank example (3.2.2). In
these cases, a node failure in the cluster may result in
the loss of some slice of data from each parent RDD, requiring a full recomputation [20]. In contrast, for RDDs
with narrow dependencies on data in stable storage, such
as the points in our logistic regression example (3.2.1)
and the link lists in PageRank, checkpointing may never
be worthwhile. If a node fails, lost partitions from these
RDDs can be recomputed in parallel on other nodes, at a
fraction of the cost of replicating the whole RDD.
Spark currently provides an API for checkpointing (a
REPLICATE flag to persist), but leaves the decision of
which data to checkpoint to the user. However, we are
also investigating how to perform automatic checkpointing. Because our scheduler knows the size of each dataset
as well as the time it took to first compute it, it should be
able to select an optimal set of RDDs to checkpoint to
minimize system recovery time [30].
Finally, note that the read-only nature of RDDs makes
8 The cost depends on how much computation the application does
per byte of data, but can be up to 2 for lightweight processing.

6.1

Iterative Machine Learning Applications

We implemented two iterative machine learning applications, logistic regression and k-means, to compare the
performance of the following systems:
Hadoop: The Hadoop 0.20.2 stable release.

HadoopBinMem: A Hadoop deployment that converts the input data into a low-overhead binary format
in the first iteration to eliminate text parsing in later
ones, and stores it in an in-memory HDFS instance.

Spark: Our implementation of RDDs.


We ran both algorithms for 10 iterations on 100 GB
datasets using 25100 machines. The key difference between the two applications is the amount of computation
they perform per byte of data. The iteration time of kmeans is dominated by computation, while logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.
Since typical learning algorithms need tens of iterations to converge, we report times for the first iteration
and subsequent iterations separately. We find that sharing data via RDDs greatly speeds up future iterations.

182!

3!

40!

33!

46!

82!

87!

115!
106!

139!
62!

80!
76!

80!
0!

Hadoop! HadoopBM! Spark!

Hadoop! HadoopBM! Spark!

Logistic Regression!

K-Means!

0!

100!
50!

106!
87!

197!
143!

150!

33!

50!

200!

Hadoop !
HadoopBinMem!
Spark!

61!

100!

250!

157!
121!

150!

300!
Iteration time (s)!

200!

76!
62!

250!

111!
80!

Hadoop!
HadoopBinMem!
Spark!

300!

274!

Figure 7: Duration of the first and later iterations in Hadoop,


HadoopBinMem and Spark for logistic regression and k-means
using 100 GB of data on a 100-node cluster.

3!

Spark can be used to query a 1 TB dataset interactively with latencies of 57 seconds.


We start by presenting benchmarks for iterative machine learning applications (6.1) and PageRank (6.2)
against Hadoop. We then evaluate fault recovery in Spark
(6.3) and behavior when a dataset does not fit in memory (6.4). Finally, we discuss results for user applications (6.5) and interactive data mining (6.6).
Unless otherwise noted, our tests used m1.xlarge EC2
nodes with 4 cores and 15 GB of RAM. We used HDFS
for storage, with 256 MB blocks. Before each test, we
cleared OS buffer caches to measure IO costs accurately.

120!

6!

When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions.

160!

184!

Applications written by our users perform and scale


well. In particular, we used Spark to speed up an analytics report that was running on Hadoop by 40.

200!

116!

We evaluated Spark and RDDs through a series of experiments on Amazon EC2, as well as benchmarks of user
applications. Overall, our results show the following:
Spark outperforms Hadoop by up to 20 in iterative machine learning and graph applications. The
speedup comes from avoiding I/O and deserialization
costs by storing data in memory as Java objects.

First Iteration!
Later Iterations!

240!

15!

Evaluation

Iteration time (s)!

Iteration time (s)!

them simpler to checkpoint than general shared memory. Because consistency is not a concern, RDDs can be
written out in the background without requiring program
pauses or distributed snapshot schemes.

0!
25!
50!
100!
Number of machines!

(a) Logistic Regression

25!

50!

100!

Number of machines!

(b) K-Means

Figure 8: Running times for iterations after the first in Hadoop,


HadoopBinMem, and Spark. The jobs all processed 100 GB.

First Iterations All three systems read text input from


HDFS in their first iterations. As shown in the light bars
in Figure 7, Spark was moderately faster than Hadoop
across experiments. This difference was due to signaling overheads in Hadoops heartbeat protocol between
its master and workers. HadoopBinMem was the slowest
because it ran an extra MapReduce job to convert the data
to binary, it and had to write this data across the network
to a replicated in-memory HDFS instance.
Subsequent Iterations Figure 7 also shows the average running times for subsequent iterations, while Figure 8 shows how these scaled with cluster size. For logistic regression, Spark 25.3 and 20.7 faster than
Hadoop and HadoopBinMem respectively on 100 machines. For the more compute-intensive k-means application, Spark still achieved speedup of 1.9 to 3.2.
Understanding the Speedup We were surprised to
find that Spark outperformed even Hadoop with inmemory storage of binary data (HadoopBinMem) by a
20 margin. In HadoopBinMem, we had used Hadoops
standard binary format (SequenceFile) and a large block
size of 256 MB, and we had forced HDFSs data directory to be on an in-memory file system. However,
Hadoop still ran slower due to several factors:
1. Minimum overhead of the Hadoop software stack,
2. Overhead of HDFS while serving data, and

6.2

PageRank

We compared the performance of Spark with Hadoop


for PageRank using a 54 GB Wikipedia dump. We ran
10 iterations of the PageRank algorithm to process a
link graph of approximately 4 million articles. Figure 10
demonstrates that in-memory storage alone provided
Spark with a 2.4 speedup over Hadoop on 30 nodes.
In addition, controlling the partitioning of the RDDs to
make it consistent across iterations, as discussed in Section 3.2.2, improved the speedup to 7.4. The results
also scaled nearly linearly to 60 nodes.
We also evaluated a version of PageRank written using our implementation of Pregel over Spark, which we
describe in Section 7.1. The iteration times were similar
to the ones in Figure 10, but longer by about 4 seconds
because Pregel runs an extra operation on each iteration
to let the vertices vote whether to finish the job.

171!

80!

Basic Spark!

23!

28!
14!

72!

50!

Spark + Controlled
Partitioning!

0!
30!
60!
Number of machines!

Figure 10: Performance of PageRank on Hadoop and Spark.

4!

5! 6! 7!
Iteration!

8!

9!

59!

3!

57!

2!

59!

1!

57!

81!

No Failure!
Failure in the 6th Iteration!
58!

140!
120!
100!
80!
60!
40!
20!
0!

58!

3. Deserialization cost to convert binary records to usable in-memory Java objects.


We investigated each of these factors in turn. To measure (1), we ran no-op Hadoop jobs, and saw that these at
incurred least 25s of overhead to complete the minimal
requirements of job setup, starting tasks, and cleaning up.
Regarding (2), we found that HDFS performed multiple
memory copies and a checksum to serve each block.
Finally, to measure (3), we ran microbenchmarks on
a single machine to run the logistic regression computation on 256 MB inputs in various formats. In particular,
we compared the time to process text and binary inputs
from both HDFS (where overheads in the HDFS stack
will manifest) and an in-memory local file (where the
kernel can very efficiently pass data to the program).
We show the results of these tests in Figure 9. The differences between in-memory HDFS and local file show
that reading through HDFS introduced a 2-second overhead, even when data was in memory on the local machine. The differences between the text and binary input indicate the parsing overhead was 7 seconds. Finally,
even when reading from an in-memory file, converting
the pre-parsed binary data into Java objects took 3 seconds, which is still almost as expensive as the logistic regression itself. By storing RDD elements directly as Java
objects in memory, Spark avoids all these overheads.

100!

56!

Figure 9: Iteration times for logistic regression using 256 MB


data on a single machine for different sources of input.

Hadoop!

150!

57!

In-mem HDFS! In-mem local file! Spark RDD!

200!

119!

0!

Iteration time (s)!

2.9!

2.9!

5!

Binary Input!
6.9!

10!

Text Input!

Iteratrion time (s)!

13.1!

15.4!

15!

8.4!

Iteration time (s)!

20!

10!

Figure 11: Iteration times for k-means in presence of a failure.


One machine was killed at the start of the 6th iteration, resulting
in partial reconstruction of an RDD using lineage.

6.3

Fault Recovery

We evaluated the cost of reconstructing RDD partitions


using lineage after a node failure in the k-means application. Figure 11 compares the running times for 10 iterations of k-means on a 75-node cluster in normal operating scenario, with one where a node fails at the start
of the 6th iteration. Without any failure, each iteration
consisted of 400 tasks working on 100 GB of data.
Until the end of the 5th iteration, the iteration times
were about 58 seconds. In the 6th iteration, one of the
machines was killed, resulting in the loss of the tasks
running on that machine and the RDD partitions stored
there. Spark re-ran these tasks in parallel on other machines, where they re-read corresponding input data and
reconstructed RDDs via lineage, which increased the iteration time to 80s. Once the lost RDD partitions were
reconstructed, the iteration time went back down to 58s.
Note that with a checkpoint-based fault recovery
mechanism, recovery would likely require rerunning at
least several iterations, depending on the frequency of
checkpoints. Furthermore, the system would need to
replicate the applications 100 GB working set (the text
input data converted into binary) across the network, and
would either consume twice the memory of Spark to
replicate it in RAM, or would have to wait to write 100
GB to disk. In contrast, the lineage graphs for the RDDs
in our examples were all less than 10 KB in size.
6.4

Behavior with Insufficient Memory

So far, we ensured that every machine in the cluster


had enough memory to store all the RDDs across itera-

Figure 12: Performance of logistic regression using 100 GB


data on 25 machines with varying amounts of data in memory.

Traffic Modeling Researchers in the Mobile Millennium project at Berkeley [18] parallelized a learning algorithm for inferring road traffic congestion from sporadic automobile GPS measurements. The source data
were a 10,000 link road network for a metropolitan area,
as well as 600,000 samples of point-to-point trip times
for GPS-equipped automobiles (travel times for each
path may include multiple road links). Using a traffic
model, the system can estimate the time it takes to travel
across individual road links. The researchers trained this
model using an expectation maximization (EM) algorithm that repeats two map and reduceByKey steps iteratively. The application scales nearly linearly from 20 to
80 nodes with 4 cores each, as shown in Figure 13(a).

38.6!

70.6!

27.6!

422!

820!

Iteration time (s)!

20!

(b) Spam classification

6!
4!

6.6!

Exact Match + View Count!


Substring Match + View Count!
Total View Count!

7.0!

8!

4.7!

10!

5.5!

Figure 13: Per-iteration running time of two user applications


implemented with Spark. Error bars show standard deviations.

User Applications Built with Spark

In-Memory Analytics Conviva Inc, a video distribution company, used Spark to accelerate a number of data
analytics reports that previously ran over Hadoop. For
example, one report ran as a series of Hive [1] queries
that computed various statistics for a customer. These
queries all worked on the same subset of the data (records
matching a customer-provided filter), but performed aggregations (averages, percentiles, and COUNT DISTINCT)
over different grouping fields, requiring separate MapReduce jobs. By implementing the queries in Spark and
loading the subset of data shared across them once into
an RDD, the company was able to speed up the report by
40. A report on 200 GB of compressed data that took
20 hours on a Hadoop cluster now runs in 30 minutes
using only two Spark machines. Furthermore, the Spark
program only required 96 GB of RAM, because it only
stored the rows and columns matching the customers filter in an RDD, not the whole decompressed file.

40!

20!
40!
80!
Number of machines!

(a) Traffic modeling

Query response time (s)!

6.5

60!

0!

20!
40!
80!
Number of machines!

Percent of dataset in memory!

tions. A natural question is how Spark runs if there is not


enough memory to store a jobs data. In this experiment,
we configured Spark not to use more than a certain percentage of memory to store RDDs on each machine. We
present results for various amounts of storage space for
logistic regression in Figure 12. We see that performance
degrades gracefully with less space.

1521!

400!
0!

100%!

4.5!

75%!

800!

80!

3.2!

50%!

1200!

2.8!

25%!

1600!

2.0!

0%!

2000!

1.7!

0!

Iteration time (s)!

20!

11.5!

40!

29.7!

60!

40.7!

58.1!

80!

68.8!

Iteration time (s)!

100!

2!
0!
100 GB!

500 GB!
Data size (GB)!

1 TB!

Figure 14: Response times for interactive queries on Spark,


scanning increasingly larger input datasets on 100 machines.

Twitter Spam Classification The Monarch project at


Berkeley [29] used Spark to identify link spam in Twitter
messages. They implemented a logistic regression classifier on top of Spark similar to the example in Section 6.1,
but they used a distributed reduceByKey to sum the gradient vectors in parallel. In Figure 13(b) we show the scaling results for training a classifier over a 50 GB subset
of the data: 250,000 URLs and 107 features/dimensions
related to the network and content properties of the pages
at each URL. The scaling is not as close to linear due to
a higher fixed communication cost per iteration.
6.6

Interactive Data Mining

To demonstrate Spark ability to interactively query big


datasets, we used it to analyze 1TB of Wikipedia page
view logs (2 years of data). For this experiment, we used
100 m2.4xlarge EC2 instances with 8 cores and 68 GB
of RAM each. We ran queries to find total views of (1)
all pages, (2) pages with titles exactly matching a given
word, and (3) pages with titles partially matching a word.
Each query scanned the entire input data.
Figure 14 shows the response times of the queries on
the full dataset and half and one-tenth of the data. Even
at 1 TB of data, queries on Spark took 57 seconds. This
was more than an order of magnitude faster than working with on-disk data; for example, querying the 1 TB
file from disk took 170s. This illustrates that RDDs make
Spark a powerful tool for interactive data mining.

Discussion

Although RDDs seem to offer a limited programming interface due to their immutable nature and coarse-grained
transformations, we have found them suitable for a wide
class of applications. In particular, RDDs can express a
surprising number of cluster programming models that
have so far been proposed as separate frameworks, allowing users to compose these models in one program
(e.g., run a MapReduce operation to build a graph, then
run Pregel on it) and share data between them. In this section, we discuss which programming models RDDs can
express and why they are so widely applicable (7.1). In
addition, we discuss another benefit of the lineage information in RDDs that we are pursuing, which is to facilitate debugging across these models (7.2).
7.1

Expressing Existing Programming Models

RDDs can efficiently express a number of cluster programming models that have so far been proposed independently. By efficiently, we mean that not only can
RDDs be used to produce the same output as programs
written in these models, but that RDDs can also capture
the optimizations that these frameworks perform, such as
keeping specific data in memory, partitioning it to minimize communication, and recovering from failures efficiently. The models expressible using RDDs include:
MapReduce: This model can be expressed using the
flatMap and groupByKey operations in Spark, or reduceByKey if there is a combiner.
DryadLINQ: The DryadLINQ system provides a
wider range of operators than MapReduce over the more
general Dryad runtime, but these are all bulk operators
that correspond directly to RDD transformations available in Spark (map, groupByKey, join, etc).
SQL: Like DryadLINQ expressions, SQL queries perform data-parallel operations on sets of records.
Pregel: Googles Pregel [22] is a specialized model for
iterative graph applications that at first looks quite different from the set-oriented programming models in other
systems. In Pregel, a program runs as a series of coordinated supersteps. On each superstep, each vertex in the
graph runs a user function that can update state associated with the vertex, change the graph topology, and send
messages to other vertices for use in the next superstep.
This model can express many graph algorithms, including shortest paths, bipartite matching, and PageRank.
The key observation that lets us implement this model
with RDDs is that Pregel applies the same user function
to all the vertices on each iteration. Thus, we can store the
vertex states for each iteration in an RDD and perform
a bulk transformation (flatMap) to apply this function
and generate an RDD of messages. We can then join this

RDD with the vertex states to perform the message exchange. Equally importantly, RDDs allow us to keep vertex states in memory like Pregel does, to minimize communication by controlling their partitioning, and to support partial recovery on failures. We have implemented
Pregel as a 200-line library on top of Spark and refer the
reader to [33] for more details.
Iterative MapReduce: Several recently proposed systems, including HaLoop [7] and Twister [11], provide an
iterative MapReduce model where the user gives the system a series of MapReduce jobs to loop. The systems
keep data partitioned consistently across iterations, and
Twister can also keep it in memory. Both optimizations
are simple to express with RDDs, and we were able to
implement HaLoop as a 200-line library using Spark.
Batched Stream Processing: Researchers have recently proposed several incremental processing systems
for applications that periodically update a result with
new data [21, 15, 4]. For example, an application updating statistics about ad clicks every 15 minutes should be
able to combine intermediate state from the previous 15minute window with data from new logs. These systems
perform bulk operations similar to Dryad, but store application state in distributed filesystems. Placing the intermediate state in RDDs would speed up their processing.
Explaining the Expressivity of RDDs Why are RDDs
able to express these diverse programming models? The
reason is that the restrictions on RDDs have little impact in many parallel applications. In particular, although
RDDs can only be created through bulk transformations,
many parallel programs naturally apply the same operation to many records, making them easy to express. Similarly, the immutability of RDDs is not an obstacle because one can create multiple RDDs to represent versions
of the same dataset. Indeed, many of todays MapReduce
applications run over filesystems that do not allow updates to files, such as HDFS.
One final question is why previous frameworks have
not offered the same level of generality. We believe that
this is because these systems explored specific problems
that MapReduce and Dryad do not handle well, such as
iteration, without observing that the common cause of
these problems was a lack of data sharing abstractions.
7.2

Leveraging RDDs for Debugging

While we initially designed RDDs to be deterministically


recomputable for fault tolerance, this property also facilitates debugging. In particular, by logging the lineage of
RDDs created during a job, one can (1) reconstruct these
RDDs later and let the user query them interactively and
(2) re-run any task from the job in a single-process debugger, by recomputing the RDD partitions it depends
on. Unlike traditional replay debuggers for general dis-

tributed systems [13], which must capture or infer the


order of events across multiple nodes, this approach adds
virtually zero recording overhead because only the RDD
lineage graph needs to be logged.9 We are currently developing a Spark debugger based on these ideas [33].

Related Work

Cluster Programming Models: Related work in cluster programming models falls into several classes. First,
data flow models such as MapReduce [10], Dryad [19]
and Ciel [23] support a rich set of operators for processing data but share it through stable storage systems.
RDDs represent a more efficient data sharing abstraction
than stable storage because they avoid the cost of data
replication, I/O and serialization.10
Second, several high-level programming interfaces
for data flow systems, including DryadLINQ [31] and
FlumeJava [8], provide language-integrated APIs where
the user manipulates parallel collections through operators like map and join. However, in these systems,
the parallel collections represent either files on disk or
ephemeral datasets used to express a query plan. Although the systems will pipeline data across operators
in the same query (e.g., a map followed by another
map), they cannot share data efficiently across queries.
We based Sparks API on the parallel collection model
due to its convenience, and do not claim novelty for the
language-integrated interface, but by providing RDDs as
the storage abstraction behind this interface, we allow it
to support a far broader class of applications.
A third class of systems provide high-level interfaces
for specific classes of applications requiring data sharing.
For example, Pregel [22] supports iterative graph applications, while Twister [11] and HaLoop [7] are iterative
MapReduce runtimes. However, these frameworks perform data sharing implicitly for the pattern of computation they support, and do not provide a general abstraction that the user can employ to share data of her choice
among operations of her choice. For example, a user cannot use Pregel or Twister to load a dataset into memory
and then decide what query to run on it. RDDs provide
a distributed storage abstraction explicitly and can thus
support applications that these specialized systems do
not capture, such as interactive data mining.
Finally, some systems expose shared mutable state
to allow the user to perform in-memory computation.
For example, Piccolo [27] lets users run parallel functions that read and update cells in a distributed hash
table. Distributed shared memory (DSM) systems [24]
9 Unlike these systems, an RDD-based debugger will not replay non-

deterministic behavior in the users functions (e.g., a nondeterministic


map), but it can at least report it by checksumming data.
10 Note that running MapReduce/Dryad over an in-memory data store
like RAMCloud [25] would still require data replication and serialization, which can be costly for some applications, as shown in 6.1.

and key-value stores like RAMCloud [25] offer a similar model. RDDs differ from these systems in two ways.
First, RDDs provide a higher-level programming interface based on operators such as map, sort and join,
whereas the interface in Piccolo and DSM is just reads
and updates to table cells. Second, Piccolo and DSM systems implement recovery through checkpoints and rollback, which is more expensive than the lineage-based
strategy of RDDs in many applications. Finally, as discussed in Section 2.3, RDDs also provide other advantages over DSM, such as straggler mitigation.
Caching Systems: Nectar [12] can reuse intermediate
results across DryadLINQ jobs by identifying common
subexpressions with program analysis [16]. This capability would be compelling to add to an RDD-based system.
However, Nectar does not provide in-memory caching (it
places the data in a distributed file system), nor does it
let users explicitly control which datasets to persist and
how to partition them. Ciel [23] and FlumeJava [8] can
likewise cache task results but do not provide in-memory
caching or explicit control over which data is cached.
Ananthanarayanan et al. have proposed adding an inmemory cache to distributed file systems to exploit the
temporal and spatial locality of data access [3]. While
this solution provides faster access to data that is already
in the file system, it is not as efficient a means of sharing intermediate results within an application as RDDs,
because it would still require applications to write these
results to the file system between stages.
Lineage: Capturing lineage or provenance information
for data has long been a research topic in scientific computing and databases, for applications such as explaining
results, allowing them to be reproduced by others, and
recomputing data if a bug is found in a workflow or if
a dataset is lost. We refer the reader to [5] and [9] for
surveys of this work. RDDs provide a parallel programming model where fine-grained lineage is inexpensive to
capture, so that it can be used for failure recovery.
Our lineage-based recovery mechanism is also similar
to the recovery mechanism used within a computation
(job) in MapReduce and Dryad, which track dependencies among a DAG of tasks. However, in these systems,
the lineage information is lost after a job ends, requiring
the use of a replicated storage system to share data across
computations. In contrast, RDDs apply lineage to persist
in-memory data efficiently across computations, without
the cost of replication and disk I/O.
Relational Databases: RDDs are conceptually similar
to views in a database, and persistent RDDs resemble
materialized views [28]. However, like DSM systems,
databases typically allow fine-grained read-write access
to all records, requiring logging of operations and data
for fault tolerance and additional overhead to maintain

consistency. These overheads are not required with the


coarse-grained transformation model of RDDs.

Conclusion

We have presented resilient distributed datasets (RDDs),


an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications. RDDs can
express a wide range of parallel applications, including
many specialized programming models that have been
proposed for iterative computation, and new applications
that these models do not capture. Unlike existing storage
abstractions for clusters, which require data replication
for fault tolerance, RDDs offer an API based on coarsegrained transformations that lets them recover data efficiently using lineage. We have implemented RDDs in
a system called Spark that outperforms Hadoop by up
to 20 in iterative applications and can be used interactively to query hundreds of gigabytes of data.
We have open sourced Spark at spark-project.org as
a vehicle for scalable data analysis and systems research.

Acknowledgements
We thank the first Spark users, including Tim Hunter,
Lester Mackey, Dilip Joseph, and Jibin Zhan, for trying
out our system in their real applications, providing many
good suggestions, and identifying a few research challenges along the way. We also thank our shepherd, Ed
Nightingale, and our reviewers for their feedback. This
research was supported in part by Berkeley AMP Lab
sponsors Google, SAP, Amazon Web Services, Cloudera, Huawei, IBM, Intel, Microsoft, NEC, NetApp and
VMWare, by DARPA (contract #FA8650-11-C-7136),
by a Google PhD Fellowship, and by the Natural Sciences and Engineering Research Council of Canada.

References
[1] Apache Hive. http://hadoop.apache.org/hive.
[2] Scala. http://www.scala-lang.org.
[3] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica.
Disk-locality in datacenter computing considered irrelevant. In
HotOS 11, 2011.
[4] P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and
R. Pasquin. Incoop: MapReduce for incremental computations.
In ACM SOCC 11, 2011.
[5] R. Bose and J. Frew. Lineage retrieval for scientific data
processing: a survey. ACM Computing Surveys, 37:128, 2005.
[6] S. Brin and L. Page. The anatomy of a large-scale hypertextual
web search engine. In WWW, 1998.
[7] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop:
efficient iterative data processing on large clusters. Proc. VLDB
Endow., 3:285296, September 2010.
[8] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry,
R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient
data-parallel pipelines. In PLDI 10. ACM, 2010.
[9] J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in
databases: Why, how, and where. Foundations and Trends in
Databases, 1(4):379474, 2009.
[10] J. Dean and S. Ghemawat. MapReduce: Simplified data
processing on large clusters. In OSDI, 2004.

[11] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu,


and G. Fox. Twister: a runtime for iterative mapreduce. In
HPDC 10, 2010.
[12] P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and
L. Zhuang. Nectar: automatic management of data and
computation in datacenters. In OSDI 10, 2010.
[13] Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F.
Kaashoek, and Z. Zhang. R2: an application-level kernel for
record and replay. OSDI08, 2008.
[14] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of
Statistical Learning: Data Mining, Inference, and Prediction.
Springer Publishing Company, New York, NY, 2009.
[15] B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou.
Comet: batched stream processing for data intensive distributed
computing. In SoCC 10.
[16] A. Heydon, R. Levin, and Y. Yu. Caching function calls using
precise dependencies. In ACM SIGPLAN Notices, pages
311320, 2000.
[17] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D.
Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform
for fine-grained resource sharing in the data center. In NSDI 11.
[18] T. Hunter, T. Moldovan, M. Zaharia, S. Merzgui, J. Ma, M. J.
Franklin, P. Abbeel, and A. M. Bayen. Scaling the Mobile
Millennium system in the cloud. In SOCC 11, 2011.
[19] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad:
distributed data-parallel programs from sequential building
blocks. In EuroSys 07, 2007.
[20] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. On availability of
intermediate data in cloud computations. In HotOS 09, 2009.
[21] D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum.
Stateful bulk processing for incremental analytics. SoCC 10.
[22] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski. Pregel: a system for large-scale
graph processing. In SIGMOD, 2010.
[23] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith,
A. Madhavapeddy, and S. Hand. Ciel: a universal execution
engine for distributed data-flow computing. In NSDI, 2011.
[24] B. Nitzberg and V. Lo. Distributed shared memory: a survey of
issues and algorithms. Computer, 24(8):52 60, Aug 1991.
[25] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis,
J. Leverich, D. Mazi`eres, S. Mitra, A. Narayanan, G. Parulkar,
M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman.
The case for RAMClouds: scalable high-performance storage
entirely in DRAM. SIGOPS Op. Sys. Rev., 43:92105, Jan 2010.
[26] D. Peng and F. Dabek. Large-scale incremental processing using
distributed transactions and notifications. In OSDI 2010.
[27] R. Power and J. Li. Piccolo: Building fast, distributed programs
with partitioned tables. In Proc. OSDI 2010, 2010.
[28] R. Ramakrishnan and J. Gehrke. Database Management
Systems. McGraw-Hill, Inc., 3 edition, 2003.
[29] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Design and
evaluation of a real-time URL spam filtering service. In IEEE
Symposium on Security and Privacy, 2011.
[30] J. W. Young. A first order approximation to the optimum
checkpoint interval. Commun. ACM, 17:530531, Sept 1974.
Erlingsson, P. K.
[31] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U.
Gunda, and J. Currey. DryadLINQ: A system for
general-purpose distributed data-parallel computing using a
high-level language. In OSDI 08, 2008.
[32] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,
S. Shenker, and I. Stoica. Delay scheduling: A simple technique
for achieving locality and fairness in cluster scheduling. In
EuroSys 10, 2010.
[33] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient
distributed datasets: A fault-tolerant abstraction for in-memory
cluster computing. Technical Report UCB/EECS-2011-82,
EECS Department, UC Berkeley, 2011.

Cassandra - A Decentralized Structured Storage System


Avinash Lakshman
Facebook

ABSTRACT
Cassandra is a distributed storage system for managing very
large amounts of structured data spread out across many
commodity servers, while providing highly available service
with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes (possibly spread
across dierent data centers). At this scale, small and large
components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives
the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation
strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple
data model that supports dynamic control over data layout and format. Cassandra system was designed to run on
cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.

1.

INTRODUCTION

Facebook runs the largest social networking platform that


serves hundreds of millions users at peak times using tens of
thousands of servers located in many data centers around
the world. There are strict operational requirements on
Facebooks platform in terms of performance, reliability and
efficiency, and to support continuous growth the platform
needs to be highly scalable. Dealing with failures in an infrastructure comprised of thousands of components is our
standard mode of operation; there are always a small but
significant number of server and network components that
are failing at any given time. As such, the software systems
need to be constructed in a manner that treats failures as the
norm rather than the exception. To meet the reliability and
scalability needs described above Facebook has developed
Cassandra.
Cassandra uses a synthesis of well known techniques to
achieve scalability and availability. Cassandra was designed
to fulfill the storage needs of the Inbox Search problem. In-

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00.

Prashant Malik
Facebook

box Search is a feature that enables users to search through


their Facebook Inbox. At Facebook this meant the system
was required to handle a very high write throughput, billions
of writes per day, and also scale with the number of users.
Since users are served from data centers that are geographically distributed, being able to replicate data across data
centers was key to keep search latencies down. Inbox Search
was launched in June of 2008 for around 100 million users
and today we are at over 250 million users and Cassandra
has kept up the promise so far. Cassandra is now deployed
as the backend storage system for multiple services within
Facebook.
This paper is structured as follows. Section 2 talks about
related work, some of which has been very influential on our
design. Section 3 presents the data model in more detail.
Section 4 presents the overview of the client API. Section
5 presents the system design and the distributed algorithms
that make Cassandra work. Section 6 details the experiences
of making Cassandra work and refinements to improve performance. In Section 6.1 we describe how one of the applications in the Facebook platform uses Cassandra. Finally
Section 7 concludes with future work on Cassandra.

2.

RELATED WORK

Distributing data for performance, availability and durability has been widely studied in the file system and database
communities. Compared to P2P storage systems that only
support flat namespaces, distributed file systems typically
support hierarchical namespaces. Systems like Ficus[14] and
Coda[16] replicate files for high availability at the expense
of consistency. Update conflicts are typically managed using specialized conflict resolution procedures. Farsite[2] is
a distributed file system that does not use any centralized
server. Farsite achieves high availability and scalability using replication. The Google File System (GFS)[9] is another
distributed file system built for hosting the state of Googles
internal applications. GFS uses a simple design with a single master server for hosting the entire metadata and where
the data is split into chunks and stored in chunk servers.
However the GFS master is now made fault tolerant using
the Chubby[3] abstraction. Bayou[18] is a distributed relational database system that allows disconnected operations
and provides eventual data consistency. Among these systems, Bayou, Coda and Ficus allow disconnected operations
and are resilient to issues such as network partitions and
outages. These systems dier on their conflict resolution
procedures. For instance, Coda and Ficus perform system
level conflict resolution and Bayou allows application level

resolution. All of them however, guarantee eventual consistency. Similar to these systems, Dynamo[6] allows read and
write operations to continue even during network partitions
and resolves update conflicts using dierent conflict resolution mechanisms, some client driven. Traditional replicated
relational database systems focus on the problem of guaranteeing strong consistency of replicated data. Although
strong consistency provides the application writer a convenient programming model, these systems are limited in
scalability and availability [10]. These systems are not capable of handling network partitions because they typically
provide strong consistency guarantees.
Dynamo[6] is a storage system that is used by Amazon
to store and retrieve user shopping carts. Dynamos Gossip
based membership algorithm helps every node maintain information about every other node. Dynamo can be defined
as a structured overlay with at most one-hop request routing. Dynamo detects updated conflicts using a vector clock
scheme, but prefers a client side conflict resolution mechanism. A write operation in Dynamo also requires a read to
be performed for managing the vector timestamps. This is
can be very limiting in environments where systems need
to handle a very high write throughput. Bigtable[4] provides both structure and data distribution but relies on a
distributed file system for its durability.

3.

DATA MODEL

A table in Cassandra is a distributed multi dimensional


map indexed by a key. The value is an object which is highly
structured. The row key in a table is a string with no size
restrictions, although typically 16 to 36 bytes long. Every
operation under a single row key is atomic per replica no
matter how many columns are being read or written into.
Columns are grouped together into sets called column families very much similar to what happens in the Bigtable[4]
system. Cassandra exposes two kinds of columns families,
Simple and Super column families. Super column families
can be visualized as a column family within a column family.
Furthermore, applications can specify the sort order of
columns within a Super Column or Simple Column family.
The system allows columns to be sorted either by time or
by name. Time sorting of columns is exploited by application like Inbox Search where the results are always displayed
in time sorted order. Any column within a column family
is accessed using the convention column f amily : column
and any column within a column family that is of type
super is accessed using the convention column f amily :
super column : column. A very good example of the super column family abstraction power is given in Section 6.1.
Typically applications use a dedicated Cassandra cluster and
manage them as part of their service. Although the system
supports the notion of multiple tables all deployments have
only one table in their schema.

4.

API

The Cassandra API consists of the following three simple


methods.
insert(table, key, rowM utation)
get(table, key, columnN ame)
delete(table, key, columnN ame)

columnN ame can refer to a specific column within a column family, a column family, a super column family, or a
column within a super column.

5.

SYSTEM ARCHITECTURE

The architecture of a storage system that needs to operate in a production setting is complex. In addition to
the actual data persistence component, the system needs to
have the following characteristics; scalable and robust solutions for load balancing, membership and failure detection,
failure recovery, replica synchronization, overload handling,
state transfer, concurrency and job scheduling, request marshalling, request routing, system monitoring and alarming,
and configuration management. Describing the details of
each of the solutions is beyond the scope of this paper, so
we will focus on the core distributed systems techniques used
in Cassandra: partitioning, replication, membership, failure
handling and scaling. All these modules work in synchrony
to handle read/write requests. Typically a read/write request for a key gets routed to any node in the Cassandra
cluster. The node then determines the replicas for this particular key. For writes, the system routes the requests to
the replicas and waits for a quorum of replicas to acknowledge the completion of the writes. For reads, based on the
consistency guarantees required by the client, the system either routes the requests to the closest replica or routes the
requests to all replicas and waits for a quorum of responses.

5.1

Partitioning

One of the key design features for Cassandra is the ability


to scale incrementally. This requires, the ability to dynamically partition the data over the set of nodes (i.e., storage
hosts) in the cluster. Cassandra partitions data across the
cluster using consistent hashing [11] but uses an order preserving hash function to do so. In consistent hashing the
output range of a hash function is treated as a fixed circular
space or ring (i.e. the largest hash value wraps around
to the smallest hash value). Each node in the system is assigned a random value within this space which represents its
position on the ring. Each data item identified by a key is
assigned to a node by hashing the data items key to yield
its position on the ring, and then walking the ring clockwise
to find the first node with a position larger than the items
position. This node is deemed the coordinator for this key.
The application specifies this key and the Cassandra uses it
to route requests. Thus, each node becomes responsible for
the region in the ring between it and its predecessor node
on the ring. The principal advantage of consistent hashing
is that departure or arrival of a node only aects its immediate neighbors and other nodes remain unaected. The
basic consistent hashing algorithm presents some challenges.
First, the random position assignment of each node on the
ring leads to non-uniform data and load distribution. Second, the basic algorithm is oblivious to the heterogeneity in
the performance of nodes. Typically there exist two ways to
address this issue: One is for nodes to get assigned to multiple positions in the circle (like in Dynamo), and the second
is to analyze load information on the ring and have lightly
loaded nodes move on the ring to alleviate heavily loaded
nodes as described in [17]. Cassandra opts for the latter as
it makes the design and implementation very tractable and
helps to make very deterministic choices about load balancing.

5.2

Replication

Cassandra uses replication to achieve high availability and


durability. Each data item is replicated at N hosts, where N
is the replication factor configured per-instance. Each key,
k, is assigned to a coordinator node (described in the previous section). The coordinator is in charge of the replication
of the data items that fall within its range. In addition
to locally storing each key within its range, the coordinator
replicates these keys at the N-1 nodes in the ring. Cassandra
provides the client with various options for how data needs to
be replicated. Cassandra provides various replication policies such as Rack Unaware, Rack Aware (within a datacenter) and Datacenter Aware. Replicas are chosen based
on the replication policy chosen by the application. If certain application chooses Rack Unaware replication strategy then the non-coordinator replicas are chosen by picking
N-1 successors of the coordinator on the ring. For Rack
Aware and Datacenter Aware strategies the algorithm is
slightly more involved. Cassandra system elects a leader
amongst its nodes using a system called Zookeeper[13]. All
nodes on joining the cluster contact the leader who tells
them for what ranges they are replicas for and leader makes
a concerted eort to maintain the invariant that no node
is responsible for more than N-1 ranges in the ring. The
metadata about the ranges a node is responsible is cached
locally at each node and in a fault-tolerant manner inside
Zookeeper - this way a node that crashes and comes back up
knows what ranges it was responsible for. We borrow from
Dynamo parlance and deem the nodes that are responsible
for a given range the preference list for the range.
As is explained in Section 5.1 every node is aware of every
other node in the system and hence the range they are responsible for. Cassandra provides durability guarantees in
the presence of node failures and network partitions by relaxing the quorum requirements as described in Section5.2.
Data center failures happen due to power outages, cooling
failures, network failures, and natural disasters. Cassandra
is configured such that each row is replicated across multiple
data centers. In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple datacenters. These datacenters are connected through
high speed network links. This scheme of replicating across
multiple datacenters allows us to handle entire data center
failures without any outage.

5.3

Membership

Cluster membership in Cassandra is based on Scuttlebutt[19], a very efficient anti-entropy Gossip based mechanism. The salient feature of Scuttlebutt is that it has very
efficient CPU utilization and very efficient utilization of the
gossip channel. Within the Cassandra system Gossip is not
only used for membership but also to disseminate other system related control state.

5.3.1

Failure Detection

Failure detection is a mechanism by which a node can


locally determine if any other node in the system is up or
down. In Cassandra failure detection is also used to avoid attempts to communicate with unreachable nodes during various operations. Cassandra uses a modified version of the
Accrual Failure Detector[8]. The idea of an Accrual Failure
Detection is that the failure detection module doesnt emit
a Boolean value stating a node is up or down. Instead the

failure detection module emits a value which represents a


suspicion level for each of monitored nodes. This value is
defined as . The basic idea is to express the value of on
a scale that is dynamically adjusted to reflect network and
load conditions at the monitored nodes.
has the following meaning: Given some threshold ,
and assuming that we decide to suspect a node A when =
1, then the likelihood that we will make a mistake (i.e., the
decision will be contradicted in the future by the reception
of a late heartbeat) is about 10%. The likelihood is about
1% with = 2, 0.1% with = 3, and so on. Every node in
the system maintains a sliding window of inter-arrival times
of gossip messages from other nodes in the cluster. The
distribution of these inter-arrival times is determined and
is calculated. Although the original paper suggests that
the distribution is approximated by the Gaussian distribution we found the Exponential Distribution to be a better
approximation, because of the nature of the gossip channel
and its impact on latency. To our knowledge our implementation of the Accrual Failure Detection in a Gossip based
setting is the first of its kind. Accrual Failure Detectors
are very good in both their accuracy and their speed and
they also adjust well to network conditions and server load
conditions.

5.4

Bootstrapping

When a node starts for the first time, it chooses a random


token for its position in the ring. For fault tolerance, the
mapping is persisted to disk locally and also in Zookeeper.
The token information is then gossiped around the cluster.
This is how we know about all nodes and their respective positions in the ring. This enables any node to route a request
for a key to the correct node in the cluster. In the bootstrap
case, when a node needs to join a cluster, it reads its configuration file which contains a list of a few contact points within
the cluster. We call these initial contact points, seeds of the
cluster. Seeds can also come from a configuration service
like Zookeeper.
In Facebooks environment node outages (due to failures
and maintenance tasks) are often transient but may last for
extended intervals. Failures can be of various forms such
as disk failures, bad CPU etc. A node outage rarely signifies a permanent departure and therefore should not result
in re-balancing of the partition assignment or repair of the
unreachable replicas. Similarly, manual error could result
in the unintentional startup of new Cassandra nodes. To
that eect every message contains the cluster name of each
Cassandra instance. If a manual error in configuration led
to a node trying to join a wrong Cassandra instance it can
thwarted based on the cluster name. For these reasons, it
was deemed appropriate to use an explicit mechanism to
initiate the addition and removal of nodes from a Cassandra instance. An administrator uses a command line tool
or a browser to connect to a Cassandra node and issue a
membership change to join or leave the cluster.

5.5

Scaling the Cluster

When a new node is added into the system, it gets assigned


a token such that it can alleviate a heavily loaded node.
This results in the new node splitting a range that some
other node was previously responsible for. The Cassandra
bootstrap algorithm is initiated from any other node in the
system by an operator using either a command line utility

or the Cassandra web dashboard. The node giving up the


data streams the data over to the new node using kernelkernel copy techniques. Operational experience has shown
that data can be transferred at the rate of 40 MB/sec from
a single node. We are working on improving this by having
multiple replicas take part in the bootstrap transfer thereby
parallelizing the eort, similar to Bittorrent.

5.6

Local Persistence

The Cassandra system relies on the local file system for


data persistence. The data is represented on disk using a format that lends itself to efficient data retrieval. Typical write
operation involves a write into a commit log for durability
and recoverability and an update into an in-memory data
structure. The write into the in-memory data structure is
performed only after a successful write into the commit log.
We have a dedicated disk on each machine for the commit
log since all writes into the commit log are sequential and
so we can maximize disk throughput. When the in-memory
data structure crosses a certain threshold, calculated based
on data size and number of objects, it dumps itself to disk.
This write is performed on one of many commodity disks
that machines are equipped with. All writes are sequential
to disk and also generate an index for efficient lookup based
on row key. These indices are also persisted along with the
data file. Over time many such files could exist on disk and
a merge process runs in the background to collate the different files into one file. This process is very similar to the
compaction process that happens in the Bigtable system.
A typical read operation first queries the in-memory data
structure before looking into the files on disk. The files are
looked at in the order of newest to oldest. When a disk
lookup occurs we could be looking up a key in multiple files
on disk. In order to prevent lookups into files that do not
contain the key, a bloom filter, summarizing the keys in
the file, is also stored in each data file and also kept in
memory. This bloom filter is first consulted to check if the
key being looked up does indeed exist in the given file. A key
in a column family could have many columns. Some special
indexing is required to retrieve columns which are further
away from the key. In order to prevent scanning of every
column on disk we maintain column indices which allow us to
jump to the right chunk on disk for column retrieval. As the
columns for a given key are being serialized and written out
to disk we generate indices at every 256K chunk boundary.
This boundary is configurable, but we have found 256K to
work well for us in our production workloads.

5.7

Implementation Details

The Cassandra process on a single machine is primarily


consists of the following abstractions: partitioning module,
the cluster membership and failure detection module and
the storage engine module. Each of these modules rely on an
event driven substrate where the message processing pipeline
and the task pipeline are split into multiple stages along the
line of the SEDA[20] architecture. Each of these modules
has been implemented from the ground up using Java. The
cluster membership and failure detection module, is built on
top of a network layer which uses non-blocking I/O. All system control messages rely on UDP based messaging while
the application related messages for replication and request
routing relies on TCP. The request routing modules are implemented using a certain state machine. When a read/write

request arrives at any node in the cluster the state machine


morphs through the following states (i) identify the node(s)
that own the data for the key (ii) route the requests to the
nodes and wait on the responses to arrive (iii) if the replies
do not arrive within a configured timeout value fail the request and return to the client (iv) figure out the latest response based on timestamp (v) schedule a repair of the data
at any replica if they do not have the latest piece of data.
For sake of exposition we do not talk about failure scenarios
here. The system can be configured to perform either synchronous or asynchronous writes. For certain systems that
require high throughput we rely on asynchronous replication. Here the writes far exceed the reads that come into
the system. During the synchronous case we wait for a quorum of responses before we return a result to the client.
In any journaled system there needs to exist a mechanism
for purging commit log entries. In Cassandra we use a rolling
a commit log where a new commit log is rolled out after an
older one exceeds a particular, configurable, size. We have
found that rolling commit logs after 128MB size seems to
work very well in our production workloads. Every commit log has a header which is basically a bit vector whose
size is fixed and typically more than the number of column
families that a particular system will ever handle. In our
implementation we have an in-memory data structure and a
data file that is generated per column family. Every time the
in-memory data structure for a particular column family is
dumped to disk we set its bit in the commit log stating that
this column family has been successfully persisted to disk.
This is an indication that this piece of information is already
committed. These bit vectors are per commit log and also
maintained in memory. Every time a commit log is rolled
its bit vector and all the bit vectors of commit logs rolled
prior to it are checked. If it is deemed that all the data
has been successfully persisted to disk then these commit
logs are deleted. The write operation into the commit log
can either be in normal mode or in fast sync mode. In the
fast sync mode the writes to the commit log are buered.
This implies that there is a potential of data loss on machine crash. In this mode we also dump the in-memory data
structure to disk in a buered fashion. Traditional databases
are not designed to handle particularly high write throughput. Cassandra morphs all writes to disk into sequential
writes thus maximizing disk write throughput. Since the
files dumped to disk are never mutated no locks need to be
taken while reading them. The server instance of Cassandra
is practically lockless for read/write operations. Hence we
do not need to deal with or handle the concurrency issues
that exist in B-Tree based database implementations.
The Cassandra system indexes all data based on primary
key. The data file on disk is broken down into a sequence
of blocks. Each block contains at most 128 keys and is demarcated by a block index. The block index captures the
relative oset of a key within the block and the size of its
data. When an in-memory data structure is dumped to disk
a block index is generated and their osets written out to
disk as indices. This index is also maintained in memory for
fast access. A typical read operation always looks up data
first in the in-memory data structure. If found the data is
returned to the application since the in-memory data structure contains the latest data for any key. If not found then
we perform disk I/O against all the data files on disk in reverse time order. Since we are always looking for the latest

data we look into the latest file first and return if we find
the data. Over time the number of data files will increase
on disk. We perform a compaction process, very much like
the Bigtable system, which merges multiple files into one;
essentially merge sort on a bunch of sorted data files. The
system will always compact files that are close to each other
with respect to size i.e there will never be a situation where a
100GB file is compacted with a file which is less than 50GB.
Periodically a major compaction process is run to compact
all related data files into one big file. This compaction process is a disk I/O intensive operation. Many optimizations
can be put in place to not aect in coming read requests.

6.

PRACTICAL EXPERIENCES

In the process of designing, implementing and maintaining


Cassandra we gained a lot of useful experience and learned
numerous lessons. One very fundamental lesson learned was
not to add any new feature without understanding the eects
of its usage by applications. Most problematic scenarios do
not stem from just node crashes and network partitions. We
share just a few interesting scenarios here.
Before launching the Inbox Search application we had
to index 7TB of inbox data for over 100M users, then
stored in our MySQL[1] infrastructure, and load it into
the Cassandra system. The whole process involved
running Map/Reduce[7] jobs against the MySQL data
files, indexing them and then storing the reverse-index
in Cassandra. The M/R process actually behaves as
the client of Cassandra. We exposed some background
channels for the M/R process to aggregate the reverse index per user and send over the serialized data
over to the Cassandra instance, to avoid the serialization/deserialization overhead. This way the Cassandra
instance is only bottlenecked by network bandwidth.
Most applications only require atomic operation per
key per replica. However there have been some applications that have asked for transactional mainly for the
purpose of maintaining secondary indices. Most developers with years of development experience working
with RDBMSs find this a very useful feature to have.
We are working on a mechanism to expose such atomic
operations.

hooks to repair nodes when disk fail. This is however


an administrative operation.
Although Cassandra is a completely decentralized system we have learned that having some amount of coordination is essential to making the implementation
of some distributed features tractable. For example
Cassandra is integrated with Zookeeper, which can be
used for various coordination tasks in large scale distributed systems. We intend to use the Zookeeper abstraction for some key features which actually do not
come in the way of applications that use Cassandra as
the storage engine.

6.1

Facebook Inbox Search

For Inbox Search we maintain a per user index of all messages that have been exchanged between the sender and the
recipients of the message. There are two kinds of search features that are enabled today (a) term search (b) interactions
- given the name of a person return all messages that the
user might have ever sent or received from that person. The
schema consists of two column families. For query (a) the
user id is the key and the words that make up the message
become the super column. Individual message identifiers
of the messages that contain the word become the columns
within the super column. For query (b) again the user id is
the key and the recipients ids are the super columns. For
each of these super columns the individual message identifiers are the columns. In order to make the searches fast
Cassandra provides certain hooks for intelligent caching of
data. For instance when a user clicks into the search bar
an asynchronous message is sent to the Cassandra cluster
to prime the buer cache with that users index. This way
when the actual search query is executed the search results
are likely to already be in memory. The system currently
stores about 50+TB of data on a 150 node cluster, which
is spread out between east and west coast data centers. We
show some production measured numbers for read performance.
Latency Stat
Min
Median
Max

Search Interactions
7.69ms
15.69ms
26.13ms

Term Search
7.78ms
18.27ms
44.41ms

We experimented with various implementations of Failure Detectors such as the ones described in [15] and [5].
Our experience had been that the time to detect failures increased beyond an acceptable limit as the size
of the cluster grew. In one particular experiment in a
cluster of 100 nodes time to taken to detect a failed
node was in the order of two minutes. This is practically unworkable in our environments. With the accrual failure detector with a slightly conservative value
of PHI, set to 5, the average time to detect failures in
the above experiment was about 15 seconds.

7.

Monitoring is not to be taken for granted. The Cassandra system is well integrated with Ganglia[12], a
distributed performance monitoring tool. We expose
various system level metrics to Ganglia and this has
helped us understand the behavior of the system when
subject to our production workload. Disks fail for no
apparent reasons. The bootstrap algorithm has some

Cassandra system has benefitted greatly from feedback


from many individuals within Facebook. In addition we
thank Karthik Ranganathan who indexed all the existing
data in MySQL and moved it into Cassandra for our first
production deployment. We would also like to thank Dan
Dumitriu from EPFL for his valuable suggestions about [19]
and [8].

CONCLUSION

We have built, implemented, and operated a storage system providing scalability, high performance, and wide applicability. We have empirically demonstrated that Cassandra
can support a very high update throughput while delivering low latency. Future works involves adding compression,
ability to support atomicity across keys and secondary index
support.

8.

ACKNOWLEDGEMENTS

9.

REFERENCES

[1] MySQL AB. Mysql.


[2] Atul Adya, William J. Bolosky, Miguel Castro, Gerald
Cermak, Ronnie Chaiken, John R. Douceur, Jon
Howell, Jacob R. Lorch, Marvin Theimer, and
Roger P. Wattenhofer. Farsite: Federated, available,
and reliable storage for an incompletely trusted
environment. In In Proceedings of the 5th Symposium
on Operating Systems Design and Implementation
(OSDI, pages 114, 2002.
[3] Mike Burrows. The chubby lock service for
loosely-coupled distributed systems. In OSDI 06:
Proceedings of the 7th symposium on Operating
systems design and implementation, pages 335350,
Berkeley, CA, USA, 2006. USENIX Association.
[4] Fay Chang, Jerey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Tushar Chandra, Andrew Fikes, and Robert E.
Gruber. Bigtable: A distributed storage system for
structured data. In In Proceedings of the 7th
Conference on USENIX Symposium on Operating
Systems Design and Implementation - Volume 7,
pages 205218, 2006.
[5] Abhinandan Das, Indranil Gupta, and Ashish
Motivala. Swim: Scalable weakly-consistent
infection-style process group membership protocol. In
DSN 02: Proceedings of the 2002 International
Conference on Dependable Systems and Networks,
pages 303312, Washington, DC, USA, 2002. IEEE
Computer Society.
[6] Giuseppe de Candia, Deniz Hastorun, Madan
Jampani, Gunavardhan Kakulapati, Alex Pilchin,
Swaminathan Sivasubramanian, Peter Vosshall, and
highly available
Werner Vogels. Dynamo: amazonOs
key-value store. In Proceedings of twenty-first ACM
SIGOPS symposium on Operating systems principles,
pages 205220. ACM, 2007.
[7] Jerey Dean and Sanjay Ghemawat. Mapreduce:
simplified data processing on large clusters. Commun.
ACM, 51(1):107113, 2008.
[8] Xavier Defago, Peter Urb
an, Naohiro Hayashibara,
and Takuya Katayama. The accrual failure detector.
In RR IS-RR-2004-010, Japan Advanced Institute of
Science and Technology, pages 6678, 2004.
[9] Sanjay Ghemawat, Howard Gobio, and Shun-Tak
Leung. The google file system. In SOSP 03:
Proceedings of the nineteenth ACM symposium on
Operating systems principles, pages 2943, New York,
NY, USA, 2003. ACM.
[10] Jim Gray and Pat Helland. The dangers of replication
and a solution. In In Proceedings of the 1996 ACM
SIGMOD International Conference on Management of
Data, pages 173182, 1996.
[11] David Karger, Eric Lehman, Tom Leighton, Matthew
Levine, Daniel Lewin, and Rina Panigrahy. Consistent
hashing and random trees: Distributed caching
protocols for relieving hot spots on the world wide
web. In In ACM Symposium on Theory of Computing,
pages 654663, 1997.
[12] Matthew L. Massie, Brent N. Chun, and David E.
Culler. The ganglia distributed monitoring system:
Design, implementation, and experience. Parallel

Computing, 30:2004, 2004.


[13] Benjamin Reed and Flavio Junquieira. Zookeeper.
[14] Peter Reiher, John Heidemann, David Ratner, Greg
Skinner, and Gerald Popek. Resolving file conflicts in
the ficus file system. In USTC94: Proceedings of the
USENIX Summer 1994 Technical Conference on
USENIX Summer 1994 Technical Conference, pages
1212, Berkeley, CA, USA, 1994. USENIX
Association.
[15] Robbert Van Renesse, Yaron Minsky, and Mark
Hayden. A gossip-style failure detection service. In
Proc. Conf. Middleware, pages 5570, 1996.
Service,T
[16] Mahadev Satyanarayanan, James J. Kistler, Puneet
Kumar, Maria E. Okasaki, Ellen H. Siegel, and
David C. Steere. Coda: A highly available file system
for a distributed workstation environment. IEEE
Trans. Comput., 39(4):447459, 1990.
[17] Ion Stoica, Robert Morris, David Liben-nowell,
David R. Karger, M. Frans Kaashoek, Frank Dabek,
and Hari Balakrishnan. Chord: a scalable peer-to-peer
lookup protocol for internet applications. IEEE/ACM
Transactions on Networking, 11:1732, 2003.
[18] D. B. Terry, M. M. Theimer, Karin Petersen, A. J.
Demers, M. J. Spreitzer, and C. H. Hauser. Managing
update conflicts in bayou, a weakly connected
replicated storage system. In SOSP 95: Proceedings
of the fifteenth ACM symposium on Operating systems
principles, pages 172182, New York, NY, USA, 1995.
ACM.
[19] Robbert van Renesse, Dan Mihai Dumitriu, Valient
Gough, and Chris Thomas. Efficient reconciliation and
flow control for anti-entropy protocols. In Proceedings
of the 2nd Large Scale Distributed Systems and
Middleware Workshop (LADIS 08), New York, NY,
USA, 2008. ACM.
[20] Matt Welsh, David Culler, and Eric Brewer. Seda: an
architecture for well-conditioned, scalable internet
services. In SOSP 01: Proceedings of the eighteenth
ACM symposium on Operating systems principles,
pages 230243, New York, NY, USA, 2001. ACM.

Windows Azure Storage: A Highly Available


Cloud Storage Service with Strong Consistency
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu,
Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju,
Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal,
Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand,
Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, Leonidas Rigas

Microsoft
Abstract

workflow for many applications. A common usage pattern we see


is incoming and outgoing data being shipped via Blobs, Queues
providing the overall workflow for processing the Blobs, and
intermediate service state and final results being kept in Tables or
Blobs.

Windows Azure Storage (WAS) is a cloud storage system that


provides customers the ability to store seemingly limitless
amounts of data for any duration of time. WAS customers have
access to their data from anywhere at any time and only pay for
what they use and store. In WAS, data is stored durably using
both local and geographic replication to facilitate disaster
recovery. Currently, WAS storage comes in the form of Blobs
(files), Tables (structured storage), and Queues (message
delivery). In this paper, we describe the WAS architecture, global
namespace, and data model, as well as its resource provisioning,
load balancing, and replication systems.

An example of this pattern is an ingestion engine service built on


Windows Azure to provide near real-time Facebook and Twitter
search. This service is one part of a larger data processing
pipeline that provides publically searchable content (via our
search engine, Bing) within 15 seconds of a Facebook or Twitter
users posting or status update. Facebook and Twitter send the
raw public content to WAS (e.g., user postings, user status
updates, etc.) to be made publically searchable. This content is
stored in WAS Blobs. The ingestion engine annotates this data
with user auth, spam, and adult scores; content classification; and
classification for language and named entities. In addition, the
engine crawls and expands the links in the data. While
processing, the ingestion engine accesses WAS Tables at high
rates and stores the results back into Blobs. These Blobs are then
folded into the Bing search engine to make the content publically
searchable. The ingestion engine uses Queues to manage the flow
of work, the indexing jobs, and the timing of folding the results
into the search engine. As of this writing, the ingestion engine for
Facebook and Twitter keeps around 350TB of data in WAS
(before replication). In terms of transactions, the ingestion engine
has a peak traffic load of around 40,000 transactions per second
and does between two to three billion transactions per day (see
Section 7 for discussion of additional workload profiles).

Categories and Subject Descriptors

D.4.2 [Operating Systems]: Storage ManagementSecondary


storage;
D.4.3 [Operating
Systems]:
File
Systems
ManagementDistributed file systems; D.4.5 [Operating
Systems]: ReliabilityFault tolerance; D.4.7 [Operating
Systems]: Organization and DesignDistributed systems; D.4.8
[Operating Systems]: PerformanceMeasurements

General Terms

Algorithms, Design, Management, Measurement, Performance,


Reliability.

Keywords

Cloud storage, distributed storage systems, Windows Azure.

1. Introduction

In the process of building WAS, feedback from potential internal


and external customers drove many design decisions. Some key
design features resulting from this feedback include:

Windows Azure Storage (WAS) is a scalable cloud storage


system that has been in production since November 2008. It is
used inside Microsoft for applications such as social networking
search, serving video, music and game content, managing medical
records, and more. In addition, there are thousands of customers
outside Microsoft using WAS, and anyone can sign up over the
Internet to use the system.

Strong Consistency Many customers want strong consistency:


especially enterprise customers moving their line of business
applications to the cloud. They also want the ability to perform
conditional reads, writes, and deletes for optimistic concurrency
control [12] on the strongly consistent data. For this, WAS
provides three properties that the CAP theorem [2] claims are
difficult to achieve at the same time: strong consistency, high
availability, and partition tolerance (see Section 8).

WAS provides cloud storage in the form of Blobs (user files),


Tables (structured storage), and Queues (message delivery).
These three data abstractions provide the overall storage and

Global and Scalable Namespace/Storage For ease of use,


WAS implements a global namespace that allows data to be stored
and accessed in a consistent manner from any location in the
world. Since a major goal of WAS is to enable storage of massive
amounts of data, this global namespace must be able to address
exabytes of data and beyond. We discuss our global namespace
design in detail in Section 2.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
SOSP '11, October 23-26, 2011, Cascais, Portugal.
Copyright 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.

143

Disaster Recovery WAS stores customer data across multiple


data centers hundreds of miles apart from each other. This
redundancy provides essential data recovery protection against
disasters such as earthquakes, wild fires, tornados, nuclear reactor
meltdown, etc.

primary key that consists of two properties: the PartitionName and


the ObjectName. This distinction allows applications using
Tables to group rows into the same partition to perform atomic
transactions across them. For Queues, the queue name is the
PartitionName and each message has an ObjectName to uniquely
identify it within the queue.

Multi-tenancy and Cost of Storage To reduce storage cost,


many customers are served from the same shared storage
infrastructure. WAS combines the workloads of many different
customers with varying resource needs together so that
significantly less storage needs to be provisioned at any one point
in time than if those services were run on their own dedicated
hardware.

3. High Level Architecture

Here we present a high level discussion of the WAS architecture


and how it fits into the Windows Azure Cloud Platform.

3.1 Windows Azure Cloud Platform

The Windows Azure Cloud platform runs many cloud services


across different data centers and different geographic regions.
The Windows Azure Fabric Controller is a resource provisioning
and management layer that provides resource allocation,
deployment/upgrade, and management for cloud services on the
Windows Azure platform. WAS is one such service running on
top of the Fabric Controller.

We describe these design features in more detail in the following


sections. The remainder of this paper is organized as follows.
Section 2 describes the global namespace used to access the WAS
Blob, Table, and Queue data abstractions. Section 3 provides a
high level overview of the WAS architecture and its three layers:
Stream, Partition, and Front-End layers. Section 4 describes the
stream layer, and Section 5 describes the partition layer. Section
6 shows the throughput experienced by Windows Azure
applications accessing Blobs and Tables. Section 7 describes
some internal Microsoft workloads using WAS. Section 8
discusses design choices and lessons learned. Section 9 presents
related work, and Section 10 summarizes the paper.

The Fabric Controller provides node management, network


configuration, health monitoring, starting/stopping of service
instances, and service deployment for the WAS system. In
addition, WAS retrieves network topology information, physical
layout of the clusters, and hardware configuration of the storage
nodes from the Fabric Controller. WAS is responsible for
managing the replication and data placement across the disks and
load balancing the data and application traffic within the storage
cluster.

2. Global Partitioned Namespace

A key goal of our storage system is to provide a single global


namespace that allows clients to address all of their storage in the
cloud and scale to arbitrary amounts of storage needed over time.
To provide this capability we leverage DNS as part of the storage
namespace and break the storage namespace into three parts: an
account name, a partition name, and an object name. As a result,
all data is accessible via a URI of the form:

3.2 WAS Architectural Components

An important feature of WAS is the ability to store and provide


access to an immense amount of storage (exabytes and beyond).
We currently have 70 petabytes of raw storage in production and
are in the process of provisioning a few hundred more petabytes
of raw storage based on customer demand for 2012.

http(s)://AccountName.<service>1.core.windows.net/PartitionNa
me/ObjectName

The WAS production system consists of Storage Stamps and the


Location Service (shown in Figure 1).

The AccountName is the customer selected account name for


accessing storage and is part of the DNS host name. The
AccountName DNS translation is used to locate the primary
storage cluster and data center where the data is stored. This
primary location is where all requests go to reach the data for that
account. An application may use multiple AccountNames to store
its data across different locations.

https://AccountName.service.core.windows.net/
DNS Lookup
Access Blobs,
Tables and Queues
for account

Partition Layer

Front-Ends

Inter-Stamp
Replication

Partition Layer

Stream Layer
Intra-Stamp Replication

Stream Layer
Intra-Stamp Replication

Storage Stamp

Storage Stamp

Storage Stamps A storage stamp is a cluster of N racks of


storage nodes, where each rack is built out as a separate fault
domain with redundant networking and power. Clusters typically
range from 10 to 20 racks with 18 disk-heavy storage nodes per
rack. Our first generation storage stamps hold approximately 2PB
of raw storage each. Our next generation stamps hold up to 30PB
of raw storage each.

<service> specifies the service type, which can be blob, table, or queue.
APIs for Windows Azure Blobs, Tables, and Queues can be found
http://msdn.microsoft.com/en-us/library/dd179355.aspx

VIP

Figure 1: High-level architecture

This naming approach enables WAS to flexibly support its three


data abstractions2. For Blobs, the full blob name is the
PartitionName. For Tables, each entity (row) in the table has a

DNS

Front-Ends

When a PartitionName holds many objects, the ObjectName


identifies individual objects within that partition. The system
supports atomic transactions across objects with the same
PartitionName value. The ObjectName is optional since, for some
types of data, the PartitionName uniquely identifies the object
within the account.

Account Management

VIP

In conjunction with the AccountName, the PartitionName locates


the data once a request reaches the storage cluster. The
PartitionName is used to scale out access to the data across
storage nodes based on traffic needs.

Location
Service

here:

144

Partition Layer The partition layer is built for (a) managing


and understanding higher level data abstractions (Blob, Table,
Queue), (b) providing a scalable object namespace, (c) providing
transaction ordering and strong consistency for objects, (d) storing
object data on top of the stream layer, and (e) caching object data
to reduce disk I/O.

To provide low cost cloud storage, we need to keep the storage


provisioned in production as highly utilized as possible. Our goal
is to keep a storage stamp around 70% utilized in terms of
capacity, transactions, and bandwidth. We try to avoid going
above 80% because we want to keep 20% in reserve for (a) disk
short stroking to gain better seek time and higher throughput by
utilizing the outer tracks of the disks and (b) to continue providing
storage capacity and availability in the presence of a rack failure
within a stamp. When a storage stamp reaches 70% utilization,
the location service migrates accounts to different stamps using
inter-stamp replication (see Section 3.4).

Another responsibility of this layer is to achieve scalability by


partitioning all of the data objects within a stamp. As described
earlier, all objects have a PartitionName; they are broken down
into disjointed ranges based on the PartitionName values and
served by different partition servers. This layer manages which
partition server is serving what PartitionName ranges for Blobs,
Tables, and Queues. In addition, it provides automatic load
balancing of PartitionNames across the partition servers to meet
the traffic needs of the objects.

Location Service (LS) The location service manages all the


storage stamps. It is also responsible for managing the account
namespace across all stamps. The LS allocates accounts to storage
stamps and manages them across the storage stamps for disaster
recovery and load balancing. The location service itself is
distributed across two geographic locations for its own disaster
recovery.

Front-End (FE) layer The Front-End (FE) layer consists of a


set of stateless servers that take incoming requests. Upon
receiving a request, an FE looks up the AccountName,
authenticates and authorizes the request, then routes the request to
a partition server in the partition layer (based on the
PartitionName). The system maintains a Partition Map that keeps
track of the PartitionName ranges and which partition server is
serving which PartitionNames. The FE servers cache the Partition
Map and use it to determine which partition server to forward
each request to. The FE servers also stream large objects directly
from the stream layer and cache frequently accessed data for
efficiency.

WAS provides storage from multiple locations in each of the three


geographic regions: North America, Europe, and Asia. Each
location is a data center with one or more buildings in that
location, and each location holds multiple storage stamps. To
provision additional capacity, the LS has the ability to easily add
new regions, new locations to a region, or new stamps to a
location. Therefore, to increase the amount of storage, we deploy
one or more storage stamps in the desired locations data center
and add them to the LS. The LS can then allocate new storage
accounts to those new stamps for customers as well as load
balance (migrate) existing storage accounts from older stamps to
the new stamps.

3.4 Two Replication Engines

Before describing the stream and partition layers in detail, we first


give a brief overview of the two replication engines in our system
and their separate responsibilities.

Figure 1 shows the location service with two storage stamps and
the layers within the storage stamps. The LS tracks the resources
used by each storage stamp in production across all locations.
When an application requests a new account for storing data, it
specifies the location affinity for the storage (e.g., US North).
The LS then chooses a storage stamp within that location as the
primary stamp for the account using heuristics based on the load
information across all stamps (which considers the fullness of the
stamps and other metrics such as network and transaction
utilization). The LS then stores the account metadata information
in the chosen storage stamp, which tells the stamp to start taking
traffic for the assigned account. The LS then updates DNS to
allow
requests
to
now
route
from
the
name
https://AccountName.service.core.windows.net/ to that storage
stamps virtual IP (VIP, an IP address the storage stamp exposes
for external traffic).

Intra-Stamp Replication (stream layer) This system provides


synchronous replication and is focused on making sure all the
data written into a stamp is kept durable within that stamp. It
keeps enough replicas of the data across different nodes in
different fault domains to keep data durable within the stamp in
the face of disk, node, and rack failures. Intra-stamp replication
is done completely by the stream layer and is on the critical path
of the customers write requests. Once a transaction has been
replicated successfully with intra-stamp replication, success can
be returned back to the customer.
Inter-Stamp Replication (partition layer) This system
provides asynchronous replication and is focused on replicating
data across stamps. Inter-stamp replication is done in the
background and is off the critical path of the customers request.
This replication is at the object level, where either the whole
object is replicated or recent delta changes are replicated for a
given account. Inter-stamp replication is used for (a) keeping a
copy of an accounts data in two locations for disaster recovery
and (b) migrating an accounts data between stamps. Inter-stamp
replication is configured for an account by the location service
and performed by the partition layer.

3.3 Three Layers within a Storage Stamp

Also shown in Figure 1 are the three layers within a storage


stamp. From bottom up these are:
Stream Layer This layer stores the bits on disk and is in charge
of distributing and replicating the data across many servers to
keep data durable within a storage stamp. The stream layer can be
thought of as a distributed file system layer within a stamp. It
understands files, called streams (which are ordered lists of
large storage chunks called extents), how to store them, how to
replicate them, etc., but it does not understand higher level object
constructs or their semantics. The data is stored in the stream
layer, but it is accessible from the partition layer. In fact, partition
servers (daemon processes in the partition layer) and stream
servers are co-located on each storage node in a stamp.

Inter-stamp replication is focused on replicating objects and the


transactions applied to those objects, whereas intra-stamp
replication is focused on replicating blocks of disk storage that are
used to make up the objects.
We separated replication into intra-stamp and inter-stamp at these
two different layers for the following reasons. Intra-stamp
replication provides durability against hardware failures, which
occur frequently in large scale systems, whereas inter-stamp
replication provides geo-redundancy against geo-disasters, which

145

are rare. It is crucial to provide intra-stamp replication with low


latency, since that is on the critical path of user requests; whereas
the focus of inter-stamp replication is optimal use of network
bandwidth between stamps while achieving an acceptable level of
replication delay. They are different problems addressed by the
two replication schemes.

and consists of a sequence of blocks. The target extent size used


by the partition layer is 1GB. To store small objects, the partition
layer appends many of them to the same extent and even in the
same block; to store large TB-sized objects (Blobs), the object is
broken up over many extents by the partition layer. The partition
layer keeps track of what streams, extents, and byte offsets in the
extents in which objects are stored as part of its index.

Another reason for creating these two separate replication layers


is the namespace each of these two layers has to maintain.
Performing intra-stamp replication at the stream layer allows the
amount of information that needs to be maintained to be scoped
by the size of a single storage stamp. This focus allows all of the
meta-state for intra-stamp replication to be cached in memory for
performance (see Section 4), enabling WAS to provide fast
replication with strong consistency by quickly committing
transactions within a single stamp for customer requests. In
contrast, the partition layer combined with the location service
controls and understands the global object namespace across
stamps, allowing it to efficiently replicate and maintain object
state across data centers.

Streams Every stream has a name in the hierarchical namespace


maintained at the stream layer, and a stream looks like a big file to
the partition layer. Streams are appended to and can be randomly
read from. A stream is an ordered list of pointers to extents
which is maintained by the Stream Manager. When the extents are
concatenated together they represent the full contiguous address
space in which the stream can be read in the order they were
added to the stream. A new stream can be constructed by
concatenating extents from existing streams, which is a fast
operation since it just updates a list of pointers. Only the last
extent in the stream can be appended to. All of the prior extents
in the stream are immutable.

4.1 Stream Manager and Extent Nodes

4. Stream Layer

The two main architecture components of the stream layer are the
Stream Manager (SM) and Extent Node (EN) (shown in Figure 3).

The stream layer provides an internal interface used only by the


partition layer. It provides a file system like namespace and API,
except that all writes are append-only. It allows clients (the
partition layer) to open, close, delete, rename, read, append to, and
concatenate these large files, which are called streams. A stream
is an ordered list of extent pointers, and an extent is a sequence of
append blocks.

Pointer to Extent E3

B11 B12 .. B1x

B21 B22 .. B2y

B31 B32 .. B3z

Extent E1 - Sealed

Extent E2 - Sealed

Extent E3 - Sealed

SM
SM
SM
B. Allocate extent
replica set

1
write

EN
ack

Stream //foo
Pointer to Extent E2

A. Create extent

Partition
Layer/
Client

Figure 2 shows stream //foo, which contains (pointers to) four


extents (E1, E2, E3, and E4). Each extent contains a set of blocks
that were appended to it. E1, E2 and E3 are sealed extents. It
means that they can no longer be appended to; only the last extent
in a stream (E4) can be appended to. If an application reads the
data of the stream from beginning to end, it would get the block
contents of the extents in the order of E1, E2, E3 and E4.

Pointer to Extent E1

Stream Layer

paxos

2
6

EN

3
5

EN
4

Primary

Secondary

Secondary

EN

EN

EN

EN

EN

Pointer to Extent E4

B41 B42 B43

Figure 3: Stream Layer Architecture

Extent E4 - Unsealed

Stream Manager (SM) The SM keeps track of the stream


namespace, what extents are in each stream, and the extent
allocation across the Extent Nodes (EN). The SM is a standard
Paxos cluster [13] as used in prior storage systems [3], and is off
the critical path of client requests. The SM is responsible for (a)
maintaining the stream namespace and state of all active streams
and extents, (b) monitoring the health of the ENs, (c) creating and
assigning extents to ENs, (d) performing the lazy re-replication of
extent replicas that are lost due to hardware failures or
unavailability, (e) garbage collecting extents that are no longer
pointed to by any stream, and (f) scheduling the erasure coding of
extent data according to stream policy (see Section 4.4).

Figure 2: Example stream with four extents


In more detail these data concepts are:
Block This is the minimum unit of data for writing and reading.
A block can be up to N bytes (e.g. 4MB). Data is written
(appended) as one or more concatenated blocks to an extent,
where blocks do not have to be the same size. The client does an
append in terms of blocks and controls the size of each block. A
client read gives an offset to a stream or extent, and the stream
layer reads as many blocks as needed at the offset to fulfill the
length of the read. When performing a read, the entire contents of
a block are read. This is because the stream layer stores its
checksum validation at the block level, one checksum per block.
The whole block is read to perform the checksum validation, and
it is checked on every block read. In addition, all blocks in the
system are validated against their checksums once every few days
to check for data integrity issues.

The SM periodically polls (syncs) the state of the ENs and what
extents they store. If the SM discovers that an extent is replicated
on fewer than the expected number of ENs, a re-replication of the
extent will lazily be created by the SM to regain the desired level
of replication. For extent replica placement, the SM randomly
chooses ENs across different fault domains, so that they are stored
on nodes that will not have correlated failures due to power,
network, or being on the same rack.

Extent Extents are the unit of replication in the stream layer,


and the default replication policy is to keep three replicas within a
storage stamp for an extent. Each extent is stored in an NTFS file

146

The SM does not know anything about blocks, just streams and
extents. The SM is off the critical path of client requests and does
not track each block append, since the total number of blocks can
be huge and the SM cannot scale to track those. Since the stream
and extent state is only tracked within a single stamp, the amount
of state can be kept small enough to fit in the SMs memory. The
only client of the stream layer is the partition layer, and the
partition layer and stream layer are co-designed so that they will
not use more than 50 million extents and no more than 100,000
streams for a single storage stamp given our current stamp sizes.
This parameterization can comfortably fit into 32GB of memory
for the SM.

of the partition layer providing strong consistency is built upon


the following guarantees from the stream layer:
1. Once a record is appended and acknowledged back to the
client, any later reads of that record from any replica will see the
same data (the data is immutable).
2. Once an extent is sealed, any reads from any sealed replica will
always see the same contents of the extent.
The data center, Fabric Controller, and WAS have security
mechanisms in place to guard against malicious adversaries, so
the stream replication does not deal with such threats. We
consider faults ranging from disk and node errors to power
failures, network issues, bit-flip and random hardware failures, as
well as software bugs. These faults can cause data corruption;
checksums are used to detect such corruption. The rest of the
section discusses the intra-stamp replication scheme within this
context.

Extent Nodes (EN) Each extent node maintains the storage for
a set of extent replicas assigned to it by the SM. An EN has N
disks attached, which it completely controls for storing extent
replicas and their blocks. An EN knows nothing about streams,
and only deals with extents and blocks. Internally on an EN
server, every extent on disk is a file, which holds data blocks and
their checksums, and an index which maps extent offsets to blocks
and their file location. Each extent node contains a view about the
extents it owns and where the peer replicas are for a given extent.
This view is a cache kept by the EN of the global state the SM
keeps. ENs only talk to other ENs to replicate block writes
(appends) sent by a client, or to create additional copies of an
existing replica when told to by the SM. When an extent is no
longer referenced by any stream, the SM garbage collects the
extent and notifies the ENs to reclaim the space.

4.3.1 Replication Flow

As shown in Figure 3, when a stream is first created (step A), the


SM assigns three replicas for the first extent (one primary and two
secondary) to three extent nodes (step B), which are chosen by the
SM to randomly spread the replicas across different fault and
upgrade domains while considering extent node usage (for load
balancing). In addition, the SM decides which replica will be the
primary for the extent. Writes to an extent are always performed
from the client to the primary EN, and the primary EN is in charge
of coordinating the write to two secondary ENs. The primary EN
and the location of the three replicas never change for an extent
while it is being appended to (while the extent is unsealed).
Therefore, no leases are used to represent the primary EN for an
extent, since the primary is always fixed while an extent is
unsealed.

4.2 Append Operation and Sealed Extent

Streams can only be appended to; existing data cannot be


modified. The append operations are atomic: either the entire data
block is appended, or nothing is. Multiple blocks can be
appended at once, as a single atomic multi-block append
operation. The minimum read size from a stream is a single
block. The multi-block append operation allows us to write a
large amount of sequential data in a single append and to later
perform small reads. The contract used between the client
(partition layer) and the stream layer is that the multi-block
append will occur atomically, and if the client never hears back
for a request (due to failure) the client should retry the request (or
seal the extent). This contract implies that the client needs to
expect the same block to be appended more than once in face of
timeouts and correctly deal with processing duplicate records. The
partition layer deals with duplicate records in two ways (see
Section 5 for details on the partition layer streams). For the
metadata and commit log streams, all of the transactions written
have a sequence number and duplicate records will have the same
sequence number. For the row data and blob data streams, for
duplicate writes, only the last write will be pointed to by the
RangePartition data structures, so the prior duplicate writes will
have no references and will be garbage collected later.

When the SM allocates the extent, the extent information is sent


back to the client, which then knows which ENs hold the three
replicas and which one is the primary. This state is now part of
the streams metadata information held in the SM and cached on
the client. When the last extent in the stream that is being
appended to becomes sealed, the same process repeats. The SM
then allocates another extent, which now becomes the last extent
in the stream, and all new appends now go to the new last extent
for the stream.
For an extent, every append is replicated three times across the
extents replicas. A client sends all write requests to the primary
EN, but it can read from any replica, even for unsealed extents.
The append is sent to the primary EN for the extent by the client,
and the primary is then in charge of (a) determining the offset of
the append in the extent, (b) ordering (choosing the offset of) all
of the appends if there are concurrent append requests to the same
extent outstanding, (c) sending the append with its chosen offset
to the two secondary extent nodes, and (d) only returning success
for the append to the client after a successful append has occurred
to disk for all three extent nodes. The sequence of steps during an
append is shown in Figure 3 (labeled with numbers). Only when
all of the writes have succeeded for all three replicas will the
primary EN then respond to the client that the append was a
success. If there are multiple outstanding appends to the same
extent, the primary EN will respond success in the order of their
offset (commit them in order) to the clients. As appends commit
in order for a replica, the last append position is considered to be
the current commit length of the replica. We ensure that the bits
are the same between all replicas by the fact that the primary EN
for an extent never changes, it always picks the offset for appends,

An extent has a target size, specified by the client (partition layer),


and when it fills up to that size the extent is sealed at a block
boundary, and then a new extent is added to the stream and
appends continue into that new extent. Once an extent is sealed it
can no longer be appended to. A sealed extent is immutable, and
the stream layer performs certain optimizations on sealed extents
like erasure coding cold extents. Extents in a stream do not have
to be the same size, and they can be sealed anytime and can even
grow arbitrarily large.

4.3 Stream Layer Intra-Stamp Replication

The stream layer and partition layer are co-designed to provide


strong consistency at the object transaction level. The correctness

147

1. Read records at known locations. The partition layer uses two


types of data streams (row and blob). For these streams, it always
reads at specific locations (extent+offset, length).
More
importantly, the partition layer will only read these two streams
using the location information returned from a previous successful
append at the stream layer. That will only occur if the append was
successfully committed to all three replicas. The replication
scheme guarantees such reads always see the same data.

appends for an extent are committed in order, and how extents are
sealed upon failures (discussed in Section 4.3.2).
When a stream is opened, the metadata for its extents is cached at
the client, so the client can go directly to the ENs for reading and
writing without talking to the SM until the next extent needs to be
allocated for the stream. If during writing, one of the replicas
ENs is not reachable or there is a disk failure for one of the
replicas, a write failure is returned to the client. The client then
contacts the SM, and the extent that was being appended to is
sealed by the SM at its current commit length (see Section 4.3.2).
At this point the sealed extent can no longer be appended to. The
SM will then allocate a new extent with replicas on different
(available) ENs, which makes it now the last extent of the stream.
The information for this new extent is returned to the client. The
client then continues appending to the stream with its new extent.
This process of sealing by the SM and allocating the new extent is
done on average within 20ms. A key point here is that the client
can continue appending to a stream as soon as the new extent has
been allocated, and it does not rely on a specific node to become
available again.

2. Iterate all records sequentially in a stream on partition


load. Each partition has two additional streams (metadata and
commit log). These are the only streams that the partition layer
will read sequentially from a starting point to the very last record
of a stream. This operation only occurs when the partition is
loaded (explained in Section 5). The partition layer ensures that
no useful appends from the partition layer will happen to these
two streams during partition load. Then the partition and stream
layer together ensure that the same sequence of records is returned
on partition load.
At the start of a partition load, the partition server sends a check
for commit length to the primary EN of the last extent of these
two streams. This checks whether all the replicas are available and
that they all have the same length. If not, the extent is sealed and
reads are only performed, during partition load, against a replica
sealed by the SM. This ensures that the partition load will see all
of its data and the exact same view, even if we were to repeatedly
load the same partition reading from different sealed replicas for
the last extent of the stream.

For the newly sealed extent, the SM will create new replicas to
bring it back to the expected level of redundancy in the
background if needed.

4.3.2 Sealing

From a high level, the SM coordinates the sealing operation


among the ENs; it determines the commit length of the extent
used for sealing based on the commit length of the extent replicas.
Once the sealing is done, the commit length will never change
again.

4.4 Erasure Coding Sealed Extents

To reduce the cost of storage, WAS erasure codes sealed extents


for Blob storage. WAS breaks an extent into N roughly equal
sized fragments at block boundaries. Then, it adds M error
correcting code fragments using Reed-Solomon for the erasure
coding algorithm [19]. As long as it does not lose more than M
fragments (across the data fragments + code fragments), WAS can
recreate the full extent.

To seal an extent, the SM asks all three ENs their current length.
During sealing, either all replicas have the same length, which is
the simple case, or a given replica is longer or shorter than another
replica for the extent. This latter case can only occur during an
append failure where some but not all of the ENs for the replica
are available (i.e., some of the replicas get the append block, but
not all of them). We guarantee that the SM will seal the extent
even if the SM may not be able to reach all the ENs involved.
When sealing the extent, the SM will choose the smallest commit
length based on the available ENs it can talk to. This will not
cause data loss since the primary EN will not return success
unless all replicas have been written to disk for all three ENs. This
means the smallest commit length is sure to contain all the writes
that have been acknowledged to the client. In addition, it is also
fine if the final length contains blocks that were never
acknowledged back to the client, since the client (partition layer)
correctly deals with these as described in Section 4.2. During the
sealing, all of the extent replicas that were reachable by the SM
are sealed to the commit length chosen by the SM.

Erasure coding sealed extents is an important optimization, given


the amount of data we are storing. It reduces the cost of storing
data from three full replicas within a stamp, which is three times
the original data, to only 1.3x 1.5x the original data, depending
on the number of fragments used. In addition, erasure coding
actually increases the durability of the data when compared to
keeping three replicas within a stamp.

4.5 Read Load-Balancing

When reads are issued for an extent that has three replicas, they
are submitted with a deadline value which specifies that the
read should not be attempted if it cannot be fulfilled within the
deadline. If the EN determines the read cannot be fulfilled within
the time constraint, it will immediately reply to the client that the
deadline cannot be met. This mechanism allows the client to
select a different EN to read that data from, likely allowing the
read to complete faster.

Once the sealing is done, the commit length of the extent will
never be changed. If an EN was not reachable by the SM during
the sealing process but later becomes reachable, the SM will force
the EN to synchronize the given extent to the chosen commit
length. This ensures that once an extent is sealed, all its available
replicas (the ones the SM can eventually reach) are bitwise
identical.

This method is also used with erasure coded data. When reads
cannot be serviced in a timely manner due to a heavily loaded
spindle to the data fragment, the read may be serviced faster by
doing a reconstruction rather than reading that data fragment. In
this case, reads (for the range of the fragment needed to satisfy the
client request) are issued to all fragments of an erasure coded
extent, and the first N responses are used to reconstruct the desired
fragment.

4.3.3 Interaction with Partition Layer

An interesting case is when, due to network partitioning, a client


(partition server) is still able to talk to an EN that the SM could
not talk to during the sealing process. This section explains how
the partition layer handles this case.
The partition layer has two different read patterns:

148

scalable namespace for the objects, (d) load balancing to access


objects across the available partition servers, and (e) transaction
ordering and strong consistency for access to objects.

4.6 Spindle Anti-Starvation

Many hard disk drives are optimized to achieve the highest


possible throughput, and sacrifice fairness to achieve that goal.
They tend to prefer reads or writes that are sequential. Since our
system contains many streams that can be very large, we observed
in developing our service that some disks would lock into
servicing large pipelined reads or writes while starving other
operations. On some disks we observed this could lock out nonsequential IO for as long as 2300 milliseconds. To avoid this
problem we avoid scheduling new IO to a spindle when there is
over 100ms of expected pending IO already scheduled or when
there is any pending IO request that has been scheduled but not
serviced for over 200ms. Using our own custom IO scheduling
allows us to achieve fairness across reads/writes at the cost of
slightly increasing overall latency on some sequential requests.

5.1 Partition Layer Data Model

The partition layer provides an important internal data structure


called an Object Table (OT). An OT is a massive table which can
grow to several petabytes. Object Tables are dynamically broken
up into RangePartitions (based on traffic load to the table) and
spread across Partition Servers (Section 5.2) in a stamp. A
RangePartition is a contiguous range of rows in an OT from a
given low-key to a high-key. All RangePartitions for a given OT
are non-overlapping, and every row is represented in some
RangePartition.
The following are the Object Tables used by the partition layer.
The Account Table stores metadata and configuration for each
storage account assigned to the stamp. The Blob Table stores all
blob objects for all accounts in the stamp. The Entity Table stores
all entity rows for all accounts in the stamp; it is used for the
public Windows Azure Table data abstraction. The Message
Table stores all messages for all accounts queues in the stamp.
The Schema Table keeps track of the schema for all OTs. The
Partition Map Table keeps track of the current RangePartitions for
all Object Tables and what partition server is serving each
RangePartition. This table is used by the Front-End servers to
route requests to the corresponding partition servers.

4.7 Durability and Journaling

The durability contract for the stream layer is that when data is
acknowledged as written by the stream layer, there must be at
least three durable copies of the data stored in the system. This
contract allows the system to maintain data durability even in the
face of a cluster-wide power failure. We operate our storage
system in such a way that all writes are made durable to power
safe storage before they are acknowledged back to the client.
As part of maintaining the durability contract while still achieving
good performance, an important optimization for the stream layer
is that on each extent node we reserve a whole disk drive or SSD
as a journal drive for all writes into the extent node. The journal
drive [11] is dedicated solely for writing a single sequential
journal of data, which allows us to reach the full write throughput
potential of the device. When the partition layer does a stream
append, the data is written by the primary EN while in parallel
sent to the two secondaries to be written. When each EN
performs its append, it (a) writes all of the data for the append to
the journal drive and (b) queues up the append to go to the data
disk where the extent file lives on that EN. Once either succeeds,
success can be returned. If the journal succeeds first, the data is
also buffered in memory while it goes to the data disk, and any
reads for that data are served from memory until the data is on the
data disk. From that point on, the data is served from the data
disk. This also enables the combining of contiguous writes into
larger writes to the data disk, and better scheduling of concurrent
writes and reads to get the best throughput. It is a tradeoff for
good latency at the cost of an extra write off the critical path.

Each of the above OTs has a fixed schema stored in the Schema
Table. The primary key for the Blob Table, Entity Table, and
Message Table consists of three properties: AccountName,
PartitionName, and ObjectName. These properties provide the
indexing and sort order for those Object Tables.

5.1.1 Supported Data Types and Operations

The property types supported for an OTs schema are the standard
simple types (bool, binary, string, DateTime, double, GUID,
int32, int64). In addition, the system supports two special types
DictionaryType and BlobType. The DictionaryType allows for
flexible properties (i.e., without a fixed schema) to be added to a
row at any time. These flexible properties are stored inside of the
dictionary type as (name, type, value) tuples. From a data access
standpoint, these flexible properties behave like first-order
properties of the row and are queryable just like any other
property in the row. The BlobType is a special property used to
store large amounts of data and is currently used only by the Blob
Table. BlobType avoids storing the blob data bits with the row
properties in the row data stream. Instead, the blob data bits
are stored in a separate blob data stream and a pointer to the
blobs data bits (list of extent + offset, length pointers) is stored
in the BlobTypes property in the row. This keeps the large data
bits separated from the OTs queryable row property values stored
in the row data stream.

Even though the stream layer is an append-only system, we found


that adding a journal drive provided important benefits, since the
appends do not have to contend with reads going to the data disk
in order to commit the result back to the client. The journal
allows the append times from the partition layer to have more
consistent and lower latencies. Take for example the partition
layers commit log stream, where an append is only as fast as the
slowest EN for the replicas being appended to. For small appends
to the commit log stream without journaling we saw an average
end-to-end stream append latency of 30ms. With journaling we
see an average append latency of 6ms. In addition, the variance of
latencies decreased significantly.

OTs support standard operations including insert, update, and


delete operations on rows as well as query/get operations. In
addition, OTs allows batch transactions across rows with the same
PartitionName value. The operations in a single batch are
committed as a single transaction. Finally, OTs provide snapshot
isolation to allow read operations to happen concurrently with
writes.

5. Partition Layer

5.2 Partition Layer Architecture

The partition layer stores the different types of objects and


understands what a transaction means for a given object type
(Blob, Table, or Queue). The partition layer provides the (a) data
model for the different types of objects stored, (b) logic and
semantics to process the different types of objects, (c) massively

The partition layer has three main architectural components as


shown in Figure 4: a Partition Manager (PM), Partition Servers
(PS), and a Lock Service.

149

Lookup partition
Front End/
Client

Partition
Map Table

Update

5.3.1 Persistent Data Structure

Lock
Service

Monitor Lease
Status

A RangePartition uses a Log-Structured Merge-Tree [17,4] to


maintain its persistent data. Each Object Tables RangePartition
consists of its own set of streams in the stream layer, and the
streams belong solely to a given RangePartition, though the
underlying extents can be pointed to by multiple streams in
different RangePartitions due to RangePartition splitting. The
following are the set of streams that comprise each RangePartition
(shown in Figure 5):

PM
Lease Renewal

Partition Assignment
Load Balance
writes

reads

PS1
Persist partition state

PS2
Read partition state
from streams

PS3

Read/Query

Write

Partition Layer

Memory
Table

Stream Layer

RangePartition Memory Data Module


Row Page Cache Index cache

Figure 4: Partition Layer Architecture

Load Metrics

Adaptive
Bloom Filters Range Profiling

Partition Manager (PM) Responsible for keeping track of and


splitting the massive Object Tables into RangePartitions and
assigning each RangePartition to a Partition Server to serve access
to the objects. The PM splits the Object Tables into N
RangePartitions in each stamp, keeping track of the current
RangePartition breakdown for each OT and to which partition
servers they are assigned. The PM stores this assignment in the
Partition Map Table. The PM ensures that each RangePartition is
assigned to exactly one active partition server at any time, and that
two RangePartitions do not overlap. It is also responsible for load
balancing RangePartitions among partition servers. Each stamp
has multiple instances of the PM running, and they all contend for
a leader lock that is stored in the Lock Service (see below). The
PM with the lease is the active PM controlling the partition layer.

Persistent Data for a Range Partition


(Data Stored in Stream Layer)
Row Data Stream
Commit Log Stream

Metadata Stream

Checkpoint Checkpoint Checkpoint


Blob Data Stream
Extent Ptr

Extent Ptr

Extent Ptr

Figure 5: RangePartition Data Structures


Metadata Stream The metadata stream is the root stream for a
RangePartition. The PM assigns a partition to a PS by providing
the name of the RangePartitions metadata stream. The metadata
stream contains enough information for the PS to load a
RangePartition, including the name of the commit log stream and
data streams for that RangePartition, as well as pointers
(extent+offset) into those streams for where to start operating in
those streams (e.g., where to start processing in the commit log
stream and the root of the index for the row data stream). The PS
serving the RangePartition also writes in the metadata stream the
status of outstanding split and merge operations that the
RangePartition may be involved in.

Partition Server (PS) A partition server is responsible for


serving requests to a set of RangePartitions assigned to it by the
PM. The PS stores all the persistent state of the partitions into
streams and maintains a memory cache of the partition state for
efficiency. The system guarantees that no two partition servers
can serve the same RangePartition at the same time by using
leases with the Lock Service. This allows the PS to provide
strong consistency and ordering of concurrent transactions to
objects for a RangePartition it is serving. A PS can concurrently
serve multiple RangePartitions from different OTs. In our
deployments, a PS serves on average ten RangePartitions at any
time.

Commit Log Stream Is a commit log used to store the recent


insert, update, and delete operations applied to the RangePartition
since the last checkpoint was generated for the RangePartition.

Lock Service A Paxos Lock Service [3,13] is used for leader


election for the PM. In addition, each PS also maintains a lease
with the lock service in order to serve partitions. We do not go
into the details of the PM leader election, or the PS lease
management, since the concepts used are similar to those
described in the Chubby Lock [3] paper.

Row Data Stream Stores the checkpoint row data and index for
the RangePartition.
Blob Data Stream Is only used by the Blob Table to store the
blob data bits.
Each of the above is a separate stream in the stream layer owned
by an Object Tables RangePartition.

On partition server failure, all N RangePartitions served by the


failed PS are assigned to available PSs by the PM. The PM will
choose N (or fewer) partition servers, based on the load on those
servers. The PM will assign a RangePartition to a PS, and then
update the Partition Map Table specifying what partition server is
serving each RangePartition. This allows the Front-End layer to
find the location of RangePartitions by looking in the Partition
Map Table (see Figure 4). When the PS gets a new assignment it
will start serving the new RangePartitions for as long as the PS
holds its partition server lease.

Each RangePartition in an Object Table has only one data stream,


except the Blob Table. A RangePartition in the Blob Table has a
row data stream for storing its row checkpoint data (the blob
index), and a separate blob data stream for storing the blob data
bits for the special BlobType described earlier.

5.3.2 In-Memory Data Structures

A partition server maintains the following in-memory components


as shown in Figure 5:

5.3 RangePartition Data Structures

Memory Table This is the in-memory version of the commit


log for a RangePartition, containing all of the recent updates that
have not yet been checkpointed to the row data stream. When a

A PS serves a RangePartition by maintaining a set of in-memory


data structures and a set of persistent data structures in streams.

150

Split This operation identifies when a single RangePartition has


too much load and splits the RangePartition into two or more
smaller and disjoint RangePartitions, then load balances
(reassigns) them across two or more partition servers.

lookup occurs the memory table is checked to find recent updates


to the RangePartition.
Index Cache This cache stores the checkpoint indexes of the
row data stream. We separate this cache out from the row data
cache to make sure we keep as much of the main index cached in
memory as possible for a given RangePartition.

Merge This operation merges together cold or lightly loaded


RangePartitions that together form a contiguous key range within
their OT. Merge is used to keep the number of RangePartitions
within a bound proportional to the number of partition servers in a
stamp.

Row Data Cache This is a memory cache of the checkpoint row


data pages. The row data cache is read-only. When a lookup
occurs, both the row data cache and the memory table are
checked, giving preference to the memory table.

WAS keeps the total number of partitions between a low


watermark and a high watermark (typically around ten times the
partition server count within a stamp). At equilibrium, the
partition count will stay around the low watermark. If there are
unanticipated traffic bursts that concentrate on a single
RangePartition, it will be split to spread the load. When the total
RangePartition count is approaching the high watermark, the
system will increase the merge rate to eventually bring the
RangePartition count down towards the low watermark.
Therefore, the number of RangePartitions for each OT changes
dynamically based upon the load on the objects in those tables.

Bloom Filters If the data is not found in the memory table or


the row data cache, then the index/checkpoints in the data stream
need to be searched. It can be expensive to blindly examine them
all. Therefore a bloom filter is kept for each checkpoint, which
indicates if the row being accessed may be in the checkpoint.
We do not go into further details about these components, since
these are similar to those in [17,4].

5.4 Data Flow

When the PS receives a write request to the RangePartition (e.g.,


insert, update, delete), it appends the operation into the commit
log, and then puts the newly changed row into the memory table.
Therefore, all the modifications to the partition are recorded
persistently in the commit log, and also reflected in the memory
table. At this point success can be returned back to the client (the
FE servers) for the transaction. When the size of the memory table
reaches its threshold size or the size of the commit log stream
reaches its threshold, the partition server will write the contents of
the memory table into a checkpoint stored persistently in the row
data stream for the RangePartition. The corresponding portion of
the commit log can then be removed. To control the total number
of checkpoints for a RangePartition, the partition server will
periodically combine the checkpoints into larger checkpoints, and
then remove the old checkpoints via garbage collection.

Having a high watermark of RangePartitions ten times the number


of partition servers (a storage stamp has a few hundred partition
servers) was chosen based on how big we can allow the stream
and extent metadata to grow for the SM, and still completely fit
the metadata in memory for the SM. Keeping many more
RangePartitions than partition servers enables us to quickly
distribute a failed PS or racks load across many other PSs. A
given partition server can end up serving a single extremely hot
RangePartition, tens of lightly loaded RangePartitions, or a
mixture in-between, depending upon the current load to the
RangePartitions in the stamp. The number of RangePartitions for
the Blob Table vs. Entity Table vs. Message Table depends upon
the load on the objects in those tables and is continuously
changing within a storage stamp based upon traffic.

For the Blob Tables RangePartitions, we also store the Blob data
bits directly into the commit log stream (to minimize the number
of stream writes for Blob operations), but those data bits are not
part of the row data so they are not put into the memory table.
Instead, the BlobType property for the row tracks the location of
the Blob data bits (extent+offset, length). During checkpoint, the
extents that would be removed from the commit log are instead
concatenated to the RangePartitions Blob data stream. Extent
concatenation is a fast operation provided by the stream layer
since it consists of just adding pointers to extents at the end of the
Blob data stream without copying any data.

For each stamp, we typically see 75 splits and merges and 200
RangePartition load balances per day.

5.5.1 Load Balance Operation Details

We track the load for each RangePartition as well as the overall


load for each PS.
For both of these we track
(a) transactions/second, (b) average pending transaction count,
(c) throttling rate, (d) CPU usage, (e) network usage, (f) request
latency, and (g) data size of the RangePartition.
The PM
maintains heartbeats with each PS. This information is passed
back to the PM in responses to the heartbeats. If the PM sees a
RangePartition that has too much load based upon the metrics,
then it will decide to split the partition and send a command to the
PS to perform the split. If instead a PS has too much load, but no
individual RangePartition seems to be too highly loaded, the PM
will take one or more RangePartitions from the PS and reassign
them to a more lightly loaded PS.

A PS can start serving a RangePartition by loading the partition.


Loading a partition involves reading the metadata stream of the
RangePartition to locate the active set of checkpoints and
replaying the transactions in the commit log to rebuild the inmemory state. Once these are done, the PS has the up-to-date
view of the RangePartition and can start serving requests.

To load balance a RangePartition, the PM sends an offload


command to the PS, which will have the RangePartition write a
current checkpoint before offloading it. Once complete, the PS
acks back to the PM that the offload is done. The PM then
assigns the RangePartition to its new PS and updates the Partition
Map Table to point to the new PS. The new PS loads and starts
serving traffic for the RangePartition. The loading of the
RangePartition on the new PS is very quick since the commit log
is small due to the checkpoint prior to the offload.

5.5 RangePartition Load Balancing

A critical part of the partition layer is breaking these massive


Object Tables into RangePartitions and automatically load
balancing them across the partition servers to meet their varying
traffic demands.
The PM performs three operations to spread load across partition
servers and control the total number of partitions in a stamp:
Load Balance This operation identifies when a given PS has
too much traffic and reassigns one or more RangePartitions to less
loaded partition servers.

151

6. The PM then updates the Partition Map Table and its metadata
information to reflect the merge.

5.5.2 Split Operation

WAS splits a RangePartition due to too much load as well as the


size of its row or blob data streams. If the PM identifies either
situation, it tells the PS serving the RangePartition to split based
upon load or size. The PM makes the decision to split, but the PS
chooses the key (AccountName, PartitionName) where the
partition will be split.
To split based upon size, the
RangePartition maintains the total size of the objects in the
partition and the split key values where the partition can be
approximately halved in size, and the PS uses that to pick the key
for where to split. If the split is based on load, the PS chooses the
key based upon Adaptive Range Profiling [16]. The PS
adaptively tracks which key ranges in a RangePartition have the
most load and uses this to determine on what key to split the
RangePartition.

5.6 Partition Layer Inter-Stamp Replication

Thus far we have talked about an AccountName being associated


(via DNS) to a single location and storage stamp, where all data
access goes to that stamp. We call this the primary stamp for an
account. An account actually has one or more secondary stamps
assigned to it by the Location Service, and this primary/secondary
stamp information tells WAS to perform inter-stamp replication
for this account from the primary stamp to the secondary
stamp(s).
One of the main scenarios for inter-stamp replication is to georeplicate an accounts data between two data centers for disaster
recovery. In this scenario, a primary and secondary location is
chosen for the account. Take, for example, an account, for which
we want the primary stamp (P) to be located in US South and the
secondary stamp (S) to be located in US North. When
provisioning the account, the LS will choose a stamp in each
location and register the AccountName with both stamps such that
the US South stamp (P) takes live traffic and the US North stamp
(S) will take only inter-stamp replication (also called georeplication) traffic from stamp P for the account. The LS updates
DNS to have hostname AccountName.service.core.windows.net
point to the storage stamp Ps VIP in US South. When a write
comes into stamp P for the account, the change is fully replicated
within that stamp using intra-stamp replication at the stream layer
then success is returned to the client. After the update has been
committed in stamp P, the partition layer in stamp P will
asynchronously geo-replicate the change to the secondary stamp S
using inter-stamp replication. When the change arrives at stamp
S, the transaction is applied in the partition layer and this update
fully replicates using intra-stamp replication within stamp S.

To split a RangePartition (B) into two new RangePartitions (C,D),


the following steps are taken.
1. The PM instructs the PS to split B into C and D.
2. The PS in charge of B checkpoints B, then stops serving traffic
briefly during step 3 below.
3. The PS uses a special stream operation MultiModify to take
each of Bs streams (metadata, commit log and data) and creates
new sets of streams for C and D respectively with the same
extents in the same order as in B. This step is very fast, since a
stream is just a list of pointers to extents. The PS then appends
the new partition key ranges for C and D to their metadata
streams.
4. The PS starts serving requests to the two new partitions C and
D for their respective disjoint PartitionName ranges.
5. The PS notifies the PM of the split completion, and the PM
updates the Partition Map Table and its metadata information
accordingly. The PM then moves one of the split partitions to a
different PS.

Since the inter-stamp replication is done asynchronously, recent


updates that have not been inter-stamp replicated can be lost in the
event of disaster. In production, changes are geo-replicated and
committed on the secondary stamp within 30 seconds on average
after the update was committed on the primary stamp.

5.5.3 Merge Operation

To merge two RangePartitions, the PM will choose two


RangePartitions C and D with adjacent PartitionName ranges that
have low traffic. The following steps are taken to merge C and D
into a new RangePartition E.

Inter-stamp replication is used for both account geo-replication


and migration across stamps. For disaster recovery, we may need
to perform an abrupt failover where recent changes may be lost,
but for migration we perform a clean failover so there is no data
loss. In both failover scenarios, the Location Service makes an
active secondary stamp for the account the new primary and
switches DNS to point to the secondary stamps VIP. Note that
the URI used to access the object does not change after failover.
This allows the existing URIs used to access Blobs, Tables and
Queues to continue to work after failover.

1. The PM moves C and D so that they are served by the same PS.
The PM then tells the PS to merge (C,D) into E.
2. The PS performs a checkpoint for both C and D, and then
briefly pauses traffic to C and D during step 3.
3. The PS uses the MultiModify stream command to create a new
commit log and data streams for E. Each of these streams is the
concatenation of all of the extents from their respective streams in
C and D. This merge means that the extents in the new commit
log stream for E will be all of Cs extents in the order they were in
Cs commit log stream followed by all of Ds extents in their
original order. This layout is the same for the new row and Blob
data stream(s) for E.

6. Application Throughput

For our cloud offering, customers run their applications as a


tenant (service) on VMs. For our platform, we separate
computation and storage into their own stamps (clusters) within a
data center since this separation allows each to scale
independently and control their own load balancing. Here we
examine the performance of a customer application running from
their hosted service on VMs in the same data center as where their
account data is stored. Each VM used is an extra-large VM with
full control of the entire compute node and a 1Gbps NIC. The
results were gathered on live shared production stamps with
internal and external customers.

4. The PS constructs the metadata stream for E, which contains


the names of the new commit log and data stream, the combined
key range for E, and pointers (extent+offset) for the start and end
of the commit log regions in Es commit log derived from C and
D, as well as the root of the data index in Es data streams.
5. At this point, the new metadata stream for E can be correctly
loaded, and the PS starts serving the newly merged RangePartition
E.

152

Figure 6 shows the WAS Table operation throughput in terms of


the entities per second (y-axis) for 1-16 VMs (x-axis) performing
random 1KB single entity get and put requests against a single
100GB Table. It also shows batch inserts of 100 entities at a time
a common way applications insert groups of entities into a WAS
Table. Figure 7 shows the throughput in megabytes per second
(y-axis) for randomly getting and putting 4MB blobs vs. the
number of VMs used (x-axis). All of the results are for a single
storage account.

WAS cloud storage service, which they can then access from any
XBox console they sign into. The backing storage for this feature
leverages Blob and Table storage.
The XBox Telemetry service stores console-generated diagnostics
and telemetry information for later secure retrieval and offline
processing. For example, various Kinect related features running
on Xbox 360 generate detailed usage files which are uploaded to
the cloud to analyze and improve the Kinect experience based on
customer opt-in. The data is stored directly into Blobs, and
Tables are used to maintain metadata information about the files.
Queues are used to coordinate the processing and the cleaning up
of the Blobs.
Microsofts Zune backend uses Windows Azure for media file
storage and delivery, where files are stored as Blobs.
Table 1 shows the relative breakdown among Blob, Table, and
Queue usage across all (All) services (internal and external) using
WAS as well as for the services described above. The table
shows the breakdown of requests, capacity usage, and ingress and
egress traffic for Blobs, Tables and Queues.
Notice that, the percentage of requests for all services shows that
about 17.9% of all requests are Blob requests, 46.88% of the
requests are Table operations and 35.22% are Queue requests for
all services using WAS. But in terms of capacity, 70.31% of
capacity is in Blobs, 29.68% of capacity is used by Tables, and
0.01% used by Queues. %Ingress is the percentage breakdown
of incoming traffic (bytes) among Blob, Table, and Queue;
%Egress is the same for outbound traffic (bytes). The results
show that different customers have very different usage patterns.
In term of capacity usage, some customers (e.g., Zune and Xbox
GameSaves) have mostly unstructured data (such as media files)
and put those into Blobs, whereas other customers like Bing and
XBox Telemetry that have to index a lot of data have a significant
amount of structured data in Tables. Queues use very little space
compared to Blobs and Tables, since they are primarily used as a
communication mechanism instead of storing data over a long
period of time.

Figure 6 Table Entity Throughput for 1-16 VMs

Figure 7: Blob Throughput for 1-16 VMs


These results show a linear increase in scale is achieved for
entities/second as the application scales out the amount of
computing resources it uses for accessing WAS Tables. For
Blobs, the throughput scales linearly up to eight VMs, but tapers
off as the aggregate throughput reaches the network capacity on
the client side where the test traffic was generated. The results
show that, for Table operations, batch puts offer about three times
more throughput compared to single entity puts. That is because
the batch operation significantly reduces the number of network
roundtrips and requires fewer stream writes. In addition, the
Table read operations have slightly lower throughput than write
operations. This difference is due to the particular access pattern
of our experiment, which randomly accesses a large key space on
a large data set, minimizing the effect of caching. Writes on the
other hand always result in sequential writes to the journal.

Table 1: Usage Comparison for (Blob/Table/Queue)


Blob
All
Table
Queue
Blob
Bing
Table
Queue
Blob
XBox
Table
GameSaves
Queue
Blob
XBox
Table
Telemetry
Queue
Blob
Zune
Table
Queue

7. Workload Profiles

Usage patterns for cloud-based applications can vary significantly.


Section 1 already described a near-real time ingestion engine to
provide Facebook and Twitter search for Bing. In this section we
describe a few additional internal services using WAS, and give
some high-level metrics of their usage.

%Requests %Capacity
17.9
70.31
46.88
29.68
35.22
0.01
0.46
60.45
98.48
39.55
1.06
0
99.68
99.99
0.32
0.01
0
0
26.78
19.57
44.98
80.43
28.24
0
94.64
99.9
5.36
0.1
0
0

%Ingress
48.28
49.61
2.11
16.73
83.14
0.13
99.84
0.16
0
50.25
49.25
0.5
98.22
1.78
0

%Egress
66.17
33.07
0.76
29.11
70.79
0.1
99.88
0.12
0
11.26
88.29
0.45
96.21
3.79
0

8. Design Choices and Lessons Learned

Here, we discuss a few of our WAS design choices and relate


some of the lessons we have learned thus far.

The XBox GameSaves service was announced at E3 this year and


will provide a new feature in Fall 2011 for providing saved game
data into the cloud for millions of XBox users. This feature will
enable subscribed users to upload their game progress into the

Scaling Computation Separate from Storage Early on we


decided to separate customer VM-based computation from storage
for Windows Azure. Therefore, nodes running a customers

153

service code are separate from nodes providing their storage. As a


result, we can scale our supply of computation cores and storage
independently to meet customer demand in a given data center.
This separation also provides a layer of isolation between
compute and storage given its multi-tenancy usage, and allows
both of the systems to load balance independently.

traffic will not). In addition, based on the request history at the


AccountName and PartitionName levels, the system determines
whether the account has been well-behaving. Load balancing will
try to keep the servers within an acceptable load, but when access
patterns cannot be load balanced (e.g., high traffic to a single
PartitionName, high sequential access traffic, repetitive sequential
scanning, etc.), the system throttles requests of such traffic
patterns when they are too high.

Given this decision, our goal from the start has been to allow
computation to efficiently access storage with high bandwidth
without the data being on the same node or even in the same rack.
To achieve this goal we are in the process of moving towards our
next generation data center networking architecture [10], which
flattens the data center networking topology and provides full
bisection bandwidth between compute and storage.

Automatic Load Balancing We found it crucial to have


efficient automatic load balancing of partitions that can quickly
adapt to various traffic conditions. This enables WAS to maintain
high availability in this multi-tenancy environment as well as deal
with traffic spikes to a single users storage account. Gathering
the adaptive profile information, discovering what metrics are
most useful under various traffic conditions, and tuning the
algorithm to be smart enough to effectively deal with different
traffic patterns we see in production were some of the areas we
spent a lot of time working on before achieving a system that
works well for our multi-tenancy environment.

Range Partitions vs. Hashing We decided to use range-based


partitioning/indexing instead of hash-based indexing (where the
objects are assigned to a server based on the hash values of their
keys) for the partition layers Object Tables. One reason for this
decision is that range-based partitioning makes performance
isolation easier since a given accounts objects are stored together
within a set of RangePartitions (which also provides efficient
object enumeration). Hash-based schemes have the simplicity of
distributing the load across servers, but lose the locality of objects
for isolation and efficient enumeration. The range partitioning
allows WAS to keep a customers objects together in their own set
of partitions to throttle and isolate potentially abusive accounts.

We started with a system that used a single number to quantify


load on each RangePartition and each server. We first tried the
product of request latency and request rate to represent the load on
a PS and each RangePartition. This product is easy to compute
and reflects the load incurred by the requests on the server and
partitions. This design worked well for the majority of the load
balancing needs (moving partitions around), but it did not
correctly capture high CPU utilization that can occur during scans
or high network utilization. Therefore, we now take into
consideration request, CPU, and network loads to guide load
balancing. However, these metrics are not sufficient to correctly
guide splitting decisions.

For these reasons, we took the range-based approach and built an


automatic load balancing system (Section 5.5) to spread the load
dynamically according to user traffic by splitting and moving
partitions among servers.
A downside of range partitioning is scaling out access to
sequential access patterns. For example, if a customer is writing
all of their data to the very end of a tables key range (e.g., insert
key 2011-06-30:12:00:00, then key 2011-06-30:12:00:02, then
key 2011-06:30-12:00:10), all of the writes go to the very last
RangePartition in the customers table. This pattern does not take
advantage of the partitioning and load balancing our system
provides. In contrast, if the customer distributes their writes
across a large number of PartitionNames, the system can quickly
split the table into multiple RangePartitions and spread them
across different servers to allow performance to scale linearly
with load (as shown in Figure 6). To address this sequential
access pattern for RangePartitions, a customer can always use
hashing or bucketing for the PartitionName, which avoids the
above sequential access pattern issue.

For splitting, we introduced separate mechanisms to trigger splits


of partitions, where we collect hints to find out whether some
partitions are reaching their capacity across several metrics. For
example, we can trigger partition splits based on request
throttling, request timeouts, the size of a partition, etc. Combining
split triggers and the load balancing allows the system to quickly
split and load balance hot partitions across different servers.
From a high level, the algorithm works as follows. Every N
seconds (currently 15 seconds) the PM sorts all RangePartitions
based on each of the split triggers. The PM then goes through
each partition, looking at the detailed statistics to figure out if it
needs to be split using the metrics described above (load,
throttling, timeouts, CPU usage, size, etc.). During this process,
the PM picks a small number to split for this quantum, and
performs the split action on those.

Throttling/Isolation At times, servers become overloaded by


customer requests. A difficult problem was identifying which
storage accounts should be throttled when this happens and
making sure well-behaving accounts are not affected.

After doing the split pass, the PM sorts all of the PSs based on
each of the load balancing metrics - request load, CPU load and
network load. It then uses this to identify which PSs are
overloaded versus lightly loaded. The PM then chooses the PSs
that are heavily loaded and, if there was a recent split from the
prior split pass, the PM will offload one of those RangePartitions
to a lightly loaded server. If there are still highly loaded PSs
(without a recent split to offload), the PM offloads
RangePartitions from them to the lightly loaded PSs.

Each partition server keeps track of the request rate for


AccountNames and PartitionNames. Because there are a large
number of AccountNames and PartitionNames it may not be
practical to keep track of them all. The system uses a SampleHold algorithm [7] to track the request rate history of the top N
busiest AccountNames and PartitionNames. This information is
used to determine whether an account is well-behaving or not
(e.g., whether the traffic backs off when it is throttled). If a server
is getting overloaded, it uses this information to selectively
throttle the incoming traffic, targeting accounts that are causing
the issue. For example, a PS computes a throttling probability of
the incoming requests for each account based on the request rate
history for the account (those with high request rates will have a
larger probability being throttled, whereas accounts with little

The core load balancing algorithm can be dynamically swapped


out via configuration updates. WAS includes scripting language
support that enables customizing the load balancing logic, such as
defining how a partition split can be triggered based on different
system metrics. This support gives us flexibility to fine-tune the
load balancing algorithm at runtime as well as try new algorithms
according to various traffic patterns observed.

154

Separate Log Files per RangePartition Performance isolation


for storage accounts is critical in a multi-tenancy environment.
This requirement is one of the reasons we used separate log
streams for each RangePartition, whereas BigTable [4] uses a
single log file across all partitions on the same server. Having
separate log files enables us to isolate the load time of a
RangePartition to just the recent object updates in that
RangePartition.

spread evenly across different fault and upgrade domains for the
storage service. This way, if a fault domain goes down, we lose at
most 1/X of the servers for a given layer, where X is the number
of fault domains. Similarly, during a service upgrade at most 1/Y
of the servers for a given layer are upgraded at a given time,
where Y is the number of upgrade domains. To achieve this, we
use rolling upgrades, which enable us to maintain high availability
when upgrading the storage service, and we upgrade a single
upgrade domain at a time. For example, if we have ten upgrade
domains, then upgrading a single domain would potentially
upgrade ten percent of the servers from each layer at a time.

Journaling When we originally released WAS, it did not have


journaling. As a result, we experienced many hiccups with
read/writes contending with each other on the same drive,
noticeably affecting performance. We did not want to write to
two log files (six replicas) like BigTable [4] due to the increased
network traffic. We also wanted a way to optimize small writes,
especially since we wanted separate log files per RangePartition.
These requirements led us to the journal approach with a single
log file per RangePartition. We found this optimization quite
effective in reducing the latency and providing consistent
performance.

During a service upgrade, storage nodes may go offline for a few


minutes before coming back online. We need to maintain
availability and ensure that enough replicas are available at any
point in time. Even though the system is built to tolerate isolated
failures, these planned (massive) upgrade failures can be more
efficiently dealt with instead of being treated as abrupt massive
failures. The upgrade process is automated so that it is tractable to
manage a large number of these large-scale deployments. The
automated upgrade process goes through each upgrade domain
one at a time for a given storage stamp. Before taking down an
upgrade domain, the upgrade process notifies the PM to move the
partitions out of that upgrade domain and notifies the SM to not
allocate new extents in that upgrade domain. Furthermore, before
taking down any servers, the upgrade process checks with the SM
to ensure that there are sufficient extent replicas available for each
extent outside the given upgrade domain. After upgrading a given
domain, a set of validation tests are run to make sure the system is
healthy before proceeding to the next upgrade domain. This
validation is crucial for catching issues during the upgrade process
and stopping it early should an error occur.

Append-only System Having an append-only system and


sealing an extent upon failure have greatly simplified the
replication protocol and handling of failure scenarios. In this
model, the data is never overwritten once committed to a replica,
and, upon failures, the extent is immediately sealed. This model
allows the consistency to be enforced across all the replicas via
their commit lengths.
Furthermore, the append-only system has allowed us to keep
snapshots of the previous states at virtually no extra cost, which
has made it easy to provide snapshot/versioning features. It also
has allowed us to efficiently provide optimizations like erasure
coding. In addition, append-only has been a tremendous benefit
for diagnosing issues as well as repairing/recovering the system in
case something goes wrong. Since the history of changes is
preserved, tools can easily be built to diagnose issues and to repair
or recover the system from a corrupted state back to a prior known
consistent state. When operating a system at this scale, we cannot
emphasize enough the benefit we have seen from using an
append-only system for diagnostics and recovery.

Multiple Data Abstractions from a Single Stack Our system


supports three different data abstraction from the same storage
stack: Blobs, Tables and Queues. This design enables all data
abstractions to use the same intra-stamp and inter-stamp
replication, use the same load balancing system, and realize the
benefits from improvements in the stream and partition layers. In
addition, because the performance needs of Blobs, Tables, and
Queues are different, our single stack approach enables us to
reduce costs by running all services on the same set of hardware.
Blobs use the massive disk capacity, Tables use the I/O spindles
from the many disks on a node (but do not require as much
capacity as Blobs), and Queues mainly run in memory.
Therefore, we are not only blending different customers
workloads together on shared resources, we are also blending
together Blob, Table, and Queue traffic across the same set of
storage nodes.

An append-based system comes with certain costs. An efficient


and scalable garbage collection (GC) system is crucial to keep the
space overhead low, and GC comes at a cost of extra I/O. In
addition, the data layout on disk may not be the same as the
virtual address space of the data abstraction stored, which led us
to implement prefetching logic for streaming large data sets back
to the client.
End-to-end Checksums We found it crucial to keep checksums
for user data end to end. For example, during a blob upload, once
the Front-End server receives the user data, it immediately
computes the checksum and sends it along with the data to the
backend servers. Then at each layer, the partition server and the
stream servers verify the checksum before continuing to process
it. If a mismatch is detected, the request is failed. This prevents
corrupted data from being committed into the system. We have
seen cases where a few servers had hardware issues, and our endto-end checksum caught such issues and helped maintain data
integrity. Furthermore, this end-to-end checksum mechanism also
helps identify servers that consistently have hardware issues so we
can take them out of rotation and mark them for repair.

Use of System-defined Object Tables We chose to use a fixed


number of system defined Object Tables to build Blob, Table, and
Queue abstractions instead of exposing the raw Object Table
semantics to end users. This decision reduces management by our
system to only the small set of schemas of our internal, system
defined Object Tables. It also provides for easy maintenance and
upgrade of the internal data structures and isolates changes of
these system defined tables from end user data abstractions.
Offering Storage in Buckets of 100TBs We currently limit the
amount of storage for an account to be no more than 100TB. This
constraint allows all of the storage account data to fit within a
given storage stamp, especially since our initial storage stamps
held only two petabytes of raw data (the new ones hold 20-30PB).
To obtain more storage capacity within a single data center,
customers use more than one account within that location. This

Upgrades A rack in a storage stamp is a fault domain. A


concept orthogonal to fault domain is what we call an upgrade
domain (a set of servers briefly taken offline at the same time
during a rolling upgrade). Servers for each of the three layers are

155

ended up being a reasonable tradeoff for many of our large


customers (storing petabytes of data), since they are typically
already using multiple accounts to partition their storage across
different regions and locations (for local access to data for their
customers). Therefore, partitioning their data across accounts
within a given location to add more storage often fits into their
existing partitioning design. Even so, it does require large
services to have account level partitioning logic, which not all
customers naturally have as part of their design. Therefore, we
plan to increase the amount of storage that can be held within a
given storage account in the future.

interactions. The system provides a programmable interface for all


of the main operations in our system as well as the points in the
system to create faults. Some examples of these pressure point
commands are: checkpoint a RangePartition, combine a set of
RangePartition checkpoints, garbage collect a RangePartition,
split/merge/load balance RangePartitions, erasure code or unerasure code an extent, crash each type of server in a stamp, inject
network latencies, inject disk latencies, etc.
The pressure point system is used to trigger all of these
interactions during a stress run in specific orders or randomly.
This system has been instrumental in finding and reproducing
issues from complex interactions that might have taken years to
naturally occur on their own.

CAP Theorem WAS provides high availability with strong


consistency guarantees. This combination seems to violate the
CAP theorem [2], which says a distributed system cannot have
availability, consistency, and partition tolerance at the same time.
However, our system, in practice, provides all three of these
properties within a storage stamp. This situation is made possible
through layering and designing our system around a specific fault
model.

9. Related Work

Prior studies [9] revealed the challenges in achieving strong


consistency and high availability in a poorly-connected network
environment. Some systems address this by reducing consistency
guarantees to achieve high availability [22,14,6]. But this shifts
the burden to the applications to deal with conflicting views of
data.
For instance, Amazons SimpleDB was originally
introduced with an eventual consistency model and more recently
added strongly consistent operations [23]. Van Renesse et. al. [20]
has shown, via Chain Replication, the feasibility of building largescale storage systems providing both strong consistency and high
availability, which was later extended to allow reading from any
replica [21]. Given our customer needs for strong consistency, we
set out to provide a system that can provide strong consistency
with high availability along with partition tolerance for our fault
model.

The stream layer has a simple append-only data model, which


provides high availability in the face of network partitioning and
other failures, whereas the partition layer, built upon the stream
layer, provides strong consistency guarantees. This layering
allows us to decouple the nodes responsible for providing strong
consistency from the nodes storing the data with availability in the
face of network partitioning. This decoupling and targeting a
specific set of faults allows our system to provide high availability
and strong consistency in face of various classes of failures we see
in practice. For example, the type of network partitioning we
have seen within a storage stamp are node failures and top-of-rack
(TOR) switch failures. When a TOR switch fails, the given rack
will stop being used for traffic the stream layer will stop using
that rack and start using extents on available racks to allow
streams to continue writing. In addition, the partition layer will
reassign its RangePartitions to partition servers on available racks
to allow all of the data to continue to be served with high
availability and strong consistency. Therefore, our system is
designed to be able to provide strong consistency with high
availability for the network partitioning issues that are likely to
occur in our system (at the node level as well as TOR failures).

As in many other highly-available distributed storage systems


[6,14,1,5], WAS also provides geo-redundancy. Some of these
systems put geo-replication on the critical path of the live
application requests, whereas we made a design trade-off to take a
classical asynchronous geo-replication approach [18] and leave it
off the critical path. Performing the geo-replication completely
asynchronously allows us to provide better write latency for
applications, and allows more optimizations, such as batching and
compaction for geo-replication, and efficient use of cross-data
center bandwidth. The tradeoff is that if there is a disaster and an
abrupt failover needs to occur, then there is unavailability during
the failover and a potential loss of recent updates to a customers
account.

High-performance Debug Logging We used an extensive


debug logging infrastructure throughout the development of
WAS. The system writes logs to the local disks of the storage
nodes and provides a grep-like utility to do a distributed search
across all storage node logs. We do not push these verbose logs
off the storage nodes, given the volume of data being logged.

The closest system to ours is GFS [8,15] combined with BigTable


[4]. A few differences from these prior publications are: (1) GFS
allows relaxed consistency across replicas and does not guarantee
that all replicas are bitwise the same, whereas WAS provides that
guarantee, (2) BigTable combines multiple tablets into a single
commit log and writes them to two GFS files in parallel to avoid
GFS hiccups, whereas we found we could work around both of
these by using journaling in our stream layer, and (3) we provide a
scalable Blob storage system and batch Table transactions
integrated into a BigTable-like framework. In addition, we
describe how WAS automatically load balances, splits, and
merges RangePartitions according to application traffic demands.

When bringing WAS to production, reducing logging for


performance reasons was considered. The utility of verbose
logging though made us wary of reducing the amount of logging
in the system. Instead, the logging system was optimized to
vastly increase its performance and reduce its disk space overhead
by automatically tokenizing and compressing output, achieving a
system that can log 100s of MB/s with little application
performance impact per node. This feature allows retention of
many days of verbose debug logs across a cluster. The highperformance logging system and associated log search tools are
critical for investigating any problems in production in detail
without the need to deploy special code or reproduce problems.

10. Conclusions

The Windows Azure Storage platform implements essential


services for developers of cloud based solutions. The combination
of strong consistency, global partitioned namespace, and disaster
recovery has been important customer features in WASs multitenancy environment. WAS runs a disparate set of workloads with

Pressure Point Testing It is not practical to create tests for all


combinations of all complex behaviors that can occur in a large
scale distributed system. Therefore, we use what we call Pressure
Points to aid in capturing these complex behaviors and

156

various peak usage profiles from many customers on the same set
of hardware. This significantly reduces storage cost since the
amount of resources to be provisioned is significantly less than the
sum of the peak resources required to run all of these workloads
on dedicated hardware.

[4] F. Chang et al., "Bigtable: A Distributed Storage System for


Structured Data," in OSDI, 2006.
[5] B. Cooper et al., "PNUTS: Yahoo!'s Hosted Data Serving
Platform," VLDB, vol. 1, no. 2, 2008.
[6] G. DeCandia et al., "Dynamo: Amazon's Highly Available
Key-value Store," in SOSP, 2007.

As our examples demonstrate, the three storage abstractions,


Blobs, Tables, and Queues, provide mechanisms for storage and
workflow control for a wide range of applications. Not mentioned,
however, is the ease with which the WAS system can be
utilized. For example, the initial version of the Facebook/Twitter
search ingestion engine took one engineer only two months from
the start of development to launching the service. This experience
illustrates our service's ability to empower customers to easily
develop and deploy their applications to the cloud.

[7] Cristian Estan and George Varghese, "New Directions in


Traffic Measurement and Accounting," in SIGCOMM, 2002.
[8] S. Ghemawat, H. Gobioff, and S.T. Leung, "The Google
File System," in SOSP, 2003.
[9] J. Gray, P. Helland, P. O'Neil, and D. Shasha, "The Dangers
of Replication and a Solution," in SIGMOD, 1996.

Additional information on Windows Azure and Windows Azure


Storage is available at http://www.microsoft.com/windowsazure/.

[10] Albert Greenberg et al., "VL2: A Scalable and Flexible Data


Center Network," Communications of the ACM, vol. 54, no.
3, pp. 95-104, 2011.

Acknowledgements

[11] Y. Hu and Q. Yang, "DCDDisk Caching Disk: A New


Approach for Boosting I/O Performance," in ISCA, 1996.

We would like to thank Geoff Voelker, Greg Ganger, and


anonymous reviewers for providing valuable feedback on this
paper.

[12] H.T. Kung and John T. Robinson, "On Optimistic Methods


for Concurrency Control," ACM Transactions on Database
Systems, vol. 6, no. 2, pp. 213-226, June 1981.

We would like to acknowledge the creators of Cosmos (Bings


storage system): Darren Shakib, Andrew Kadatch, Sam McKelvie,
Jim Walsh and Jonathan Forbes. We started Windows Azure 5
years ago with Cosmos as our intra-stamp replication system. The
data abstractions and append-only extent-based replication system
presented in Section 4 was created by them. We extended
Cosmos to create our stream layer by adding mechanisms to allow
us to provide strong consistency in coordination with the partition
layer, stream operations to allow us to efficiently split/merge
partitions, journaling, erasure coding, spindle anti-starvation, read
load-balancing, and other improvements.

[13] Leslie Lamport, "The Part-Time Parliament," ACM


Transactions on Computer Systems, vol. 16, no. 2, pp. 133169, May 1998.
[14] A. Malik and P. Lakshman, "Cassandra: a decentralized
structured storage system," SIGOPS Operating System
Review, vol. 44, no. 2, 2010.
[15] M. McKusick and S. Quinlan, "GFS: Evolution on Fastforward," ACM File Systems, vol. 7, no. 7, 2009.
[16] S. Mysore, B. Agrawal, T. Sherwood, N. Shrivastava, and S.
Suri, "Profiling over Adaptive Ranges," in Symposium on
Code Generation and Optimization, 2006.

We would also like to thank additional contributors to Windows


Azure Storage: Maneesh Sah, Matt Hendel, Kavitha Golconda,
Jean Ghanem, Joe Giardino, Shuitao Fan, Justin Yu, Dinesh
Haridas, Jay Sreedharan, Monilee Atkinson, Harshawardhan
Gadgil, Phaneesh Kuppahalli, Nima Hakami, Maxim Mazeev,
Andrei Marinescu, Garret Buban, Ioan Oltean, Ritesh Kumar,
Richard Liu, Rohit Galwankar, Brihadeeshwar Venkataraman,
Jayush Luniya, Serdar Ozler, Karl Hsueh, Ming Fan, David
Goebel, Joy Ganguly, Ishai Ben Aroya, Chun Yuan, Philip Taron,
Pradeep Gunda, Ryan Zhang, Shyam Antony, Qi Zhang, Madhav
Pandya, Li Tan, Manish Chablani, Amar Gadkari, Haiyong Wang,
Hakon Verespej, Ramesh Shankar, Surinder Singh, Ryan Wu,
Amruta Machetti, Abhishek Singh Baghel, Vineet Sarda, Alex
Nagy, Orit Mazor, and Kayla Bunch.

[17] P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil, "The LogStructured Merge-Tree (LSM-tree)," Acta Informatica ACTA, vol. 33, no. 4, 1996.
[18] H. Patterson et al., "SnapMirror: File System Based
Asynchronous Mirroring for Disaster Recovery," in
USENIX-FAST, 2002.
[19] Irving S. Reed and Gustave Solomon, "Polynomial Codes
over Certain Finite Fields," Journal of the Society for
Industrial and Applied Mathematics, vol. 8, no. 2, pp. 300304, 1960.
[20] R. Renesse and F. Schneider, "Chain Replication for
Supporting High Throughput and Availability," in USENIXOSDI, 2004.

Finally we would like to thank Amitabh Srivastava, G.S. Rana,


Bill Laing, Satya Nadella, Ray Ozzie, and the rest of the Windows
Azure team for their support.

[21] J. Terrace and M. Freedman, "Object Storage on CRAQ:


High-throughput chain replication for read-mostly
workloads," in USENIX'09, 2009.

Reference

[22] D. Terry, K. Petersen M. Theimer, A. Demers, M. Spreitzer,


and C. Hauser, "Managing Update Conflicts in Bayou, A
Weakly Connected Replicated Storage System," in ACM
SOSP, 1995.

[1] J. Baker et al., "Megastore: Providing Scalable, Highly


Available Storage for Interactive Services," in Conf. on
Innovative Data Systems Research, 2011.
[2] Eric A. Brewer, "Towards Robust Distributed Systems.
(Invited Talk)," in Principles of Distributed Computing,
Portland, Oregon, 2000.

[23] W. Vogel, "All Things Distributed - Choosing Consistency,"


in
http://www.allthingsdistributed.com/2010/02/strong_consist
ency_simpledb.html, 2010.

[3] M. Burrows, "The Chubby Lock Service for LooselyCoupled Distributed Systems," in OSDI, 2006.

157

Sparrow: Distributed, Low Latency Scheduling


Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica
University of California, Berkeley

Abstract
Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel
jobs that complete in hundreds of milliseconds poses a
major challenge for task schedulers, which will need to
schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high
availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability
limitations of a centralized design. We implement and
deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12%
of an ideal scheduler.

1 Introduction
Todays data analytics clusters are running ever shorter
and higher-fanout jobs. Spurred by demand for lowerlatency interactive data processing, efforts in research and industry alike have produced frameworks
(e.g., Dremel [12], Spark [26], Impala [11]) that stripe
work across thousands of machines or store data in
memory in order to analyze large volumes of data in
seconds, as shown in Figure 1. We expect this trend to
continue with a new generation of frameworks targeting sub-second response times. Bringing response times
into the 100ms range will enable powerful new applications; for example, user-facing services will be able
to run sophisticated parallel computations, such as language translation and highly personalized search, on a
per-query basis.
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for third-party components of this work must be honored. For all other
uses, contact the Owner/Author.
Copyright is held by the Owner/Author(s).
SOSP 13, Nov. 36, 2013, Farmington, Pennsylvania, USA.
ACM 978-1-4503-2388-8/13/11.
http://dx.doi.org/10.1145/2517349.2522716

2010: Dremel query


2004:
2012: Impala query
2009: Hive
MapReduce
query
2010: In-memory
batch job
Spark query
10 min.

10 sec.

100 ms

1 ms

Figure 1: Data analytics frameworks can analyze


large volumes of data with ever lower latency.

Jobs composed of short, sub-second tasks present a


difficult scheduling challenge. These jobs arise not only
due to frameworks targeting low latency, but also as a
result of breaking long-running batch jobs into a large
number of short tasks, a technique that improves fairness and mitigates stragglers [17]. When tasks run in
hundreds of milliseconds, scheduling decisions must be
made at very high throughput: a cluster containing ten
thousand 16-core machines and running 100ms tasks
may require over 1 million scheduling decisions per
second. Scheduling must also be performed with low
latency: for 100ms tasks, scheduling delays (including queueing delays) above tens of milliseconds represent intolerable overhead. Finally, as processing frameworks approach interactive time-scales and are used in
customer-facing systems, high system availability becomes a requirement. These design requirements differ
from those of traditional batch workloads.
Modifying todays centralized schedulers to support
sub-second parallel tasks presents a difficult engineering challenge. Supporting sub-second tasks requires
handling two orders of magnitude higher throughput
than the fastest existing schedulers (e.g., Mesos [8],
YARN [16], SLURM [10]); meeting this design requirement would be difficult with a design that schedules and
launches all tasks through a single node. Additionally,
achieving high availability would require the replication
or recovery of large amounts of state in sub-second time.
This paper explores the opposite extreme in the design
space: we propose scheduling from a set of machines
that operate autonomously and without centralized or
logically centralized state. A decentralized design offers

attractive scaling and availability properties. The system


can support more requests by adding additional schedulers, and if a scheduler fails, users can direct requests to
an alternate scheduler. The key challenge with a decentralized design is providing response times comparable
to those provided by a centralized scheduler, given that
concurrently operating schedulers may make conflicting
scheduling decisions.
We present Sparrow, a stateless distributed scheduler
that adapts the power of two choices load balancing technique [14] to the domain of parallel task scheduling.
The power of two choices technique proposes scheduling each task by probing two random servers and placing
the task on the server with fewer queued tasks. We introduce three techniques to make the power of two choices
effective in a cluster running parallel jobs:
Batch Sampling: The power of two choices performs
poorly for parallel jobs because job response time is sensitive to tail task wait time (because a job cannot complete until its last task finishes) and tail wait times remain
high with the power of two choices. Batch sampling
solves this problem by applying the recently developed
multiple choices approach [18] to the domain of parallel
job scheduling. Rather than sampling for each task individually, batch sampling places the m tasks in a job on
the least loaded of d m randomly selected worker machines (for d > 1). We demonstrate analytically that, unlike the power of two choices, batch samplings performance does not degrade as a jobs parallelism increases.
Late Binding: The power of two choices suffers from
two remaining performance problems: first, server queue
length is a poor indicator of wait time, and second, due
to messaging delays, multiple schedulers sampling in
parallel may experience race conditions. Late binding
avoids these problems by delaying assignment of tasks
to worker machines until workers are ready to run the
task, and reduces median job response time by as much
as 45% compared to batch sampling alone.
Policies and Constraints: Sparrow uses multiple
queues on worker machines to enforce global policies,
and supports the per-job and per-task placement constraints needed by analytics frameworks. Neither policy enforcement nor constraint handling are addressed
in simpler theoretical models, but both play an important role in real clusters [21].
We have deployed Sparrow on a 110-machine cluster to evaluate its performance. When scheduling TPCH queries, Sparrow provides response times within 12%
of an ideal scheduler, schedules with median queueing
delay of less than 9ms, and recovers from scheduler failures in less than 120ms. Sparrow provides low response
times for jobs with short tasks, even in the presence
of tasks that take up to 3 orders of magnitude longer.
In spite of its decentralized design, Sparrow maintains

aggregate fair shares, and isolates users with different


priorities such that a misbehaving low priority user increases response times for high priority jobs by at most
40%. Simulation results suggest that Sparrow will continue to perform well as cluster size increases to tens
of thousands of cores. Our results demonstrate that distributed scheduling using Sparrow presents a viable alternative to centralized scheduling for low latency, parallel workloads.

2 Design Goals
This paper focuses on fine-grained task scheduling for
low-latency applications.
Low-latency workloads have more demanding
scheduling requirements than batch workloads do,
because batch workloads acquire resources for long periods of time and thus require infrequent task scheduling.
To support a workload composed of sub-second tasks,
a scheduler must provide millisecond-scale scheduling
delay and support millions of task scheduling decisions
per second. Additionally, because low-latency frameworks may be used to power user-facing services, a
scheduler for low-latency workloads should be able to
tolerate scheduler failure.
Sparrow provides fine-grained task scheduling, which
is complementary to the functionality provided by cluster resource managers. Sparrow does not launch new
processes for each task; instead, Sparrow assumes that
a long-running executor process is already running on
each worker machine for each framework, so that Sparrow need only send a short task description (rather than
a large binary) when a task is launched. These executor processes may be launched within a static portion
of a cluster, or via a cluster resource manager (e.g.,
YARN [16], Mesos [8], Omega [20]) that allocates resources to Sparrow along with other frameworks (e.g.,
traditional batch workloads).
Sparrow also makes approximations when scheduling
and trades off many of the complex features supported
by sophisticated, centralized schedulers in order to provide higher scheduling throughput and lower latency. In
particular, Sparrow does not allow certain types of placement constraints (e.g., my job should not be run on machines where User Xs jobs are running), does not perform bin packing, and does not support gang scheduling.
Sparrow supports a small set of features in a way that
can be easily scaled, minimizes latency, and keeps the
design of the system simple. Many applications run lowlatency queries from multiple users, so Sparrow enforces
strict priorities or weighted fair shares when aggregate
demand exceeds capacity. Sparrow also supports basic
2

constraints over job placement, such as per-task constraints (e.g. each task needs to be co-resident with input data) and per-job constraints (e.g., all tasks must be
placed on machines with GPUs). This feature set is similar to that of the Hadoop MapReduce scheduler [23] and
the Spark [26] scheduler.

3 Sample-Based
Parallel Jobs

Scheduling

We assume a single wave job model when we evaluate scheduling techniques because single wave jobs are
most negatively affected by the approximations involved
in our distributed scheduling approach: even a single
delayed task affects the jobs response time. However,
Sparrow also handles multiwave jobs.

for

3.2 Per-task sampling


Sparrows design takes inspiration from the power of
two choices load balancing technique [14], which provides low expected task wait times using a stateless, randomized approach. The power of two choices technique
proposes a simple improvement over purely random assignment of tasks to worker machines: place each task
on the least loaded of two randomly selected worker
machines. Assigning tasks in this manner improves expected wait time exponentially compared to using random placement [14].1
We first consider a direct application of the power of
two choices technique to parallel job scheduling. The
scheduler randomly selects two worker machines for
each task and sends a probe to each, where a probe is
a lightweight RPC. The worker machines each reply to
the probe with the number of currently queued tasks, and
the scheduler places the task on the worker machine with
the shortest queue. The scheduler repeats this process for
each task in the job, as illustrated in Figure 2(a). We refer
to this application of the power of two choices technique
as per-task sampling.
Per-task sampling improves performance compared to
random placement but still performs 2 or more worse
than an omniscient scheduler.2 Intuitively, the problem
with per-task sampling is that a jobs response time is
dictated by the longest wait time of any of the jobs tasks,
making average job response time much higher (and also
much more sensitive to tail performance) than average
task response time. We simulated per-task sampling and
random placement in cluster composed of 10,000 4-core
machines with 1ms network round trip time. Jobs arrive as a Poisson process and are each composed of 100
tasks. The duration of a jobs tasks is chosen from the
exponential distribution such that across jobs, task durations are exponentially distributed with mean 100ms,
but within a particular job, all tasks are the same du-

A traditional task scheduler maintains a complete view


of which tasks are running on which worker machines,
and uses this view to assign incoming tasks to available workers. Sparrow takes a radically different approach: many schedulers operate in parallel, and schedulers do not maintain any state about cluster load. To
schedule a jobs tasks, schedulers rely on instantaneous
load information acquired from worker machines. Sparrows approach extends existing load balancing techniques [14, 18] to the domain of parallel job scheduling
and introduces late binding to improve performance.

3.1 Terminology and job model


We consider a cluster composed of worker machines that
execute tasks and schedulers that assign tasks to worker
machines. A job consists of m tasks that are each allocated to a worker machine. Jobs can be handled by any
scheduler. Workers run tasks in a fixed number of slots;
we avoid more sophisticated bin packing because it adds
complexity to the design. If a worker machine is assigned more tasks than it can run concurrently, it queues
new tasks until existing tasks release enough resources
for the new task to be run. We use wait time to describe
the time from when a task is submitted to the scheduler until when the task begins executing and service
time to describe the time the task spends executing on
a worker machine. Job response time describes the time
from when the job is submitted to the scheduler until
the last task finishes executing. We use delay to describe
the total delay within a job due to both scheduling and
queueing. We compute delay by taking the difference between the job response time using a given scheduling
technique, and job response time if all of the jobs tasks
had been scheduled with zero wait time (equivalent to
the longest service time across all tasks in the job).
In evaluating different scheduling approaches, we assume that each job runs as a single wave of tasks. In
real clusters, jobs may run as multiple waves of tasks
when, for example, m is greater than the number of slots
assigned to the user; for multiwave jobs, the scheduler
can place some early tasks on machines with longer
queueing delay without affecting job response time.

1 More precisely, expected task wait time using random placement


is 1/(1 ), where represents load. Using the least loaded of d
choices, wait time in an initially empty system over the first T units
d i d

d1 + o(1) [14].
of time is upper bounded by
i=1
2 The omniscient scheduler uses a greedy scheduling algorithm
based on complete information about which worker machines are busy.
For each incoming job, the scheduler places the jobs tasks on idle
workers, if any exist, and otherwise uses FIFO queueing.

Scheduler

Worker

Job

Worker

Scheduler

Worker

Task 2

Worker

Scheduler

Scheduler (d = 2)

Worker

Scheduler

Worker

Scheduler

Worker

Job

4 probes

Worker

Scheduler Task 1

Worker

Scheduler

Worker

Worker

Worker
2

(a) Per-task sampling selects queues of length 1 and 3.

(b) Batch sampling selects queues of length 1 and 2.

Figure 2: Placing a parallel, two-task job. Batch sampling outperforms per-task sampling because tasks are
placed in the least loaded of the entire batch of sampled queues.

Response Time (ms)

350

Random
Per-Task
Batch
Batch+Late Binding
Omniscient

300
250
200

information from the probes sent for all of a jobs tasks,


and places the jobs m tasks on the least loaded of all the
worker machines probed. In the example shown in Figure 2, per-task sampling places tasks in queues of length
1 and 3; batch sampling reduces the maximum queue
length to 2 by using both workers that were probed by
Task 2 with per-task sampling.
To schedule using batch sampling, a scheduler randomly selects dm worker machines (for d 1). The
scheduler sends a probe to each of the dm workers; as
with per-task sampling, each worker replies with the
number of queued tasks. The scheduler places one of the
jobs m tasks on each of the m least loaded workers. Unless otherwise specified, we use d = 2; we justify this
choice of d in 7.9.
As shown in Figure 3, batch sampling improves performance compared to per-task sampling. At 80% load,
batch sampling provides response times 0.73 those
with per-task sampling. Nonetheless, response times
with batch sampling remain a factor of 1.92 worse than
those provided by an omniscient scheduler.

150
100
50
0

0.2

0.4

Load

0.6

0.8

Figure 3: Comparison of scheduling techniques in a


simulated cluster of 10,000 4-core machines running
100-task jobs.
ration.3 As shown in Figure 3, response time increases
with increasing load, because schedulers have less success finding free machines on which to place tasks. At
80% load, per-task sampling improves performance by
over 3 compared to random placement, but still results
in response times equal to over 2.6 those offered by a
omniscient scheduler.

3.4 Problems with sample-based scheduling

3.3 Batch sampling

Sample-based techniques perform poorly at high load


due to two problems. First, schedulers place tasks based
on the queue length at worker nodes. However, queue
length provides only a coarse prediction of wait time.
Consider a case where the scheduler probes two workers to place one task, one of which has two 50ms tasks
queued and the other of which has one 300ms task
queued. The scheduler will place the task in the queue
with only one task, even though that queue will result
in a 200ms longer wait time. While workers could reply with an estimate of task duration rather than queue
length, accurately predicting task durations is notoriously difficult. Furthermore, almost all task duration estimates would need to be accurate for such a technique

Batch sampling improves on per-task sampling by sharing information across all of the probes for a particular
job. Batch sampling is similar to a technique recently
proposed in the context of storage systems [18]. With
per-task sampling, one pair of probes may have gotten
unlucky and sampled two heavily loaded machines (e.g.,
Task 1 in Figure 2(a)), while another pair may have gotten lucky and sampled two lightly loaded machines (e.g,
Task 2 in Figure 2(a)); one of the two lightly loaded machines will go unused. Batch sampling aggregates load
3 We use this distribution because it puts the most stress on our
approximate, distributed scheduling technique. When tasks within a
job are of different duration, the shorter tasks can have longer wait
times without affecting job response time.

and task runtimes are comparable, late binding will not


present a worthwhile tradeoff.

to be effective, because each job includes many parallel


tasks, all of which must be placed on machines with low
wait time to ensure good performance.
Sampling also suffers from a race condition where
multiple schedulers concurrently place tasks on a worker
that appears lightly loaded [13]. Consider a case where
two different schedulers probe the same idle worker machine, w, at the same time. Since w is idle, both schedulers are likely to place a task on w; however, only one
of the two tasks placed on the worker will arrive in an
empty queue. The queued task might have been placed
in a different queue had the corresponding scheduler
known that w was not going to be idle when the task
arrived.

3.6 Proactive Cancellation


When a scheduler has launched all of the tasks for a particular job, it can handle remaining outstanding probes
in one of two ways: it can proactively send a cancellation RPC to all workers with outstanding probes, or
it can wait for the workers to request a task and reply
to those requests with a message indicating that no unlaunched tasks remain. We use our simulation to model
the benefit of using proactive cancellation and find that
proactive cancellation reduces median response time by
6% at 95% cluster load. At a given load , workers are
busy more than of the time: they spend proportion of
time executing tasks, but they spend additional time requesting tasks from schedulers. Using cancellation with
1ms network RTT, a probe ratio of 2, and with tasks that
are an average of 100ms long reduces the time workers spend busy by approximately 1%; because response
times approach infinity as load approaches 100%, the
1% reduction in time workers spend busy leads to a noticeable reduction in response times. Cancellation leads
to additional RPCs if a worker receives a cancellation for
a reservation after it has already requested a task for that
reservation: at 95% load, cancellation leads to 2% additional RPCs. We argue that the additional RPCs are a
worthwhile tradeoff for the improved performance, and
the full Sparrow implementation includes cancellation.
Cancellation helps more when the ratio of network delay to task duration increases, so will become more important as task durations decrease, and less important as
network delay decreases.

3.5 Late binding


Sparrow introduces late binding to solve the aforementioned problems. With late binding, workers do not reply immediately to probes and instead place a reservation for the task at the end of an internal work queue.
When this reservation reaches the front of the queue, the
worker sends an RPC to the scheduler that initiated the
probe requesting a task for the corresponding job. The
scheduler assigns the jobs tasks to the first m workers
to reply, and replies to the remaining (d 1)m workers
with a no-op signaling that all of the jobs tasks have
been launched. In this manner, the scheduler guarantees
that the tasks will be placed on the m probed workers
where they will be launched soonest. For exponentiallydistributed task durations at 80% load, late binding provides response times 0.55 those with batch sampling,
bringing response time to within 5% (4ms) of an omniscient scheduler (as shown in Figure 3).
The downside of late binding is that workers are
idle while they are sending an RPC to request a new
task from a scheduler. All current cluster schedulers
we are aware of make this tradeoff: schedulers wait to
assign tasks until a worker signals that it has enough
free resources to launch the task. In our target setting, this tradeoff leads to a 2% efficiency loss compared to queueing tasks at worker machines. The fraction of time a worker spends idle while requesting tasks
is (d RTT)/(t + d RTT) (where d denotes the number of probes per task, RTT denotes the mean network
round trip time, and t denotes mean task service time). In
our deployment on EC2 with an un-optimized network
stack, mean network round trip time was 1 millisecond.
We expect that the shortest tasks will complete in 100ms
and that scheduler will use a probe ratio of no more than
2, leading to at most a 2% efficiency loss. For our target workload, this tradeoff is worthwhile, as illustrated
by the results shown in Figure 3, which incorporate network delays. In environments where network latencies

4 Scheduling Policies and Constraints


Sparrow aims to support a small but useful set of policies within its decentralized framework. This section
describes support for two types of popular scheduler
policies: constraints over where individual tasks are
launched and inter-user isolation policies to govern the
relative performance of users when resources are contended.

4.1 Handling placement constraints


Sparrow handles two types of constraints, per-job and
per-task constraints. Such constraints are commonly required in data-parallel frameworks, for instance, to run
tasks on a machine that holds the tasks input data
on disk or in memory. As mentioned in 2, Sparrow
5

does not support many types of constraints (e.g., interjob constraints) supported by some general-purpose resource managers.
Per-job constraints (e.g., all tasks should be run on
a worker with a GPU) are trivially handled at a Sparrow scheduler. Sparrow randomly selects the dm candidate workers from the subset of workers that satisfy the
constraint. Once the dm workers to probe are selected,
scheduling proceeds as described previously.
Sparrow also handles jobs with per-task constraints,
such as constraints that limit tasks to run on machines
where input data is located. Co-locating tasks with input
data typically reduces response time, because input data
does not need to be transferred over the network. For
jobs with per-task constraints, each task may have a different set of machines on which it can run, so Sparrow
cannot aggregate information over all of the probes in
the job using batch sampling. Instead, Sparrow uses pertask sampling, where the scheduler selects the two machines to probe for each task from the set of machines
that the task is constrained to run on, along with late
binding.
Sparrow implements a small optimization over pertask sampling for jobs with per-task constraints. Rather
than probing individually for each task, Sparrow shares
information across tasks when possible. For example,
consider a case where task 0 is constrained to run in
machines A, B, and C, and task 1 is constrained to run
on machines C, D, and E. Suppose the scheduler probed
machines A and B for task 0, which were heavily loaded,
and probed machines C and D for task 1, which were
both idle. In this case, Sparrow will place task 0 on machine C and task 1 on machine D, even though both machines were selected to be probed for task 1.
Although Sparrow cannot use batch sampling for jobs
with per-task constraints, our distributed approach still
provides near-optimal response times for these jobs, because even a centralized scheduler has only a small number of choices for where to place each task. Jobs with
per-task constraints can still use late binding, so the
scheduler is guaranteed to place each task on whichever
of the two probed machines where the task will run
sooner. Storage layers like HDFS typically replicate data
on three different machines, so tasks that read input data
will be constrained to run on one of three machines
where the input data is located. As a result, even an
ideal, omniscient scheduler would only have one additional choice for where to place each task.

policies: strict priorities and weighted fair sharing. These


policies mirror those offered by other schedulers, including the Hadoop Map Reduce scheduler [25].
Many cluster sharing policies reduce to using strict
priorities; Sparrow supports all such policies by maintaining multiple queues on worker nodes. FIFO, earliest
deadline first, and shortest job first all reduce to assigning a priority to each job, and running the highest priority jobs first. For example, with earliest deadline first,
jobs with earlier deadlines are assigned higher priority.
Cluster operators may also wish to directly assign priorities; for example, to give production jobs high priority and ad-hoc jobs low priority. To support these policies, Sparrow maintains one queue for each priority at
each worker node. When resources become free, Sparrow responds to the reservation from the highest priority non-empty queue. This mechanism trades simplicity
for accuracy: nodes need not use complex gossip protocols to exchange information about jobs that are waiting
to be scheduled, but low priority jobs may run before
high priority jobs if a probe for a low priority job arrives at a node where no high priority jobs happen to
be queued. We believe this is a worthwhile tradeoff: as
shown in 7.8, this distributed mechanism provides good
performance for high priority users. Sparrow does not
currently support preemption when a high priority task
arrives at a machine running a lower priority task; we
leave exploration of preemption to future work.
Sparrow can also enforce weighted fair shares. Each
worker maintains a separate queue for each user, and
performs weighted fair queuing [6] over those queues.
This mechanism provides cluster-wide fair shares in expectation: two users using the same worker will get
shares proportional to their weight, so by extension, two
users using the same set of machines will also be assigned shares proportional to their weight. We choose
this simple mechanism because more accurate mechanisms (e.g., Pisces [22]) add considerable complexity;
as we demonstrate in 7.7, Sparrows simple mechanism
provides near-perfect fair shares.

5 Analysis
Before delving into our experimental evaluation, we analytically show that batch sampling achieves near-optimal
performance, regardless of the task duration distribution, given some simplifying assumptions. Section 3
demonstrated that Sparrow performs well, but only under one particular workload; this section generalizes
those results to all workloads. We also demonstrate that
with per-task sampling, performance decreases exponentially with the number of tasks in a job, making it
poorly suited for parallel workloads.

4.2 Resource allocation policies


Cluster schedulers seek to allocate resources according to a specific policy when aggregate demand for resources exceeds capacity. Sparrow supports two types of
6

Number of servers in the cluster


Load (fraction non-idle workers)
Tasks per job
Probes per task
Mean task service time
Mean request arrival rate

Pr(zero wait time)

m
d
t
n/(mt)

10 tasks/job

1
0.8
0.6
0.4
0.2
0

Table 1: Summary of notation.


Random Placement
Per-Task Sampling
Batch Sampling

0.2 0.4 0.6 0.8


Load
Random

(1 )m
(1 d )m
! "
i dmi dm
dm
i=m (1 )
i

100 tasks/job

1 0

Per-Task

0.2 0.4 0.6 0.8


Load

Batch

Figure 4: Probability that a job will experience zero


wait time in a single-core environment using random
placement, sampling 2 servers/task, and sampling 2m
machines to place an m-task job.
Pr(zero wait time)

Table 2: Probability that a job will experience zero


wait time under three different scheduling techniques.
To analyze the performance of batch and per-task
sampling, we examine the probability of placing all tasks
in a job on idle machines, or equivalently, providing zero
wait time. Quantifying how often our approach places
jobs on idle workers provides a bound on how Sparrow
performs compared to an optimal scheduler.
We make a few simplifying assumptions for the purpose of this analysis. We assume zero network delay, an
infinitely large number of servers, and that each server
runs one task at a time. Our experimental evaluation
shows results in the absence of these assumptions.
Mathematical analysis corroborates the results in 3
demonstrating that per-task sampling performs poorly
for parallel jobs. The probability that a particular task is
placed on an idle machine is one minus the probability
that all probes hit busy machines: 1 d (where represents cluster load and d represents the probe ratio; Table 1 summarizes notation). The probability that all tasks
in a job are assigned to idle machines is (1 d )m (as
shown in Table 2) because all m sets of probes must hit
at least one idle machine. This probability decreases exponentially with the number of tasks in a job, rendering
per-task sampling inappropriate for scheduling parallel
jobs. Figure 4 illustrates the probability that a job experiences zero wait time for both 10 and 100-task jobs, and
demonstrates that the probability of experiencing zero
wait time for a 100-task job drops to < 2% at 20% load.
Batch sampling can place all of a jobs tasks on idle
machines at much higher loads than per-task sampling.
In expectation, batch sampling will be able to place all
m tasks in empty queues as long as d 1/(1 ). Crucially, this expression does not depend on the number
of tasks in a job (m). Figure 4 illustrates this effect: for
both 10 and 100-task jobs, the probability of experiencing zero wait time drops from 1 to 0 at 50% load.4

10 tasks/job

100 tasks/job

0.8
0.6
0.4
0.2
0

0.2 0.4 0.6 0.8


Load
Random

1 0

Per-Task

0.2 0.4 0.6 0.8


Load

Batch

Figure 5: Probability that a job will experience zero


wait time in a system of 4-core servers.

Our analysis thus far has considered machines that can


run only one task at a time; however, todays clusters
typically feature multi-core machines. Multicore machines significantly improve the performance of batch
sampling. Consider a model where each server can run
up to c tasks concurrently. Each probe implicitly describes load on c processing units rather than just one,
which increases the likelihood of finding an idle processing unit on which to run each task. To analyze performance in a multicore environment, we make two simplifying assumptions: first, we assume that the probability
that a core is idle is independent of whether other cores
on the same machine are idle; and second, we assume
that the scheduler places at most 1 task on each machine,
even if multiple cores are idle (placing multiple tasks on
an idle machine exacerbates the gold rush effect where
many schedulers concurrently place tasks on an idle machine). Based on these assumptions, we can replace in
Table 2 with c to obtain the results shown in Figure 5.
These results improve dramatically on the single-core
results: for batch sampling with 4 cores per machine and
100 tasks per job, batch sampling achieves near perfect
performance (99.9% of jobs experience zero wait time)
at up to 79% load. This result demonstrates that, under
some simplifying assumptions, batch sampling performs
well regardless of the distribution of task durations.

4 With the larger, 100-task job, the drop happens more rapidly because the job uses more total probes, which decreases the variance in
the proportion of probes that hit idle machines.

Spark
Frontend

App X
Frontend

Sparrow Scheduler

Application
Frontend

Spark
Frontend

Node
Monitor

Scheduler

submitR

Application
Executor

equest()

Sparrow Scheduler

enqueueR

reserve time

eservation(

Worker
Sparrow Node Monitor
Spark
Executor

App X
Executor

Worker
Sparrow Node Monitor
Spark
Executor

Worker
Sparrow Node Monitor

Time

queue time
get task
time

getTask()

launchT
ask()

App X
Executor

taskComplete()

Figure 6: Frameworks that use Sparrow are decomposed into frontends, which generate tasks, and executors, which run tasks. Frameworks schedule jobs
by communicating with any one of a set of distributed
Sparrow schedulers. Sparrow node monitors run on
each worker machine and federate resource usage.

taskComplete()

taskComplete()

service
time

Figure 7: RPCs (parameters not shown) and timings


associated with launching a job. Sparrows external
interface is shown in bold text and internal RPCs are
shown in grey text.

queries or job specifications (e.g., a SQL query) from exogenous sources (e.g., a data analyst, web service, business application, etc.) and compile them into parallel
tasks for execution on workers. Frontends are typically
distributed over multiple machines to provide high performance and availability. Because Sparrow schedulers
are lightweight, in our deployment, we run a scheduler
on each machine where an application frontend is running to ensure minimum scheduling latency.
Executor processes are responsible for executing
tasks, and are long-lived to avoid startup overhead such
as shipping binaries or caching large datasets in memory.
Executor processes for multiple frameworks may run coresident on a single machine; the node monitor federates
resource usage between co-located frameworks. Sparrow requires executors to accept a launchTask()
RPC from a local node monitor, as shown in Figure 7;
Sparrow uses the launchTask() RPC to pass on the
task description (opaque to Sparrow) originally supplied
by the application frontend.

6 Implementation
We implemented Sparrow to evaluate its performance
on a cluster of 110 Amazon EC2 virtual machines. The
Sparrow code, including scripts to replicate our experimental evaluation, is publicly available at http://
github.com/radlab/sparrow.

6.1 System components


As shown in Figure 6, Sparrow schedules from a distributed set of schedulers that are each responsible for
assigning tasks to workers. Because Sparrow does not
require any communication between schedulers, arbitrarily many schedulers may operate concurrently, and
users or applications may use any available scheduler
to place jobs. Schedulers expose a service (illustrated in
Figure 7) that allows frameworks to submit job scheduling requests using Thrift remote procedure calls [1].
Thrift can generate client bindings in many languages,
so applications that use Sparrow for scheduling are not
tied to a particular language. Each scheduling request includes a list of task specifications; the specification for a
task includes a task description and a list of constraints
governing where the task can be placed.
A Sparrow node monitor runs on each worker, and
federates resource usage on the worker by enqueuing reservations and requesting task specifications from
schedulers when resources become available. Node
monitors run tasks in a fixed number of slots; slots can
be configured based on the resources of the underlying
machine, such as CPU cores and memory.
Sparrow performs task scheduling for one or more
concurrently operating frameworks. As shown in Figure 6, frameworks are composed of long-lived frontend
and executor processes, a model employed by many
systems (e.g., Mesos [8]). Frontends accept high level

6.2 Spark on Sparrow


In order to test Sparrow using a realistic workload,
we ported Spark [26] to Sparrow by writing a Spark
scheduling plugin. This plugin is 280 lines of Scala
code, and can be found at https://github.com/
kayousterhout/spark/tree/sparrow.
The execution of a Spark query begins at a Spark
frontend, which compiles a functional query definition
into multiple parallel stages. Each stage is submitted as
a Sparrow job, including a list of task descriptions and
any associated placement constraints. The first stage is
typically constrained to execute on machines that contain input data, while the remaining stages (which read
data shuffled or broadcasted over the network) are unconstrained. When one stage completes, Spark requests
scheduling of the tasks in the subsequent stage.
8

6.3 Fault tolerance

a TPC-H workload, which features heterogeneous analytics queries. We provide fine-grained tracing of the
overhead that Sparrow incurs and quantify its performance in comparison with an ideal scheduler. Second,
we demonstrate Sparrows ability to handle scheduler
failures. Third, we evaluate Sparrows ability to isolate
users from one another in accordance with cluster-wide
scheduling policies. Finally, we perform a sensitivity
analysis of key parameters in Sparrows design.

Because Sparrow schedulers do not have any logically


centralized state, the failure of one scheduler does not affect the operation of other schedulers. Frameworks that
were using the failed scheduler need to detect the failure
and connect to a backup scheduler. Sparrow includes a
Java client that handles failover between Sparrow schedulers. The client accepts a list of schedulers from the application and connects to the first scheduler in the list.
The client sends a heartbeat message to the scheduler it
is using every 100ms to ensure that the scheduler is still
alive; if the scheduler has failed, the client connects to
the next scheduler in the list and triggers a callback at the
application. This approach allows frameworks to decide
how to handle tasks that were in-flight during the scheduler failure. Some frameworks may choose to ignore
failed tasks and proceed with a partial result; for Spark,
the handler instantly relaunches any phases that were inflight when the scheduler failed. Frameworks that elect
to re-launch tasks must ensure that tasks are idempotent,
because the task may have been partway through execution when the scheduler died. Sparrow does not attempt
to learn about in-progress jobs that were launched by the
failed scheduler, and instead relies on applications to relaunch such jobs. Because Sparrow is designed for short
jobs, the simplicity benefit of this approach outweighs
the efficiency loss from needing to restart jobs that were
in the process of being scheduled by the failed scheduler.
While Sparrows design allows for scheduler failures,
Sparrow does not provide any safeguards against rogue
schedulers. A misbehaving scheduler could use a larger
probe ratio to improve performance, at the expensive of
other jobs. In trusted environments where schedulers are
run by a trusted entity (e.g., within a company), this
should not be a problem; in more adversarial environments, schedulers may need to be authenticated and ratelimited to prevent misbehaving schedulers from wasting
resources.
Sparrow does not handle worker failures, as discussed
in 8, nor does it handle the case where the entire cluster fails. Because Sparrow does not persist scheduling
state to disk, in the event that all machines in the cluster fail (for example, due to a power loss event), all jobs
that were in progress will need to be restarted. As in the
case when a scheduler fails, the efficiency loss from this
approach is minimal because jobs are short.

7.1 Performance on TPC-H workload


We measure Sparrows performance scheduling queries
from the TPC-H decision support benchmark. The TPCH benchmark is representative of ad-hoc queries on business data, which are a common use case for low-latency
data parallel frameworks.
Each TPC-H query is executed using Shark [24],
a large scale data analytics platform built on top of
Spark [26]. Shark queries are compiled into multiple
Spark stages that each trigger a scheduling request using
Sparrows submitRequest() RPC. Tasks in the first
stage are constrained to run on one of three machines
holding the tasks input data, while tasks in remaining
stages are unconstrained. The response time of a query
is the sum of the response times of each stage. Because
Shark is resource-intensive, we use EC2 high-memory
quadruple extra large instances, which each have 8 cores
and 68.4GB of memory, and use 4 slots on each worker.
Ten different users launch random permutations of the
TPC-H queries to sustain an average cluster load of 80%
for a period of approximately 15 minutes. We report response times from a 200 second period in the middle
of the experiment; during the 200 second period, Sparrow schedules over 20k jobs that make up 6.2k TPC-H
queries. Each user runs queries on a distinct denormalized copy of the TPC-H dataset; each copy of the data set
is approximately 2GB (scale factor 2) and is broken into
33 partitions that are each triply replicated in memory.
The TPC-H query workload has four qualities representative of a real cluster workload. First, cluster utilization fluctuates around the mean value of 80% depending
on whether the users are collectively in more resourceintensive or less resource-intensive stages. Second, the
stages have different numbers of tasks: the first stage has
33 tasks, and subsequent stages have either 8 tasks (for
reduce-like stages that read shuffled data) or 1 task (for
aggregation stages). Third, the duration of each stage is
non-uniform, varying from a few tens of milliseconds to
several hundred. Finally, the queries have a mix of constrained and unconstrained scheduling requests: 6.2k requests are constrained (the first stage in each query) and
the remaining 14k requests are unconstrained.

7 Experimental Evaluation
We evaluate Sparrow using a cluster composed of 100
worker machines and 10 schedulers running on Amazon EC2. Unless otherwise specified, we use a probe
ratio of 2. First, we use Sparrow to schedule tasks for
9

4000
3500
3000
2500
2000
1500
1000
500
0

4217 (med.)

q3

Batch + late binding


Ideal

5396 (med.)

q4

Cumulative Probability

Response Time (ms)

Random
Per-task sampling
Batch sampling

7881 (med.)

q6

Get task time


Service time

0.8
0.6
0.4
0.2
0

q12

Figure 8: Response times for TPC-H queries using


different placement stategies. Whiskers depict 5th
and 95th percentiles; boxes depict median, 25th, and
75th percentiles.

Reserve time
Queue time

10
Milliseconds

100

Delay (ms)

Figure 9: Latency distribution for each phase in the


Sparrow scheduling algorithm.

To evaluate Sparrows performance, we compare


Sparrow to an ideal scheduler that always places all tasks
with zero wait time, as described in 3.1. To compute the
ideal response time for a query, we compute the response
time for each stage if all of the tasks in the stage had been
placed with zero wait time, and then sum the ideal response times for all stages in the query. Sparrow always
satisfies data locality constraints; because the ideal response times are computed using the service times when
Sparrow executed the job, the ideal response time assumes data locality for all tasks. The ideal response time
does not include the time needed to send tasks to worker
machines, nor does it include queueing that is inevitable
during utilization bursts, making it a conservative lower
bound on the response time attainable with a centralized
scheduler.

140
120
100
80
60
40
20
0

535

219
Per-task
Sparrow

Constrained Stages

Unconstrained Stages

Figure 10: Delay using both Sparrow and per-task


sampling, for both constrained and unconstrained
Spark stages. Whiskers depict 5th and 95 percentiles;
boxes depict median, 25th, and 75th percentiles.

7.2 Deconstructing performance


To understand the components of the delay that Sparrow adds relative to an ideal scheduler, we deconstruct
Sparrow scheduling latency in Figure 9. Each line corresponds to one of the phases of the Sparrow scheduling algorithm depicted in Figure 7. The reserve time
and queue times are unique to Sparrowa centralized
scheduler might be able to reduce these times to zero.
However, the get task time is unavoidable: no matter the
scheduling algorithm, the scheduler will need to ship the
task to the worker machine.

Figure 8 demonstrates that Sparrow outperforms alternate techniques and provides response times within
12% of an ideal scheduler. Compared to randomly assigning tasks to workers, Sparrow (batch sampling with
late binding) reduces median query response time by 4
8 and reduces 95th percentile response time by over
10. Sparrow also reduces response time compared to
per-task sampling (a nave implementation based on the
power of two choices): batch sampling with late binding provides query response times an average of 0.8
those provided by per-task sampling. Ninety-fifth percentile response times drop by almost a factor of two
with Sparrow, compared to per-task sampling. Late binding reduces median query response time by an average
of 14% compared to batch sampling alone. Sparrow also
provides good absolute performance: Sparrow provides
median response times just 12% higher than those provided by an ideal scheduler.

7.3 How do task constraints affect performance?


Sparrow provides good absolute performance and improves over per-task sampling for both constrained and
unconstrained tasks. Figure 10 depicts the delay for constrained and unconstrained stages in the TPC-H workload using both Sparrow and per-task sampling. Sparrow
schedules with a median of 7ms of delay for jobs with
unconstrained tasks and a median of 14ms of delay for
jobs with constrained tasks; because Sparrow cannot aggregate information across the tasks in a job when tasks
are constrained, delay is longer. Nonetheless, even for
10

Response Time (ms)

Query response time (ms)

4000
3000
2000
1000
0
4000
3000
2000
1000
0

Node 1

Failure

Node 2

10

20

30
Time (s)

40

50

6000
5000
4000
3000
2000
1000

Spark native scheduler


Sparrow
Ideal

0
6000

5000

4000
3000
2000
Task Duration (ms)

1000

Figure 12: Response time when scheduling 10-task


jobs in a 100 node cluster using both Sparrow and
Sparks native scheduler. Utilization is fixed at 80%,
while task duration decreases.

60

Figure 11: TPC-H response times for two frontends


submitting queries to a 100-node cluster. Node 1 suffers from a scheduler failure at 20 seconds.

experiments run on a cluster of 110 EC2 servers, with 10


schedulers and 100 workers.

constrained tasks, Sparrow provides a performance improvement over per-task sampling due to its use of late
binding.

7.6 How does Sparrow compare to Sparks


native, centralized scheduler?
Even in the relatively small, 100-node cluster in which
we conducted our evaluation, Sparks existing centralized scheduler cannot provide high enough throughput
to support sub-second tasks.5 We use a synthetic workload where each job is composed of 10 tasks that each
sleep for a specified period of time, and measure job response time. Since all tasks in the job are the same duration, ideal job response time (if all tasks are launched
immediately) is the duration of a single task. To stress
the schedulers, we use 8 slots on each machine (one per
core). Figure 12 depicts job response time as a function
of task duration. We fix cluster load at 80%, and vary
task submission rate to keep load constant as task duration decreases. For tasks longer than 2 seconds, Sparrow and Sparks native scheduler both provide near-ideal
response times. However, when tasks are shorter than
1355ms, Sparks native scheduler cannot keep up with
the rate at which tasks are completing so jobs experience
infinite queueing.
To ensure that Sparrows distributed scheduling is
necessary, we performed extensive profiling of the Spark
scheduler to understand how much we could increase
scheduling throughput with improved engineering. We
did not find any one bottleneck in the Spark scheduler; instead, messaging overhead, virtual function call
overhead, and context switching lead to a best-case
throughput (achievable when Spark is scheduling only
a single job) of approximately 1500 tasks per second.
Some of these factors could be mitigated, but at the expense of code readability and understandability. A clus-

7.4 How do scheduler failures impact job


response time?
Sparrow provides automatic failover between schedulers
and can failover to a new scheduler in less than 120ms.
Figure 11 plots the response time for ongoing TPC-H
queries in an experiment parameterized as in 7.1, with
10 Shark frontends that submit queries. Each frontend
connects to a co-resident Sparrow scheduler but is initialized with a list of alternate schedulers to connect to in
case of failure. At time t=20, we terminate the Sparrow
scheduler on node 1. The plot depicts response times for
jobs launched from the Spark frontend on node 1, which
fails over to the scheduler on node 2. The plot also shows
response times for jobs launched from the Spark frontend on node 2, which uses the scheduler on node 2 for
the entire duration of the experiment. When the Sparrow
scheduler on node 1 fails, it takes 100ms for the Sparrow client to detect the failure, less than 5ms to for the
Sparrow client to connect to the scheduler on node 2,
and less than 15ms for Spark to relaunch all outstanding tasks. Because of the speed at which failure recovery occurs, only 2 queries have tasks in flight during the
failure; these queries suffer some overhead.

7.5 Synthetic workload


The remaining sections evaluate Sparrow using a synthetic workload composed of jobs with constant duration tasks. In this workload, ideal job completion time
is always equal to task duration, which helps to isolate
the performance of Sparrow from application-layer variations in service time. As in previous experiments, these

5 For these experiments, we use Sparks standalone mode, which


relies on a simple, centralized scheduler. Spark also allows for scheduling using Mesos; Mesos is more heavyweight and provides worse performance than standalone mode for short tasks.

11

Running Tasks

400
350
300
250
200
150
100
50
0

HP
load
0.25
0.25
0.25
0.25
0.25

User 0
User 1

10

20
30
Time (s)

40

50

LP
load
0
0.25
0.5
0.75
1.75

HP response
time in ms
106 (111)
108 (114)
110 (148)
136 (170)
141 (226)

LP response
time in ms
N/A
108 (115)
110 (449)
40.2k (46.2k)
255k (270k)

Table 3: Median and 95th percentile (shown in parentheses) response times for a high priority (HP) and
low priority (LP) user running jobs composed of 10
100ms tasks in a 100-node cluster. Sparrow successfully shields the high priority user from a low priority user. When aggregate load is 1 or more, response
time will grow to be unbounded for at least one user.

Figure 13: Cluster share used by two users that are


each assigned equal shares of the cluster. User 0 submits at a rate to utilize the entire cluster for the entire
experiment while user 1 adjusts its submission rate
each 10 seconds. Sparrow assigns both users their
max-min fair share.

ter with tens of thousands of machines running subsecond tasks may require millions of scheduling decisions per second; supporting such an environment would
require 1000 higher scheduling throughput, which is
difficult to imagine even with a significant rearchitecting
of the scheduler. Clusters running low latency workloads
will need to shift from using centralized task schedulers
like Sparks native scheduler to using more scalable distributed schedulers like Sparrow.

over short time intervals. Nonetheless, as shown in Figure 13, Sparrow quickly allocates enough resources to
User 1 when she begins submitting scheduling requests
(10 seconds into the experiment), and the cluster share
allocated by Sparrow exhibits only small fluctuations
from the correct fair share.

7.8 How much can low priority users hurt


response times for high priority users?

7.7 How well can Sparrows distributed


fairness enforcement maintain fair
shares?

Table 3 demonstrates that Sparrow provides response


times within 40% of an ideal scheduler for a high priority
user in the presence of a misbehaving low priority user.
This experiment uses workers that each have 16 slots.
The high priority user submits jobs at a rate to fill 25%
of the cluster, while the low priority user increases her
submission rate to well beyond the capacity of the cluster. Without any isolation mechanisms, when the aggregate submission rate exceeds the cluster capacity, both
users would experience infinite queueing. As described
in 4.2, Sparrow node monitors run all queued high priority tasks before launching any low priority tasks, allowing Sparrow to shield high priority users from misbehaving low priority users. While Sparrow prevents the
high priority user from experiencing infinite queueing
delay, the high priority user still experiences 40% worse
response times when sharing with a demanding low priority user than when running alone on the cluster. This is
because Sparrow does not use preemption: high priority
tasks may need to wait to be launched until low priority tasks complete. In the worst case, this wait time may
be as long as the longest running low-priority task. Exploring the impact of preemption is a subject of future
work.

Figure 13 demonstrates that Sparrows distributed fairness mechanism enforces cluster-wide fair shares and
quickly adapts to changing user demand. Users 0 and
1 are both given equal shares in a cluster with 400 slots.
Unlike other experiments, we use 100 4-core EC2 machines; Sparrows distributed enforcement works better
as the number of cores increases, so to avoid over stating
performance, we evaluate it under the smallest number
of cores we would expect in a cluster today. User 0 submits at a rate to fully utilize the cluster for the entire
duration of the experiment. User 1 changes her demand
every 10 seconds: she submits at a rate to consume 0%,
25%, 50%, 25%, and finally 0% of the clusters available
slots. Under max-min fairness, each user is allocated her
fair share of the cluster unless the users demand is less
than her share, in which case the unused share is distributed evenly amongst the remaining users. Thus, user
1s max-min share for each 10-second interval is 0 concurrently running tasks, 100 tasks, 200 tasks, 100 tasks,
and finally 0 tasks; user 0s max-min fair share is the remaining resources. Sparrows fairness mechanism lacks
any central authority with a complete view of how many
tasks each user is running, leading to imperfect fairness
12

678 (med.), 4212 (95th)

9279 (95th)

250

574 (med.),
4169 (95th)

Short Job Resp. Time (ms)

Response Time (ms)

300

200
150
100

Ideal

50
0

80% load
1

1.1

1.2
1.5
Probe Ratio

90% load
2

16 core, 50% long


4 cores, 10% long
300
250
200
150
100
50
0

666

10s

4 cores, 50% long


2466

Duration of Long Tasks

6278

100s

Figure 15: Sparrow provides low median response


time for jobs composed of 10 100ms tasks, even when
those tasks are run alongside much longer jobs. Error bars depict 5th and 95th percentiles.

Figure 14: Effect of probe ratio on job response time


at two different cluster loads. Whiskers depict 5th
and 95th percentiles; boxes depict median, 25th, and
75th percentiles.

to sustain 80% cluster load. Figure 15 illustrates the response time of short jobs when sharing the cluster with
long jobs. We vary the percentage of jobs that are long,
the duration of the long jobs, and the number of cores
on the machine, to illustrate where performance breaks
down. Sparrow provides response times for short tasks
within 11% of ideal (100ms) when running on 16-core
machines, even when 50% of tasks are 3 orders of magnitude longer. When 50% of tasks are 3 orders of magnititude longer, over 99% of the execution time across all
jobs is spent executing long tasks; given this, Sparrows
performance is impressive. Short tasks see more significant performance degredation in a 4-core environment.

7.9 How sensitive is Sparrow to the probe


ratio?
Changing the probe ratio affects Sparrows performance
most at high cluster load. Figure 14 depicts response
time as a function of probe ratio in a 110-machine cluster of 8-core machines running the synthetic workload
(each job has 10 100ms tasks). The figure demonstrates
that using a small amount of oversampling significantly
improves performance compared to placing tasks randomly: oversampling by just 10% (probe ratio of 1.1)
reduces median response time by more than 2.5 compared to random sampling (probe ratio of 1) at 90% load.
The figure also demonstrates a sweet spot in the probe
ratio: a low probe ratio negatively impacts performance
because schedulers do not oversample enough to find
lightly loaded machines, but additional oversampling
eventually hurts performance due to increased messaging. This effect is most apparent at 90% load; at 80%
load, median response time with a probe ratio of 1.1 is
just 1.4 higher than median response time with a larger
probe ratio of 2. We use a probe ratio of 2 throughout
our evaluation to facilitate comparison with the power
of two choices and because non-integral probe ratios are
not possible with constrained tasks.

7.11 Scaling to large clusters


We used simulation to evaluate Sparrows performance
in larger clusters. Figure 3 suggests that Sparrow will
continue to provide good performance in a 10,000 node
cluster; of course, the only way to conclusively evaluate
Sparrows performance at scale will be to deploy it on a
large cluster.

8 Limitations and Future Work


To handle the latency and throughput demands of lowlatency frameworks, our approach sacrifices features
available in general purpose resource managers. Some of
these limitations of our approach are fundamental, while
others are the focus of future work.
Scheduling policies When a cluster becomes oversubscribed, Sparrow supports aggregate fair-sharing or
priority-based scheduling. Sparrows distributed setting
lends itself to approximated policy enforcement in order to minimize system complexity; exploring whether
Sparrow can provide more exact policy enforcement

7.10 Handling task heterogeneity


Sparrow does not perform as well under extreme task
heterogeneity: if some workers are running long tasks,
Sparrow schedulers are less likely to find idle machines
on which to run tasks. Sparrow works well unless a large
fraction of tasks are long and the long tasks are many orders of magnitude longer than the short tasks. We ran
a series of experiments with two types of jobs: short
jobs, composed of 10 100ms tasks, and long jobs, composed of 10 tasks of longer duration. Jobs are submitted
13

rely on centralized architectures. Among logically decentralized schedulers, Sparrow is the first to schedule all of a jobs tasks together, rather than scheduling
each task independently, which improves performance
for parallel jobs.
Deans work on reducing the latency tail in serving
systems [5] is most similar to ours. He proposes using
hedged requests where the client sends each request to
two workers and cancels remaining outstanding requests
when the first result is received. He also describes tied
requests, where clients send each request to two servers,
but the servers communicate directly about the status of
the request: when one server begins executing the request, it cancels the counterpart. Both mechanisms are
similar to Sparrows late binding, but target an environment where each task needs to be scheduled independently (for data locality), so information cannot be
shared across the tasks in a job.
Work on load sharing in distributed systems (e.g., [7])
also uses randomized techniques similar to Sparrows.
In load sharing systems, each processor both generates
and processes work; by default, work is processed where
it is generated. Processors re-distribute queued tasks if
the number of tasks queued at a processor exceeds some
threshold, using either receiver-initiated policies, where
lightly loaded processors request work from randomly
selected other processors, or sender-initiated policies,
where heavily loaded processors offload work to randomly selected recipients. Sparrow represents a combination of sender-initiated and receiver-initiated policies:
schedulers (senders) initiate the assignment of tasks
to workers (receivers) by sending probes, but workers finalize the assignment by responding to probes and
requesting tasks as resources become available.
Projects that explore load balancing tasks in multiprocessor shared-memory architectures (e.g., [19]) echo
many of the design tradeoffs underlying our approach,
such as the need to avoid centralized scheduling points.
They differ from our approach because they focus
on a single machine where the majority of the effort is spent determining when to reschedule processes
amongst cores to balance load.
Quincy [9] targets task-level scheduling in compute
clusters, similar to Sparrow. Quincy maps the scheduling problem onto a graph in order to compute an optimal
schedule that balances data locality, fairness, and starvation freedom. Quincys graph solver supports more sophisticated scheduling policies than Sparrow but takes
over a second to compute a scheduling assignment in
a 2500 node cluster, making it too slow for our target
workload.
In the realm of data analytics frameworks,
Dremel [12] achieves response times of seconds
with extremely high fanout. Dremel uses a hierarchical

without adding significant complexity is a focus of future work. Adding pre-emption, for example, would be a
simple way to mitigate the effects of low-priority users
jobs on higher priority users.
Constraints Our current design does not handle interjob constraints (e.g. the tasks for job A must not run on
racks with tasks for job B). Supporting inter-job constraints across frontends is difficult to do without significantly altering Sparrows design.
Gang scheduling Some applications require gang
scheduling, a feature not implemented by Sparrow. Gang
scheduling is typically implemented using bin-packing
algorithms that search for and reserve time slots in which
an entire job can run. Because Sparrow queues tasks on
several machines, it lacks a central point from which
to perform bin-packing. While Sparrow often places all
jobs on entirely idle machines, this is not guaranteed,
and deadlocks between multiple jobs that require gang
scheduling may occur. Sparrow is not alone: many cluster schedulers do not support gang scheduling [8, 9, 16].
Query-level policies Sparrows performance could be
improved by adding query-level scheduling policies. A
user query (e.g., a SQL query executed using Shark)
may be composed of many stages that are each executed using a separate Sparrow scheduling request; to
optimize query response time, Sparrow should schedule queries in FIFO order. Currently, Sparrows algorithm attempts to schedule jobs in FIFO order; adding
query-level scheduling policies should improve end-toend query performance.
Worker failures Handling worker failures is complicated by Sparrows distributed design, because when a
worker fails, all schedulers with outstanding requests
at that worker must be informed. We envision handling
worker failures with a centralized state store that relies
on occasional heartbeats to maintain a list of currently
alive workers. The state store would periodically disseminate the list of live workers to all schedulers. Since the
information stored in the state store would be soft state,
it could easily be recreated in the event of a state store
failure.
Dynamically adapting the probe ratio Sparrow
could potentially improve performance by dynamically
adapting the probe ratio based on cluster load; however,
such an approach sacrifices some of the simplicity of
Sparrows current design. Exploring whether dynamically changing the probe ratio would significantly increase performance is the subject of ongoing work.

9 Related Work
Scheduling in distributed systems has been extensively
studied in earlier work. Most existing cluster schedulers
14

scheduler design whereby each query is decomposed


into a serving tree; this approach exploits the internal structure of Dremel queries so is not generally
applicable.
Many schedulers aim to allocate resources at coarse
granularity, either because tasks tend to be long-running
or because the cluster supports many applications
that each acquire some amount of resources and perform their own task-level scheduling (e.g., Mesos [8],
YARN [16], Omega [20]). These schedulers sacrifice request granularity in order to enforce complex scheduling policies; as a result, they provide insufficient latency
and/or throughput for scheduling sub-second tasks. High
performance computing schedulers fall into this category: they optimize for large jobs with complex constraints, and target maximum throughput in the tens
to hundreds of scheduling decisions per second (e.g.,
SLURM [10]). Similarly, Condor supports complex features including a rich constraint language, job checkpointing, and gang scheduling using a heavy-weight
matchmaking process that results in maximum scheduling throughput of 10 to 100 jobs per second [4].
In the theory literature, a substantial body of work
analyzes the performance of the power of two choices
load balancing technique, as summarized by Mitzenmacher [15]. To the best of our knowledge, no existing work explores performance for parallel jobs. Many
existing analyses consider placing balls into bins, and
recent work [18] has generalized this to placing multiple balls concurrently into multiple bins. This analysis
is not appropriate for a scheduling setting, because unlike bins, worker machines process tasks to empty their
queue. Other work analyzes scheduling for single tasks;
parallel jobs are fundamentally different because a parallel job cannot complete until the last of a large number
of tasks completes.
Straggler mitigation techniques (e.g., Dolly [2],
LATE [27], Mantri [3]) focus on variation in task execution time (rather than task wait time) and are complementary to Sparrow. For example, Mantri launches a
task on a second machine if the first version of the task
is progressing too slowly, a technique that could easily
be used by Sparrows distributed schedulers.

and strict priorities. Experiments using a synthetic workload demonstrate that Sparrow is resilient to different
probe ratios and distributions of task durations. In light
of these results, we believe that distributed scheduling
using Sparrow presents a viable alternative to centralized schedulers for low latency parallel workloads.

11 Acknowledgments
We are indebted to Aurojit Panda for help with debugging EC2 performance anomalies, Shivaram Venkataraman for insightful comments on several drafts of this paper and for help with Spark integration, Sameer Agarwal
for help with running simulations, Satish Rao for help
with theoretical models of the system, and Peter Bailis,
Ali Ghodsi, Adam Oliner, Sylvia Ratnasamy, and Colin
Scott for helpful comments on earlier drafts of this paper.
We also thank our shepherd, John Wilkes, for helping to
shape the final version of the paper. Finally, we thank
the reviewers from HotCloud 2012, OSDI 2012, NSDI
2013, and SOSP 2013 for their helpful feedback.
This research is supported in part by a Hertz Foundation Fellowship, the Department of Defense through the
National Defense Science & Engineering Graduate Fellowship Program, NSF CISE Expeditions award CCF1139158, DARPA XData Award FA8750-12-2-0331, Intel via the Intel Science and Technology Center for
Cloud Computing (ISTC-CC), and gifts from Amazon
Web Services, Google, SAP, Cisco, Clearstory Data,
Cloudera, Ericsson, Facebook, FitWave, General Electric, Hortonworks, Huawei, Microsoft, NetApp, Oracle,
Samsung, Splunk, VMware, WANdisco and Yahoo!.

References
[1] Apache Thrift.
org.

http://thrift.apache.

[2] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and


I. Stoica. Why Let Resources Idle? Aggressive
Cloning of Jobs with Dolly. In HotCloud, 2012.
[3] G. Ananthanarayanan, S. Kandula, A. Greenberg,
I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in
the Outliers in Map-Reduce Clusters using Mantri.
In Proc. OSDI, 2010.

10 Conclusion
This paper presents Sparrow, a stateless decentralized
scheduler that provides near optimal performance using
two key techniques: batch sampling and late binding. We
use a TPC-H workload to demonstrate that Sparrow can
provide median response times within 12% of an ideal
scheduler and survives scheduler failures. Sparrow enforces popular scheduler policies, including fair sharing

[4] D. Bradley, T. S. Clair, M. Farrellee, Z. Guo,


M. Livny, I. Sfiligoi, and T. Tannenbaum. An Update on the Scalability Limits of the Condor Batch
System. Journal of Physics: Conference Series,
331(6), 2011.
15

[17] K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The Case for Tiny Tasks in Compute Clusters.
In Proc. HotOS, 2013.

[5] J. Dean and L. A. Barroso. The Tail at Scale. Communications of the ACM, 56(2), February 2013.
[6] A. Demers, S. Keshav, and S. Shenker. Analysis
and Simulation of a Fair Queueing Algorithm. In
Proc. SIGCOMM, 1989.

[18] G. Park. A Generalization of Multiple Choice


Balls-into-Bins. In Proc. PODC, pages 297298,
2011.

[7] D. L. Eager, E. D. Lazowska, and J. Zahorjan. Adaptive Load Sharing in Homogeneous Distributed Systems. IEEE Transactions on Software
Engineering, 1986.

[19] L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A


Simple Load Balancing Scheme for Task Allocation in Parallel Machines. In Proc. SPAA, 1991.

[8] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform For Fine-Grained Resource
Sharing in the Data Center. In Proc. NSDI, 2011.

[20] M. Schwarzkopf, A. Konwinski, M. Abd-ElMalek, and J. Wilkes. Omega: flexible, scalable


schedulers for large compute clusters. In Proc. EuroSys, 2013.

[9] M. Isard, V. Prabhakaran, J. Currey, U. Wieder,


K. Talwar, and A. Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. In Proc.
SOSP, 2009.

[21] B. Sharma, V. Chudnovsky, J. L. Hellerstein, R. Rifaat, and C. R. Das. Modeling and Synthesizing
Task Placement Constraints in Google Compute
Clusters. In Proc. SOCC, 2011.

[10] M. A. Jette, A. B. Yoo, and M. Grondona.


SLURM: Simple Linux Utility for Resource Management. In Proc. Job Scheduling Strategies for
Parallel Processing, Lecture Notes in Computer
Science, pages 4460. Springer, 2003.

[22] D. Shue, M. J. Freedman, and A. Shaikh. Performance Isolation and Fairness for Multi-Tenant
Cloud Storage. In Proc. OSDI, 2012.
[23] T. White. Hadoop: The Definitive Guide. OReilly
Media, 2009.

[11] M. Kornacker and J. Erickson. Cloudera Impala:


Real Time Queries in Apache Hadoop, For Real.
http://blog.cloudera.com/blog/
2012/10/cloudera-impala-realtime-queries-in-apache-hadoopfor-real/, October 2012.

[24] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin,


S. Shenker, and I. Stoica. Shark: SQL and Rich
Analytics at Scale. In Proc. SIGMOD, 2013.
[25] M. Zaharia, D. Borthakur, J. Sen Sarma,
K. Elmeleegy, S. Shenker, and I. Stoica. Delay
Scheduling: A Simple Technique For Achieving
Locality and Fairness in Cluster Scheduling. In
Proc. EuroSys, 2010.

[12] S. Melnik, A. Gubarev, J. J. Long, G. Romer,


S. Shivakumar, M. Tolton, and T. Vassilakis.
Dremel: Interactive Analysis of Web-Scale
Datasets. Proc. VLDB Endow., 2010.
[13] M. Mitzenmacher. How Useful is Old Information? volume 11, pages 620, 2000.

[26] M. Zaharia, M. Chowdhury, T. Das, A. Dave,


J. Ma, M. McCauley, M. J. Franklin, S. Shenker,
and I. Stoica. Resilient Distributed Datasets: A
Fault-Tolerant Abstraction for In-Memory Cluster
Computing. In Proc. NSDI, 2012.

[14] M. Mitzenmacher. The Power of Two Choices


in Randomized Load Balancing. IEEE Transactions on Parallel and Distributed Computing,
12(10):10941104, 2001.

[27] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz,


and I. Stoica. Improving MapReduce Performance
in Heterogeneous Environments. In Proc. OSDI,
2008.

[15] M. Mitzenmacher. The Power of Two Random


Choices: A Survey of Techniques and Results. In
S. Rajasekaran, P. Pardalos, J. Reif, and J. Rolim,
editors, Handbook of Randomized Computing, volume 1, pages 255312. Springer, 2001.
[16] A. C. Murthy. The Next Generation of Apache
MapReduce. http://developer.yahoo.
com/blogs/hadoop/next-generationapache-hadoop-mapreduce-3061.html,
February 2012.
16

Kafka: a Distributed Messaging System for Log Processing


Jay Kreps
LinkedIn Corp.
jkreps@linkedin.com

Neha Narkhede
LinkedIn Corp.
nnarkhede@linkedin.com

ABSTRACT
Log processing has become a critical component of the data
pipeline for consumer internet companies. We introduce Kafka, a
distributed messaging system that we developed for collecting and
delivering high volumes of log data with low latency. Our system
incorporates ideas from existing log aggregators and messaging
systems, and is suitable for both offline and online message
consumption. We made quite a few unconventional yet practical
design choices in Kafka to make our system efficient and scalable.
Our experimental results show that Kafka has superior
performance when compared to two popular messaging systems.
We have been using Kafka in production for some time and it is
processing hundreds of gigabytes of new data each day.

General Terms
Management, Performance, Design, Experimentation.

Keywords
messaging, distributed, log processing, throughput, online.

1. Introduction
There is a large amount of log data generated at any sizable
internet company. This data typically includes (1) user activity
events corresponding to logins, pageviews, clicks, likes,
sharing, comments, and search queries; (2) operational metrics
such as service call stack, call latency, errors, and system metrics
such as CPU, memory, network, or disk utilization on each
machine. Log data has long been a component of analytics used to
track user engagement, system utilization, and other metrics.
However recent trends in internet applications have made activity
data a part of the production data pipeline used directly in site
features. These uses include (1) search relevance, (2)
recommendations which may be driven by item popularity or cooccurrence in the activity stream, (3) ad targeting and reporting,
and (4) security applications that protect against abusive behaviors
such as spam or unauthorized data scraping, and (5) newsfeed
features that aggregate user status updates or actions for their
friends or connections to read.
This production, real-time usage of log data creates new
challenges for data systems because its volume is orders of
magnitude larger than the real data. For example, search,
recommendations, and advertising often require computing
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
NetDB'11, Jun. 12, 2011, Athens, Greece.
Copyright 2011 ACM 978-1-4503-0652-2/11/06$10.00.

Jun Rao
LinkedIn Corp.
jrao@linkedin.com

granular click-through rates, which generate log records not only


for every user click, but also for dozens of items on each page that
are not clicked. Every day, China Mobile collects 58TB of phone
call records [11] and Facebook gathers almost 6TB of various user
activity events [12].
Many early systems for processing this kind of data relied on
physically scraping log files off production servers for analysis. In
recent years, several specialized distributed log aggregators have
been built, including Facebooks Scribe [6], Yahoos Data
Highway [4], and Clouderas Flume [3]. Those systems are
primarily designed for collecting and loading the log data into a
data warehouse or Hadoop [8] for offline consumption. At
LinkedIn (a social network site), we found that in addition to
traditional offline analytics, we needed to support most of the
real-time applications mentioned above with delays of no more
than a few seconds.
We have built a novel messaging system for log processing called
Kafka [18] that combines the benefits of traditional log
aggregators and messaging systems. On the one hand, Kafka is
distributed and scalable, and offers high throughput. On the other
hand, Kafka provides an API similar to a messaging system and
allows applications to consume log events in real time. Kafka has
been open sourced and used successfully in production at
LinkedIn for more than 6 months. It greatly simplifies our
infrastructure, since we can exploit a single piece of software for
both online and offline consumption of the log data of all types.
The rest of the paper is organized as follows. We revisit
traditional messaging systems and log aggregators in Section 2. In
Section 3, we describe the architecture of Kafka and its key design
principles. We describe our deployment of Kafka at LinkedIn in
Section 4 and the performance results of Kafka in Section 5. We
discuss future work and conclude in Section 6.

2. Related Work
Traditional enterprise messaging systems [1][7][15][17] have
existed for a long time and often play a critical role as an event
bus for processing asynchronous data flows. However, there are a
few reasons why they tend not to be a good fit for log processing.
First, there is a mismatch in features offered by enterprise
systems. Those systems often focus on offering a rich set of
delivery guarantees. For example, IBM Websphere MQ [7] has
transactional supports that allow an application to insert messages
into multiple queues atomically. The JMS [14] specification
allows each individual message to be acknowledged after
consumption, potentially out of order. Such delivery guarantees
are often overkill for collecting log data. For instance, losing a
few pageview events occasionally is certainly not the end of the
world. Those unneeded features tend to increase the complexity of
both the API and the underlying implementation of those systems.
Second, many systems do not focus as strongly on throughput as
their primary design constraint. For example, JMS has no API to
allow the producer to explicitly batch multiple messages into a

single request. This means each message requires a full TCP/IP


roundtrip, which is not feasible for the throughput requirements of
our domain. Third, those systems are weak in distributed support.
There is no easy way to partition and store messages on multiple
machines. Finally, many messaging systems assume near
immediate consumption of messages, so the queue of unconsumed
messages is always fairly small. Their performance degrades
significantly if messages are allowed to accumulate, as is the case
for offline consumers such as data warehousing applications that
do periodic large loads rather than continuous consumption.
A number of specialized log aggregators have been built over the
last few years. Facebook uses a system called Scribe. Each frontend machine can send log data to a set of Scribe machines over
sockets. Each Scribe machine aggregates the log entries and
periodically dumps them to HDFS [9] or an NFS device. Yahoos
data highway project has a similar dataflow. A set of machines
aggregate events from the clients and roll out minute files,
which are then added to HDFS. Flume is a relatively new log
aggregator developed by Cloudera. It supports extensible pipes
and sinks, and makes streaming log data very flexible. It also
has more integrated distributed support. However, most of those
systems are built for consuming the log data offline, and often
expose implementation details unnecessarily (e.g. minute files)
to the consumer. Additionally, most of them use a push model
in which the broker forwards data to consumers. At LinkedIn, we
find the pull model more suitable for our applications since each
consumer can retrieve the messages at the maximum rate it can
sustain and avoid being flooded by messages pushed faster than it
can handle. The pull model also makes it easy to rewind a
consumer and we discuss this benefit at the end of Section 3.2.
More recently, Yahoo! Research developed a new distributed
pub/sub system called HedWig [13]. HedWig is highly scalable
and available, and offers strong durability guarantees. However, it
is mainly intended for storing the commit log of a data store.

topic will be evenly distributed into these sub-streams. The details


about how Kafka distributes the messages are described later in
Section 3.2. Each message stream provides an iterator interface
over the continual stream of messages being produced. The
consumer then iterates over every message in the stream and
processes the payload of the message. Unlike traditional iterators,
the message stream iterator never terminates. If there are currently
no more messages to consume, the iterator blocks until new
messages are published to the topic. We support both the point-topoint delivery model in which multiple consumers jointly
consume a single copy of all messages in a topic, as well as the
publish/subscribe model in which multiple consumers each
retrieve its own copy of a topic.
Sample consumer code:
streams[] = Consumer.createMessageStreams(topic1, 1)
for (message : streams[0]) {
bytes = message.payload();
// do something with the bytes
}
The overall architecture of Kafka is shown in Figure 1. Since
Kafka is distributed in nature, an Kafka cluster typically consists
of multiple brokers. To balance load, a topic is divided into
multiple partitions and each broker stores one or more of those
partitions. Multiple producers and consumers can publish and
retrieve messages at the same time. In Section 3.1, we describe the
layout of a single partition on a broker and a few design choices
that we selected to make accessing a partition efficient. In Section
3.2, we describe how the producer and the consumer interact with
multiple brokers in a distributed setting. We discuss the delivery
guarantees of Kafka in Section 3.3.
producer

producer

3. Kafka Architecture and Design Principles


Because of limitations in existing systems, we developed a new
messaging-based log aggregator Kafka. We first introduce the
basic concepts in Kafka. A stream of messages of a particular type
is defined by a topic. A producer can publish messages to a topic.
The published messages are then stored at a set of servers called
brokers. A consumer can subscribe to one or more topics from the
brokers, and consume the subscribed messages by pulling data
from the brokers.
Messaging is conceptually simple, and we have tried to make the
Kafka API equally simple to reflect this. Instead of showing the
exact API, we present some sample code to show how the API is
used. The sample code of the producer is given below. A message
is defined to contain just a payload of bytes. A user can choose
her favorite serialization method to encode a message. For
efficiency, the producer can send a set of messages in a single
publish request.
Sample producer code:
producer = new Producer();
message = new Message(test message str.getBytes());
set = new MessageSet(message);
producer.send(topic1, set);
To subscribe to a topic, a consumer first creates one or more
message streams for the topic. The messages published to that

BROKER 1

BROKER 2

BROKER 3

topic1/part1
/part2
topic2/part1

topic1/part1
/part2
topic2/part1

topic1/part1
/part2
topic2/part1

consumer

consumer

Figure 1. Kafka Architecture

3.1 Efficiency on a Single Partition


We made a few decisions in Kafka to make the system efficient.
Simple storage: Kafka has a very simple storage layout. Each
partition of a topic corresponds to a logical log. Physically, a log
is implemented as a set of segment files of approximately the
same size (e.g., 1GB). Every time a producer publishes a message
to a partition, the broker simply appends the message to the last
segment file. For better performance, we flush the segment files to
disk only after a configurable number of messages have been
published or a certain amount of time has elapsed. A message is
only exposed to the consumers after it is flushed.

Unlike typical messaging systems, a message stored in Kafka


doesnt have an explicit message id. Instead, each message is
addressed by its logical offset in the log. This avoids the overhead
of maintaining auxiliary, seek-intensive random-access index
structures that map the message ids to the actual message
locations. Note that our message ids are increasing but not
consecutive. To compute the id of the next message, we have to
add the length of the current message to its id. From now on, we
will use message ids and offsets interchangeably.
A consumer always consumes messages from a particular
partition sequentially. If the consumer acknowledges a particular
message offset, it implies that the consumer has received all
messages prior to that offset in the partition. Under the covers, the
consumer is issuing asynchronous pull requests to the broker to
have a buffer of data ready for the application to consume. Each
pull request contains the offset of the message from which the
consumption begins and an acceptable number of bytes to fetch.
Each broker keeps in memory a sorted list of offsets, including the
offset of the first message in every segment file. The broker
locates the segment file where the requested message resides by
searching the offset list, and sends the data back to the consumer.
After a consumer receives a message, it computes the offset of the
next message to consume and uses it in the next pull request. The
layout of an Kafka log and the in-memory index is depicted in
Figure 2. Each box shows the offset of a message.

delete

reads

append

in-memory index
msg-00000000000
msg-00014517018
msg-00030706778
.
.
.
.
.
msg-02050706778

segment file 1
msg-00000000000
msg-00000000215
.
.
.
.
msg-00014516809

segment file N
msg-02050706778
msg-02050706945
.
.
.
.
msg-02614516809

Figure 2. Kafka log

Efficient transfer: We are very careful about transferring data in


and out of Kafka. Earlier, we have shown that the producer can
submit a set of messages in a single send request. Although the
end consumer API iterates one message at a time, under the
covers, each pull request from a consumer also retrieves multiple
messages up to a certain size, typically hundreds of kilobytes.
Another unconventional choice that we made is to avoid explicitly
caching messages in memory at the Kafka layer. Instead, we rely
on the underlying file system page cache. This has the main
benefit of avoiding double buffering---messages are only cached
in the page cache. This has the additional benefit of retaining
warm cache even when a broker process is restarted. Since Kafka
doesnt cache messages in process at all, it has very little overhead
in garbage collecting its memory, making efficient
implementation in a VM-based language feasible. Finally, since
both the producer and the consumer access the segment files

sequentially, with the consumer often lagging the producer by a


small amount, normal operating system caching heuristics are
very effective (specifically write-through caching and readahead). We have found that both the production and the
consumption have consistent performance linear to the data size,
up to many terabytes of data.
In addition we optimize the network access for consumers. Kafka
is a multi-subscriber system and a single message may be
consumed multiple times by different consumer applications. A
typical approach to sending bytes from a local file to a remote
socket involves the following steps: (1) read data from the storage
media to the page cache in an OS, (2) copy data in the page cache
to an application buffer, (3) copy application buffer to another
kernel buffer, (4) send the kernel buffer to the socket. This
includes 4 data copying and 2 system calls. On Linux and other
Unix operating systems, there exists a sendfile API [5] that can
directly transfer bytes from a file channel to a socket channel.
This typically avoids 2 of the copies and 1 system call introduced
in steps (2) and (3). Kafka exploits the sendfile API to efficiently
deliver bytes in a log segment file from a broker to a consumer.
Stateless broker: Unlike most other messaging systems, in
Kafka, the information about how much each consumer has
consumed is not maintained by the broker, but by the consumer
itself. Such a design reduces a lot of the complexity and the
overhead on the broker. However, this makes it tricky to delete a
message, since a broker doesnt know whether all subscribers
have consumed the message. Kafka solves this problem by using a
simple time-based SLA for the retention policy. A message is
automatically deleted if it has been retained in the broker longer
than a certain period, typically 7 days. This solution works well in
practice. Most consumers, including the offline ones, finish
consuming either daily, hourly, or in real-time. The fact that the
performance of Kafka doesnt degrade with a larger data size
makes this long retention feasible.
There is an important side benefit of this design. A consumer can
deliberately rewind back to an old offset and re-consume data.
This violates the common contract of a queue, but proves to be an
essential feature for many consumers. For example, when there is
an error in application logic in the consumer, the application can
re-play certain messages after the error is fixed. This is
particularly important to ETL data loads into our data warehouse
or Hadoop system. As another example, the consumed data may
be flushed to a persistent store only periodically (e.g, a full-text
indexer). If the consumer crashes, the unflushed data is lost. In
this case, the consumer can checkpoint the smallest offset of the
unflushed messages and re-consume from that offset when its
restarted. We note that rewinding a consumer is much easier to
support in the pull model than the push model.

3.2 Distributed Coordination


We now describe how the producers and the consumers behave in
a distributed setting. Each producer can publish a message to
either a randomly selected partition or a partition semantically
determined by a partitioning key and a partitioning function. We
will focus on how the consumers interact with the brokers.
Kafka has the concept of consumer groups. Each consumer group
consists of one or more consumers that jointly consume a set of
subscribed topics, i.e., each message is delivered to only one of
the consumers within the group. Different consumer groups each
independently consume the full set of subscribed messages and no
coordination is needed across consumer groups. The consumers

within the same group can be in different processes or on different


machines. Our goal is to divide the messages stored in the brokers
evenly among the consumers, without introducing too much
coordination overhead.
Our first decision is to make a partition within a topic the smallest
unit of parallelism. This means that at any given time, all
messages from one partition are consumed only by a single
consumer within each consumer group. Had we allowed multiple
consumers to simultaneously consume a single partition, they
would have to coordinate who consumes what messages, which
necessitates locking and state maintenance overhead. In contrast,
in our design consuming processes only need co-ordinate when
the consumers rebalance the load, an infrequent event. In order for
the load to be truly balanced, we require many more partitions in a
topic than the consumers in each group. We can easily achieve
this by over partitioning a topic.
The second decision that we made is to not have a central
master node, but instead let consumers coordinate among
themselves in a decentralized fashion. Adding a master can
complicate the system since we have to further worry about
master failures. To facilitate the coordination, we employ a highly
available consensus service Zookeeper [10]. Zookeeper has a
very simple, file system like API. One can create a path, set the
value of a path, read the value of a path, delete a path, and list the
children of a path. It does a few more interesting things: (a) one
can register a watcher on a path and get notified when the children
of a path or the value of a path has changed; (b) a path can be
created as ephemeral (as oppose to persistent), which means that
if the creating client is gone, the path is automatically removed by
the Zookeeper server; (c) zookeeper replicates its data to multiple
servers, which makes the data highly reliable and available.
Kafka uses Zookeeper for the following tasks: (1) detecting the
addition and the removal of brokers and consumers, (2) triggering
a rebalance process in each consumer when the above events
happen, and (3) maintaining the consumption relationship and
keeping track of the consumed offset of each partition.
Specifically, when each broker or consumer starts up, it stores its
information in a broker or consumer registry in Zookeeper. The
broker registry contains the brokers host name and port, and the
set of topics and partitions stored on it. The consumer registry
includes the consumer group to which a consumer belongs and the
set of topics that it subscribes to. Each consumer group is
associated with an ownership registry and an offset registry in
Zookeeper. The ownership registry has one path for every
subscribed partition and the path value is the id of the consumer
currently consuming from this partition (we use the terminology
that the consumer owns this partition). The offset registry stores
for each subscribed partition, the offset of the last consumed
message in the partition.

Algorithm 1: rebalance process for consumer Ci in group G


For each topic T that Ci subscribes to {
remove partitions owned by Ci from the ownership registry
read the broker and the consumer registries from Zookeeper
compute PT = partitions available in all brokers under topic T
compute CT = all consumers in G that subscribe to topic T
sort PT and CT
let j be the index position of Ci in CT and let N = |PT|/|CT|
assign partitions from j*N to (j+1)*N - 1 in PT to consumer Ci
for each assigned partition p {
set the owner of p to Ci in the ownership registry
let Op = the offset of partition p stored in the offset registry
invoke a thread to pull data in partition p from offset Op
}
}
subset of partitions that it should consume from. The process is
described in Algorithm 1. By reading the broker and the consumer
registry from Zookeeper, the consumer first computes the set (PT)
of partitions available for each subscribed topic T and the set (CT)
of consumers subscribing to T. It then range-partitions PT into |CT|
chunks and deterministically picks one chunk to own. For each
partition the consumer picks, it writes itself as the new owner of
the partition in the ownership registry. Finally, the consumer
begins a thread to pull data from each owned partition, starting
from the offset stored in the offset registry. As messages get
pulled from a partition, the consumer periodically updates the
latest consumed offset in the offset registry.
When there are multiple consumers within a group, each of them
will be notified of a broker or a consumer change. However, the
notification may come at slightly different times at the consumers.
So, it is possible that one consumer tries to take ownership of a
partition still owned by another consumer. When this happens, the
first consumer simply releases all the partitions that it currently
owns, waits a bit and retries the rebalance process. In practice, the
rebalance process often stabilizes after only a few retries.
When a new consumer group is created, no offsets are available in
the offset registry. In this case, the consumers will begin with
either the smallest or the largest offset (depending on a
configuration) available on each subscribed partition, using an
API that we provide on the brokers.

3.3 Delivery Guarantees

The paths created in Zookeeper are ephemeral for the broker


registry, the consumer registry and the ownership registry, and
persistent for the offset registry. If a broker fails, all partitions on
it are automatically removed from the broker registry. The failure
of a consumer causes it to lose its entry in the consumer registry
and all partitions that it owns in the ownership registry. Each
consumer registers a Zookeeper watcher on both the broker
registry and the consumer registry, and will be notified whenever
a change in the broker set or the consumer group occurs.

In general, Kafka only guarantees at-least-once delivery. Exactlyonce delivery typically requires two-phase commits and is not
necessary for our applications. Most of the time, a message is
delivered exactly once to each consumer group. However, in the
case when a consumer process crashes without a clean shutdown,
the consumer process that takes over those partitions owned by
the failed consumer may get some duplicate messages that are
after the last offset successfully committed to zookeeper. If an
application cares about duplicates, it must add its own deduplication logic, either using the offsets that we return to the
consumer or some unique key within the message. This is usually
a more cost-effective approach than using two-phase commits.

During the initial startup of a consumer or when the consumer is


notified about a broker/consumer change through the watcher, the
consumer initiates a rebalance process to determine the new

Kafka guarantees that messages from a single partition are


delivered to a consumer in order. However, there is no guarantee
on the ordering of messages coming from different partitions.

To avoid log corruption, Kafka stores a CRC for each message in


the log. If there is any I/O error on the broker, Kafka runs a
recovery process to remove those messages with inconsistent
CRCs. Having the CRC at the message level also allows us to
check network errors after a message is produced or consumed.
If a broker goes down, any message stored on it not yet consumed
becomes unavailable. If the storage system on a broker is
permanently damaged, any unconsumed message is lost forever.
In the future, we plan to add built-in replication in Kafka to
redundantly store each message on multiple brokers.

4. Kafka Usage at LinkedIn


In this section, we describe how we use Kafka at LinkedIn. Figure
3 shows a simplified version of our deployment. We have one
Kafka cluster co-located with each datacenter where our userfacing services run. The frontend services generate various kinds
of log data and publish it to the local Kafka brokers in batches.
We rely on a hardware load-balancer to distribute the publish
requests to the set of Kafka brokers evenly. The online consumers
of Kafka run in services within the same datacenter.
main datacenter
frontend

frontend

analysis datacenter

broker

broker

realtime
service

Our tracking also includes an auditing system to verify that there


is no data loss along the whole pipeline. To facilitate that, each
message carries the timestamp and the server name when they are
generated. We instrument each producer such that it periodically
generates a monitoring event, which records the number of
messages published by that producer for each topic within a fixed
time window. The producer publishes the monitoring events to
Kafka in a separate topic. The consumers can then count the
number of messages that they have received from a given topic
and validate those counts with the monitoring events to validate
the correctness of data.
Loading into the Hadoop cluster is accomplished by implementing
a special Kafka input format that allows MapReduce jobs to
directly read data from Kafka. A MapReduce job loads the raw
data and then groups and compresses it for efficient processing in
the future. The stateless broker and client-side storage of message
offsets again come into play here, allowing the MapReduce task
management (which allows tasks to fail and be restarted) to
handle the data load in a natural way without duplicating or losing
messages in the event of a task restart. Both data and offsets are
stored in HDFS only on the successful completion of the job.
We chose to use Avro [2] as our serialization protocol since it is
efficient and supports schema evolution. For each message, we
store the id of its Avro schema and the serialized bytes in the
payload. This schema allows us to enforce a contract to ensure
compatibility between data producers and consumers. We use a
lightweight schema registry service to map the schema id to the
actual schema. When a consumer gets a message, it looks up in
the schema registry to retrieve the schema, which is used to
decode the bytes into an object (this lookup need only be done
once per schema, since the values are immutable).

frontend

Load balancer

realtime
service

consumption when the operation staffs start or stop brokers for


software or hardware maintenance.

DWH

Hadoop

Figure 3. Kafka Deployment


We also deploy a cluster of Kafka in a separate datacenter for
offline analysis, located geographically close to our Hadoop
cluster and other data warehouse infrastructure. This instance of
Kafka runs a set of embedded consumers to pull data from the
Kafka instances in the live datacenters. We then run data load jobs
to pull data from this replica cluster of Kafka into Hadoop and our
data warehouse, where we run various reporting jobs and
analytical process on the data. We also use this Kafka cluster for
prototyping and have the ability to run simple scripts against the
raw event streams for ad hoc querying. Without too much tuning,
the end-to-end latency for the complete pipeline is about 10
seconds on average, good enough for our requirements.
Currently, Kafka accumulates hundreds of gigabytes of data and
close to a billion messages per day, which we expect will grow
significantly as we finish converting legacy systems to take
advantage of Kafka. More types of messages will be added in the
future. The rebalance process is able to automatically redirect the

5. Experimental Results
We conducted an experimental study, comparing the performance
of Kafka with Apache ActiveMQ v5.4 [1], a popular open-source
implementation of JMS, and RabbitMQ v2.4 [16], a message
system known for its performance. We used ActiveMQs default
persistent message store KahaDB. Although not presented here,
we also tested an alternative AMQ message store and found its
performance very similar to that of KahaDB. Whenever possible,
we tried to use comparable settings in all systems.
We ran our experiments on 2 Linux machines, each with 8 2GHz
cores, 16GB of memory, 6 disks with RAID 10. The two
machines are connected with a 1Gb network link. One of the
machines was used as the broker and the other machine was used
as the producer or the consumer.
Producer Test: We configured the broker in all systems to
asynchronously flush messages to its persistence store. For each
system, we ran a single producer to publish a total of 10 million
messages, each of 200 bytes. We configured the Kafka producer
to send messages in batches of size 1 and 50. ActiveMQ and
RabbitMQ dont seem to have an easy way to batch messages and
we assume that it used a batch size of 1. The results are shown in
Figure 4. The x-axis represents the amount of data sent to the
broker over time in MB, and the y-axis corresponds to the
producer throughput in messages per second. On average, Kafka
can publish messages at the rate of 50,000 and 400,000 messages
per second for batch size of 1 and 50, respectively. These numbers

Figure 4. Producer Performance


are orders of magnitude higher than that of ActiveMQ, and at least
2 times higher than RabbitMQ.
There are a few reasons why Kafka performed much better. First,
the Kafka producer currently doesnt wait for acknowledgements
from the broker and sends messages as faster as the broker can
handle. This significantly increased the throughput of the
publisher. With a batch size of 50, a single Kafka producer almost
saturated the 1Gb link between the producer and the broker. This
is a valid optimization for the log aggregation case, as data must
be sent asynchronously to avoid introducing any latency into the
live serving of traffic. We note that without acknowledging the
producer, there is no guarantee that every published message is
actually received by the broker. For many types of log data, it is
desirable to trade durability for throughput, as long as the number
of dropped messages is relatively small. However, we do plan to
address the durability issue for more critical data in the future.
Second, Kafka has a more efficient storage format. On average,
each message had an overhead of 9 bytes in Kafka, versus 144
bytes in ActiveMQ. This means that ActiveMQ was using 70%
more space than Kafka to store the same set of 10 million
messages. One overhead in ActiveMQ came from the heavy
message header, required by JMS. Another overhead was the cost
of maintaining various indexing structures. We observed that one
of the busiest threads in ActiveMQ spent most of its time
accessing a B-Tree to maintain message metadata and state.
Finally, batching greatly improved the throughput by amortizing
the RPC overhead. In Kafka, a batch size of 50 messages
improved the throughput by almost an order of magnitude.
Consumer Test: In the second experiment, we tested the
performance of the consumer. Again, for all systems, we used a
single consumer to retrieve a total of 10 millions messages. We
configured all systems so that each pull request should prefetch
approximately the same amount data---up to 1000 messages or
about 200KB. For both ActiveMQ and RabbitMQ, we set the
consumer acknowledge mode to be automatic. Since all messages
fit in memory, all systems were serving data from the page cache
of the underlying file system or some in-memory buffers. The
results are presented in Figure 5.
On average, Kafka consumed 22,000 messages per second, more
than 4 times that of ActiveMQ and RabbitMQ. We can think of
several reasons. First, since Kafka has a more efficient storage
format, fewer bytes were transferred from the broker to the

Figure 5. Consumer Performance


consumer in Kafka. Second, the broker in both ActiveMQ and
RabbitMQ had to maintain the delivery state of every message.
We observed that one of the ActiveMQ threads was busy writing
KahaDB pages to disks during this test. In contrast, there were no
disk write activities on the Kafka broker. Finally, by using the
sendfile API, Kafka reduces the transmission overhead.
We close the section by noting that the purpose of the experiment
is not to show that other messaging systems are inferior to Kafka.
After all, both ActiveMQ and RabbitMQ have more features than
Kafka. The main point is to illustrate the potential performance
gain that can be achieved by a specialized system.

6. Conclusion and Future Works


We present a novel system called Kafka for processing huge
volume of log data streams. Like a messaging system, Kafka
employs a pull-based consumption model that allows an
application to consume data at its own rate and rewind the
consumption whenever needed. By focusing on log processing
applications, Kafka achieves much higher throughput than
conventional messaging systems. It also provides integrated
distributed support and can scale out. We have been using Kafka
successfully at LinkedIn for both offline and online applications.
There are a number of directions that wed like to pursue in the
future. First, we plan to add built-in replication of messages across
multiple brokers to allow durability and data availability
guarantees even in the case of unrecoverable machine failures.
Wed like to support both asynchronous and synchronous
replication models to allow some tradeoff between producer
latency and the strength of the guarantees provided. An
application can choose the right level of redundancy based on its
requirement on durability, availability and throughput. Second, we
want to add some stream processing capability in Kafka. After
retrieving messages from Kafka, real time applications often
perform similar operations such as window-based counting and
joining each message with records in a secondary store or with
messages in another stream. At the lowest level this is supported
by semantically partitioning messages on the join key during
publishing so that all messages sent with a particular key go to the
same partition and hence arrive at a single consumer process. This
provides the foundation for processing distributed streams across
a cluster of consumer machines. On top of this we feel a library of
helpful stream utilities, such as different windowing functions or
join techniques will be beneficial to this kind of applications.

7. REFERENCES

[10] http://hadoop.apache.org/zookeeper/

[1] http://activemq.apache.org/
[2] http://avro.apache.org/

[11] http://www.slideshare.net/cloudera/hw09-hadoop-baseddata-mining-platform-for-the-telecom-industry

[3] Clouderas Flume, https://github.com/cloudera/flume

[12] http://www.slideshare.net/prasadc/hive-percona-2009

[4] http://developer.yahoo.com/blogs/hadoop/posts/2010/06/ena
bling_hadoop_batch_processi_1/

[13] https://issues.apache.org/jira/browse/ZOOKEEPER-775

[5] Efficient data transfer through zero copy:


https://www.ibm.com/developerworks/linux/library/jzerocopy/
[6] Facebooks Scribe,
http://www.facebook.com/note.php?note_id=32008268919
[7] IBM Websphere MQ: http://www01.ibm.com/software/integration/wmq/

[14] JAVA Message Service:


http://download.oracle.com/javaee/1.3/jms/tutorial/1_3_1fcs/doc/jms_tutorialTOC.html.
[15] Oracle Enterprise Messaging Service:
http://www.oracle.com/technetwork/middleware/ias/index093455.html
[16] http://www.rabbitmq.com/

[8] http://hadoop.apache.org/

[17] TIBCO Enterprise Message Service:


http://www.tibco.com/products/soa/messaging/

[9] http://hadoop.apache.org/hdfs/

[18] Kafka, http://sna-projects.com/kafka/

Bigtable: A Distributed Storage System for Structured Data


Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
{fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,fikes,gruber}@google.com

Google, Inc.

Abstract
Bigtable is a distributed storage system for managing
structured data that is designed to scale to a very large
size: petabytes of data across thousands of commodity
servers. Many projects at Google store data in Bigtable,
including web indexing, Google Earth, and Google Finance. These applications place very different demands
on Bigtable, both in terms of data size (from URLs to
web pages to satellite imagery) and latency requirements
(from backend bulk processing to real-time data serving).
Despite these varied demands, Bigtable has successfully
provided a flexible, high-performance solution for all of
these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients
dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

1 Introduction
Over the last two and a half years we have designed,
implemented, and deployed a distributed storage system
for managing structured data at Google called Bigtable.
Bigtable is designed to reliably scale to petabytes of
data and thousands of machines. Bigtable has achieved
several goals: wide applicability, scalability, high performance, and high availability. Bigtable is used by
more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of demanding workloads,
which range from throughput-oriented batch-processing
jobs to latency-sensitive serving of data to end users.
The Bigtable clusters used by these products span a wide
range of configurations, from a handful to thousands of
servers, and store up to several hundred terabytes of data.
In many ways, Bigtable resembles a database: it shares
many implementation strategies with databases. Parallel databases [14] and main-memory databases [13] have
To appear in OSDI 2006

achieved scalability and high performance, but Bigtable


provides a different interface than such systems. Bigtable
does not support a full relational data model; instead, it
provides clients with a simple data model that supports
dynamic control over data layout and format, and allows clients to reason about the locality properties of the
data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary
strings. Bigtable also treats data as uninterpreted strings,
although clients often serialize various forms of structured and semi-structured data into these strings. Clients
can control the locality of their data through careful
choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve
data out of memory or from disk.
Section 2 describes the data model in more detail, and
Section 3 provides an overview of the client API. Section 4 briefly describes the underlying Google infrastructure on which Bigtable depends. Section 5 describes the
fundamentals of the Bigtable implementation, and Section 6 describes some of the refinements that we made
to improve Bigtables performance. Section 7 provides
measurements of Bigtables performance. We describe
several examples of how Bigtable is used at Google
in Section 8, and discuss some lessons we learned in
designing and supporting Bigtable in Section 9. Finally, Section 10 describes related work, and Section 11
presents our conclusions.

2 Data Model
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row
key, column key, and a timestamp; each value in the map
is an uninterpreted array of bytes.
(row:string, column:string, time:int64) string
1

"contents:"

"com.cnn.www"

"anchor:cnnsi.com"

"<html>..."
t3
"<html>..."
t5
"<html>..."
t6

"CNN"

"anchor:my.look.ca"

t9

"CNN.com"

t8

Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con-

tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNNs home page
is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com
and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t 3 , t5 , and t6 .

We settled on this data model after examining a variety


of potential uses of a Bigtable-like system. As one concrete example that drove some of our design decisions,
suppose we want to keep a copy of a large collection of
web pages and related information that could be used by
many different projects; let us call this particular table
the Webtable. In Webtable, we would use URLs as row
keys, various aspects of web pages as column names, and
store the contents of the web pages in the contents: column under the timestamps when they were fetched, as
illustrated in Figure 1.

Rows
The row keys in a table are arbitrary strings (currently up
to 64KB in size, although 10-100 bytes is a typical size
for most of our users). Every read or write of data under
a single row key is atomic (regardless of the number of
different columns being read or written in the row), a
design decision that makes it easier for clients to reason
about the systems behavior in the presence of concurrent
updates to the same row.
Bigtable maintains data in lexicographic order by row
key. The row range for a table is dynamically partitioned.
Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short
row ranges are efficient and typically require communication with only a small number of machines. Clients
can exploit this property by selecting their row keys so
that they get good locality for their data accesses. For
example, in Webtable, pages in the same domain are
grouped together into contiguous rows by reversing the
hostname components of the URLs. For example, we
store data for maps.google.com/index.html under the
key com.google.maps/index.html. Storing pages from
the same domain near each other makes some host and
domain analyses more efficient.
To appear in OSDI 2006

Column Families
Column keys are grouped into sets called column families, which form the basic unit of access control. All data
stored in a column family is usually of the same type (we
compress data in the same column family together). A
column family must be created before data can be stored
under any column key in that family; after a family has
been created, any column key within the family can be
used. It is our intent that the number of distinct column
families in a table be small (in the hundreds at most), and
that families rarely change during operation. In contrast,
a table may have an unbounded number of columns.
A column key is named using the following syntax:
family:qualifier. Column family names must be printable, but qualifiers may be arbitrary strings. An example column family for the Webtable is language, which
stores the language in which a web page was written. We
use only one column key in the language family, and it
stores each web pages language ID. Another useful column family for this table is anchor; each column key in
this family represents a single anchor, as shown in Figure 1. The qualifier is the name of the referring site; the
cell contents is the link text.
Access control and both disk and memory accounting are performed at the column-family level. In our
Webtable example, these controls allow us to manage
several different types of applications: some that add new
base data, some that read the base data and create derived
column families, and some that are only allowed to view
existing data (and possibly not even to view all of the
existing families for privacy reasons).
Timestamps
Each cell in a Bigtable can contain multiple versions of
the same data; these versions are indexed by timestamp.
Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent real
time in microseconds, or be explicitly assigned by client
2

// Open the table


Table *T = OpenOrDie("/bigtable/web/webtable");
// Write a new anchor and delete an old anchor
RowMutation r1(T, "com.cnn.www");
r1.Set("anchor:www.c-span.org", "CNN");
r1.Delete("anchor:www.abc.com");
Operation op;
Apply(&op, &r1);

Figure 2: Writing to Bigtable.


applications. Applications that need to avoid collisions
must generate unique timestamps themselves. Different
versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first.
To make the management of versioned data less onerous, we support two per-column-family settings that tell
Bigtable to garbage-collect cell versions automatically.
The client can specify either that only the last n versions
of a cell be kept, or that only new-enough versions be
kept (e.g., only keep values that were written in the last
seven days).
In our Webtable example, we set the timestamps of
the crawled pages stored in the contents: column to
the times at which these page versions were actually
crawled. The garbage-collection mechanism described
above lets us keep only the most recent three versions of
every page.

3 API
The Bigtable API provides functions for creating and
deleting tables and column families. It also provides
functions for changing cluster, table, and column family
metadata, such as access control rights.
Client applications can write or delete values in
Bigtable, look up values from individual rows, or iterate over a subset of the data in a table. Figure 2 shows
C++ code that uses a RowMutation abstraction to perform a series of updates. (Irrelevant details were elided
to keep the example short.) The call to Apply performs
an atomic mutation to the Webtable: it adds one anchor
to www.cnn.com and deletes a different anchor.
Figure 3 shows C++ code that uses a Scanner abstraction to iterate over all anchors in a particular row.
Clients can iterate over multiple column families, and
there are several mechanisms for limiting the rows,
columns, and timestamps produced by a scan. For example, we could restrict the scan above to only produce
anchors whose columns match the regular expression
anchor:*.cnn.com, or to only produce anchors whose
timestamps fall within ten days of the current time.
To appear in OSDI 2006

Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily("anchor");
stream->SetReturnAllVersions();
scanner.Lookup("com.cnn.www");
for (; !stream->Done(); stream->Next()) {
printf("%s %s %lld %s\n",
scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
}

Figure 3: Reading from Bigtable.


Bigtable supports several other features that allow the
user to manipulate data in more complex ways. First,
Bigtable supports single-row transactions, which can be
used to perform atomic read-modify-write sequences on
data stored under a single row key. Bigtable does not currently support general transactions across row keys, although it provides an interface for batching writes across
row keys at the clients. Second, Bigtable allows cells
to be used as integer counters. Finally, Bigtable supports the execution of client-supplied scripts in the address spaces of the servers. The scripts are written in a
language developed at Google for processing data called
Sawzall [28]. At the moment, our Sawzall-based API
does not allow client scripts to write back into Bigtable,
but it does allow various forms of data transformation,
filtering based on arbitrary expressions, and summarization via a variety of operators.
Bigtable can be used with MapReduce [12], a framework for running large-scale parallel computations developed at Google. We have written a set of wrappers
that allow a Bigtable to be used both as an input source
and as an output target for MapReduce jobs.

4 Building Blocks
Bigtable is built on several other pieces of Google infrastructure. Bigtable uses the distributed Google File
System (GFS) [17] to store log and data files. A Bigtable
cluster typically operates in a shared pool of machines
that run a wide variety of other distributed applications,
and Bigtable processes often share the same machines
with processes from other applications. Bigtable depends on a cluster management system for scheduling
jobs, managing resources on shared machines, dealing
with machine failures, and monitoring machine status.
The Google SSTable file format is used internally to
store Bigtable data. An SSTable provides a persistent,
ordered immutable map from keys to values, where both
keys and values are arbitrary byte strings. Operations are
provided to look up the value associated with a specified
3

key, and to iterate over all key/value pairs in a specified


key range. Internally, each SSTable contains a sequence
of blocks (typically each block is 64KB in size, but this
is configurable). A block index (stored at the end of the
SSTable) is used to locate blocks; the index is loaded
into memory when the SSTable is opened. A lookup
can be performed with a single disk seek: we first find
the appropriate block by performing a binary search in
the in-memory index, and then reading the appropriate
block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform
lookups and scans without touching disk.
Bigtable relies on a highly-available and persistent
distributed lock service called Chubby [8]. A Chubby
service consists of five active replicas, one of which is
elected to be the master and actively serve requests. The
service is live when a majority of the replicas are running
and can communicate with each other. Chubby uses the
Paxos algorithm [9, 23] to keep its replicas consistent in
the face of failure. Chubby provides a namespace that
consists of directories and small files. Each directory or
file can be used as a lock, and reads and writes to a file
are atomic. The Chubby client library provides consistent caching of Chubby files. Each Chubby client maintains a session with a Chubby service. A clients session
expires if it is unable to renew its session lease within the
lease expiration time. When a clients session expires, it
loses any locks and open handles. Chubby clients can
also register callbacks on Chubby files and directories
for notification of changes or session expiration.
Bigtable uses Chubby for a variety of tasks: to ensure
that there is at most one active master at any time; to
store the bootstrap location of Bigtable data (see Section 5.1); to discover tablet servers and finalize tablet
server deaths (see Section 5.2); to store Bigtable schema
information (the column family information for each table); and to store access control lists. If Chubby becomes
unavailable for an extended period of time, Bigtable becomes unavailable. We recently measured this effect
in 14 Bigtable clusters spanning 11 Chubby instances.
The average percentage of Bigtable server hours during
which some data stored in Bigtable was not available due
to Chubby unavailability (caused by either Chubby outages or network issues) was 0.0047%. The percentage
for the single cluster that was most affected by Chubby
unavailability was 0.0326%.

5 Implementation
The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers. Tablet servers can be
To appear in OSDI 2006

dynamically added (or removed) from a cluster to accomodate changes in workloads.


The master is responsible for assigning tablets to tablet
servers, detecting the addition and expiration of tablet
servers, balancing tablet-server load, and garbage collection of files in GFS. In addition, it handles schema
changes such as table and column family creations.
Each tablet server manages a set of tablets (typically
we have somewhere between ten to a thousand tablets per
tablet server). The tablet server handles read and write
requests to the tablets that it has loaded, and also splits
tablets that have grown too large.
As with many single-master distributed storage systems [17, 21], client data does not move through the master: clients communicate directly with tablet servers for
reads and writes. Because Bigtable clients do not rely on
the master for tablet location information, most clients
never communicate with the master. As a result, the master is lightly loaded in practice.
A Bigtable cluster stores a number of tables. Each table consists of a set of tablets, and each tablet contains
all data associated with a row range. Initially, each table
consists of just one tablet. As a table grows, it is automatically split into multiple tablets, each approximately
100-200 MB in size by default.

5.1 Tablet Location


We use a three-level hierarchy analogous to that of a B+ tree [10] to store tablet location information (Figure 4).
Other
METADATA
tablets

Chubby file

Root tablet

(1st METADATA tablet)

...

...
...
...

UserTable1
...
...
..
.

..
.

UserTableN
...

...

..
.
...

Figure 4: Tablet location hierarchy.


The first level is a file stored in Chubby that contains
the location of the root tablet. The root tablet contains
the location of all tablets in a special METADATA table.
Each METADATA tablet contains the location of a set of
user tablets. The root tablet is just the first tablet in the
METADATA table, but is treated speciallyit is never
splitto ensure that the tablet location hierarchy has no
more than three levels.
The METADATA table stores the location of a tablet
under a row key that is an encoding of the tablets table
4

identifier and its end row. Each METADATA row stores


approximately 1KB of data in memory. With a modest
limit of 128 MB METADATA tablets, our three-level location scheme is sufficient to address 234 tablets (or 261
bytes in 128 MB tablets).
The client library caches tablet locations. If the client
does not know the location of a tablet, or if it discovers that cached location information is incorrect, then
it recursively moves up the tablet location hierarchy.
If the clients cache is empty, the location algorithm
requires three network round-trips, including one read
from Chubby. If the clients cache is stale, the location
algorithm could take up to six round-trips, because stale
cache entries are only discovered upon misses (assuming
that METADATA tablets do not move very frequently).
Although tablet locations are stored in memory, so no
GFS accesses are required, we further reduce this cost
in the common case by having the client library prefetch
tablet locations: it reads the metadata for more than one
tablet whenever it reads the METADATA table.
We also store secondary information in the
METADATA table, including a log of all events pertaining to each tablet (such as when a server begins
serving it). This information is helpful for debugging
and performance analysis.

5.2 Tablet Assignment


Each tablet is assigned to one tablet server at a time. The
master keeps track of the set of live tablet servers, and
the current assignment of tablets to tablet servers, including which tablets are unassigned. When a tablet is
unassigned, and a tablet server with sufficient room for
the tablet is available, the master assigns the tablet by
sending a tablet load request to the tablet server.
Bigtable uses Chubby to keep track of tablet servers.
When a tablet server starts, it creates, and acquires an
exclusive lock on, a uniquely-named file in a specific
Chubby directory. The master monitors this directory
(the servers directory) to discover tablet servers. A tablet
server stops serving its tablets if it loses its exclusive
lock: e.g., due to a network partition that caused the
server to lose its Chubby session. (Chubby provides an
efficient mechanism that allows a tablet server to check
whether it still holds its lock without incurring network
traffic.) A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists. If the
file no longer exists, then the tablet server will never be
able to serve again, so it kills itself. Whenever a tablet
server terminates (e.g., because the cluster management
system is removing the tablet servers machine from the
cluster), it attempts to release its lock so that the master
will reassign its tablets more quickly.
To appear in OSDI 2006

The master is responsible for detecting when a tablet


server is no longer serving its tablets, and for reassigning those tablets as soon as possible. To detect when a
tablet server is no longer serving its tablets, the master
periodically asks each tablet server for the status of its
lock. If a tablet server reports that it has lost its lock,
or if the master was unable to reach a server during its
last several attempts, the master attempts to acquire an
exclusive lock on the servers file. If the master is able to
acquire the lock, then Chubby is live and the tablet server
is either dead or having trouble reaching Chubby, so the
master ensures that the tablet server can never serve again
by deleting its server file. Once a servers file has been
deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned
tablets. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby,
the master kills itself if its Chubby session expires. However, as described above, master failures do not change
the assignment of tablets to tablet servers.
When a master is started by the cluster management
system, it needs to discover the current tablet assignments before it can change them. The master executes
the following steps at startup. (1) The master grabs
a unique master lock in Chubby, which prevents concurrent master instantiations. (2) The master scans the
servers directory in Chubby to find the live servers.
(3) The master communicates with every live tablet
server to discover what tablets are already assigned to
each server. (4) The master scans the METADATA table
to learn the set of tablets. Whenever this scan encounters
a tablet that is not already assigned, the master adds the
tablet to the set of unassigned tablets, which makes the
tablet eligible for tablet assignment.
One complication is that the scan of the METADATA
table cannot happen until the METADATA tablets have
been assigned. Therefore, before starting this scan (step
4), the master adds the root tablet to the set of unassigned
tablets if an assignment for the root tablet was not discovered during step 3. This addition ensures that the root
tablet will be assigned. Because the root tablet contains
the names of all METADATA tablets, the master knows
about all of them after it has scanned the root tablet.
The set of existing tablets only changes when a table is created or deleted, two existing tablets are merged
to form one larger tablet, or an existing tablet is split
into two smaller tablets. The master is able to keep
track of these changes because it initiates all but the last.
Tablet splits are treated specially since they are initiated by a tablet server. The tablet server commits the
split by recording information for the new tablet in the
METADATA table. When the split has committed, it notifies the master. In case the split notification is lost (either
5

because the tablet server or the master died), the master


detects the new tablet when it asks a tablet server to load
the tablet that has now split. The tablet server will notify
the master of the split, because the tablet entry it finds in
the METADATA table will specify only a portion of the
tablet that the master asked it to load.

5.3 Tablet Serving


The persistent state of a tablet is stored in GFS, as illustrated in Figure 5. Updates are committed to a commit
log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted
buffer called a memtable; the older updates are stored in a
sequence of SSTables. To recover a tablet, a tablet server
memtable

Read Op

Memory
GFS
tablet log
Write Op
SSTable Files

Figure 5: Tablet Representation


reads its metadata from the METADATA table. This metadata contains the list of SSTables that comprise a tablet
and a set of a redo points, which are pointers into any
commit logs that may contain data for the tablet. The
server reads the indices of the SSTables into memory and
reconstructs the memtable by applying all of the updates
that have committed since the redo points.
When a write operation arrives at a tablet server, the
server checks that it is well-formed, and that the sender
is authorized to perform the mutation. Authorization is
performed by reading the list of permitted writers from a
Chubby file (which is almost always a hit in the Chubby
client cache). A valid mutation is written to the commit
log. Group commit is used to improve the throughput of
lots of small mutations [13, 16]. After the write has been
committed, its contents are inserted into the memtable.
When a read operation arrives at a tablet server, it is
similarly checked for well-formedness and proper authorization. A valid read operation is executed on a merged
view of the sequence of SSTables and the memtable.
Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be
formed efficiently.
Incoming read and write operations can continue
while tablets are split and merged.
To appear in OSDI 2006

5.4 Compactions
As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the
memtable is frozen, a new memtable is created, and the
frozen memtable is converted to an SSTable and written
to GFS. This minor compaction process has two goals:
it shrinks the memory usage of the tablet server, and it
reduces the amount of data that has to be read from the
commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.
Every minor compaction creates a new SSTable. If this
behavior continued unchecked, read operations might
need to merge updates from an arbitrary number of
SSTables. Instead, we bound the number of such files
by periodically executing a merging compaction in the
background. A merging compaction reads the contents
of a few SSTables and the memtable, and writes out a
new SSTable. The input SSTables and memtable can be
discarded as soon as the compaction has finished.
A merging compaction that rewrites all SSTables
into exactly one SSTable is called a major compaction.
SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in
older SSTables that are still live. A major compaction,
on the other hand, produces an SSTable that contains
no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major
compactions to them. These major compactions allow
Bigtable to reclaim resources used by deleted data, and
also allow it to ensure that deleted data disappears from
the system in a timely fashion, which is important for
services that store sensitive data.

6 Refinements
The implementation described in the previous section
required a number of refinements to achieve the high
performance, availability, and reliability required by our
users. This section describes portions of the implementation in more detail in order to highlight these refinements.
Locality groups
Clients can group multiple column families together into
a locality group. A separate SSTable is generated for
each locality group in each tablet. Segregating column
families that are not typically accessed together into separate locality groups enables more efficient reads. For
example, page metadata in Webtable (such as language
and checksums) can be in one locality group, and the
contents of the page can be in a different group: an ap6

plication that wants to read the metadata does not need


to read through all of the page contents.
In addition, some useful tuning parameters can be
specified on a per-locality group basis. For example, a locality group can be declared to be in-memory. SSTables
for in-memory locality groups are loaded lazily into the
memory of the tablet server. Once loaded, column families that belong to such locality groups can be read
without accessing the disk. This feature is useful for
small pieces of data that are accessed frequently: we
use it internally for the location column family in the
METADATA table.

Caching for read performance


To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache
that caches the key-value pairs returned by the SSTable
interface to the tablet server code. The Block Cache is a
lower-level cache that caches SSTables blocks that were
read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The
Block Cache is useful for applications that tend to read
data that is close to the data they recently read (e.g., sequential reads, or random reads of different columns in
the same locality group within a hot row).
Bloom filters

Compression
Clients can control whether or not the SSTables for a
locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block (whose size
is controllable via a locality group specific tuning parameter). Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. Many clients use a two-pass custom
compression scheme. The first pass uses Bentley and
McIlroys scheme [6], which compresses long common
strings across a large window. The second pass uses a
fast compression algorithm that looks for repetitions in
a small 16 KB window of the data. Both compression
passes are very fastthey encode at 100200 MB/s, and
decode at 4001000 MB/s on modern machines.
Even though we emphasized speed instead of space reduction when choosing our compression algorithms, this
two-pass compression scheme does surprisingly well.
For example, in Webtable, we use this compression
scheme to store Web page contents. In one experiment,
we stored a large number of documents in a compressed
locality group. For the purposes of the experiment, we
limited ourselves to one version of each document instead of storing all versions available to us. The scheme
achieved a 10-to-1 reduction in space. This is much
better than typical Gzip reductions of 3-to-1 or 4-to-1
on HTML pages because of the way Webtable rows are
laid out: all pages from a single host are stored close
to each other. This allows the Bentley-McIlroy algorithm to identify large amounts of shared boilerplate in
pages from the same host. Many applications, not just
Webtable, choose their row names so that similar data
ends up clustered, and therefore achieve very good compression ratios. Compression ratios get even better when
we store multiple versions of the same value in Bigtable.
To appear in OSDI 2006

As described in Section 5.3, a read operation has to read


from all SSTables that make up the state of a tablet.
If these SSTables are not in memory, we may end up
doing many disk accesses. We reduce the number of
accesses by allowing clients to specify that Bloom filters [7] should be created for SSTables in a particular locality group. A Bloom filter allows us to ask
whether an SSTable might contain any data for a specified row/column pair. For certain applications, a small
amount of tablet server memory used for storing Bloom
filters drastically reduces the number of disk seeks required for read operations. Our use of Bloom filters
also implies that most lookups for non-existent rows or
columns do not need to touch disk.
Commit-log implementation
If we kept the commit log for each tablet in a separate
log file, a very large number of files would be written
concurrently in GFS. Depending on the underlying file
system implementation on each GFS server, these writes
could cause a large number of disk seeks to write to the
different physical log files. In addition, having separate
log files per tablet also reduces the effectiveness of the
group commit optimization, since groups would tend to
be smaller. To fix these issues, we append mutations
to a single commit log per tablet server, co-mingling
mutations for different tablets in the same physical log
file [18, 20].
Using one log provides significant performance benefits during normal operation, but it complicates recovery. When a tablet server dies, the tablets that it served
will be moved to a large number of other tablet servers:
each server typically loads a small number of the original servers tablets. To recover the state for a tablet,
the new tablet server needs to reapply the mutations for
that tablet from the commit log written by the original
tablet server. However, the mutations for these tablets
7

were co-mingled in the same physical log file. One approach would be for each new tablet server to read this
full commit log file and apply just the entries needed for
the tablets it needs to recover. However, under such a
scheme, if 100 machines were each assigned a single
tablet from a failed tablet server, then the log file would
be read 100 times (once by each server).
We avoid duplicating log reads by first sorting the commit log entries in order of the keys
table, row name, log sequence number.
In the
sorted output, all mutations for a particular tablet are
contiguous and can therefore be read efficiently with one
disk seek followed by a sequential read. To parallelize
the sorting, we partition the log file into 64 MB segments, and sort each segment in parallel on different
tablet servers. This sorting process is coordinated by the
master and is initiated when a tablet server indicates that
it needs to recover mutations from some commit log file.
Writing commit logs to GFS sometimes causes performance hiccups for a variety of reasons (e.g., a GFS server
machine involved in the write crashes, or the network
paths traversed to reach the particular set of three GFS
servers is suffering network congestion, or is heavily
loaded). To protect mutations from GFS latency spikes,
each tablet server actually has two log writing threads,
each writing to its own log file; only one of these two
threads is actively in use at a time. If writes to the active log file are performing poorly, the log file writing is
switched to the other thread, and mutations that are in
the commit log queue are written by the newly active log
writing thread. Log entries contain sequence numbers
to allow the recovery process to elide duplicated entries
resulting from this log switching process.
Speeding up tablet recovery
If the master moves a tablet from one tablet server to
another, the source tablet server first does a minor compaction on that tablet. This compaction reduces recovery time by reducing the amount of uncompacted state in
the tablet servers commit log. After finishing this compaction, the tablet server stops serving the tablet. Before
it actually unloads the tablet, the tablet server does another (usually very fast) minor compaction to eliminate
any remaining uncompacted state in the tablet servers
log that arrived while the first minor compaction was
being performed. After this second minor compaction
is complete, the tablet can be loaded on another tablet
server without requiring any recovery of log entries.
Exploiting immutability
Besides the SSTable caches, various other parts of the
Bigtable system have been simplified by the fact that all
To appear in OSDI 2006

of the SSTables that we generate are immutable. For example, we do not need any synchronization of accesses
to the file system when reading from SSTables. As a result, concurrency control over rows can be implemented
very efficiently. The only mutable data structure that is
accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make
each memtable row copy-on-write and allow reads and
writes to proceed in parallel.
Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage
collecting obsolete SSTables. Each tablets SSTables are
registered in the METADATA table. The master removes
obsolete SSTables as a mark-and-sweep garbage collection [25] over the set of SSTables, where the METADATA
table contains the set of roots.
Finally, the immutability of SSTables enables us to
split tablets quickly. Instead of generating a new set of
SSTables for each child tablet, we let the child tablets
share the SSTables of the parent tablet.

7 Performance Evaluation
We set up a Bigtable cluster with N tablet servers to
measure the performance and scalability of Bigtable as
N is varied. The tablet servers were configured to use 1
GB of memory and to write to a GFS cell consisting of
1786 machines with two 400 GB IDE hard drives each.
N client machines generated the Bigtable load used for
these tests. (We used the same number of clients as tablet
servers to ensure that clients were never a bottleneck.)
Each machine had two dual-core Opteron 2 GHz chips,
enough physical memory to hold the working set of all
running processes, and a single gigabit Ethernet link.
The machines were arranged in a two-level tree-shaped
switched network with approximately 100-200 Gbps of
aggregate bandwidth available at the root. All of the machines were in the same hosting facility and therefore the
round-trip time between any pair of machines was less
than a millisecond.
The tablet servers and master, test clients, and GFS
servers all ran on the same set of machines. Every machine ran a GFS server. Some of the machines also ran
either a tablet server, or a client process, or processes
from other jobs that were using the pool at the same time
as these experiments.
R is the distinct number of Bigtable row keys involved
in the test. R was chosen so that each benchmark read or
wrote approximately 1 GB of data per tablet server.
The sequential write benchmark used row keys with
names 0 to R 1. This space of row keys was partitioned into 10N equal-sized ranges. These ranges were
assigned to the N clients by a central scheduler that as8

500
241
6250
2000
2469
1905
7843

Values read/written per second

Experiment
random reads
random reads (mem)
random writes
sequential reads
sequential writes
scans

# of Tablet Servers
1
50
250
1212
593
479
10811
8511 8000
8850
3745 3425
4425
2463 2625
8547
3623 2451
15385 10526 9524

4M
3M
2M

scans
random reads (mem)
random writes
sequential reads
sequential writes
random reads

1M

100

200

300

400

500

Number of tablet servers

Figure 6: Number of 1000-byte values read/written per second. The table shows the rate per tablet server; the graph shows the
aggregate rate.

signed the next available range to a client as soon as the


client finished processing the previous range assigned to
it. This dynamic assignment helped mitigate the effects
of performance variations caused by other processes running on the client machines. We wrote a single string under each row key. Each string was generated randomly
and was therefore uncompressible. In addition, strings
under different row key were distinct, so no cross-row
compression was possible. The random write benchmark
was similar except that the row key was hashed modulo
R immediately before writing so that the write load was
spread roughly uniformly across the entire row space for
the entire duration of the benchmark.
The sequential read benchmark generated row keys in
exactly the same way as the sequential write benchmark,
but instead of writing under the row key, it read the string
stored under the row key (which was written by an earlier
invocation of the sequential write benchmark). Similarly,
the random read benchmark shadowed the operation of
the random write benchmark.
The scan benchmark is similar to the sequential read
benchmark, but uses support provided by the Bigtable
API for scanning over all values in a row range. Using a scan reduces the number of RPCs executed by the
benchmark since a single RPC fetches a large sequence
of values from a tablet server.
The random reads (mem) benchmark is similar to the
random read benchmark, but the locality group that contains the benchmark data is marked as in-memory, and
therefore the reads are satisfied from the tablet servers
memory instead of requiring a GFS read. For just this
benchmark, we reduced the amount of data per tablet
server from 1 GB to 100 MB so that it would fit comfortably in the memory available to the tablet server.
Figure 6 shows two views on the performance of our
benchmarks when reading and writing 1000-byte values
to Bigtable. The table shows the number of operations
per second per tablet server; the graph shows the aggregate number of operations per second.
To appear in OSDI 2006

Single tablet-server performance


Let us first consider performance with just one tablet
server. Random reads are slower than all other operations
by an order of magnitude or more. Each random read involves the transfer of a 64 KB SSTable block over the
network from GFS to a tablet server, out of which only a
single 1000-byte value is used. The tablet server executes
approximately 1200 reads per second, which translates
into approximately 75 MB/s of data read from GFS. This
bandwidth is enough to saturate the tablet server CPUs
because of overheads in our networking stack, SSTable
parsing, and Bigtable code, and is also almost enough
to saturate the network links used in our system. Most
Bigtable applications with this type of an access pattern
reduce the block size to a smaller value, typically 8KB.
Random reads from memory are much faster since
each 1000-byte read is satisfied from the tablet servers
local memory without fetching a large 64 KB block from
GFS.
Random and sequential writes perform better than random reads since each tablet server appends all incoming
writes to a single commit log and uses group commit to
stream these writes efficiently to GFS. There is no significant difference between the performance of random
writes and sequential writes; in both cases, all writes to
the tablet server are recorded in the same commit log.
Sequential reads perform better than random reads
since every 64 KB SSTable block that is fetched from
GFS is stored into our block cache, where it is used to
serve the next 64 read requests.
Scans are even faster since the tablet server can return
a large number of values in response to a single client
RPC, and therefore RPC overhead is amortized over a
large number of values.
Scaling
Aggregate throughput increases dramatically, by over a
factor of a hundred, as we increase the number of tablet
servers in the system from 1 to 500. For example, the
9

# of tablet servers
0 ..
19
20 ..
49
50 ..
99
100 ..
499
> 500

# of clusters
259
47
20
50
12

Table 1: Distribution of number of tablet servers in Bigtable


clusters.

performance of random reads from memory increases by


almost a factor of 300 as the number of tablet server increases by a factor of 500. This behavior occurs because
the bottleneck on performance for this benchmark is the
individual tablet server CPU.
However, performance does not increase linearly. For
most benchmarks, there is a significant drop in per-server
throughput when going from 1 to 50 tablet servers. This
drop is caused by imbalance in load in multiple server
configurations, often due to other processes contending
for CPU and network. Our load balancing algorithm attempts to deal with this imbalance, but cannot do a perfect job for two main reasons: rebalancing is throttled to
reduce the number of tablet movements (a tablet is unavailable for a short time, typically less than one second,
when it is moved), and the load generated by our benchmarks shifts around as the benchmark progresses.
The random read benchmark shows the worst scaling
(an increase in aggregate throughput by only a factor of
100 for a 500-fold increase in number of servers). This
behavior occurs because (as explained above) we transfer
one large 64KB block over the network for every 1000byte read. This transfer saturates various shared 1 Gigabit links in our network and as a result, the per-server
throughput drops significantly as we increase the number
of machines.

8 Real Applications
As of August 2006, there are 388 non-test Bigtable clusters running in various Google machine clusters, with a
combined total of about 24,500 tablet servers. Table 1
shows a rough distribution of tablet servers per cluster.
Many of these clusters are used for development purposes and therefore are idle for significant periods. One
group of 14 busy clusters with 8069 total tablet servers
saw an aggregate volume of more than 1.2 million requests per second, with incoming RPC traffic of about
741 MB/s and outgoing RPC traffic of about 16 GB/s.
Table 2 provides some data about a few of the tables
currently in use. Some tables store data that is served
to users, whereas others store data for batch processing;
the tables range widely in total size, average cell size,
To appear in OSDI 2006

percentage of data served from memory, and complexity


of the table schema. In the rest of this section, we briefly
describe how three product teams use Bigtable.

8.1 Google Analytics


Google Analytics (analytics.google.com) is a service
that helps webmasters analyze traffic patterns at their
web sites. It provides aggregate statistics, such as the
number of unique visitors per day and the page views
per URL per day, as well as site-tracking reports, such as
the percentage of users that made a purchase, given that
they earlier viewed a specific page.
To enable the service, webmasters embed a small
JavaScript program in their web pages. This program
is invoked whenever a page is visited. It records various
information about the request in Google Analytics, such
as a user identifier and information about the page being fetched. Google Analytics summarizes this data and
makes it available to webmasters.
We briefly describe two of the tables used by Google
Analytics. The raw click table (200 TB) maintains a
row for each end-user session. The row name is a tuple
containing the websites name and the time at which the
session was created. This schema ensures that sessions
that visit the same web site are contiguous, and that they
are sorted chronologically. This table compresses to 14%
of its original size.
The summary table (20 TB) contains various predefined summaries for each website. This table is generated from the raw click table by periodically scheduled
MapReduce jobs. Each MapReduce job extracts recent
session data from the raw click table. The overall systems throughput is limited by the throughput of GFS.
This table compresses to 29% of its original size.

8.2 Google Earth


Google operates a collection of services that provide
users with access to high-resolution satellite imagery of
the worlds surface, both through the web-based Google
Maps interface (maps.google.com) and through the
Google Earth (earth.google.com) custom client software. These products allow users to navigate across the
worlds surface: they can pan, view, and annotate satellite imagery at many different levels of resolution. This
system uses one table to preprocess data, and a different
set of tables for serving client data.
The preprocessing pipeline uses one table to store raw
imagery. During preprocessing, the imagery is cleaned
and consolidated into final serving data. This table contains approximately 70 terabytes of data and therefore is
served from disk. The images are efficiently compressed
already, so Bigtable compression is disabled.
10

Project
name
Crawl
Crawl
Google Analytics
Google Analytics
Google Base
Google Earth
Google Earth
Orkut
Personalized Search

Table size
(TB)
800
50
20
200
2
0.5
70
9
4

Compression
ratio
11%
33%
29%
14%
31%
64%

47%

# Cells
(billions)
1000
200
10
80
10
8
9
0.9
6

# Column
Families
16
2
1
1
29
7
8
8
93

# Locality
Groups
8
2
1
1
3
2
3
5
11

% in
memory
0%
0%
0%
0%
15%
33%
0%
1%
5%

Latencysensitive?
No
No
Yes
Yes
Yes
Yes
No
Yes
Yes

Table 2: Characteristics of a few tables in production use. Table size (measured before compression) and # Cells indicate approximate sizes. Compression ratio is not given for tables that have compression disabled.
Each row in the imagery table corresponds to a single geographic segment. Rows are named to ensure that
adjacent geographic segments are stored near each other.
The table contains a column family to keep track of the
sources of data for each segment. This column family
has a large number of columns: essentially one for each
raw data image. Since each segment is only built from a
few images, this column family is very sparse.
The preprocessing pipeline relies heavily on MapReduce over Bigtable to transform data. The overall system
processes over 1 MB/sec of data per tablet server during
some of these MapReduce jobs.
The serving system uses one table to index data stored
in GFS. This table is relatively small (500 GB), but it
must serve tens of thousands of queries per second per
datacenter with low latency. As a result, this table is
hosted across hundreds of tablet servers and contains inmemory column families.

8.3 Personalized Search


Personalized Search (www.google.com/psearch) is an
opt-in service that records user queries and clicks across
a variety of Google properties such as web search, images, and news. Users can browse their search histories
to revisit their old queries and clicks, and they can ask
for personalized search results based on their historical
Google usage patterns.
Personalized Search stores each users data in
Bigtable. Each user has a unique userid and is assigned
a row named by that userid. All user actions are stored
in a table. A separate column family is reserved for each
type of action (for example, there is a column family that
stores all web queries). Each data element uses as its
Bigtable timestamp the time at which the corresponding
user action occurred. Personalized Search generates user
profiles using a MapReduce over Bigtable. These user
profiles are used to personalize live search results.
To appear in OSDI 2006

The Personalized Search data is replicated across several Bigtable clusters to increase availability and to reduce latency due to distance from clients. The Personalized Search team originally built a client-side replication
mechanism on top of Bigtable that ensured eventual consistency of all replicas. The current system now uses a
replication subsystem that is built into the servers.
The design of the Personalized Search storage system
allows other groups to add new per-user information in
their own columns, and the system is now used by many
other Google properties that need to store per-user configuration options and settings. Sharing a table amongst
many groups resulted in an unusually large number of
column families. To help support sharing, we added a
simple quota mechanism to Bigtable to limit the storage consumption by any particular client in shared tables; this mechanism provides some isolation between
the various product groups using this system for per-user
information storage.

9 Lessons
In the process of designing, implementing, maintaining,
and supporting Bigtable, we gained useful experience
and learned several interesting lessons.
One lesson we learned is that large distributed systems are vulnerable to many types of failures, not just
the standard network partitions and fail-stop failures assumed in many distributed protocols. For example, we
have seen problems due to all of the following causes:
memory and network corruption, large clock skew, hung
machines, extended and asymmetric network partitions,
bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance. As we have gained more
experience with these problems, we have addressed them
by changing various protocols. For example, we added
checksumming to our RPC mechanism. We also handled
11

some problems by removing assumptions made by one


part of the system about another part. For example, we
stopped assuming a given Chubby operation could return
only one of a fixed set of errors.
Another lesson we learned is that it is important to
delay adding new features until it is clear how the new
features will be used. For example, we initially planned
to support general-purpose transactions in our API. Because we did not have an immediate use for them, however, we did not implement them. Now that we have
many real applications running on Bigtable, we have
been able to examine their actual needs, and have discovered that most applications require only single-row transactions. Where people have requested distributed transactions, the most important use is for maintaining secondary indices, and we plan to add a specialized mechanism to satisfy this need. The new mechanism will
be less general than distributed transactions, but will be
more efficient (especially for updates that span hundreds
of rows or more) and will also interact better with our
scheme for optimistic cross-data-center replication.
A practical lesson that we learned from supporting
Bigtable is the importance of proper system-level monitoring (i.e., monitoring both Bigtable itself, as well as
the client processes using Bigtable). For example, we extended our RPC system so that for a sample of the RPCs,
it keeps a detailed trace of the important actions done on
behalf of that RPC. This feature has allowed us to detect and fix many problems such as lock contention on
tablet data structures, slow writes to GFS while committing Bigtable mutations, and stuck accesses to the
METADATA table when METADATA tablets are unavailable. Another example of useful monitoring is that every Bigtable cluster is registered in Chubby. This allows
us to track down all clusters, discover how big they are,
see which versions of our software they are running, how
much traffic they are receiving, and whether or not there
are any problems such as unexpectedly large latencies.
The most important lesson we learned is the value
of simple designs. Given both the size of our system
(about 100,000 lines of non-test code), as well as the
fact that code evolves over time in unexpected ways, we
have found that code and design clarity are of immense
help in code maintenance and debugging. One example of this is our tablet-server membership protocol. Our
first protocol was simple: the master periodically issued
leases to tablet servers, and tablet servers killed themselves if their lease expired. Unfortunately, this protocol reduced availability significantly in the presence of
network problems, and was also sensitive to master recovery time. We redesigned the protocol several times
until we had a protocol that performed well. However,
the resulting protocol was too complex and depended on
To appear in OSDI 2006

the behavior of Chubby features that were seldom exercised by other applications. We discovered that we were
spending an inordinate amount of time debugging obscure corner cases, not only in Bigtable code, but also in
Chubby code. Eventually, we scrapped this protocol and
moved to a newer simpler protocol that depends solely
on widely-used Chubby features.

10 Related Work
The Boxwood project [24] has components that overlap
in some ways with Chubby, GFS, and Bigtable, since it
provides for distributed agreement, locking, distributed
chunk storage, and distributed B-tree storage. In each
case where there is overlap, it appears that the Boxwoods component is targeted at a somewhat lower level
than the corresponding Google service. The Boxwood
projects goal is to provide infrastructure for building
higher-level services such as file systems or databases,
while the goal of Bigtable is to directly support client
applications that wish to store data.
Many recent projects have tackled the problem of providing distributed storage or higher-level services over
wide area networks, often at Internet scale. This includes work on distributed hash tables that began with
projects such as CAN [29], Chord [32], Tapestry [37],
and Pastry [30]. These systems address concerns that do
not arise for Bigtable, such as highly variable bandwidth,
untrusted participants, or frequent reconfiguration; decentralized control and Byzantine fault tolerance are not
Bigtable goals.
In terms of the distributed data storage model that one
might provide to application developers, we believe the
key-value pair model provided by distributed B-trees or
distributed hash tables is too limiting. Key-value pairs
are a useful building block, but they should not be the
only building block one provides to developers. The
model we chose is richer than simple key-value pairs,
and supports sparse semi-structured data. Nonetheless,
it is still simple enough that it lends itself to a very efficient flat-file representation, and it is transparent enough
(via locality groups) to allow our users to tune important
behaviors of the system.
Several database vendors have developed parallel
databases that can store large volumes of data. Oracles
Real Application Cluster database [27] uses shared disks
to store data (Bigtable uses GFS) and a distributed lock
manager (Bigtable uses Chubby). IBMs DB2 Parallel
Edition [4] is based on a shared-nothing [33] architecture
similar to Bigtable. Each DB2 server is responsible for
a subset of the rows in a table which it stores in a local
relational database. Both products provide a complete
relational model with transactions.
12

Bigtable locality groups realize similar compression


and disk read performance benefits observed for other
systems that organize data on disk using column-based
rather than row-based storage, including C-Store [1, 34]
and commercial products such as Sybase IQ [15, 36],
SenSage [31], KDB+ [22], and the ColumnBM storage
layer in MonetDB/X100 [38]. Another system that does
vertical and horizontal data partioning into flat files and
achieves good data compression ratios is AT&Ts Daytona database [19]. Locality groups do not support CPUcache-level optimizations, such as those described by
Ailamaki [2].
The manner in which Bigtable uses memtables and
SSTables to store updates to tablets is analogous to the
way that the Log-Structured Merge Tree [26] stores updates to index data. In both systems, sorted data is
buffered in memory before being written to disk, and
reads must merge data from memory and disk.
C-Store and Bigtable share many characteristics: both
systems use a shared-nothing architecture and have two
different data structures, one for recent writes, and one
for storing long-lived data, with a mechanism for moving data from one form to the other. The systems differ significantly in their API: C-Store behaves like a
relational database, whereas Bigtable provides a lower
level read and write interface and is designed to support
many thousands of such operations per second per server.
C-Store is also a read-optimized relational DBMS,
whereas Bigtable provides good performance on both
read-intensive and write-intensive applications.
Bigtables load balancer has to solve some of the same
kinds of load and memory balancing problems faced by
shared-nothing databases (e.g., [11, 35]). Our problem is
somewhat simpler: (1) we do not consider the possibility
of multiple copies of the same data, possibly in alternate
forms due to views or indices; (2) we let the user tell us
what data belongs in memory and what data should stay
on disk, rather than trying to determine this dynamically;
(3) we have no complex queries to execute or optimize.

Given the unusual interface to Bigtable, an interesting question is how difficult it has been for our users to
adapt to using it. New users are sometimes uncertain of
how to best use the Bigtable interface, particularly if they
are accustomed to using relational databases that support
general-purpose transactions. Nevertheless, the fact that
many Google products successfully use Bigtable demonstrates that our design works well in practice.
We are in the process of implementing several additional Bigtable features, such as support for secondary
indices and infrastructure for building cross-data-center
replicated Bigtables with multiple master replicas. We
have also begun deploying Bigtable as a service to product groups, so that individual groups do not need to maintain their own clusters. As our service clusters scale,
we will need to deal with more resource-sharing issues
within Bigtable itself [3, 5].
Finally, we have found that there are significant advantages to building our own storage solution at Google.
We have gotten a substantial amount of flexibility from
designing our own data model for Bigtable. In addition, our control over Bigtables implementation, and
the other Google infrastructure upon which Bigtable depends, means that we can remove bottlenecks and inefficiencies as they arise.

Acknowledgements
We thank the anonymous reviewers, David Nagle, and
our shepherd Brad Calder, for their feedback on this paper. The Bigtable system has benefited greatly from the
feedback of our many users within Google. In addition,
we thank the following people for their contributions to
Bigtable: Dan Aguayo, Sameer Ajmani, Zhifeng Chen,
Bill Coughran, Mike Epstein, Healfdene Goguen, Robert
Griesemer, Jeremy Hylton, Josh Hyman, Alex Khesin,
Joanna Kulik, Alberto Lerner, Sherry Listgarten, Mike
Maloney, Eduardo Pinheiro, Kathy Polizzi, Frank Yellin,
and Arthur Zwiegincew.

References
11 Conclusions
We have described Bigtable, a distributed system for
storing structured data at Google. Bigtable clusters have
been in production use since April 2005, and we spent
roughly seven person-years on design and implementation before that date. As of August 2006, more than sixty
projects are using Bigtable. Our users like the performance and high availability provided by the Bigtable implementation, and that they can scale the capacity of their
clusters by simply adding more machines to the system
as their resource demands change over time.
To appear in OSDI 2006

[1] A BADI , D. J., M ADDEN , S. R., AND F ERREIRA ,


M. C. Integrating compression and execution in columnoriented database systems. Proc. of SIGMOD (2006).
[2] A ILAMAKI , A., D E W ITT, D. J., H ILL , M. D., AND S K OUNAKIS , M. Weaving relations for cache performance.
In The VLDB Journal (2001), pp. 169180.
[3] BANGA , G., D RUSCHEL , P., AND M OGUL , J. C. Resource containers: A new facility for resource management in server systems. In Proc. of the 3rd OSDI (Feb.
1999), pp. 4558.
[4] BARU , C. K., F ECTEAU , G., G OYAL , A., H SIAO ,
H., J HINGRAN , A., PADMANABHAN , S., C OPELAND ,

13

G. P., AND W ILSON , W. G. DB2 parallel edition. IBM


Systems Journal 34, 2 (1995), 292322.
[5] BAVIER , A., B OWMAN , M., C HUN , B., C ULLER , D.,
K ARLIN , S., P ETERSON , L., ROSCOE , T., S PALINK , T.,
AND WAWRZONIAK , M. Operating system support for
planetary-scale network services. In Proc. of the 1st NSDI
(Mar. 2004), pp. 253266.

[22]

KX . COM .

kx.com/products/database.php. Product page.

[23] L AMPORT, L. The part-time parliament. ACM TOCS 16,


2 (1998), 133169.
[24] M AC C ORMICK , J., M URPHY, N., NAJORK , M.,
T HEKKATH , C. A., AND Z HOU , L. Boxwood: Abstractions as the foundation for storage infrastructure. In Proc.
of the 6th OSDI (Dec. 2004), pp. 105120.

[6] B ENTLEY, J. L., AND M C I LROY, M. D. Data compression using long common strings. In Data Compression
Conference (1999), pp. 287295.

[25] M C C ARTHY, J. Recursive functions of symbolic expressions and their computation by machine. CACM 3, 4 (Apr.
1960), 184195.

[7] B LOOM , B. H. Space/time trade-offs in hash coding with


allowable errors. CACM 13, 7 (1970), 422426.

[26] ON EIL , P., C HENG , E., G AWLICK , D., AND ON EIL ,


E. The log-structured merge-tree (LSM-tree). Acta Inf.
33, 4 (1996), 351385.

[8] B URROWS , M. The Chubby lock service for looselycoupled distributed systems. In Proc. of the 7th OSDI
(Nov. 2006).
[9] C HANDRA , T., G RIESEMER , R., AND R EDSTONE , J.
Paxos made live An engineering perspective. In Proc.
of PODC (2007).
[10] C OMER , D. Ubiquitous B-tree. Computing Surveys 11, 2
(June 1979), 121137.
[11] C OPELAND , G. P., A LEXANDER , W., B OUGHTER ,
E. E., AND K ELLER , T. W. Data placement in Bubba. In
Proc. of SIGMOD (1988), pp. 99108.
[12] D EAN , J., AND G HEMAWAT, S. MapReduce: Simplified
data processing on large clusters. In Proc. of the 6th OSDI
(Dec. 2004), pp. 137150.
[13] D E W ITT, D., K ATZ , R., O LKEN , F., S HAPIRO , L.,
S TONEBRAKER , M., AND W OOD , D. Implementation
techniques for main memory database systems. In Proc.
of SIGMOD (June 1984), pp. 18.
[14] D E W ITT, D. J., AND G RAY, J. Parallel database systems: The future of high performance database systems.
CACM 35, 6 (June 1992), 8598.
[15] F RENCH , C. D. One size fits all database architectures
do not work for DSS. In Proc. of SIGMOD (May 1995),
pp. 449450.
[16] G AWLICK , D., AND K INKADE , D. Varieties of concurrency control in IMS/VS fast path. Database Engineering
Bulletin 8, 2 (1985), 310.
[17] G HEMAWAT, S., G OBIOFF , H., AND L EUNG , S.-T. The
Google file system. In Proc. of the 19th ACM SOSP (Dec.
2003), pp. 2943.
[18] G RAY, J. Notes on database operating systems. In Operating Systems An Advanced Course, vol. 60 of Lecture
Notes in Computer Science. Springer-Verlag, 1978.
[19] G REER , R. Daytona and the fourth-generation language
Cymbal. In Proc. of SIGMOD (1999), pp. 525526.
[20] H AGMANN , R. Reimplementing the Cedar file system
using logging and group commit. In Proc. of the 11th
SOSP (Dec. 1987), pp. 155162.
[21] H ARTMAN , J. H., AND O USTERHOUT, J. K. The Zebra
striped network file system. In Proc. of the 14th SOSP
(Asheville, NC, 1993), pp. 2943.

To appear in OSDI 2006

[27]

ORACLE . COM . www.oracle.com/technology/products/database/clustering/index.html. Product page.

[28] P IKE , R., D ORWARD , S., G RIESEMER , R., AND Q UIN LAN , S. Interpreting the data: Parallel analysis with
Sawzall. Scientific Programming Journal 13, 4 (2005),
227298.
[29] R ATNASAMY, S., F RANCIS , P., H ANDLEY, M., K ARP,
R., AND S HENKER , S. A scalable content-addressable
network. In Proc. of SIGCOMM (Aug. 2001), pp. 161
172.
[30] ROWSTRON , A., AND D RUSCHEL , P. Pastry: Scalable, distributed object location and routing for largescale peer-to-peer systems. In Proc. of Middleware 2001
(Nov. 2001), pp. 329350.
[31]

SENSAGE . COM .
Product page.

sensage.com/products-sensage.htm.

[32] S TOICA , I., M ORRIS , R., K ARGER , D., K AASHOEK ,


M. F., AND BALAKRISHNAN , H. Chord: A scalable
peer-to-peer lookup service for Internet applications. In
Proc. of SIGCOMM (Aug. 2001), pp. 149160.
[33] S TONEBRAKER , M. The case for shared nothing.
Database Engineering Bulletin 9, 1 (Mar. 1986), 49.
[34] S TONEBRAKER , M., A BADI , D. J., BATKIN , A., C HEN ,
X., C HERNIACK , M., F ERREIRA , M., L AU , E., L IN ,
A., M ADDEN , S., ON EIL , E., ON EIL , P., R ASIN ,
A., T RAN , N., AND Z DONIK , S. C-Store: A columnoriented DBMS. In Proc. of VLDB (Aug. 2005), pp. 553
564.
[35] S TONEBRAKER , M., AOKI , P. M., D EVINE , R.,
L ITWIN , W., AND O LSON , M. A. Mariposa: A new architecture for distributed data. In Proc. of the Tenth ICDE
(1994), IEEE Computer Society, pp. 5465.
[36]

SYBASE . COM .
www.sybase.com/products/databaseservers/sybaseiq. Product page.

[37] Z HAO , B. Y., K UBIATOWICZ , J., AND J OSEPH , A. D.


Tapestry: An infrastructure for fault-tolerant wide-area
location and routing. Tech. Rep. UCB/CSD-01-1141, CS
Division, UC Berkeley, Apr. 2001.
[38] Z UKOWSKI , M., B ONCZ , P. A., N ES , N., AND H EMAN ,
S. MonetDB/X100 A DBMS in the CPU cache. IEEE
Data Eng. Bull. 28, 2 (2005), 1722.

14

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center


Benjamin Hindman, Andy Konwinski, Matei Zaharia,
Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
University of California, Berkeley

Abstract

Two common solutions for sharing a cluster today are


either to statically partition the cluster and run one framework per partition, or to allocate a set of VMs to each
framework. Unfortunately, these solutions achieve neither high utilization nor efficient data sharing. The main
problem is the mismatch between the allocation granularities of these solutions and of existing frameworks. Many
frameworks, such as Hadoop and Dryad, employ a finegrained resource sharing model, where nodes are subdivided into slots and jobs are composed of short tasks
that are matched to slots [25, 38]. The short duration of
tasks and the ability to run multiple tasks per node allow
jobs to achieve high data locality, as each job will quickly
get a chance to run on nodes storing its input data. Short
tasks also allow frameworks to achieve high utilization,
as jobs can rapidly scale when new nodes become available. Unfortunately, because these frameworks are developed independently, there is no way to perform finegrained sharing across frameworks, making it difficult to
share clusters and data efficiently between them.
In this paper, we propose Mesos, a thin resource sharing layer that enables fine-grained sharing across diverse
cluster computing frameworks, by giving frameworks a
common interface for accessing cluster resources.
The main design question for Mesos is how to build
a scalable and efficient system that supports a wide array of both current and future frameworks. This is challenging for several reasons. First, each framework will
have different scheduling needs, based on its programming model, communication pattern, task dependencies,
and data placement. Second, the scheduling system must
scale to clusters of tens of thousands of nodes running
hundreds of jobs with millions of tasks. Finally, because
all the applications in the cluster depend on Mesos, the
system must be fault-tolerant and highly available.
One approach would be for Mesos to implement a centralized scheduler that takes as input framework requirements, resource availability, and organizational policies,
and computes a global schedule for all tasks. While this

We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing
frameworks, such as Hadoop and MPI. Sharing improves
cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by
taking turns reading data stored on each machine. To
support the sophisticated schedulers of todays frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides
how many resources to offer each framework, while
frameworks decide which resources to accept and which
computations to run on them. Our results show that
Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to
50,000 (emulated) nodes, and is resilient to failures.

Introduction

Clusters of commodity servers have become a major


computing platform, powering both large Internet services and a growing number of data-intensive scientific
applications. Driven by these applications, researchers
and practitioners have been developing a diverse array of
cluster computing frameworks to simplify programming
the cluster. Prominent examples include MapReduce
[18], Dryad [24], MapReduce Online [17] (which supports streaming jobs), Pregel [28] (a specialized framework for graph computations), and others [27, 19, 30].
It seems clear that new cluster computing frameworks1
will continue to emerge, and that no framework will be
optimal for all applications. Therefore, organizations
will want to run multiple frameworks in the same cluster,
picking the best one for each application. Multiplexing
a cluster between frameworks improves utilization and
allows applications to share access to large datasets that
may be too costly to replicate across clusters.
1 By

framework, we mean a software system that manages and


executes one or more jobs on a cluster.

CDF

approach can optimize scheduling across frameworks, it


faces several challenges. The first is complexity. The
scheduler would need to provide a sufficiently expressive API to capture all frameworks requirements, and
to solve an online optimization problem for millions
of tasks. Even if such a scheduler were feasible, this
complexity would have a negative impact on its scalability and resilience. Second, as new frameworks and
new scheduling policies for current frameworks are constantly being developed [37, 38, 40, 26], it is not clear
whether we are even at the point to have a full specification of framework requirements. Third, many existing
frameworks implement their own sophisticated scheduling [25, 38], and moving this functionality to a global
scheduler would require expensive refactoring.
Instead, Mesos takes a different approach: delegating
control over scheduling to the frameworks. This is accomplished through a new abstraction, called a resource
offer, which encapsulates a bundle of resources that a
framework can allocate on a cluster node to run tasks.
Mesos decides how many resources to offer each framework, based on an organizational policy such as fair sharing, while frameworks decide which resources to accept
and which tasks to run on them. While this decentralized scheduling model may not always lead to globally
optimal scheduling, we have found that it performs surprisingly well in practice, allowing frameworks to meet
goals such as data locality nearly perfectly. In addition,
resource offers are simple and efficient to implement, allowing Mesos to be highly scalable and robust to failures.
Mesos also provides other benefits to practitioners.
First, even organizations that only use one framework
can use Mesos to run multiple instances of that framework in the same cluster, or multiple versions of the
framework. Our contacts at Yahoo! and Facebook indicate that this would be a compelling way to isolate
production and experimental Hadoop workloads and to
roll out new versions of Hadoop [11, 10]. Second,
Mesos makes it easier to develop and immediately experiment with new frameworks. The ability to share resources across multiple frameworks frees the developers
to build and run specialized frameworks targeted at particular problem domains rather than one-size-fits-all abstractions. Frameworks can therefore evolve faster and
provide better support for each problem domain.
We have implemented Mesos in 10,000 lines of C++.
The system scales to 50,000 (emulated) nodes and uses
ZooKeeper [4] for fault tolerance. To evaluate Mesos, we
have ported three cluster computing systems to run over
it: Hadoop, MPI, and the Torque batch scheduler. To validate our hypothesis that specialized frameworks provide
value over general ones, we have also built a new framework on top of Mesos called Spark, optimized for iterative jobs where a dataset is reused in many parallel oper-

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

MapReduce Jobs
Map & Reduce Tasks
1

10

100

1000

10000

100000

Duration (s)

Figure 1: CDF of job and task durations in Facebooks Hadoop


data warehouse (data from [38]).

ations, and shown that Spark can outperform Hadoop by


10x in iterative machine learning workloads.
This paper is organized as follows. Section 2 details
the data center environment that Mesos is designed for.
Section 3 presents the architecture of Mesos. Section 4
analyzes our distributed scheduling model (resource offers) and characterizes the environments that it works
well in. We present our implementation of Mesos in Section 5 and evaluate it in Section 6. We survey related
work in Section 7. Finally, we conclude in Section 8.

Target Environment

As an example of a workload we aim to support, consider the Hadoop data warehouse at Facebook [5]. Facebook loads logs from its web services into a 2000-node
Hadoop cluster, where they are used for applications
such as business intelligence, spam detection, and ad
optimization. In addition to production jobs that run
periodically, the cluster is used for many experimental
jobs, ranging from multi-hour machine learning computations to 1-2 minute ad-hoc queries submitted interactively through an SQL interface called Hive [3]. Most
jobs are short (the median job being 84s long), and the
jobs are composed of fine-grained map and reduce tasks
(the median task being 23s), as shown in Figure 1.
To meet the performance requirements of these jobs,
Facebook uses a fair scheduler for Hadoop that takes advantage of the fine-grained nature of the workload to allocate resources at the level of tasks and to optimize data
locality [38]. Unfortunately, this means that the cluster
can only run Hadoop jobs. If a user wishes to write an ad
targeting algorithm in MPI instead of MapReduce, perhaps because MPI is more efficient for this jobs communication pattern, then the user must set up a separate MPI
cluster and import terabytes of data into it. This problem
is not hypothetical; our contacts at Yahoo! and Facebook
report that users want to run MPI and MapReduce Online
(a streaming MapReduce) [11, 10]. Mesos aims to provide fine-grained sharing between multiple cluster computing frameworks to enable these usage scenarios.
2

Hadoop
scheduler

MPI
scheduler

ZooKeeper
quorum

Framework 1

Framework 2

Job 2
Job 1
FW Scheduler

Job 2
Job 1
FW Scheduler

<s1, 4cpu, 4gb, >

Mesos
master

Standby
master

Standby
master

Mesos slave

Mesos slave

Hadoop
executor

MPI
executor

Hadoop
MPI
executor executor

task
task

task
task

task

Slave 1
Task

<fw1, task1, 2cpu, 1gb, >


<fw1, task2, 1cpu, 2gb, >

Slave 2
Executor

Task

Task

Figure 3: Resource offer example.

or priority. To support a diverse set of inter-framework


allocation policies, Mesos lets organizations define their
own policies via a pluggable allocation module.
Each framework running on Mesos consists of two
components: a scheduler that registers with the master
to be offered resources, and an executor process that is
launched on slave nodes to run the frameworks tasks.
While the master determines how many resources to offer to each framework, the frameworks schedulers select
which of the offered resources to use. When a framework
accepts offered resources, it passes Mesos a description
of the tasks it wants to launch on them.
Figure 3 shows an example of how a framework gets
scheduled to run tasks. In step (1), slave 1 reports
to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes the allocation module, which tells it that framework 1 should be offered
all available resources. In step (2), the master sends a
resource offer describing these resources to framework
1. In step (3), the frameworks scheduler replies to the
master with information about two tasks to run on the
slave, using h2 CPUs, 1 GB RAMi for the first task, and
h1 CPUs, 2 GB RAMi for the second task. Finally, in
step (4), the master sends the tasks to the slave, which allocates appropriate resources to the frameworks executor, which in turn launches the two tasks (depicted with
dotted borders). Because 1 CPU and 1 GB of RAM are
still free, the allocation module may now offer them to
framework 2. In addition, this resource offer process repeats when tasks finish and new resources become free.
To maintain a thin interface and enable frameworks
to evolve independently, Mesos does not require frameworks to specify their resource requirements or constraints. Instead, Mesos gives frameworks the ability to
reject offers. A framework can reject resources that do
not satisfy its constraints in order to wait for ones that
do. Thus, the rejection mechanism enables frameworks
to support arbitrarily complex resource constraints while
keeping Mesos simple and scalable.
One potential challenge with solely using the rejec-

Architecture

Design Philosophy

Mesos aims to provide a scalable and resilient core for


enabling various frameworks to efficiently share clusters.
Because cluster frameworks are both highly diverse and
rapidly evolving, our overriding design philosophy has
been to define a minimal interface that enables efficient
resource sharing across frameworks, and otherwise push
control of task scheduling and execution to the frameworks. Pushing control to the frameworks has two benefits. First, it allows frameworks to implement diverse approaches to various problems in the cluster (e.g., achieving data locality, dealing with faults), and to evolve these
solutions independently. Second, it keeps Mesos simple
and minimizes the rate of change required of the system,
which makes it easier to keep Mesos scalable and robust.
Although Mesos provides a low-level interface, we expect higher-level libraries implementing common functionality (such as fault tolerance) to be built on top of
it. These libraries would be analogous to library OSes in
the exokernel [20]. Putting this functionality in libraries
rather than in Mesos allows Mesos to remain small and
flexible, and lets the libraries evolve independently.
3.2

Task

Mesos
master

task

We begin our description of Mesos by discussing our design philosophy. We then describe the components of
Mesos, our resource allocation mechanisms, and how
Mesos achieves isolation, scalability, and fault tolerance.
3.1

Executor

Figure 2: Mesos architecture diagram, showing two running


frameworks (Hadoop and MPI).

<task1, s1, 2cpu, 1gb, >


<task2, s1, 1cpu, 2gb, >

Allocation
module
<s1, 4cpu, 4gb, >

Mesos slave

Overview

Figure 2 shows the main components of Mesos. Mesos


consists of a master process that manages slave daemons
running on each cluster node, and frameworks that run
tasks on these slaves.
The master implements fine-grained sharing across
frameworks using resource offers. Each resource offer
is a list of free resources on multiple slaves. The master
decides how many resources to offer to each framework
according to an organizational policy, such as fair sharing
3

tion mechanism to satisfy all framework constraints is


efficiency: a framework may have to wait a long time
before it receives an offer satisfying its constraints, and
Mesos may have to send an offer to many frameworks
before one of them accepts it. To avoid this, Mesos also
allows frameworks to set filters, which are Boolean predicates specifying that a framework will always reject certain resources. For example, a framework might specify
a whitelist of nodes it can run on.
There are two points worth noting. First, filters represent just a performance optimization for the resource offer model, as the frameworks still have the ultimate control to reject any resources that they cannot express filters
for and to choose which tasks to run on each node. Second, as we will show in this paper, when the workload
consists of fine-grained tasks (e.g., in MapReduce and
Dryad workloads), the resource offer model performs
surprisingly well even in the absence of filters. In particular, we have found that a simple policy called delay
scheduling [38], in which frameworks wait for a limited
time to acquire nodes storing their data, yields nearly optimal data locality with a wait time of 1-5s.
In the rest of this section, we describe how Mesos performs two key functions: resource allocation (3.3) and
resource isolation (3.4). We then describe filters and
several other mechanisms that make resource offers scalable and robust (3.5). Finally, we discuss fault tolerance
in Mesos (3.6) and summarize the Mesos API (3.7).
3.3

location modules expose a guaranteed allocation to each


frameworka quantity of resources that the framework
may hold without losing tasks. Frameworks read their
guaranteed allocations through an API call. Allocation
modules are responsible for ensuring that the guaranteed
allocations they provide can all be met concurrently. For
now, we have kept the semantics of guaranteed allocations simple: if a framework is below its guaranteed allocation, none of its tasks should be killed, and if it is
above, any of its tasks may be killed.
Second, to decide when to trigger revocation, Mesos
must know which of the connected frameworks would
use more resources if they were offered them. Frameworks indicate their interest in offers through an API call.
3.4

Isolation

Mesos provides performance isolation between framework executors running on the same slave by leveraging
existing OS isolation mechanisms. Since these mechanisms are platform-dependent, we support multiple isolation mechanisms through pluggable isolation modules.
We currently isolate resources using OS container
technologies, specifically Linux Containers [9] and Solaris Projects [13]. These technologies can limit the
CPU, memory, network bandwidth, and (in new Linux
kernels) I/O usage of a process tree. These isolation technologies are not perfect, but using containers is already
an advantage over frameworks like Hadoop, where tasks
from different jobs simply run in separate processes.

Resource Allocation

3.5

Mesos delegates allocation decisions to a pluggable allocation module, so that organizations can tailor allocation to their needs. So far, we have implemented two
allocation modules: one that performs fair sharing based
on a generalization of max-min fairness for multiple resources [21] and one that implements strict priorities.
Similar policies are used in Hadoop and Dryad [25, 38].
In normal operation, Mesos takes advantage of the
fact that most tasks are short, and only reallocates resources when tasks finish. This usually happens frequently enough so that new frameworks acquire their
share quickly. For example, if a frameworks share is
10% of the cluster, it needs to wait approximately 10%
of the mean task length to receive its share. However,
if a cluster becomes filled by long tasks, e.g., due to a
buggy job or a greedy framework, the allocation module
can also revoke (kill) tasks. Before killing a task, Mesos
gives its framework a grace period to clean it up.
We leave it up to the allocation module to select the
policy for revoking tasks, but describe two related mechanisms here. First, while killing a task has a low impact
on many frameworks (e.g., MapReduce), it is harmful for
frameworks with interdependent tasks (e.g., MPI). We allow these frameworks to avoid being killed by letting al-

Making Resource Offers Scalable and Robust

Because task scheduling in Mesos is a distributed process, it needs to be efficient and robust to failures. Mesos
includes three mechanisms to help with this goal.
First, because some frameworks will always reject certain resources, Mesos lets them short-circuit the rejection
process and avoid communication by providing filters to
the master. We currently support two types of filters:
only offer nodes from list L and only offer nodes with
at least R resources free. However, other types of predicates could also be supported. Note that unlike generic
constraint languages, filters are Boolean predicates that
specify whether a framework will reject one bundle of
resources on one node, so they can be evaluated quickly
on the master. Any resource that does not pass a frameworks filter is treated exactly like a rejected resource.
Second, because a framework may take time to respond to an offer, Mesos counts resources offered to a
framework towards its allocation of the cluster. This is
a strong incentive for frameworks to respond to offers
quickly and to filter resources that they cannot use.
Third, if a framework has not responded to an offer
for a sufficiently long time, Mesos rescinds the offer and
re-offers the resources to other frameworks.
4

Scheduler Callbacks
resourceOffer(offerId, offers)
offerRescinded(offerId)
statusUpdate(taskId, status)
slaveLost(slaveId)
Executor Callbacks
launchTask(taskDescriptor)
killTask(taskId)

the performance of frameworks with short tasks (4.4).


We also discuss how frameworks are incentivized to improve their performance under Mesos, and argue that
these incentives also improve overall cluster utilization
(4.5). We conclude this section with some limitations
of Mesoss distributed scheduling model (4.6).

Scheduler Actions
replyToOffer(offerId, tasks)
setNeedsOffers(bool)
setFilters(filters)
getGuaranteedShare()
killTask(taskId)
Executor Actions

4.1

sendStatus(taskId, status)

In our discussion, we consider three metrics:


Framework ramp-up time: time it takes a new
framework to achieve its allocation (e.g., fair share);

Table 1: Mesos API functions for schedulers and executors.

3.6

Fault Tolerance

Job completion time: time it takes a job to complete,


assuming one job per framework;

Since all the frameworks depend on the Mesos master, it


is critical to make the master fault-tolerant. To achieve
this, we have designed the master to be soft state, so that
a new master can completely reconstruct its internal state
from information held by the slaves and the framework
schedulers. In particular, the masters only state is the list
of active slaves, active frameworks, and running tasks.
This information is sufficient to compute how many resources each framework is using and run the allocation
policy. We run multiple masters in a hot-standby configuration using ZooKeeper [4] for leader election. When
the active master fails, the slaves and schedulers connect
to the next elected master and repopulate its state.
Aside from handling master failures, Mesos reports
node failures and executor crashes to frameworks schedulers. Frameworks can then react to these failures using
the policies of their choice.
Finally, to deal with scheduler failures, Mesos allows a
framework to register multiple schedulers such that when
one fails, another one is notified by the Mesos master to
take over. Frameworks must use their own mechanisms
to share state between their schedulers.
3.7

System utilization: total cluster utilization.


We characterize workloads along two dimensions: elasticity and task duration distribution. An elastic framework, such as Hadoop and Dryad, can scale its resources
up and down, i.e., it can start using nodes as soon as it
acquires them and release them as soon its task finish. In
contrast, a rigid framework, such as MPI, can start running its jobs only after it has acquired a fixed quantity of
resources, and cannot scale up dynamically to take advantage of new resources or scale down without a large
impact on performance. For task durations, we consider
both homogeneous and heterogeneous distributions.
We also differentiate between two types of resources:
mandatory and preferred. A resource is mandatory if a
framework must acquire it in order to run. For example, a
graphical processing unit (GPU) is mandatory if a framework cannot run without access to GPU. In contrast, a resource is preferred if a framework performs better using it, but can also run using another equivalent resource.
For example, a framework may prefer running on a node
that locally stores its data, but may also be able to read
the data remotely if it must.
We assume the amount of mandatory resources requested by a framework never exceeds its guaranteed
share. This ensures that frameworks will not deadlock
waiting for the mandatory resources to become free.2 For
simplicity, we also assume that all tasks have the same resource demands and run on identical slices of machines
called slots, and that each framework runs a single job.

API Summary

Table 1 summarizes the Mesos API. The callback


columns list functions that frameworks must implement,
while actions are operations that they can invoke.

Definitions, Metrics and Assumptions

Mesos Behavior

In this section, we study Mesoss behavior for different


workloads. Our goal is not to develop an exact model of
the system, but to provide a coarse understanding of its
behavior, in order to characterize the environments that
Mesoss distributed scheduling model works well in.
In short, we find that Mesos performs very well when
frameworks can scale up and down elastically, tasks
durations are homogeneous, and frameworks prefer all
nodes equally (4.2). When different frameworks prefer different nodes, we show that Mesos can emulate a
centralized scheduler that performs fair sharing across
frameworks (4.3). In addition, we show that Mesos can
handle heterogeneous task durations without impacting

4.2

Homogeneous Tasks

We consider a cluster with n slots and a framework, f ,


that is entitled to k slots. For the purpose of this analysis, we consider two distributions of the task durations:
constant (i.e., all tasks have the same length) and exponential. Let the mean task duration be T , and assume that
framework f runs a job which requires kT total com2 In workloads where the mandatory resource demands of the active frameworks can exceed the capacity of the cluster, the allocation
module needs to implement admission control.

Elastic Framework
Rigid Framework
Constant dist. Exponential dist. Constant dist.
Exponential dist.
Ramp-up time
T
T ln k
T
T ln k
Completion time
(1/2 + )T
(1 + )T
(1 + )T
(ln k + )T
Utilization
1
1
/(1/2 + )
/(ln k 1 + )
Table 2: Ramp-up time, job completion time and utilization for both elastic and rigid frameworks, and for both constant and
exponential task duration distributions. The framework starts with no slots. k is the number of slots the framework is entitled under
the scheduling policy, and T represents the time it takes a job to complete assuming the framework gets all k slots at once.

putation time. That is, when the framework has k slots,


it takes its job T time to finish.
Table 2 summarizes the job completion times and system utilization for the two types of frameworks and the
two types of task length distributions. As expected, elastic frameworks with constant task durations perform the
best, while rigid frameworks with exponential task duration perform the worst. Due to lack of space, we present
only the results here and include derivations in [23].

(a) there exists a system configuration in which each


framework gets all its preferred slots and achieves its full
allocation, and (b) there is no such configuration, i.e., the
demand for some preferred slots exceeds the supply.
In the first case, it is easy to see that, irrespective of the
initial configuration, the system will converge to the state
where each framework allocates its preferred slots after
at most one T interval. This is simple because during a
T interval all slots become available, and as a result each
framework will be offered its preferred slots.
In the second case, there is no configuration in which
all frameworks can satisfy their preferences. The key
question in this case is how should one allocate the preferred slots across the frameworks demanding them. In
particular, assume there are p slots preferred by m frameworks,
Pm where framework i requests ri such slots, and
i=1 ri > x. While many allocation policies are possible, here we consider a weighted fair allocation policy
where the weight associated with framework i is its intended total allocation, si . In other words, assuming that
each framework
has enough demand, we aim to allocate
Pm
psi /( i=1 si ) preferred slots to framework i.
The challenge in Mesos is that the scheduler does
not know the preferences of each framework. Fortunately, it turns out that there is an easy way to achieve
the weighted allocation of the preferred slots described
above: simply perform lottery scheduling [36], offering slots to frameworks with probabilities proportional to
their intended allocations. In particular, when a slot becomes available, Mesos
Pncan offer that slot to framework i
with probability si /( i=1 si ), where n is the total number of frameworks in the system. Furthermore, because
each framework i receives on average si slots every T
time units, the results for ramp-up times and completion
times in Section 4.2 still hold.

Framework ramp-up time: If task durations are constant, it will take framework f at most T time to acquire
k slots. This is simply because during a T interval, every
slot will become available, which will enable Mesos to
offer the framework all k of its preferred slots. If the duration distribution is exponential, the expected ramp-up
time can be as high as T ln k [23].
Job completion time: The expected completion time3
of an elastic job is at most (1 + )T , which is within T
(i.e., the mean task duration) of the completion time of
the job when it gets all its slots instantaneously. Rigid
jobs achieve similar completion times for constant task
durations, but exhibit much higher completion times for
exponential job durations, i.e., (ln k + )T . This is simply because it takes a framework T ln k time on average
to acquire all its slots and be able to start its job.
System utilization: Elastic jobs fully utilize their allocated slots, because they can use every slot as soon
as they get it. As a result, assuming infinite demand, a
system running only elastic jobs is fully utilized. Rigid
frameworks achieve slightly worse utilizations, as their
jobs cannot start before they get their full allocations, and
thus they waste the resources held while ramping up.
4.3

Placement Preferences

So far, we have assumed that frameworks have no slot


preferences. In practice, different frameworks prefer different nodes and their preferences may change over time.
In this section, we consider the case where frameworks
have different preferred slots.
The natural question is how well Mesos will work
compared to a central scheduler that has full information
about framework preferences. We consider two cases:

4.4

Heterogeneous Tasks

So far we have assumed that frameworks have homogeneous task duration distributions, i.e., that all frameworks have the same task duration distribution. In this
section, we discuss frameworks with heterogeneous task
duration distributions. In particular, we consider a workload where tasks that are either short and long, where the
mean duration of the long tasks is significantly longer
than the mean of the short tasks. Such heterogeneous

3 When computing job completion time we assume that the last tasks

of the job running on the frameworks k slots finish at the same time.

workloads can hurt frameworks with short tasks. In the


worst case, all nodes required by a short job might be
filled with long tasks, so the job may need to wait a long
time (relative to its execution time) to acquire resources.
We note first that random task assignment can work
well if the fraction of long tasks is not very close to 1
and if each node supports multiple slots. For example,
in a cluster with S slots per node, the probability that a
node is filled with long tasks will be S . When S is large
(e.g., in the case of multicore machines), this probability
is small even with > 0.5. If S = 8 and = 0.5, for example, the probability that a node is filled with long tasks
is 0.4%. Thus, a framework with short tasks can still acquire many preferred slots in a short period of time. In
addition, the more slots a framework is able to use, the
likelier it is that at least k of them are running short tasks.
To further alleviate the impact of long tasks, Mesos
can be extended slightly to allow allocation policies to
reserve some resources on each node for short tasks. In
particular, we can associate a maximum task duration
with some of the resources on each node, after which
tasks running on those resources are killed. These time
limits can be exposed to the frameworks in resource offers, allowing them to choose whether to use these resources. This scheme is similar to the common policy of
having a separate queue for short jobs in HPC clusters.
4.5

reducing latency for new jobs and wasted work for revocation. If frameworks are elastic, they will opportunistically utilize all the resources they can obtain. Finally,
if frameworks do not accept resources that they do not
understand, they will leave them for frameworks that do.
We also note that these properties are met by many
current cluster computing frameworks, such as MapReduce and Dryad, simply because using short independent
tasks simplifies load balancing and fault recovery.
4.6

Limitations of Distributed Scheduling

Although we have shown that distributed scheduling


works well in a range of workloads relevant to current
cluster environments, like any decentralized approach, it
can perform worse than a centralized scheduler. We have
identified three limitations of the distributed model:
Fragmentation: When tasks have heterogeneous resource demands, a distributed collection of frameworks
may not be able to optimize bin packing as well as a centralized scheduler. However, note that the wasted space
due to suboptimal bin packing is bounded by the ratio between the largest task size and the node size. Therefore,
clusters running larger nodes (e.g., multicore nodes)
and smaller tasks within those nodes will achieve high
utilization even with distributed scheduling.
There is another possible bad outcome if allocation
modules reallocate resources in a nave manner: when
a cluster is filled by tasks with small resource requirements, a framework f with large resource requirements
may starve, because whenever a small task finishes, f
cannot accept the resources freed by it, but other frameworks can. To accommodate frameworks with large pertask resource requirements, allocation modules can support a minimum offer size on each slave, and abstain from
offering resources on the slave until this amount is free.

Framework Incentives

Mesos implements a decentralized scheduling model,


where each framework decides which offers to accept.
As with any decentralized system, it is important to understand the incentives of entities in the system. In this
section, we discuss the incentives of frameworks (and
their users) to improve the response times of their jobs.
Short tasks: A framework is incentivized to use short
tasks for two reasons. First, it will be able to allocate any
resources reserved for short slots. Second, using small
tasks minimizes the wasted work if the framework loses
a task, either due to revocation or simply due to failures.

Interdependent framework constraints: It is possible to construct scenarios where, because of esoteric interdependencies between frameworks (e.g., certain tasks
from two frameworks cannot be colocated), only a single global allocation of the cluster performs well. We
argue such scenarios are rare in practice. In the model
discussed in this section, where frameworks only have
preferences over which nodes they use, we showed that
allocations approximate those of optimal schedulers.

Scale elastically: The ability of a framework to use resources as soon as it acquires theminstead of waiting
to reach a given minimum allocationwould allow the
framework to start (and complete) its jobs earlier. In addition, the ability to scale up and down allows a framework to grab unused resources opportunistically, as it can
later release them with little negative impact.

Framework complexity: Using resource offers may


make framework scheduling more complex. We argue,
however, that this difficulty is not onerous. First, whether
using Mesos or a centralized scheduler, frameworks need
to know their preferences; in a centralized scheduler,
the framework needs to express them to the scheduler,
whereas in Mesos, it must use them to decide which offers to accept. Second, many scheduling policies for existing frameworks are online algorithms, because frame-

Do not accept unknown resources: Frameworks are


incentivized not to accept resources that they cannot use
because most allocation policies will count all the resources that a framework owns when making offers.
We note that these incentives align well with our goal
of improving utilization. If frameworks use short tasks,
Mesos can reallocate resources quickly between them,
7

works cannot predict task times and must be able to handle failures and stragglers [18, 40, 38]. These policies
are easy to implement over resource offers.

as an executor, which may be terminated if it is not running tasks. This would make map output files unavailable
to reduce tasks. We solved this problem by providing a
shared file server on each node in the cluster to serve
local files. Such a service is useful beyond Hadoop, to
other frameworks that write data locally on each node.
In total, our Hadoop port is 1500 lines of code.

Implementation

We have implemented Mesos in about 10,000 lines of


C++. The system runs on Linux, Solaris and OS X, and
supports frameworks written in C++, Java, and Python.
To reduce the complexity of our implementation, we
use a C++ library called libprocess [7] that provides
an actor-based programming model using efficient asynchronous I/O mechanisms (epoll, kqueue, etc). We
also use ZooKeeper [4] to perform leader election.
Mesos can use Linux containers [9] or Solaris projects
[13] to isolate tasks. We currently isolate CPU cores and
memory. We plan to leverage recently added support for
network and I/O isolation in Linux [8] in the future.
We have implemented four frameworks on top of
Mesos. First, we have ported three existing cluster computing systems: Hadoop [2], the Torque resource scheduler [33], and the MPICH2 implementation of MPI [16].
None of these ports required changing these frameworks
APIs, so all of them can run unmodified user programs.
In addition, we built a specialized framework for iterative
jobs called Spark, which we discuss in Section 5.3.
5.1

5.2

Torque and MPI Ports

We have ported the Torque cluster resource manager to


run as a framework on Mesos. The framework consists
of a Mesos scheduler and executor, written in 360 lines
of Python code, that launch and manage different components of Torque. In addition, we modified 3 lines of
Torque source code to allow it to elastically scale up and
down on Mesos depending on the jobs in its queue.
After registering with the Mesos master, the framework scheduler configures and launches a Torque server
and then periodically monitors the servers job queue.
While the queue is empty, the scheduler releases all tasks
(down to an optional minimum, which we set to 0) and
refuses all resource offers it receives from Mesos. Once
a job gets added to Torques queue (using the standard
qsub command), the scheduler begins accepting new
resource offers. As long as there are jobs in Torques
queue, the scheduler accepts offers as necessary to satisfy the constraints of as many jobs in the queue as possible. On each node where offers are accepted, Mesos
launches our executor, which in turn starts a Torque
backend daemon and registers it with the Torque server.
When enough Torque backend daemons have registered,
the torque server will launch the next job in its queue.
Because jobs that run on Torque (e.g. MPI) may not be
fault tolerant, Torque avoids having its tasks revoked by
not accepting resources beyond its guaranteed allocation.
In addition to the Torque framework, we also created
a Mesos MPI wrapper framework, written in 200 lines
of Python code, for running MPI jobs directly on Mesos.

Hadoop Port

Porting Hadoop to run on Mesos required relatively few


modifications, because Hadoops fine-grained map and
reduce tasks map cleanly to Mesos tasks. In addition, the
Hadoop master, known as the JobTracker, and Hadoop
slaves, known as TaskTrackers, fit naturally into the
Mesos model as a framework scheduler and executor.
To add support for running Hadoop on Mesos, we took
advantage of the fact that Hadoop already has a pluggable API for writing job schedulers. We wrote a Hadoop
scheduler that connects to Mesos, launches TaskTrackers
as its executors, and maps each Hadoop task to a Mesos
task. When there are unlaunched tasks in Hadoop, our
scheduler first starts Mesos tasks on the nodes of the
cluster that it wants to use, and then sends the Hadoop
tasks to them using Hadoops existing internal interfaces.
When tasks finish, our executor notifies Mesos by listening for task finish events using an API in the TaskTracker.
We used delay scheduling [38] to achieve data locality
by waiting for slots on the nodes that contain task input data. In addition, our approach allowed us to reuse
Hadoops existing logic for re-scheduling of failed tasks
and for speculative execution (straggler mitigation).
We also needed to change how map output data is
served to reduce tasks. Hadoop normally writes map
output files to the local filesystem, then serves these to
reduce tasks using an HTTP server included in the TaskTracker. However, the TaskTracker within Mesos runs

5.3

Spark Framework

Mesos enables the creation of specialized frameworks


optimized for workloads for which more general execution layers may not be optimal. To test the hypothesis that simple specialized frameworks provide value,
we identified one class of jobs that were found to perform poorly on Hadoop by machine learning researchers
at our lab: iterative jobs, where a dataset is reused across
a number of iterations. We built a specialized framework
called Spark [39] optimized for these workloads.
One example of an iterative algorithm used in machine learning is logistic regression [22]. This algorithm
seeks to find a line that separates two sets of labeled data
points. The algorithm starts with a random line w. Then,
on each iteration, it computes the gradient of an objective
8

Bin
1
2
3
4
5
6
7
8

f(x,w)

f(x,w)
w

f(x,w)

...
a) Dryad

Reduce Tasks
NA
NA
2
NA
10
NA
NA
30

# Jobs Run
38
18
14
12
6
6
4
2

macrobenchmark consisting of a mix of four workloads:


A Hadoop instance running a mix of small and large
jobs based on the workload at Facebook.
A Hadoop instance running a set of large batch jobs.

function that measures how well the line separates the


points, and shifts w along this gradient. This gradient
computation amounts to evaluating a function f (x, w)
over each data point x and summing the results. An
implementation of logistic regression in Hadoop must
run each iteration as a separate MapReduce job, because
each iteration depends on the w computed at the previous
one. This imposes overhead because every iteration must
re-read the input file into memory. In Dryad, the whole
job can be expressed as a data flow DAG as shown in Figure 4a, but the data must still must be reloaded from disk
at each iteration. Reusing the data in memory between
iterations in Dryad would require cyclic data flow.
Sparks execution is shown in Figure 4b. Spark uses
the long-lived nature of Mesos executors to cache a slice
of the dataset in memory at each executor, and then run
multiple iterations on this cached data. This caching is
achieved in a fault-tolerant manner: if a node is lost,
Spark remembers how to recompute its slice of the data.
By building Spark on top of Mesos, we were able to
keep its implementation small (about 1300 lines of code),
yet still capable of outperforming Hadoop by 10 for
iterative jobs. In particular, using Mesoss API saved us
the time to write a master daemon, slave daemon, and
communication protocols between them for Spark. The
main pieces we had to write were a framework scheduler
(which uses delay scheduling for locality) and user APIs.

Spark running a series of machine learning jobs.

Torque running a series of MPI jobs.


We compared a scenario where the workloads ran as
four frameworks on a 96-node Mesos cluster using fair
sharing to a scenario where they were each given a static
partition of the cluster (24 nodes), and measured job response times and resource utilization in both cases. We
used EC2 nodes with 4 CPU cores and 15 GB of RAM.
We begin by describing the four workloads in more
detail, and then present our results.
6.1.1

Macrobenchmark Workloads

Facebook Hadoop Mix Our Hadoop job mix was


based on the distribution of job sizes and inter-arrival
times at Facebook, reported in [38]. The workload consists of 100 jobs submitted at fixed times over a 25minute period, with a mean inter-arrival time of 14s.
Most of the jobs are small (1-12 tasks), but there are also
large jobs of up to 400 tasks.4 The jobs themselves were
from the Hive benchmark [6], which contains four types
of queries: text search, a simple selection, an aggregation, and a join that gets translated into multiple MapReduce steps. We grouped the jobs into eight bins of job
type and size (listed in Table 3) so that we could compare performance in each bin. We also set the framework
scheduler to perform fair sharing between its jobs, as this
policy is used at Facebook.

Evaluation

Large Hadoop Mix To emulate batch workloads that


need to run continuously, such as web crawling, we had
a second instance of Hadoop run a series of IO-intensive
2400-task text search jobs. A script launched ten of these
jobs, submitting each one after the previous one finished.

We evaluated Mesos through a series of experiments on


the Amazon Elastic Compute Cloud (EC2). We begin
with a macrobenchmark that evaluates how the system
shares resources between four workloads, and go on to
present a series of smaller experiments designed to evaluate overhead, decentralized scheduling, our specialized
framework (Spark), scalability, and failure recovery.
6.1

Map Tasks
1
2
10
50
100
200
400
400

Table 3: Job types for each bin in our Facebook Hadoop mix.

b) Spark

Figure 4: Data flow of a logistic regression job in Dryad


vs. Spark. Solid lines show data flow within the framework.
Dashed lines show reads from a distributed file system. Spark
reuses in-memory data across iterations to improve efficiency.

Job Type
selection
text search
aggregation
selection
aggregation
selection
text search
join

Spark We ran five instances of an iterative machine


learning job on Spark. These were launched by a script
that waited 2 minutes after each job ended to submit
the next. The job we used was alternating least squares

Macrobenchmark

To evaluate the primary goal of Mesos, which is enabling


diverse frameworks to efficiently share a cluster, we ran a

4 We

scaled down the largest jobs in [38] to have the workload fit a
quarter of our cluster size.

(b) Large Hadoop Mix


Share of Cluster

Share of Cluster

(a) Facebook Hadoop Mix


1
0.8
0.6
0.4
0.2
0

Static Partitioning
Mesos

200

400

600

800

1000

1200

1400

1
0.8
0.6
0.4
0.2
0

1600

Static Partitioning
Mesos

500

1000

Time (s)

Static Partitioning
Mesos

200

400

600

800

2000

2500

3000

(d) Torque / MPI


Share of Cluster

Share of Cluster

(c) Spark
1
0.8
0.6
0.4
0.2
0

1500
Time (s)

1000

1200

1400

1600

1
0.8
0.6
0.4
0.2
0

1800

Static Partitioning
Mesos

200

400

600

Time (s)

800

1000

1200

1400

1600

Time (s)

CPU Utilization (%)

Figure 5: Comparison of cluster shares (fraction of CPUs) over time for each of the frameworks in the Mesos and static partitioning
macrobenchmark scenarios. On Mesos, frameworks can scale up when their demand is high and that of other frameworks is low, and
thus finish jobs faster. Note that the plots time axes are different (e.g., the large Hadoop mix takes 3200s with static partitioning).

100
80
60
40
20
0

Mesos
0

200

400

600

800

1000

Static
1200

1400

1600

Memory Utilization (%)

Time (s)

Figure 6: Framework shares on Mesos during the macrobenchmark. By pooling resources, Mesos lets each workload scale
up to fill gaps in the demand of others. In addition, fine-grained
sharing allows resources to be reallocated in tens of seconds.

Mesos
0

200

400

600

800

1000

Static
1200

1400

1600

Time (s)

Figure 7: Average CPU and memory utilization over time


across all nodes in the Mesos cluster vs. static partitioning.

(ALS), a collaborative filtering algorithm [42]. This job


is CPU-intensive but also benefits from caching its input
data on each node, and needs to broadcast updated parameters to all nodes running its tasks on each iteration.

framework by Mesos over time in Figure 6. We see that


Mesos enables each framework to scale up during periods when other frameworks have low demands, and thus
keeps cluster nodes busier. For example, at time 350,
when both Spark and the Facebook Hadoop framework
have no running jobs and Torque is using 1/8 of the cluster, the large-job Hadoop framework scales up to 7/8 of
the cluster. In addition, we see that resources are reallocated rapidly (e.g., when a Facebook Hadoop job starts
around time 360) due to the fine-grained nature of tasks.
Finally, higher allocation of nodes also translates into increased CPU and memory utilization (by 10% for CPU
and 17% for memory), as shown in Figure 7.
A second question is how much better jobs perform
under Mesos than when using a statically partitioned
cluster. We present this data in two ways. First, Figure 5 compares the resource allocation over time of
each framework in the shared and statically partitioned
clusters. Shaded areas show the allocation in the stat-

Torque / MPI Our Torque framework ran eight instances of the tachyon raytracing job [35] that is part of
the SPEC MPI2007 benchmark. Six of the jobs ran small
problem sizes and two ran large ones. Both types used 24
parallel tasks. We submitted these jobs at fixed times to
both clusters. The tachyon job is CPU-intensive.
6.1.2

50
40
30
20
10
0

Macrobenchmark Results

A successful result for Mesos would show two things:


that Mesos achieves higher utilization than static partitioning, and that jobs finish at least as fast in the shared
cluster as they do in their static partition, and possibly
faster due to gaps in the demand of other frameworks.
Our results show both effects, as detailed below.
We show the fraction of CPU cores allocated to each
10

Facebook
Hadoop Mix
Large Hadoop
Mix
Spark
Torque / MPI

Sum of Exec Times w/ Sum of Exec Times


Speedup
Static Partitioning (s)
with Mesos (s)
7235

6319

1.14

3143

1494

2.10

1684
3210

1338
3352

1.26
0.96

Framework

Job Type

Facebook Hadoop selection (1)


Mix
text search (2)
aggregation (3)
selection (4)
aggregation (5)
selection (6)
text search (7)
join (8)
Large Hadoop Mix text search
Spark
ALS
Torque / MPI
small tachyon
large tachyon

Table 4: Aggregate performance of each framework in the macrobenchmark (sum of running times of all the jobs in the framework). The speedup column shows the relative gain on Mesos.

ically partitioned cluster, while solid lines show the


share on Mesos. We see that the fine-grained frameworks (Hadoop and Spark) take advantage of Mesos to
scale up beyond 1/4 of the cluster when global demand
allows this, and consequently finish bursts of submitted jobs faster in Mesos. At the same time, Torque
achieves roughly similar allocations and job durations
under Mesos (with some differences explained later).
Second, Tables 4 and 5 show a breakdown of job performance for each framework. In Table 4, we compare
the aggregate performance of each framework, defined
as the sum of job running times, in the static partitioning
and Mesos scenarios. We see the Hadoop and Spark jobs
as a whole are finishing faster on Mesos, while Torque is
slightly slower. The framework that gains the most is the
large-job Hadoop mix, which almost always has tasks to
run and fills in the gaps in demand of the other frameworks; this framework performs 2x better on Mesos.
Table 5 breaks down the results further by job type.
We observe two notable trends. First, in the Facebook
Hadoop mix, the smaller jobs perform worse on Mesos.
This is due to an interaction between the fair sharing performed by Hadoop (among its jobs) and the fair sharing
in Mesos (among frameworks): During periods of time
when Hadoop has more than 1/4 of the cluster, if any jobs
are submitted to the other frameworks, there is a delay
before Hadoop gets a new resource offer (because any
freed up resources go to the framework farthest below its
share), so any small job submitted during this time is delayed for a long time relative to its length. In contrast,
when running alone, Hadoop can assign resources to the
new job as soon as any of its tasks finishes. This problem with hierarchical fair sharing is also seen in networks
[34], and could be mitigated by running the small jobs on
a separate framework or using a different allocation policy (e.g., using lottery scheduling instead of offering all
freed resources to the framework with the lowest share).
Lastly, Torque is the only framework that performed
worse, on average, on Mesos. The large tachyon jobs
took on average 2 minutes longer, while the small ones
took 20s longer. Some of this delay is due to Torque having to wait to launch 24 tasks on Mesos before starting
each job, but the average time this takes is 12s. We be-

Exec Time w/ Static Avg. Speedup


Partitioning (s)
on Mesos
24
0.84
31
0.90
82
0.94
65
1.40
192
1.26
136
1.71
137
2.14
662
1.35
314
2.21
337
1.36
261
0.91
822
0.88

Local Map Tasks (%)

Table 5: Performance of each job type in the macrobenchmark.


Bins for the Facebook Hadoop mix are in parentheses.
100%

600

80%

480

60%

360

40%

240

20%

120

0%

0
Static
Mesos, no Mesos, 1s Mesos, 5s
partitioning delay sched. delay sched. delay sched.
Data Locality

Job Running TIme (s)

Framework

Job Running Times

Figure 8: Data locality and average job durations for 16


Hadoop instances running on a 93-node cluster using static partitioning, Mesos, or Mesos with delay scheduling.

lieve that the rest of the delay is due to stragglers (slow


nodes). In our standalone Torque run, we saw two jobs
take about 60s longer to run than the others (Fig. 5d). We
discovered that both of these jobs were using a node that
performed slower on single-node benchmarks than the
others (in fact, Linux reported 40% lower bogomips on
it). Because tachyon hands out equal amounts of work
to each node, it runs as slowly as the slowest node.
6.2

Overhead

To measure the overhead Mesos imposes when a single


framework uses the cluster, we ran two benchmarks using MPI and Hadoop on an EC2 cluster with 50 nodes,
each with 2 CPU cores and 6.5 GB RAM. We used the
High-Performance LINPACK [15] benchmark for MPI
and a WordCount job for Hadoop, and ran each job three
times. The MPI job took on average 50.9s without Mesos
and 51.8s with Mesos, while the Hadoop job took 160s
without Mesos and 166s with Mesos. In both cases, the
overhead of using Mesos was less than 4%.
6.3

Data Locality through Delay Scheduling

In this experiment, we evaluated how Mesos resource


offer mechanism enables frameworks to control their
tasks placement, and in particular, data locality. We
ran 16 instances of Hadoop using 93 EC2 nodes, each
with 4 CPU cores and 15 GB RAM. Each node ran a
11

6.4

Running Time (s)

map-only scan job that searched a 100 GB file spread


throughout the cluster on a shared HDFS file system and
outputted 1% of the records. We tested four scenarios:
giving each Hadoop instance its own 5-6 node static partition of the cluster (to emulate organizations that use
coarse-grained cluster sharing systems), and running all
instances on Mesos using either no delay scheduling, 1s
delay scheduling or 5s delay scheduling.
Figure 8 shows averaged measurements from the 16
Hadoop instances across three runs of each scenario. Using static partitioning yields very low data locality (18%)
because the Hadoop instances are forced to fetch data
from nodes outside their partition. In contrast, running
the Hadoop instances on Mesos improves data locality,
even without delay scheduling, because each Hadoop instance has tasks on more nodes of the cluster (there are
4 tasks per node), and can therefore access more blocks
locally. Adding a 1-second delay brings locality above
90%, and a 5-second delay achieves 95% locality, which
is competitive with running one Hadoop instance alone
on the whole cluster. As expected, job performance improves with data locality: jobs run 1.7x faster in the 5s
delay scenario than with static partitioning.

3000
Hadoop
Spark

2000
1000
0
0

10

20

30

Number of Iterations

Task Launch
Overhead (seconds)

Figure 9: Hadoop and Spark logistic regression running times.


1
0.75
0.5
0.25
0
0

10000

20000

30000

40000

50000

Number of Nodes

Figure 10: Mesos masters scalability versus number of slaves.

frameworks running throughout the cluster continuously


launches tasks, starting one task on each slave that it receives a resource offer for. Each task sleeps for a period
of time based on a normal distribution with a mean of
30 seconds and standard deviation of 10s, and then ends.
Each slave runs up to two tasks at a time.
Once the cluster reached steady-state (i.e., the 200
frameworks achieve their fair shares and all resources
were allocated), we launched a test framework that runs a
single 10 second task and measured how long this framework took to finish. This allowed us to calculate the extra
delay incurred over 10s due to having to register with the
master, wait for a resource offer, accept it, wait for the
master to process the response and launch the task on a
slave, and wait for Mesos to report the task as finished.
We plot this extra delay in Figure 10, showing averages of 5 runs. We observe that the overhead remains
small (less than one second) even at 50,000 nodes. In
particular, this overhead is much smaller than the average task and job lengths in data center workloads (see
Section 2). Because Mesos was also keeping the cluster fully allocated, this indicates that the master kept up
with the load placed on it. Unfortunately, the EC2 virtualized environment limited scalability beyond 50,000
slaves, because at 50,000 slaves the master was processing 100,000 packets per second (in+out), which has been
shown to be the current achievable limit on EC2 [12].

Spark Framework

We evaluated the benefit of running iterative jobs using


the specialized Spark framework we developed on top
of Mesos (Section 5.3) over the general-purpose Hadoop
framework. We used a logistic regression job implemented in Hadoop by machine learning researchers in
our lab, and wrote a second version of the job using
Spark. We ran each version separately on 20 EC2 nodes,
each with 4 CPU cores and 15 GB RAM. Each experiment used a 29 GB data file and varied the number of
logistic regression iterations from 1 to 30 (see Figure 9).
With Hadoop, each iteration takes 127s on average,
because it runs as a separate MapReduce job. In contrast,
with Spark, the first iteration takes 174s, but subsequent
iterations only take about 6 seconds, leading to a speedup
of up to 10x for 30 iterations. This happens because the
cost of reading the data from disk and parsing it is much
higher than the cost of evaluating the gradient function
computed by the job on each iteration. Hadoop incurs the
read/parsing cost on each iteration, while Spark reuses
cached blocks of parsed data and only incurs this cost
once. The longer time for the first iteration in Spark is
due to the use of slower text parsing routines.
6.5

4000

6.6

Mesos Scalability

Failure Recovery

To evaluate recovery from master failures, we conducted


an experiment with 200 to 4000 slave daemons on 62
EC2 nodes with 4 cores and 15 GB RAM. We ran 200
frameworks that each launched 20-second tasks, and two
Mesos masters connected to a 5-node ZooKeeper quorum.We synchronized the two masters clocks using NTP

To evaluate Mesos scalability, we emulated large clusters by running up to 50,000 slave daemons on 99 Amazon EC2 nodes, each with 8 CPU cores and 6 GB RAM.
We used one EC2 node for the master and the rest of the
nodes to run slaves. During the experiment, each of 200
12

and measured the mean time to recovery (MTTR) after


killing the active master. The MTTR is the time for all of
the slaves and frameworks to connect to the second master. In all cases, the MTTR was between 4 and 8 seconds,
with 95% confidence intervals of up to 3s on either side.
6.7

the size of VM they require. In contrast, Mesos allows


frameworks to be highly selective about task placement.
Quincy. Quincy [25] is a fair scheduler for Dryad
that uses a centralized scheduling algorithm for Dryads
DAG-based programming model. In contrast, Mesos
provides the lower-level abstraction of resource offers to
support multiple cluster computing frameworks.

Performance Isolation

As discussed in Section 3.4, Mesos leverages existing


OS isolation mechanism to provide performance isolation between different frameworks tasks running on the
same slave. While these mechanisms are not perfect,
a preliminary evaluation of Linux Containers [9] shows
promising results. In particular, using Containers to isolate CPU usage between a MediaWiki web server (consisting of multiple Apache processes running PHP) and a
hog application (consisting of 256 processes spinning
in infinite loops) shows on average only a 30% increase
in request latency for Apache versus a 550% increase
when running without Containers. We refer the reader to
[29] for a fuller evaluation of OS isolation mechanisms.

Condor. The Condor cluster manager uses the ClassAds language [32] to match nodes to jobs. Using a resource specification language is not as flexible for frameworks as resource offers, since not all requirements may
be expressible. Also, porting existing frameworks, which
have their own schedulers, to Condor would be more difficult than porting them to Mesos, where existing schedulers fit naturally into the two-level scheduling model.
Next-Generation Hadoop. Recently, Yahoo! announced a redesign for Hadoop that uses a two-level
scheduling model, where per-application masters request
resources from a central manager [14]. The design aims
to support non-MapReduce applications as well. While
details about the scheduling model in this system are currently unavailable, we believe that the new application
masters could naturally run as Mesos frameworks.

Related Work

HPC and Grid Schedulers. The high performance


computing (HPC) community has long been managing
clusters [33, 41]. However, their target environment typically consists of specialized hardware, such as Infiniband and SANs, where jobs do not need to be scheduled
local to their data. Furthermore, each job is tightly coupled, often using barriers or message passing. Thus, each
job is monolithic, rather than composed of fine-grained
tasks, and does not change its resource demands during
its lifetime. For these reasons, HPC schedulers use centralized scheduling, and require users to declare the required resources at job submission time. Jobs are then
given coarse-grained allocations of the cluster. Unlike
the Mesos approach, this does not allow jobs to locally
access data distributed across the cluster. Furthermore,
jobs cannot grow and shrink dynamically. In contrast,
Mesos supports fine-grained sharing at the level of tasks
and allows frameworks to control their placement.
Grid computing has mostly focused on the problem
of making diverse virtual organizations share geographically distributed and separately administered resources
in a secure and interoperable way. Mesos could well be
used within a virtual organization inside a larger grid.

Conclusion and Future Work

We have presented Mesos, a thin management layer that


allows diverse cluster computing frameworks to efficiently share resources. Mesos is built around two design elements: a fine-grained sharing model at the level
of tasks, and a distributed scheduling mechanism called
resource offers that delegates scheduling decisions to the
frameworks. Together, these elements let Mesos achieve
high utilization, respond quickly to workload changes,
and cater to diverse frameworks while remaining scalable
and robust. We have shown that existing frameworks
can effectively share resources using Mesos, that Mesos
enables the development of specialized frameworks providing major performance gains, such as Spark, and that
Mesoss simple design allows the system to be fault tolerant and to scale to 50,000 nodes.
In future work, we plan to further analyze the resource offer model and determine whether any extensions can improve its efficiency while retaining its flexibility. In particular, it may be possible to have frameworks give richer hints about offers they would like to
receive. Nonetheless, we believe that below any hint
system, frameworks should still have the ability to reject offers and to choose which tasks to launch on each
resource, so that their evolution is not constrained by the
hint language provided by the system.
We are also currently using Mesos to manage resources on a 40-node cluster in our lab and in a test deployment at Twitter, and plan to report on lessons from

Public and Private Clouds. Virtual machine clouds


such as Amazon EC2 [1] and Eucalyptus [31] share
common goals with Mesos, such as isolating applications while providing a low-level abstraction (VMs).
However, they differ from Mesos in several important
ways. First, their relatively coarse grained VM allocation
model leads to less efficient resource utilization and data
sharing than in Mesos. Second, these systems generally
do not let applications specify placement needs beyond
13

these deployments in future work.

[22] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of


Statistical Learning: Data Mining, Inference, and Prediction.
Springer Publishing Company, New York, NY, 2009.
[23] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D.
Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform
for fine-grained resource sharing in the data center. Technical
Report UCB/EECS-2010-87, UC Berkeley, May 2010.
[24] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad:
distributed data-parallel programs from sequential building
blocks. In EuroSys 07, 2007.
[25] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and
A. Goldberg. Quincy: Fair scheduling for distributed computing
clusters. In SOSP, November 2009.
[26] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. On availability of
intermediate data in cloud computations. In HOTOS, May 2009.
[27] D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum.
Stateful bulk processing for incremental analytics. In Proc. ACM
symposium on Cloud computing, SoCC 10, 2010.
[28] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski. Pregel: a system for large-scale
graph processing. In SIGMOD, pages 135146, 2010.
[29] J. N. Matthews, W. Hu, M. Hapuarachchi, T. Deshane,
D. Dimatos, G. Hamilton, M. McCabe, and J. Owens.
Quantifying the performance isolation properties of
virtualization systems. In ExpCS 07, 2007.
[30] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith,
A. Madhavapeddy, and S. Hand. Ciel: a universal execution
engine for distributed data-flow computing. In NSDI, 2011.
[31] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman,
L. Youseff, and D. Zagorodnov. The Eucalyptus open-source
cloud-computing system. In CCA 08, 2008.
[32] R. Raman, M. Livny, and M. Solomon. Matchmaking: An
extensible framework for distributed resource management.
Cluster Computing, 2:129138, April 1999.
[33] G. Staples. TORQUE resource manager. In Proc.
Supercomputing 06, 2006.
[34] I. Stoica, H. Zhang, and T. S. E. Ng. A hierarchical fair service
curve algorithm for link-sharing, real-time and priority services.
In SIGCOMM 97, pages 249262, 1997.
[35] J. Stone. Tachyon ray tracing system.
http://jedi.ks.uiuc.edu/johns/raytracer.
[36] C. A. Waldspurger and W. E. Weihl. Lottery scheduling: flexible
proportional-share resource management. In OSDI, 1994.
[37] Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for
data-parallel computing: interfaces and implementations. In
SOSP 09, pages 247260, 2009.
[38] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,
S. Shenker, and I. Stoica. Delay scheduling: A simple technique
for achieving locality and fairness in cluster scheduling. In
EuroSys 10, 2010.
[39] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and
I. Stoica. Spark: cluster computing with working sets. In Proc.
HotCloud 10, 2010.
[40] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica.
Improving MapReduce performance in heterogeneous
environments. In Proc. OSDI 08, 2008.
[41] S. Zhou. LSF: Load sharing in large-scale heterogeneous
distributed systems. In Workshop on Cluster Computing, 1992.
[42] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale
parallel collaborative filtering for the Netflix prize. In AAIM,
pages 337348. Springer-Verlag, 2008.

Acknowledgements

We thank our industry colleagues at Google, Twitter,


Facebook, Yahoo! and Cloudera for their valuable feedback on Mesos. This research was supported by California MICRO, California Discovery, the Natural Sciences
and Engineering Research Council of Canada, a National
Science Foundation Graduate Research Fellowship,5 the
Swedish Research Council, and the following Berkeley
RAD Lab sponsors: Google, Microsoft, Oracle, Amazon, Cisco, Cloudera, eBay, Facebook, Fujitsu, HP, Intel,
NetApp, SAP, VMware, and Yahoo!.

References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]

[16]

[17]
[18]
[19]
[20]
[21]

Amazon EC2. http://aws.amazon.com/ec2.


Apache Hadoop. http://hadoop.apache.org.
Apache Hive. http://hadoop.apache.org/hive.
Apache ZooKeeper. hadoop.apache.org/zookeeper.
Hive A Petabyte Scale Data Warehouse using Hadoop.
http://www.facebook.com/note.php?note_id=
89508453919.
Hive performance benchmarks. http:
//issues.apache.org/jira/browse/HIVE-396.
LibProcess Homepage. http:
//www.eecs.berkeley.edu/benh/libprocess.
Linux 2.6.33 release notes.
http://kernelnewbies.org/Linux_2_6_33.
Linux containers (LXC) overview document.
http://lxc.sourceforge.net/lxc.html.
Personal communication with Dhruba Borthakur from Facebook.
Personal communication with Owen OMalley and Arun C.
Murthy from the Yahoo! Hadoop team.
RightScale blog. blog.rightscale.com/2010/04/01/
benchmarking-load-balancers-in-the-cloud.
Solaris Resource Management.
http://docs.sun.com/app/docs/doc/817-1592.
The Next Generation of Apache Hadoop MapReduce.
http://developer.yahoo.com/blogs/hadoop/
posts/2011/02/mapreduce-nextgen.
E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney,
J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and
D. Sorensen. LAPACK: a portable linear algebra library for
high-performance computers. In Supercomputing 90, 1990.
A. Bouteiller, F. Cappello, T. Herault, G. Krawezik,
P. Lemarinier, and F. Magniette. Mpich-v2: a fault tolerant MPI
for volatile nodes based on pessimistic sender based message
logging. In Supercomputing 03, 2003.
T. Condie, N. Conway, P. Alvaro, and J. M. Hellerstein.
MapReduce online. In NSDI 10, May 2010.
J. Dean and S. Ghemawat. MapReduce: Simplified data
processing on large clusters. In OSDI, pages 137150, 2004.
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu,
and G. Fox. Twister: a runtime for iterative mapreduce. In Proc.
HPDC 10, 2010.
D. R. Engler, M. F. Kaashoek, and J. OToole. Exokernel: An
operating system architecture for application-level resource
management. In SOSP, pages 251266, 1995.
A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker,
and I. Stoica. Dominant resource fairness: fair allocation of
multiple resource types. In NSDI, 2011.

5 Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF.

14

Tor: The Second-Generation Onion Router


Roger Dingledine
The Free Haven Project
arma@freehaven.net

Nick Mathewson
The Free Haven Project
nickm@freehaven.net

Abstract
We present Tor, a circuit-based low-latency anonymous communication service. This second-generation Onion Routing
system addresses limitations in the original design by adding
perfect forward secrecy, congestion control, directory servers,
integrity checking, configurable exit policies, and a practical design for location-hidden services via rendezvous points.
Tor works on the real-world Internet, requires no special privileges or kernel modifications, requires little synchronization
or coordination between nodes, and provides a reasonable
tradeoff between anonymity, usability, and efficiency. We
briefly describe our experiences with an international network
of more than 30 nodes. We close with a list of open problems
in anonymous communication.

Overview

Onion Routing is a distributed overlay network designed to


anonymize TCP-based applications like web browsing, secure shell, and instant messaging. Clients choose a path
through the network and build a circuit, in which each node
(or onion router or OR) in the path knows its predecessor
and successor, but no other nodes in the circuit. Traffic flows
down the circuit in fixed-size cells, which are unwrapped by a
symmetric key at each node (like the layers of an onion) and
relayed downstream. The Onion Routing project published
several design and analysis papers [27, 41, 48, 49]. While a
wide area Onion Routing network was deployed briefly, the
only long-running public implementation was a fragile proofof-concept that ran on a single machine. Even this simple
deployment processed connections from over sixty thousand
distinct IP addresses from all over the world at a rate of about
fifty thousand per day. But many critical design and deployment issues were never resolved, and the design has not been
updated in years. Here we describe Tor, a protocol for asynchronous, loosely federated onion routers that provides the
following improvements over the old Onion Routing design:
Perfect forward secrecy: In the original Onion Routing
design, a single hostile node could record traffic and later

Paul Syverson
Naval Research Lab
syverson@itd.nrl.navy.mil

compromise successive nodes in the circuit and force them


to decrypt it. Rather than using a single multiply encrypted
data structure (an onion) to lay each circuit, Tor now uses an
incremental or telescoping path-building design, where the
initiator negotiates session keys with each successive hop in
the circuit. Once these keys are deleted, subsequently compromised nodes cannot decrypt old traffic. As a side benefit,
onion replay detection is no longer necessary, and the process
of building circuits is more reliable, since the initiator knows
when a hop fails and can then try extending to a new node.
Separation of protocol cleaning from anonymity:
Onion Routing originally required a separate application
proxy for each supported application protocolmost of
which were never written, so many applications were never
supported.
Tor uses the standard and near-ubiquitous
SOCKS [32] proxy interface, allowing us to support most
TCP-based programs without modification. Tor now relies on
the filtering features of privacy-enhancing application-level
proxies such as Privoxy [39], without trying to duplicate those
features itself.
No mixing, padding, or traffic shaping (yet): Onion
Routing originally called for batching and reordering cells
as they arrived, assumed padding between ORs, and in later
designs added padding between onion proxies (users) and
ORs [27, 41]. Tradeoffs between padding protection and
cost were discussed, and traffic shaping algorithms were
theorized [49] to provide good security without expensive
padding, but no concrete padding scheme was suggested. Recent research [1] and deployment experience [4] suggest that
this level of resource use is not practical or economical; and
even full link padding is still vulnerable [33]. Thus, until we
have a proven and convenient design for traffic shaping or
low-latency mixing that improves anonymity against a realistic adversary, we leave these strategies out.
Many TCP streams can share one circuit: Onion Routing originally built a separate circuit for each applicationlevel request, but this required multiple public key operations
for every request, and also presented a threat to anonymity
from building so many circuits; see Section 9. Tor multi-

plexes multiple TCP streams along each circuit to improve


efficiency and anonymity.
Leaky-pipe circuit topology: Through in-band signaling
within the circuit, Tor initiators can direct traffic to nodes
partway down the circuit. This novel approach allows traffic to exit the circuit from the middlepossibly frustrating
traffic shape and volume attacks based on observing the end
of the circuit. (It also allows for long-range padding if future
research shows this to be worthwhile.)
Congestion control: Earlier anonymity designs do not address traffic bottlenecks. Unfortunately, typical approaches to
load balancing and flow control in overlay networks involve
inter-node control communication and global views of traffic.
Tors decentralized congestion control uses end-to-end acks
to maintain anonymity while allowing nodes at the edges of
the network to detect congestion or flooding and send less
data until the congestion subsides.
Directory servers: The earlier Onion Routing design
planned to flood state information through the networkan
approach that can be unreliable and complex. Tor takes a
simplified view toward distributing this information. Certain more trusted nodes act as directory servers: they provide
signed directories describing known routers and their current
state. Users periodically download them via HTTP.
Variable exit policies: Tor provides a consistent mechanism for each node to advertise a policy describing the hosts
and ports to which it will connect. These exit policies are critical in a volunteer-based distributed infrastructure, because
each operator is comfortable with allowing different types of
traffic to exit from his node.
End-to-end integrity checking: The original Onion Routing design did no integrity checking on data. Any node on the
circuit could change the contents of data cells as they passed
byfor example, to alter a connection request so it would
connect to a different webserver, or to tag encrypted traffic
and look for corresponding corrupted traffic at the network
edges [15]. Tor hampers these attacks by verifying data integrity before it leaves the network.
Rendezvous points and hidden services: Tor provides an
integrated mechanism for responder anonymity via locationprotected servers. Previous Onion Routing designs included
long-lived reply onions that could be used to build circuits
to a hidden server, but these reply onions did not provide forward security, and became useless if any node in the path
went down or rotated its keys. In Tor, clients negotiate rendezvous points to connect with hidden servers; reply onions
are no longer required.
Unlike Freedom [8], Tor does not require OS kernel
patches or network stack support. This prevents us from
anonymizing non-TCP protocols, but has greatly helped our
portability and deployability.
We have implemented all of the above features, including
rendezvous points. Our source code is available under a free
license, and Tor is not covered by the patent that affected dis-

tribution and use of earlier versions of Onion Routing. We


have deployed a wide-area alpha network to test the design, to
get more experience with usability and users, and to provide
a research platform for experimentation. As of this writing,
the network stands at 32 nodes spread over two continents.
We review previous work in Section 2, describe our goals
and assumptions in Section 3, and then address the above list
of improvements in Sections 4, 5, and 6. We summarize in
Section 7 how our design stands up to known attacks, and
talk about our early deployment experiences in Section 8. We
conclude with a list of open problems in Section 9 and future
work for the Onion Routing project in Section 10.

Related work

Modern anonymity systems date to Chaums Mix-Net design [10]. Chaum proposed hiding the correspondence between sender and recipient by wrapping messages in layers
of public-key cryptography, and relaying them through a path
composed of mixes. Each mix in turn decrypts, delays, and
re-orders messages before relaying them onward.
Subsequent relay-based anonymity designs have diverged
in two main directions. Systems like Babel [28], Mixmaster [36], and Mixminion [15] have tried to maximize
anonymity at the cost of introducing comparatively large
and variable latencies. Because of this decision, these highlatency networks resist strong global adversaries, but introduce too much lag for interactive tasks like web browsing,
Internet chat, or SSH connections.
Tor belongs to the second category: low-latency designs
that try to anonymize interactive network traffic. These systems handle a variety of bidirectional protocols. They also
provide more convenient mail delivery than the high-latency
anonymous email networks, because the remote mail server
provides explicit and timely delivery confirmation. But because these designs typically involve many packets that must
be delivered quickly, it is difficult for them to prevent an attacker who can eavesdrop both ends of the communication
from correlating the timing and volume of traffic entering the
anonymity network with traffic leaving it [45]. These protocols are similarly vulnerable to an active adversary who introduces timing patterns into traffic entering the network and
looks for correlated patterns among exiting traffic. Although
some work has been done to frustrate these attacks, most designs protect primarily against traffic analysis rather than traffic confirmation (see Section 3.1).
The simplest low-latency designs are single-hop proxies
such as the Anonymizer [3]: a single trusted server strips
the datas origin before relaying it. These designs are easy to
analyze, but users must trust the anonymizing proxy. Concentrating the traffic to this single point increases the anonymity
set (the people a given user is hiding among), but it is vulnerable if the adversary can observe all traffic entering and
leaving the proxy.

More complex are distributed-trust, circuit-based


anonymizing systems. In these designs, a user establishes one or more medium-term bidirectional end-to-end
circuits, and tunnels data in fixed-size cells. Establishing
circuits is computationally expensive and typically requires
public-key cryptography, whereas relaying cells is comparatively inexpensive and typically requires only symmetric
encryption. Because a circuit crosses several servers, and
each server only knows the adjacent servers in the circuit, no
single server can link a user to her communication partners.
The Java Anon Proxy (also known as JAP or Web MIXes)
uses fixed shared routes known as cascades. As with a
single-hop proxy, this approach aggregates users into larger
anonymity sets, but again an attacker only needs to observe
both ends of the cascade to bridge all the systems traffic. The
Java Anon Proxys design calls for padding between end users
and the head of the cascade [7]. However, it is not demonstrated whether the current implementations padding policy
improves anonymity.
PipeNet [5, 12], another low-latency design proposed
around the same time as Onion Routing, gave stronger
anonymity but allowed a single user to shut down the network by not sending. Systems like ISDN mixes [38] were
designed for other environments with different assumptions.
In P2P designs like Tarzan [24] and MorphMix [43], all
participants both generate traffic and relay traffic for others.
These systems aim to conceal whether a given peer originated
a request or just relayed it from another peer. While Tarzan
and MorphMix use layered encryption as above, Crowds [42]
simply assumes an adversary who cannot observe the initiator: it uses no public-key encryption, so any node on a circuit
can read users traffic.
Hordes [34] is based on Crowds but also uses multicast
responses to hide the initiator. Herbivore [25] and P5 [46]
go even further, requiring broadcast. These systems are designed primarily for communication among peers, although
Herbivore users can make external connections by requesting
a peer to serve as a proxy.
Systems like Freedom and the original Onion Routing
build circuits all at once, using a layered onion of publickey encrypted messages, each layer of which provides session keys and the address of the next server in the circuit.
Tor as described herein, Tarzan, MorphMix, Cebolla [9],
and Rennhards Anonymity Network [44] build circuits in
stages, extending them one hop at a time. Section 4.2 describes how this approach enables perfect forward secrecy.
Circuit-based designs must choose which protocol layer to
anonymize. They may intercept IP packets directly, and relay them whole (stripping the source address) along the circuit [8, 24]. Like Tor, they may accept TCP streams and
relay the data in those streams, ignoring the breakdown of
that data into TCP segments [43, 44]. Finally, like Crowds,
they may accept application-level protocols such as HTTP
and relay the application requests themselves. Making this

protocol-layer decision requires a compromise between flexibility and anonymity. For example, a system that understands
HTTP can strip identifying information from requests, can
take advantage of caching to limit the number of requests that
leave the network, and can batch or encode requests to minimize the number of connections. On the other hand, an IPlevel anonymizer can handle nearly any protocol, even ones
unforeseen by its designers (though these systems require
kernel-level modifications to some operating systems, and so
are more complex and less portable). TCP-level anonymity
networks like Tor present a middle approach: they are application neutral (so long as the application supports, or can
be tunneled across, TCP), but by treating application connections as data streams rather than raw TCP packets, they avoid
the inefficiencies of tunneling TCP over TCP.
Distributed-trust anonymizing systems need to prevent attackers from adding too many servers and thus compromising
user paths. Tor relies on a small set of well-known directory
servers, run by independent parties, to decide which nodes
can join. Tarzan and MorphMix allow unknown users to run
servers, and use a limited resource (like IP addresses) to prevent an attacker from controlling too much of the network.
Crowds suggests requiring written, notarized requests from
potential crowd members.
Anonymous communication is essential for censorshipresistant systems like Eternity [2], Free Haven [19], Publius [53], and Tangler [52]. Tors rendezvous points enable
connections between mutually anonymous entities; they are a
building block for location-hidden servers, which are needed
by Eternity and Free Haven.

Design goals and assumptions

Goals
Like other low-latency anonymity designs, Tor seeks to frustrate attackers from linking communication partners, or from
linking multiple communications to or from a single user.
Within this main goal, however, several considerations have
directed Tors evolution.
Deployability: The design must be deployed and used in
the real world. Thus it must not be expensive to run (for
example, by requiring more bandwidth than volunteers are
willing to provide); must not place a heavy liability burden
on operators (for example, by allowing attackers to implicate
onion routers in illegal activities); and must not be difficult
or expensive to implement (for example, by requiring kernel
patches, or separate proxies for every protocol). We also cannot require non-anonymous parties (such as websites) to run
our software. (Our rendezvous point design does not meet
this goal for non-anonymous users talking to hidden servers,
however; see Section 5.)
Usability: A hard-to-use system has fewer usersand because anonymity systems hide users among users, a system
with fewer users provides less anonymity. Usability is thus

not only a convenience: it is a security requirement [1, 5].


Tor should therefore not require modifying familiar applications; should not introduce prohibitive delays; and should require as few configuration decisions as possible. Finally, Tor
should be easily implementable on all common platforms; we
cannot require users to change their operating system to be
anonymous. (Tor currently runs on Win32, Linux, Solaris,
BSD-style Unix, MacOS X, and probably others.)
Flexibility: The protocol must be flexible and wellspecified, so Tor can serve as a test-bed for future research.
Many of the open problems in low-latency anonymity networks, such as generating dummy traffic or preventing Sybil
attacks [22], may be solvable independently from the issues
solved by Tor. Hopefully future systems will not need to reinvent Tors design.
Simple design: The protocols design and security parameters must be well-understood. Additional features impose
implementation and complexity costs; adding unproven
techniques to the design threatens deployability, readability,
and ease of security analysis. Tor aims to deploy a simple and
stable system that integrates the best accepted approaches to
protecting anonymity.

Non-goals
In favoring simple, deployable designs, we have explicitly deferred several possible goals, either because they are solved
elsewhere, or because they are not yet solved.
Not peer-to-peer: Tarzan and MorphMix aim to scale
to completely decentralized peer-to-peer environments with
thousands of short-lived servers, many of which may be controlled by an adversary. This approach is appealing, but still
has many open problems [24, 43].
Not secure against end-to-end attacks: Tor does not
claim to completely solve end-to-end timing or intersection
attacks. Some approaches, such as having users run their own
onion routers, may help; see Section 9 for more discussion.
No protocol normalization: Tor does not provide protocol normalization like Privoxy or the Anonymizer. If senders
want anonymity from responders while using complex and
variable protocols like HTTP, Tor must be layered with a
filtering proxy such as Privoxy to hide differences between
clients, and expunge protocol features that leak identity. Note
that by this separation Tor can also provide services that are
anonymous to the network yet authenticated to the responder,
like SSH. Similarly, Tor does not integrate tunneling for nonstream-based protocols like UDP; this must be provided by
an external service if appropriate.
Not steganographic: Tor does not try to conceal who is
connected to the network.

3.1

Threat Model

A global passive adversary is the most commonly assumed


threat when analyzing theoretical anonymity designs. But

like all practical low-latency systems, Tor does not protect


against such a strong adversary. Instead, we assume an adversary who can observe some fraction of network traffic; who
can generate, modify, delete, or delay traffic; who can operate onion routers of his own; and who can compromise some
fraction of the onion routers.
In low-latency anonymity systems that use layered encryption, the adversarys typical goal is to observe both the initiator and the responder. By observing both ends, passive attackers can confirm a suspicion that Alice is talking to Bob if
the timing and volume patterns of the traffic on the connection are distinct enough; active attackers can induce timing
signatures on the traffic to force distinct patterns. Rather than
focusing on these traffic confirmation attacks, we aim to prevent traffic analysis attacks, where the adversary uses traffic
patterns to learn which points in the network he should attack.
Our adversary might try to link an initiator Alice with her
communication partners, or try to build a profile of Alices
behavior. He might mount passive attacks by observing the
network edges and correlating traffic entering and leaving the
networkby relationships in packet timing, volume, or externally visible user-selected options. The adversary can also
mount active attacks by compromising routers or keys; by replaying traffic; by selectively denying service to trustworthy
routers to move users to compromised routers, or denying service to users to see if traffic elsewhere in the network stops; or
by introducing patterns into traffic that can later be detected.
The adversary might subvert the directory servers to give
users differing views of network state. Additionally, he can
try to decrease the networks reliability by attacking nodes
or by performing antisocial activities from reliable nodes and
trying to get them taken downmaking the network unreliable flushes users to other less anonymous systems, where
they may be easier to attack. We summarize in Section 7 how
well the Tor design defends against each of these attacks.

The Tor Design

The Tor network is an overlay network; each onion router


(OR) runs as a normal user-level process without any special
privileges. Each onion router maintains a TLS [17] connection to every other onion router. Each user runs local software
called an onion proxy (OP) to fetch directories, establish circuits across the network, and handle connections from user
applications. These onion proxies accept TCP streams and
multiplex them across the circuits. The onion router on the
other side of the circuit connects to the requested destinations
and relays data.
Each onion router maintains a long-term identity key and
a short-term onion key. The identity key is used to sign TLS
certificates, to sign the ORs router descriptor (a summary of
its keys, address, bandwidth, exit policy, and so on), and (by
directory servers) to sign directories. The onion key is used
to decrypt requests from users to set up a circuit and negotiate

ephemeral keys. The TLS protocol also establishes a shortterm link key when communicating between ORs. Short-term
keys are rotated periodically and independently, to limit the
impact of key compromise.
Section 4.1 presents the fixed-size cells that are the unit
of communication in Tor. We describe in Section 4.2 how
circuits are built, extended, truncated, and destroyed. Section 4.3 describes how TCP streams are routed through the
network. We address integrity checking in Section 4.4, and
resource limiting in Section 4.5. Finally, Section 4.6 talks
about congestion control and fairness issues.

4.1

Cells

Onion routers communicate with one another, and with users


OPs, via TLS connections with ephemeral keys. Using TLS
conceals the data on the connection with perfect forward secrecy, and prevents an attacker from modifying data on the
wire or impersonating an OR.
Traffic passes along these connections in fixed-size cells.
Each cell is 512 bytes, and consists of a header and a payload. The header includes a circuit identifier (circID) that
specifies which circuit the cell refers to (many circuits can
be multiplexed over the single TLS connection), and a command to describe what to do with the cells payload. (Circuit
identifiers are connection-specific: each circuit has a different circID on each OP/OR or OR/OR connection it traverses.)
Based on their command, cells are either control cells, which
are always interpreted by the node that receives them, or relay cells, which carry end-to-end stream data. The control
cell commands are: padding (currently used for keepalive,
but also usable for link padding); create or created (used to
set up a new circuit); and destroy (to tear down a circuit).
Relay cells have an additional header (the relay header) at
the front of the payload, containing a streamID (stream identifier: many streams can be multiplexed over a circuit); an
end-to-end checksum for integrity checking; the length of the
relay payload; and a relay command. The entire contents of
the relay header and the relay cell payload are encrypted or
decrypted together as the relay cell moves along the circuit,
using the 128-bit AES cipher in counter mode to generate a
cipher stream. The relay commands are: relay data (for data
flowing down the stream), relay begin (to open a stream), relay end (to close a stream cleanly), relay teardown (to close a
broken stream), relay connected (to notify the OP that a relay
begin has succeeded), relay extend and relay extended (to extend the circuit by a hop, and to acknowledge), relay truncate
and relay truncated (to tear down only part of the circuit, and
to acknowledge), relay sendme (used for congestion control),
and relay drop (used to implement long-range dummies). We
give a visual overview of cell structure plus the details of relay cell structure, and then describe each of these cell types
and commands in more detail below.

509 bytes

CircID CMD
2

DATA

CircID Relay StreamID Digest Len CMD

4.2

498
DATA

Circuits and streams

Onion Routing originally built one circuit for each TCP


stream. Because building a circuit can take several tenths
of a second (due to public-key cryptography and network latency), this design imposed high costs on applications like
web browsing that open many TCP streams.
In Tor, each circuit can be shared by many TCP streams.
To avoid delays, users construct circuits preemptively. To
limit linkability among their streams, users OPs build a new
circuit periodically if the previous ones have been used, and
expire old used circuits that no longer have any open streams.
OPs consider rotating to a new circuit once a minute: thus
even heavy users spend negligible time building circuits, but
a limited number of requests can be linked to each other
through a given exit node. Also, because circuits are built in
the background, OPs can recover from failed circuit creation
without harming user experience.
Alice

(link is TLSencrypted)
Create c1, E(g^x1)

OR 1

(link is TLSencryped)

OR 2

website

(unencrypted)

Created c1, g^y1, H(K1)


Relay c1{Extend, OR2, E(g^x2)}

Create c2, E(g^x2)


Created c2, g^y2, H(K2)

Relay c1{Extended, g^y2, H(K2)}


Relay c1{{Begin <website>:80}}

Relay c2{Begin <website>:80}


Relay c2{Connected}

Relay c1{{Connected}}
Relay c1{{Data, "HTTP GET..."}}

Relay c1{{Data, (response)}}

Legend:
E(x)RSA encryption
{X}AES encryption
cNa circID

Relay c2{Data, "HTTP GET..."}

(TCP handshake)

"HTTP GET..."
(response)

Relay c2{Data, (response)}

...

...

...

Figure 1: Alice builds a two-hop circuit and begins fetching


a web page.

Constructing a circuit
A users OP constructs circuits incrementally, negotiating a
symmetric key with each OR on the circuit, one hop at a time.
To begin creating a new circuit, the OP (call her Alice) sends
a create cell to the first node in her chosen path (call him
Bob). (She chooses a new circID CAB not currently used on
the connection from her to Bob.) The create cells payload
contains the first half of the Diffie-Hellman handshake (g x ),
encrypted to the onion key of the OR (call him Bob). Bob
responds with a created cell containing g y along with a hash
of the negotiated key K = g xy .
Once the circuit has been established, Alice and Bob can
send one another relay cells encrypted with the negotiated

key.1 More detail is given in the next section.


To extend the circuit further, Alice sends a relay extend cell
to Bob, specifying the address of the next OR (call her Carol),
and an encrypted g x2 for her. Bob copies the half-handshake
into a create cell, and passes it to Carol to extend the circuit. (Bob chooses a new circID CBC not currently used on
the connection between him and Carol. Alice never needs to
know this circID; only Bob associates CAB on his connection with Alice to CBC on his connection with Carol.) When
Carol responds with a created cell, Bob wraps the payload
into a relay extended cell and passes it back to Alice. Now
the circuit is extended to Carol, and Alice and Carol share a
common key K2 = g x2 y2 .
To extend the circuit to a third node or beyond, Alice proceeds as above, always telling the last node in the circuit to
extend one hop further.
This circuit-level handshake protocol achieves unilateral
entity authentication (Alice knows shes handshaking with
the OR, but the OR doesnt care who is opening the circuit
Alice uses no public key and remains anonymous) and unilateral key authentication (Alice and the OR agree on a key, and
Alice knows only the OR learns it). It also achieves forward
secrecy and key freshness. More formally, the protocol is as
follows (where EP KBob () is encryption with Bobs public
key, H is a secure hash function, and | is concatenation):
Alice ! Bob : EP KBob (g x )
Bob ! Alice : g y , H(K|handshake)

In the second step, Bob proves that it was he who received


g x , and who chose y. We use PK encryption in the first step
(rather than, say, using the first two steps of STS, which has
a signature in the second step) because a single cell is too
small to hold both a public key and a signature. Preliminary
analysis with the NRL protocol analyzer [35] shows this
protocol to be secure (including perfect forward secrecy)
under the traditional Dolev-Yao model.

OPs treat incoming relay cells similarly: they iteratively


unwrap the relay header and payload with the session keys
shared with each OR on the circuit, from the closest to farthest. If at any stage the digest is valid, the cell must have
originated at the OR whose encryption has just been removed.
To construct a relay cell addressed to a given OR, Alice assigns the digest, and then iteratively encrypts the cell payload
(that is, the relay header and payload) with the symmetric key
of each hop up to that OR. Because the digest is encrypted to
a different value at each step, only at the targeted OR will
it have a meaningful value.2 This leaky pipe circuit topology allows Alices streams to exit at different ORs on a single circuit. Alice may choose different exit points because of
their exit policies, or to keep the ORs from knowing that two
streams originate from the same person.
When an OR later replies to Alice with a relay cell, it encrypts the cells relay header and payload with the single key
it shares with Alice, and sends the cell back toward Alice
along the circuit. Subsequent ORs add further layers of encryption as they relay the cell back to Alice.
To tear down a circuit, Alice sends a destroy control cell.
Each OR in the circuit receives the destroy cell, closes all
streams on that circuit, and passes a new destroy cell forward.
But just as circuits are built incrementally, they can also be
torn down incrementally: Alice can send a relay truncate cell
to a single OR on a circuit. That OR then sends a destroy cell
forward, and acknowledges with a relay truncated cell. Alice
can then extend the circuit to different nodes, without signaling to the intermediate nodes (or a limited observer) that she
has changed her circuit. Similarly, if a node on the circuit
goes down, the adjacent node can send a relay truncated cell
back to Alice. Thus the break a node and see which circuits
go down attack [4] is weakened.

4.3

Opening and closing streams

Once Alice has established the circuit (so she shares keys with
each OR on the circuit), she can send relay cells. Upon receiving a relay cell, an OR looks up the corresponding circuit,
and decrypts the relay header and payload with the session
key for that circuit. If the cell is headed away from Alice the
OR then checks whether the decrypted cell has a valid digest
(as an optimization, the first two bytes of the integrity check
are zero, so in most cases we can avoid computing the hash).
If valid, it accepts the relay cell and processes it as described
below. Otherwise, the OR looks up the circID and OR for the
next step in the circuit, replaces the circID as appropriate, and
sends the decrypted relay cell to the next OR. (If the OR at
the end of the circuit receives an unrecognized relay cell, an
error has occurred, and the circuit is torn down.)

When Alices application wants a TCP connection to a given


address and port, it asks the OP (via SOCKS) to make the
connection. The OP chooses the newest open circuit (or creates one if needed), and chooses a suitable OR on that circuit
to be the exit node (usually the last node, but maybe others
due to exit policy conflicts; see Section 6.2.) The OP then
opens the stream by sending a relay begin cell to the exit node,
using a new random streamID. Once the exit node connects
to the remote host, it responds with a relay connected cell.
Upon receipt, the OP sends a SOCKS reply to notify the application of its success. The OP now accepts data from the
applications TCP stream, packaging it into relay data cells
and sending those cells along the circuit to the chosen OR.
Theres a catch to using SOCKS, howeversome applications pass the alphanumeric hostname to the Tor client, while
others resolve it into an IP address first and then pass the IP

1 Actually, the negotiated key is used to derive two symmetric keys: one
for each direction.

2 With 48 bits of digest per cell, the probability of an accidental collision


is far lower than the chance of hardware failure.

Relay cells

address to the Tor client. If the application does DNS resolution first, Alice thereby reveals her destination to the remote
DNS server, rather than sending the hostname through the Tor
network to be resolved at the far end. Common applications
like Mozilla and SSH have this flaw.
With Mozilla, the flaw is easy to address: the filtering
HTTP proxy called Privoxy gives a hostname to the Tor
client, so Alices computer never does DNS resolution. But
a portable general solution, such as is needed for SSH, is an
open problem. Modifying or replacing the local nameserver
can be invasive, brittle, and unportable. Forcing the resolver
library to prefer TCP rather than UDP is hard, and also has
portability problems. Dynamically intercepting system calls
to the resolver library seems a promising direction. We could
also provide a tool similar to dig to perform a private lookup
through the Tor network. Currently, we encourage the use of
privacy-aware proxies like Privoxy wherever possible.
Closing a Tor stream is analogous to closing a TCP stream:
it uses a two-step handshake for normal operation, or a onestep handshake for errors. If the stream closes abnormally,
the adjacent node simply sends a relay teardown cell. If the
stream closes normally, the node sends a relay end cell down
the circuit, and the other side responds with its own relay end
cell. Because all relay cells use layered encryption, only the
destination OR knows that a given relay cell is a request to
close a stream. This two-step handshake allows Tor to support
TCP-based applications that use half-closed connections.

4.4

Integrity checking on streams

Because the old Onion Routing design used a stream cipher


without integrity checking, traffic was vulnerable to a malleability attack: though the attacker could not decrypt cells,
any changes to encrypted data would create corresponding
changes to the data leaving the network. This weakness allowed an adversary who could guess the encrypted content to
change a padding cell to a destroy cell; change the destination
address in a relay begin cell to the adversarys webserver; or
change an FTP command from dir to rm *. (Even an external adversary could do this, because the link encryption
similarly used a stream cipher.)
Because Tor uses TLS on its links, external adversaries
cannot modify data. Addressing the insider malleability attack, however, is more complex.
We could do integrity checking of the relay cells at each
hop, either by including hashes or by using an authenticating
cipher mode like EAX [6], but there are some problems. First,
these approaches impose a message-expansion overhead at
each hop, and so we would have to either leak the path length
or waste bytes by padding to a maximum path length. Second, these solutions can only verify traffic coming from Alice: ORs would not be able to produce suitable hashes for
the intermediate hops, since the ORs on a circuit do not know
the other ORs session keys. Third, we have already accepted

that our design is vulnerable to end-to-end timing attacks; so


tagging attacks performed within the circuit provide no additional information to the attacker.
Thus, we check integrity only at the edges of each stream.
(Remember that in our leaky-pipe circuit topology, a streams
edge could be any hop in the circuit.) When Alice negotiates
a key with a new hop, they each initialize a SHA-1 digest with
a derivative of that key, thus beginning with randomness that
only the two of them know. Then they each incrementally
add to the SHA-1 digest the contents of all relay cells they
create, and include with each relay cell the first four bytes of
the current digest. Each also keeps a SHA-1 digest of data
received, to verify that the received hashes are correct.
To be sure of removing or modifying a cell, the attacker
must be able to deduce the current digest state (which depends on all traffic between Alice and Bob, starting with their
negotiated key). Attacks on SHA-1 where the adversary can
incrementally add to a hash to produce a new valid hash dont
work, because all hashes are end-to-end encrypted across the
circuit. The computational overhead of computing the digests
is minimal compared to doing the AES encryption performed
at each hop of the circuit. We use only four bytes per cell
to minimize overhead; the chance that an adversary will correctly guess a valid hash is acceptably low, given that the OP
or OR tear down the circuit if they receive a bad hash.

4.5

Rate limiting and fairness

Volunteers are more willing to run services that can limit


their bandwidth usage. To accommodate them, Tor servers
use a token bucket approach [50] to enforce a long-term average rate of incoming bytes, while still permitting short-term
bursts above the allowed bandwidth.
Because the Tor protocol outputs about the same number
of bytes as it takes in, it is sufficient in practice to limit only
incoming bytes. With TCP streams, however, the correspondence is not one-to-one: relaying a single incoming byte can
require an entire 512-byte cell. (We cant just wait for more
bytes, because the local application may be awaiting a reply.)
Therefore, we treat this case as if the entire cell size had been
read, regardless of the cells fullness.
Further, inspired by Rennhard et als design in [44], a circuits edges can heuristically distinguish interactive streams
from bulk streams by comparing the frequency with which
they supply cells. We can provide good latency for interactive
streams by giving them preferential service, while still giving
good overall throughput to the bulk streams. Such preferential treatment presents a possible end-to-end attack, but an
adversary observing both ends of the stream can already learn
this information through timing attacks.

4.6

Congestion control

Even with bandwidth rate limiting, we still need to worry


about congestion, either accidental or intentional. If enough
users choose the same OR-to-OR connection for their circuits, that connection can become saturated. For example,
an attacker could send a large file through the Tor network
to a webserver he runs, and then refuse to read any of the
bytes at the webserver end of the circuit. Without some congestion control mechanism, these bottlenecks can propagate
back through the entire network. We dont need to reimplement full TCP windows (with sequence numbers, the ability to drop cells when were full and retransmit later, and
so on), because TCP already guarantees in-order delivery of
each cell. We describe our response below.
Circuit-level throttling: To control a circuits bandwidth
usage, each OR keeps track of two windows. The packaging
window tracks how many relay data cells the OR is allowed to
package (from incoming TCP streams) for transmission back
to the OP, and the delivery window tracks how many relay
data cells it is willing to deliver to TCP streams outside the
network. Each window is initialized (say, to 1000 data cells).
When a data cell is packaged or delivered, the appropriate
window is decremented. When an OR has received enough
data cells (currently 100), it sends a relay sendme cell towards
the OP, with streamID zero. When an OR receives a relay
sendme cell with streamID zero, it increments its packaging
window. Either of these cells increments the corresponding
window by 100. If the packaging window reaches 0, the OR
stops reading from TCP connections for all streams on the
corresponding circuit, and sends no more relay data cells until
receiving a relay sendme cell.
The OP behaves identically, except that it must track a
packaging window and a delivery window for every OR in
the circuit. If a packaging window reaches 0, it stops reading
from streams destined for that OR.
Stream-level throttling: The stream-level congestion control mechanism is similar to the circuit-level mechanism. ORs
and OPs use relay sendme cells to implement end-to-end flow
control for individual streams across circuits. Each stream
begins with a packaging window (currently 500 cells), and
increments the window by a fixed value (50) upon receiving a relay sendme cell. Rather than always returning a relay
sendme cell as soon as enough cells have arrived, the streamlevel congestion control also has to check whether data has
been successfully flushed onto the TCP stream; it sends the
relay sendme cell only when the number of bytes pending to
be flushed is under some threshold (currently 10 cells worth).
These arbitrarily chosen parameters seem to give tolerable
throughput and delay; see Section 8.

Rendezvous Points and hidden services

Rendezvous points are a building block for location-hidden


services (also known as responder anonymity) in the Tor network. Location-hidden services allow Bob to offer a TCP service, such as a webserver, without revealing his IP address.
This type of anonymity protects against distributed DoS attacks: attackers are forced to attack the onion routing network
because they do not know Bobs IP address.
Our design for location-hidden servers has the following
goals. Access-control: Bob needs a way to filter incoming
requests, so an attacker cannot flood Bob simply by making many connections to him. Robustness: Bob should be
able to maintain a long-term pseudonymous identity even in
the presence of router failure. Bobs service must not be tied
to a single OR, and Bob must be able to migrate his service
across ORs. Smear-resistance: A social attacker should not
be able to frame a rendezvous router by offering an illegal or disreputable location-hidden service and making observers believe the router created that service. Applicationtransparency: Although we require users to run special software to access location-hidden servers, we must not require
them to modify their applications.
We provide location-hiding for Bob by allowing him to
advertise several onion routers (his introduction points) as
contact points. He may do this on any robust efficient keyvalue lookup system with authenticated updates, such as a
distributed hash table (DHT) like CFS [11].3 Alice, the client,
chooses an OR as her rendezvous point. She connects to one
of Bobs introduction points, informs him of her rendezvous
point, and then waits for him to connect to the rendezvous
point. This extra level of indirection helps Bobs introduction points avoid problems associated with serving unpopular
files directly (for example, if Bob serves material that the introduction points community finds objectionable, or if Bobs
service tends to get attacked by network vandals). The extra level of indirection also allows Bob to respond to some
requests and ignore others.

5.1

Rendezvous points in Tor

The following steps are performed on behalf of Alice and Bob


by their local OPs; application integration is described more
fully below.
Bob generates a long-term public key pair to identify his
service.
Bob chooses some introduction points, and advertises
them on the lookup service, signing the advertisement
with his public key. He can add more later.
Bob builds a circuit to each of his introduction points,
and tells them to wait for requests.
3 Rather than rely on an external infrastructure, the Onion Routing network can run the lookup service itself. Our current implementation provides
a simple lookup system on the directory servers.

Alice learns about Bobs service out of band (perhaps


Bob told her, or she found it on a website). She retrieves
the details of Bobs service from the lookup service. If
Alice wants to access Bobs service anonymously, she
must connect to the lookup service via Tor.
Alice chooses an OR as the rendezvous point (RP) for
her connection to Bobs service. She builds a circuit
to the RP, and gives it a randomly chosen rendezvous
cookie to recognize Bob.
Alice opens an anonymous stream to one of Bobs introduction points, and gives it a message (encrypted with
Bobs public key) telling it about herself, her RP and rendezvous cookie, and the start of a DH handshake. The
introduction point sends the message to Bob.
If Bob wants to talk to Alice, he builds a circuit to Alices RP and sends the rendezvous cookie, the second
half of the DH handshake, and a hash of the session key
they now share. By the same argument as in Section 4.2,
Alice knows she shares the key only with Bob.
The RP connects Alices circuit to Bobs. Note that RP
cant recognize Alice, Bob, or the data they transmit.
Alice sends a relay begin cell along the circuit. It arrives
at Bobs OP, which connects to Bobs webserver.
An anonymous stream has been established, and Alice
and Bob communicate as normal.

When establishing an introduction point, Bob provides the


onion router with the public key identifying his service. Bob
signs his messages, so others cannot usurp his introduction
point in the future. He uses the same public key to establish
the other introduction points for his service, and periodically
refreshes his entry in the lookup service.
The message that Alice gives the introduction point includes a hash of Bobs public key and an optional initial authorization token (the introduction point can do prescreening,
for example to block replays). Her message to Bob may include an end-to-end authorization token so Bob can choose
whether to respond. The authorization tokens can be used
to provide selective access: important users can get uninterrupted access. During normal situations, Bobs service might
simply be offered directly from mirrors, while Bob gives
out tokens to high-priority users. If the mirrors are knocked
down, those users can switch to accessing Bobs service via
the Tor rendezvous system.
Bobs introduction points are themselves subject to DoS
he must open many introduction points or risk such an attack. He can provide selected users with a current list or future schedule of unadvertised introduction points; this is most
practical if there is a stable and large group of introduction
points available. Bob could also give secret public keys for
consulting the lookup service. All of these approaches limit
exposure even when some selected users collude in the DoS.

5.2

Integration with user applications

Bob configures his onion proxy to know the local IP address


and port of his service, a strategy for authorizing clients, and
his public key. The onion proxy anonymously publishes a
signed statement of Bobs public key, an expiration time, and
the current introduction points for his service onto the lookup
service, indexed by the hash of his public key. Bobs webserver is unmodified, and doesnt even know that its hidden
behind the Tor network.
Alices applications also work unchangedher client
interface remains a SOCKS proxy. We encode all of
the necessary information into the fully qualified domain
name (FQDN) Alice uses when establishing her connection.
Location-hidden services use a virtual top level domain called
.onion: thus hostnames take the form x.y.onion where
x is the authorization cookie and y encodes the hash of
the public key. Alices onion proxy examines addresses; if
theyre destined for a hidden server, it decodes the key and
starts the rendezvous as described above.

5.3

Previous rendezvous work

Rendezvous points in low-latency anonymity systems were


first described for use in ISDN telephony [30, 38]. Later lowlatency designs used rendezvous points for hiding location
of mobile phones and low-power location trackers [23, 40].
Rendezvous for anonymizing low-latency Internet connections was suggested in early Onion Routing work [27], but
the first published design was by Ian Goldberg [26]. His design differs from ours in three ways. First, Goldberg suggests
that Alice should manually hunt down a current location of
the service via Gnutella; our approach makes lookup transparent to the user, as well as faster and more robust. Second,
in Tor the client and server negotiate session keys with DiffieHellman, so plaintext is not exposed even at the rendezvous
point. Third, our design minimizes the exposure from running the service, to encourage volunteers to offer introduction and rendezvous services. Tors introduction points do not
output any bytes to the clients; the rendezvous points dont
know the client or the server, and cant read the data being
transmitted. The indirection scheme is also designed to include authentication/authorizationif Alice doesnt include
the right cookie with her request for service, Bob need not
even acknowledge his existence.

6
6.1

Other design decisions


Denial of service

Providing Tor as a public service creates many opportunities for denial-of-service attacks against the network. While
flow control and rate limiting (discussed in Section 4.6) prevent users from consuming more bandwidth than routers are

willing to provide, opportunities remain for users to consume


more network resources than their fair share, or to render the
network unusable for others.
First of all, there are several CPU-consuming denial-ofservice attacks wherein an attacker can force an OR to perform expensive cryptographic operations. For example, an attacker can fake the start of a TLS handshake, forcing the OR
to carry out its (comparatively expensive) half of the handshake at no real computational cost to the attacker.
We have not yet implemented any defenses for these attacks, but several approaches are possible. First, ORs can
require clients to solve a puzzle [16] while beginning new
TLS handshakes or accepting create cells. So long as these
tokens are easy to verify and computationally expensive to
produce, this approach limits the attack multiplier. Additionally, ORs can limit the rate at which they accept create cells
and TLS connections, so that the computational work of processing them does not drown out the symmetric cryptography
operations that keep cells flowing. This rate limiting could,
however, allow an attacker to slow down other users when
they build new circuits.
Adversaries can also attack the Tor networks hosts and
network links. Disrupting a single circuit or link breaks all
streams passing along that part of the circuit. Users similarly lose service when a router crashes or its operator restarts
it. The current Tor design treats such attacks as intermittent network failures, and depends on users and applications
to respond or recover as appropriate. A future design could
use an end-to-end TCP-like acknowledgment protocol, so no
streams are lost unless the entry or exit point is disrupted.
This solution would require more buffering at the network
edges, however, and the performance and anonymity implications from this extra complexity still require investigation.

6.2

Exit policies and abuse

Exit abuse is a serious barrier to wide-scale Tor deployment.


Anonymity presents would-be vandals and abusers with an
opportunity to hide the origins of their activities. Attackers
can harm the Tor network by implicating exit servers for their
abuse. Also, applications that commonly use IP-based authentication (such as institutional mail or webservers) can be
fooled by the fact that anonymous connections appear to originate at the exit OR.
We stress that Tor does not enable any new class of abuse.
Spammers and other attackers already have access to thousands of misconfigured systems worldwide, and the Tor network is far from the easiest way to launch attacks. But because the onion routers can be mistaken for the originators
of the abuse, and the volunteers who run them may not want
to deal with the hassle of explaining anonymity networks to
irate administrators, we must block or limit abuse through the
Tor network.
To mitigate abuse issues, each onion routers exit policy de-

scribes to which external addresses and ports the router will


connect. On one end of the spectrum are open exit nodes
that will connect anywhere. On the other end are middleman
nodes that only relay traffic to other Tor nodes, and private
exit nodes that only connect to a local host or network. A
private exit can allow a client to connect to a given host or
network more securelyan external adversary cannot eavesdrop traffic between the private exit and the final destination,
and so is less sure of Alices destination and activities. Most
onion routers in the current network function as restricted exits that permit connections to the world at large, but prevent
access to certain abuse-prone addresses and services such as
SMTP. The OR might also be able to authenticate clients to
prevent exit abuse without harming anonymity [48].
Many administrators use port restrictions to support only a
limited set of services, such as HTTP, SSH, or AIM. This is
not a complete solution, of course, since abuse opportunities
for these protocols are still well known.
We have not yet encountered any abuse in the deployed
network, but if we do we should consider using proxies to
clean traffic for certain protocols as it leaves the network. For
example, much abusive HTTP behavior (such as exploiting
buffer overflows or well-known script vulnerabilities) can be
detected in a straightforward manner. Similarly, one could
run automatic spam filtering software (such as SpamAssassin) on email exiting the OR network.
ORs may also rewrite exiting traffic to append headers
or other information indicating that the traffic has passed
through an anonymity service. This approach is commonly
used by email-only anonymity systems. ORs can also run
on servers with hostnames like anonymous to further alert
abuse targets to the nature of the anonymous traffic.
A mixture of open and restricted exit nodes allows the most
flexibility for volunteers running servers. But while having
many middleman nodes provides a large and robust network,
having only a few exit nodes reduces the number of points an
adversary needs to monitor for traffic analysis, and places a
greater burden on the exit nodes. This tension can be seen in
the Java Anon Proxy cascade model, wherein only one node
in each cascade needs to handle abuse complaintsbut an adversary only needs to observe the entry and exit of a cascade
to perform traffic analysis on all that cascades users. The hydra model (many entries, few exits) presents a different compromise: only a few exit nodes are needed, but an adversary
needs to work harder to watch all the clients; see Section 10.
Finally, we note that exit abuse must not be dismissed as
a peripheral issue: when a systems public image suffers, it
can reduce the number and diversity of that systems users,
and thereby reduce the anonymity of the system itself. Like
usability, public perception is a security parameter. Sadly,
preventing abuse of open exit nodes is an unsolved problem,
and will probably remain an arms race for the foreseeable
future. The abuse problems faced by Princetons CoDeeN
project [37] give us a glimpse of likely issues.

6.3

Directory Servers

First-generation Onion Routing designs [8, 41] used in-band


network status updates: each router flooded a signed statement to its neighbors, which propagated it onward. But
anonymizing networks have different security goals than typical link-state routing protocols. For example, delays (accidental or intentional) that can cause different parts of the network to have different views of link-state and topology are
not only inconvenient: they give attackers an opportunity to
exploit differences in client knowledge. We also worry about
attacks to deceive a client about the router membership list,
topology, or current network state. Such partitioning attacks
on client knowledge help an adversary to efficiently deploy
resources against a target [15].
Tor uses a small group of redundant, well-known onion
routers to track changes in network topology and node state,
including keys and exit policies. Each such directory server
acts as an HTTP server, so clients can fetch current network
state and router lists, and so other ORs can upload state information. Onion routers periodically publish signed statements
of their state to each directory server. The directory servers
combine this information with their own views of network
liveness, and generate a signed description (a directory) of
the entire network state. Client software is pre-loaded with a
list of the directory servers and their keys, to bootstrap each
clients view of the network.
When a directory server receives a signed statement for an
OR, it checks whether the ORs identity key is recognized.
Directory servers do not advertise unrecognized ORsif they
did, an adversary could take over the network by creating
many servers [22]. Instead, new nodes must be approved by
the directory server administrator before they are included.
Mechanisms for automated node approval are an area of active research, and are discussed more in Section 9.
Of course, a variety of attacks remain. An adversary who
controls a directory server can track clients by providing them
different informationperhaps by listing only nodes under
its control, or by informing only certain clients about a given
node. Even an external adversary can exploit differences in
client knowledge: clients who use a node listed on one directory server but not the others are vulnerable.
Thus these directory servers must be synchronized and
redundant, so that they can agree on a common directory.
Clients should only trust this directory if it is signed by a
threshold of the directory servers.
The directory servers in Tor are modeled after those in
Mixminion [15], but our situation is easier. First, we make
the simplifying assumption that all participants agree on the
set of directory servers. Second, while Mixminion needs
to predict node behavior, Tor only needs a threshold consensus of the current state of the network. Third, we assume that we can fall back to the human administrators to
discover and resolve problems when a consensus directory

cannot be reached. Since there are relatively few directory


servers (currently 3, but we expect as many as 9 as the network scales), we can afford operations like broadcast to simplify the consensus-building protocol.
To avoid attacks where a router connects to all the directory servers but refuses to relay traffic from other routers,
the directory servers must also build circuits and use them to
anonymously test router reliability [18]. Unfortunately, this
defense is not yet designed or implemented.
Using directory servers is simpler and more flexible than
flooding. Flooding is expensive, and complicates the analysis
when we start experimenting with non-clique network topologies. Signed directories can be cached by other onion routers,
so directory servers are not a performance bottleneck when
we have many users, and do not aid traffic analysis by forcing
clients to announce their existence to any central point.

Attacks and Defenses

Below we summarize a variety of attacks, and discuss how


well our design withstands them.

Passive attacks
Observing user traffic patterns. Observing a users connection will not reveal her destination or data, but it will reveal
traffic patterns (both sent and received). Profiling via user
connection patterns requires further processing, because multiple application streams may be operating simultaneously or
in series over a single circuit.
Observing user content. While content at the user end is
encrypted, connections to responders may not be (indeed, the
responding website itself may be hostile). While filtering
content is not a primary goal of Onion Routing, Tor can directly use Privoxy and related filtering services to anonymize
application data streams.
Option distinguishability. We allow clients to choose configuration options. For example, clients concerned about request linkability should rotate circuits more often than those
concerned about traceability. Allowing choice may attract
users with different needs; but clients who are in the minority may lose more anonymity by appearing distinct than they
gain by optimizing their behavior [1].
End-to-end timing correlation. Tor only minimally hides
such correlations. An attacker watching patterns of traffic at
the initiator and the responder will be able to confirm the correspondence with high probability. The greatest protection
currently available against such confirmation is to hide the
connection between the onion proxy and the first Tor node,
by running the OP on the Tor node or behind a firewall. This
approach requires an observer to separate traffic originating at
the onion router from traffic passing through it: a global observer can do this, but it might be beyond a limited observers
capabilities.

End-to-end size correlation. Simple packet counting will


also be effective in confirming endpoints of a stream. However, even without padding, we may have some limited protection: the leaky pipe topology means different numbers of
packets may enter one end of a circuit than exit at the other.
Website fingerprinting. All the effective passive attacks
above are traffic confirmation attacks, which puts them outside our design goals. There is also a passive traffic analysis
attack that is potentially effective. Rather than searching
exit connections for timing and volume correlations, the
adversary may build up a database of fingerprints containing file sizes and access patterns for targeted websites. He
can later confirm a users connection to a given site simply
by consulting the database. This attack has been shown to
be effective against SafeWeb [29]. It may be less effective
against Tor, since streams are multiplexed within the same
circuit, and fingerprinting will be limited to the granularity
of cells (currently 512 bytes). Additional defenses could
include larger cell sizes, padding schemes to group websites
into large sets, and link padding or long-range dummies.4

Active attacks
Compromise keys. An attacker who learns the TLS session
key can see control cells and encrypted relay cells on every
circuit on that connection; learning a circuit session key lets
him unwrap one layer of the encryption. An attacker who
learns an ORs TLS private key can impersonate that OR for
the TLS keys lifetime, but he must also learn the onion key
to decrypt create cells (and because of perfect forward secrecy, he cannot hijack already established circuits without
also compromising their session keys). Periodic key rotation
limits the window of opportunity for these attacks. On the
other hand, an attacker who learns a nodes identity key can
replace that node indefinitely by sending new forged descriptors to the directory servers.
Iterated compromise. A roving adversary who can compromise ORs (by system intrusion, legal coercion, or extralegal coercion) could march down the circuit compromising the
nodes until he reaches the end. Unless the adversary can complete this attack within the lifetime of the circuit, however,
the ORs will have discarded the necessary information before
the attack can be completed. (Thanks to the perfect forward
secrecy of session keys, the attacker cannot force nodes to decrypt recorded traffic once the circuits have been closed.) Additionally, building circuits that cross jurisdictions can make
legal coercion harderthis phenomenon is commonly called
jurisdictional arbitrage. The Java Anon Proxy project recently experienced the need for this approach, when a German court forced them to add a backdoor to their nodes [51].
Run a recipient. An adversary running a webserver trivially
4 Note that this fingerprinting attack should not be confused with the much
more complicated latency attacks of [5], which require a fingerprint of the
latencies of all circuits through the network, combined with those from the
network edges to the target user and the responder website.

learns the timing patterns of users connecting to it, and can introduce arbitrary patterns in its responses. End-to-end attacks
become easier: if the adversary can induce users to connect
to his webserver (perhaps by advertising content targeted to
those users), he now holds one end of their connection. There
is also a danger that application protocols and associated programs can be induced to reveal information about the initiator.
Tor depends on Privoxy and similar protocol cleaners to solve
this latter problem.
Run an onion proxy. It is expected that end users will nearly
always run their own local onion proxy. However, in some
settings, it may be necessary for the proxy to run remotely
typically, in institutions that want to monitor the activity of
those connecting to the proxy. Compromising an onion proxy
compromises all future connections through it.
DoS non-observed nodes. An observer who can only watch
some of the Tor network can increase the value of this traffic
by attacking non-observed nodes to shut them down, reduce
their reliability, or persuade users that they are not trustworthy. The best defense here is robustness.
Run a hostile OR. In addition to being a local observer, an
isolated hostile node can create circuits through itself, or alter
traffic patterns to affect traffic at other nodes. Nonetheless, a
hostile node must be immediately adjacent to both endpoints
to compromise the anonymity of a circuit. If an adversary can
run multiple ORs, and can persuade the directory servers that
those ORs are trustworthy and independent, then occasionally
some user will choose one of those ORs for the start and another as the end of a circuit. If an adversary controls m > 1
2
of the traffic
of N nodes, he can correlate at most m
N
although an adversary could still attract a disproportionately
large amount of traffic by running an OR with a permissive
exit policy, or by degrading the reliability of other routers.
Introduce timing into messages. This is simply a stronger
version of passive timing attacks already discussed earlier.
Tagging attacks. A hostile node could tag a cell by altering it. If the stream were, for example, an unencrypted
request to a Web site, the garbled content coming out at the
appropriate time would confirm the association. However, integrity checks on cells prevent this attack.
Replace contents of unauthenticated protocols. When relaying an unauthenticated protocol like HTTP, a hostile exit
node can impersonate the target server. Clients should prefer
protocols with end-to-end authentication.
Replay attacks. Some anonymity protocols are vulnerable
to replay attacks. Tor is not; replaying one side of a handshake will result in a different negotiated session key, and so
the rest of the recorded session cant be used.
Smear attacks. An attacker could use the Tor network for
socially disapproved acts, to bring the network into disrepute
and get its operators to shut it down. Exit policies reduce
the possibilities for abuse, but ultimately the network requires
volunteers who can tolerate some political heat.
Distribute hostile code. An attacker could trick users

into running subverted Tor software that did not, in fact,


anonymize their connectionsor worse, could trick ORs
into running weakened software that provided users with
less anonymity. We address this problem (but do not solve it
completely) by signing all Tor releases with an official public
key, and including an entry in the directory that lists which
versions are currently believed to be secure. To prevent an
attacker from subverting the official release itself (through
threats, bribery, or insider attacks), we provide all releases in
source code form, encourage source audits, and frequently
warn our users never to trust any software (even from us) that
comes without source.

Directory attacks
Destroy directory servers. If a few directory servers disappear, the others still decide on a valid directory. So long
as any directory servers remain in operation, they will still
broadcast their views of the network and generate a consensus
directory. (If more than half are destroyed, this directory will
not, however, have enough signatures for clients to use it automatically; human intervention will be necessary for clients
to decide whether to trust the resulting directory.)
Subvert a directory server. By taking over a directory
server, an attacker can partially influence the final directory.
Since ORs are included or excluded by majority vote, the corrupt directory can at worst cast a tie-breaking vote to decide
whether to include marginal ORs. It remains to be seen how
often such marginal cases occur in practice.
Subvert a majority of directory servers. An adversary who
controls more than half the directory servers can include as
many compromised ORs in the final directory as he wishes.
We must ensure that directory server operators are independent and attack-resistant.
Encourage directory server dissent. The directory agreement protocol assumes that directory server operators agree
on the set of directory servers. An adversary who can persuade some of the directory server operators to distrust one
another could split the quorum into mutually hostile camps,
thus partitioning users based on which directory they use. Tor
does not address this attack.
Trick the directory servers into listing a hostile OR. Our
threat model explicitly assumes directory server operators
will be able to filter out most hostile ORs.
Convince the directories that a malfunctioning OR is
working. In the current Tor implementation, directory servers
assume that an OR is running correctly if they can start a
TLS connection to it. A hostile OR could easily subvert this
test by accepting TLS connections from ORs but ignoring all
cells. Directory servers must actively test ORs by building
circuits and streams as appropriate. The tradeoffs of a similar
approach are discussed in [18].

Attacks against rendezvous points


Make many introduction requests. An attacker could try to

deny Bob service by flooding his introduction points with requests. Because the introduction points can block requests
that lack authorization tokens, however, Bob can restrict the
volume of requests he receives, or require a certain amount of
computation for every request he receives.
Attack an introduction point. An attacker could disrupt a
location-hidden service by disabling its introduction points.
But because a services identity is attached to its public key,
the service can simply re-advertise itself at a different introduction point. Advertisements can also be done secretly so
that only high-priority clients know the address of Bobs introduction points or so that different clients know of different
introduction points. This forces the attacker to disable all possible introduction points.
Compromise an introduction point. An attacker who controls Bobs introduction point can flood Bob with introduction
requests, or prevent valid introduction requests from reaching
him. Bob can notice a flood, and close the circuit. To notice
blocking of valid requests, however, he should periodically
test the introduction point by sending rendezvous requests
and making sure he receives them.
Compromise a rendezvous point. A rendezvous point is no
more sensitive than any other OR on a circuit, since all data
passing through the rendezvous is encrypted with a session
key shared by Alice and Bob.

Early experiences: Tor in the Wild

As of mid-May 2004, the Tor network consists of 32 nodes


(24 in the US, 8 in Europe), and more are joining each week
as the code matures. (For comparison, the current remailer
network has about 40 nodes.) Each node has at least a
768Kb/768Kb connection, and many have 10Mb. The number of users varies (and of course, its hard to tell for sure), but
we sometimes have several hundred usersadministrators at
several companies have begun sending their entire departments web traffic through Tor, to block other divisions of
their company from reading their traffic. Tor users have reported using the network for web browsing, FTP, IRC, AIM,
Kazaa, SSH, and recipient-anonymous email via rendezvous
points. One user has anonymously set up a Wiki as a hidden
service, where other users anonymously publish the addresses
of their hidden services.
Each Tor node currently processes roughly 800,000 relay
cells (a bit under half a gigabyte) per week. On average, about
80% of each 498-byte payload is full for cells going back to
the client, whereas about 40% is full for cells coming from the
client. (The difference arises because most of the networks
traffic is web browsing.) Interactive traffic like SSH brings
down the average a lotonce we have more experience, and
assuming we can resolve the anonymity issues, we may partition traffic into two relay cell sizes: one to handle bulk traffic
and one for interactive traffic.

Based in part on our restrictive default exit policy (we reject SMTP requests) and our low profile, we have had no
abuse issues since the network was deployed in October 2003.
Our slow growth rate gives us time to add features, resolve
bugs, and get a feel for what users actually want from an
anonymity system. Even though having more users would
bolster our anonymity sets, we are not eager to attract the
Kazaa or warez communitieswe feel that we must build a
reputation for privacy, human rights, research, and other socially laudable activities.
As for performance, profiling shows that Tor spends almost
all its CPU time in AES, which is fast. Current latency is
attributable to two factors. First, network latency is critical:
we are intentionally bouncing traffic around the world several
times. Second, our end-to-end congestion control algorithm
focuses on protecting volunteer servers from accidental DoS
rather than on optimizing performance. To quantify these effects, we did some informal tests using a network of 4 nodes
on the same machine (a heavily loaded 1GHz Athlon). We
downloaded a 60 megabyte file from debian.org every 30
minutes for 54 hours (108 sample points). It arrived in about
300 seconds on average, compared to 210s for a direct download. We ran a similar test on the production Tor network,
fetching the front page of cnn.com (55 kilobytes): while
a direct download consistently took about 0.3s, the performance through Tor varied. Some downloads were as fast as
0.4s, with a median at 2.8s, and 90% finishing within 5.3s. It
seems that as the network expands, the chance of building a
slow circuit (one that includes a slow or heavily loaded node
or link) is increasing. On the other hand, as our users remain
satisfied with this increased latency, we can address our performance incrementally as we proceed with development.
Although Tors clique topology and full-visibility directories present scaling problems, we still expect the network to
support a few hundred nodes and maybe 10,000 users before
were forced to become more distributed. With luck, the experience we gain running the current topology will help us
choose among alternatives when the time comes.

Open Questions in Low-latency Anonymity

In addition to the non-goals in Section 3, many questions


must be solved before we can be confident of Tors security.
Many of these open issues are questions of balance. For
example, how often should users rotate to fresh circuits? Frequent rotation is inefficient, expensive, and may lead to intersection attacks and predecessor attacks [54], but infrequent
rotation makes the users traffic linkable. Besides opening
fresh circuits, clients can also exit from the middle of the circuit, or truncate and re-extend the circuit. More analysis is
needed to determine the proper tradeoff.
How should we choose path lengths? If Alice always uses
two hops, then both ORs can be certain that by colluding they
will learn about Alice and Bob. In our current approach, Alice

always chooses at least three nodes unrelated to herself and


her destination. Should Alice choose a random path length
(e.g. from a geometric distribution) to foil an attacker who
uses timing to learn that he is the fifth hop and thus concludes
that both Alice and the responder are running ORs?
Throughout this paper, we have assumed that end-to-end
traffic confirmation will immediately and automatically defeat a low-latency anonymity system. Even high-latency
anonymity systems can be vulnerable to end-to-end traffic
confirmation, if the traffic volumes are high enough, and if
users habits are sufficiently distinct [14, 31]. Can anything
be done to make low-latency systems resist these attacks as
well as high-latency systems? Tor already makes some effort to conceal the starts and ends of streams by wrapping
long-range control commands in identical-looking relay cells.
Link padding could frustrate passive observers who count
packets; long-range padding could work against observers
who own the first hop in a circuit. But more research remains
to find an efficient and practical approach. Volunteers prefer not to run constant-bandwidth padding; but no convincing traffic shaping approach has been specified. Recent work
on long-range padding [33] shows promise. One could also
try to reduce correlation in packet timing by batching and reordering packets, but it is unclear whether this could improve
anonymity without introducing so much latency as to render
the network unusable.
A cascade topology may better defend against traffic confirmation by aggregating users, and making padding and mixing more affordable. Does the hydra topology (many input
nodes, few output nodes) work better against some adversaries? Are we going to get a hydra anyway because most
nodes will be middleman nodes?
Common wisdom suggests that Alice should run her own
OR for best anonymity, because traffic coming from her node
could plausibly have come from elsewhere. How much mixing does this approach need? Is it immediately beneficial
because of real-world adversaries that cant observe Alices
router, but can run routers of their own?
To scale to many users, and to prevent an attacker from
observing the whole network, it may be necessary to support
far more servers than Tor currently anticipates. This introduces several issues. First, if approval by a central set of directory servers is no longer feasible, what mechanism should
be used to prevent adversaries from signing up many colluding servers? Second, if clients can no longer have a complete
picture of the network, how can they perform discovery while
preventing attackers from manipulating or exploiting gaps in
their knowledge? Third, if there are too many servers for every server to constantly communicate with every other, which
non-clique topology should the network use? (Restrictedroute topologies promise comparable anonymity with better
scalability [13], but whatever topology we choose, we need
some way to keep attackers from manipulating their position within it [21].) Fourth, if no central authority is track-

ing server reliability, how do we stop unreliable servers from


making the network unusable? Fifth, do clients receive so
much anonymity from running their own ORs that we should
expect them all to do so [1], or do we need another incentive
structure to motivate them? Tarzan and MorphMix present
possible solutions.
When a Tor node goes down, all its circuits (and thus
streams) must break. Will users abandon the system because of this brittleness? How well does the method in Section 6.1 allow streams to survive node failure? If affected
users rebuild circuits immediately, how much anonymity is
lost? It seems the problem is even worse in a peer-to-peer
environmentsuch systems dont yet provide an incentive
for peers to stay connected when theyre done retrieving content, so we would expect a higher churn rate.

10 Future Directions
Tor brings together many innovations into a unified deployable system. The next immediate steps include:
Scalability: Tors emphasis on deployability and design
simplicity has led us to adopt a clique topology, semicentralized directories, and a full-network-visibility model
for client knowledge. These properties will not scale past
a few hundred servers. Section 9 describes some promising
approaches, but more deployment experience will be helpful
in learning the relative importance of these bottlenecks.
Bandwidth classes: This paper assumes that all ORs have
good bandwidth and latency. We should instead adopt the
MorphMix model, where nodes advertise their bandwidth
level (DSL, T1, T3), and Alice avoids bottlenecks by choosing nodes that match or exceed her bandwidth. In this way
DSL users can usefully join the Tor network.
Incentives: Volunteers who run nodes are rewarded with
publicity and possibly better anonymity [1]. More nodes
means increased scalability, and more users can mean more
anonymity. We need to continue examining the incentive
structures for participating in Tor. Further, we need to explore more approaches to limiting abuse, and understand why
most people dont bother using privacy systems.
Cover traffic: Currently Tor omits cover trafficits costs
in performance and bandwidth are clear but its security benefits are not well understood. We must pursue more research
on link-level cover traffic and long-range cover traffic to determine whether some simple padding method offers provable
protection against our chosen adversary.
Caching at exit nodes: Perhaps each exit node should run
a caching web proxy [47], to improve anonymity for cached
pages (Alices request never leaves the Tor network), to improve speed, and to reduce bandwidth cost. On the other
hand, forward security is weakened because caches constitute a record of retrieved files. We must find the right balance
between usability and security.

Better directory distribution: Clients currently download a


description of the entire network every 15 minutes. As the
state grows larger and clients more numerous, we may need
a solution in which clients receive incremental updates to directory state. More generally, we must find more scalable yet
practical ways to distribute up-to-date snapshots of network
status without introducing new attacks.
Further specification review: Our public byte-level specification [20] needs external review. We hope that as Tor is
deployed, more people will examine its specification.
Multisystem interoperability: We are currently working
with the designer of MorphMix to unify the specification and
implementation of the common elements of our two systems.
So far, this seems to be relatively straightforward. Interoperability will allow testing and direct comparison of the two
designs for trust and scalability.
Wider-scale deployment: The original goal of Tor was to
gain experience in deploying an anonymizing overlay network, and learn from having actual users. We are now at a
point in design and development where we can start deploying a wider network. Once we have many actual users, we
will doubtlessly be better able to evaluate some of our design
decisions, including our robustness/latency tradeoffs, our performance tradeoffs (including cell size), our abuse-prevention
mechanisms, and our overall usability.

Acknowledgments
We thank Peter Palfrader, Geoff Goodell, Adam Shostack,
Joseph Sokol-Margolis, John Bashinski, and Zack Brown for
editing and comments; Matej Pfajfar, Andrei Serjantov, Marc
Rennhard for design discussions; Bram Cohen for congestion
control discussions; Adam Back for suggesting telescoping
circuits; and Cathy Meadows for formal analysis of the extend protocol. This work has been supported by ONR and
DARPA.

References
[1] A. Acquisti, R. Dingledine, and P. Syverson. On the economics of anonymity. In R. N. Wright, editor, Financial Cryptography. Springer-Verlag, LNCS 2742, 2003.
[2] R. Anderson. The eternity service. In Pragocrypt 96, 1996.
[3] The Anonymizer. <http://anonymizer.com/>.
[4] A. Back, I. Goldberg, and A. Shostack. Freedom systems 2.1
security issues and analysis. White paper, Zero Knowledge
Systems, Inc., May 2001.
[5] A. Back, U. Moller, and A. Stiglic. Traffic analysis attacks and trade-offs in anonymity providing systems. In I. S.
Moskowitz, editor, Information Hiding (IH 2001), pages 245
257. Springer-Verlag, LNCS 2137, 2001.
[6] M. Bellare, P. Rogaway, and D. Wagner. The EAX mode of
operation: A two-pass authenticated-encryption scheme optimized for simplicity and efficiency. In Fast Software Encryption 2004, February 2004.

[7] O. Berthold, H. Federrath, and S. Kopsell. Web MIXes: A


system for anonymous and unobservable Internet access. In
H. Federrath, editor, Designing Privacy Enhancing Technologies: Workshop on Design Issue in Anonymity and Unobservability. Springer-Verlag, LNCS 2009, 2000.
[8] P. Boucher, A. Shostack, and I. Goldberg. Freedom systems
2.0 architecture. White paper, Zero Knowledge Systems, Inc.,
December 2000.
[9] Z. Brown. Cebolla: Pragmatic IP Anonymity. In Ottawa Linux
Symposium, June 2002.
[10] D. Chaum. Untraceable electronic mail, return addresses,
and digital pseudo-nyms. Communications of the ACM, 4(2),
February 1981.
[11] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica.
Wide-area cooperative storage with CFS. In 18th ACM Symposium on Operating Systems Principles (SOSP 01), Chateau
Lake Louise, Banff, Canada, October 2001.
[12] W. Dai. Pipenet 1.1. Usenet post, August 1996. <http:
//www.eskimo.com/weidai/pipenet.txt> First
mentioned in a post to the cypherpunks list, Feb. 1995.
[13] G. Danezis. Mix-networks with restricted routes. In R. Dingledine, editor, Privacy Enhancing Technologies (PET 2003).
Springer-Verlag LNCS 2760, 2003.
[14] G. Danezis. Statistical disclosure attacks. In Security and
Privacy in the Age of Uncertainty (SEC2003), pages 421426,
Athens, May 2003. IFIP TC11, Kluwer.
[15] G. Danezis, R. Dingledine, and N. Mathewson. Mixminion:
Design of a type III anonymous remailer protocol. In 2003
IEEE Symposium on Security and Privacy, pages 215. IEEE
CS, May 2003.
[16] D. Dean and A. Stubblefield. Using Client Puzzles to Protect
TLS. In Proceedings of the 10th USENIX Security Symposium.
USENIX, Aug. 2001.
[17] T. Dierks and C. Allen. The TLS Protocol Version 1.0.
IETF RFC 2246, January 1999.
[18] R. Dingledine, M. J. Freedman, D. Hopwood, and D. Molnar.
A Reputation System to Increase MIX-net Reliability. In I. S.
Moskowitz, editor, Information Hiding (IH 2001), pages 126
141. Springer-Verlag, LNCS 2137, 2001.
[19] R. Dingledine, M. J. Freedman, and D. Molnar. The free
haven project: Distributed anonymous storage service. In
H. Federrath, editor, Designing Privacy Enhancing Technologies: Workshop on Design Issue in Anonymity and Unobservability. Springer-Verlag, LNCS 2009, July 2000.
[20] R. Dingledine and N. Mathewson. Tor protocol specifications.
<http://freehaven.net/tor/tor-spec.txt>.
[21] R. Dingledine and P. Syverson. Reliable MIX Cascade Networks through Reputation. In M. Blaze, editor, Financial
Cryptography. Springer-Verlag, LNCS 2357, 2002.
[22] J. Douceur. The Sybil Attack. In Proceedings of the 1st International Peer To Peer Systems Workshop (IPTPS), Mar. 2002.
[23] H. Federrath, A. Jerichow, and A. Pfitzmann. MIXes in mobile communication systems: Location management with privacy. In R. Anderson, editor, Information Hiding, First International Workshop, pages 121135. Springer-Verlag, LNCS
1174, May 1996.
[24] M. J. Freedman and R. Morris. Tarzan: A peer-to-peer
anonymizing network layer. In 9th ACM Conference on Computer and Communications Security (CCS 2002), Washington,
DC, November 2002.

[25] S. Goel, M. Robson, M. Polte, and E. G. Sirer. Herbivore: A


scalable and efficient protocol for anonymous communication.
Technical Report TR2003-1890, Cornell University Computing and Information Science, February 2003.
[26] I. Goldberg. A Pseudonymous Communications Infrastructure
for the Internet. PhD thesis, UC Berkeley, Dec 2000.
[27] D. M. Goldschlag, M. G. Reed, and P. F. Syverson. Hiding
routing information. In R. Anderson, editor, Information Hiding, First International Workshop, pages 137150. SpringerVerlag, LNCS 1174, May 1996.
[28] C. Gulcu and G. Tsudik. Mixing E-mail with Babel. In Network and Distributed Security Symposium (NDSS 96), pages
216. IEEE, February 1996.
[29] A. Hintz. Fingerprinting websites using traffic analysis. In
R. Dingledine and P. Syverson, editors, Privacy Enhancing
Technologies (PET 2002), pages 171178. Springer-Verlag,
LNCS 2482, 2002.
[30] A. Jerichow, J. Muller, A. Pfitzmann, B. Pfitzmann, and
M. Waidner.
Real-time mixes: A bandwidth-efficient
anonymity protocol. IEEE Journal on Selected Areas in Communications, 16(4):495509, May 1998.
[31] D. Kesdogan, D. Agrawal, and S. Penz. Limits of anonymity
in open environments. In F. Petitcolas, editor, Information
Hiding Workshop (IH 2002). Springer-Verlag, LNCS 2578,
October 2002.
[32] D. Koblas and M. R. Koblas. SOCKS. In UNIX Security III
Symposium (1992 USENIX Security Symposium), pages 77
83. USENIX, 1992.
[33] B. N. Levine, M. K. Reiter, C. Wang, and M. Wright. Timing
analysis in low-latency mix-based systems. In A. Juels, editor, Financial Cryptography. Springer-Verlag, LNCS (forthcoming), 2004.
[34] B. N. Levine and C. Shields. Hordes: A multicast-based protocol for anonymity. Journal of Computer Security, 10(3):213
240, 2002.
[35] C. Meadows. The NRL protocol analyzer: An overview. Journal of Logic Programming, 26(2):113131, 1996.
[36] U. Moller, L. Cottrell, P. Palfrader, and L. Sassaman. Mixmaster Protocol Version 2. Draft, July 2003. <http:
//www.abditum.com/mixmaster-spec.txt>.
[37] V. S. Pai, L. Wang, K. Park, R. Pang, and L. Peterson. The
Dark Side of the Web: An Open Proxys View.
<http://codeen.cs.princeton.edu/>.
[38] A. Pfitzmann, B. Pfitzmann, and M. Waidner. ISDN-mixes:
Untraceable communication with very small bandwidth overhead. In GI/ITG Conference on Communication in Distributed
Systems, pages 451463, February 1991.
[39] Privoxy. <http://www.privoxy.org/>.
[40] M. G. Reed, P. F. Syverson, and D. M. Goldschlag. Protocols using anonymous connections: Mobile applications. In
B. Christianson, B. Crispo, M. Lomas, and M. Roe, editors,
Security Protocols: 5th International Workshop, pages 1323.
Springer-Verlag, LNCS 1361, April 1997.
[41] M. G. Reed, P. F. Syverson, and D. M. Goldschlag. Anonymous connections and onion routing. IEEE Journal on Selected Areas in Communications, 16(4):482494, May 1998.
[42] M. K. Reiter and A. D. Rubin. Crowds: Anonymity for web
transactions. ACM TISSEC, 1(1):6692, June 1998.
[43] M. Rennhard and B. Plattner. Practical anonymity for the
masses with morphmix. In A. Juels, editor, Financial Cryptography. Springer-Verlag, LNCS (forthcoming), 2004.

[44] M. Rennhard, S. Rafaeli, L. Mathy, B. Plattner, and D. Hutchison. Analysis of an Anonymity Network for Web Browsing.
In IEEE 7th Intl. Workshop on Enterprise Security (WET ICE
2002), Pittsburgh, USA, June 2002.
[45] A. Serjantov and P. Sewell. Passive attack analysis for
connection-based anonymity systems. In Computer Security
ESORICS 2003. Springer-Verlag, LNCS 2808, October 2003.
[46] R. Sherwood, B. Bhattacharjee, and A. Srinivasan. p5 : A protocol for scalable anonymous communication. In IEEE Symposium on Security and Privacy, pages 5870. IEEE CS, 2002.
[47] A. Shubina and S. Smith. Using caching for browsing
anonymity. ACM SIGEcom Exchanges, 4(2), Sept 2003.
[48] P. Syverson, M. Reed, and D. Goldschlag. Onion Routing
access configurations. In DARPA Information Survivability
Conference and Exposition (DISCEX 2000), volume 1, pages
3440. IEEE CS Press, 2000.
[49] P. Syverson, G. Tsudik, M. Reed, and C. Landwehr. Towards
an Analysis of Onion Routing Security. In H. Federrath, editor, Designing Privacy Enhancing Technologies: Workshop
on Design Issue in Anonymity and Unobservability, pages 96
114. Springer-Verlag, LNCS 2009, July 2000.
[50] A. Tannenbaum. Computer networks, 1996.
[51] The AN.ON Project.
German police proceeds against
anonymity service.
Press release, September 2003.
<http://www.datenschutzzentrum.de/
material/themen/presse/anon-bka_e.htm>.
[52] M. Waldman and D. Mazi`eres. Tangler: A censorshipresistant publishing system based on document entanglements. In 8th ACM Conference on Computer and Communications Security (CCS-8), pages 86135. ACM Press, 2001.
[53] M. Waldman, A. Rubin, and L. Cranor. Publius: A robust,
tamper-evident, censorship-resistant and source-anonymous
web publishing system. In Proc. 9th USENIX Security Symposium, pages 5972, August 2000.
[54] M. Wright, M. Adler, B. N. Levine, and C. Shields. Defending
anonymous communication against passive logging attacks. In
IEEE Symposium on Security and Privacy, pages 2841. IEEE
CS, May 2003.

ZooKeeper: Wait-free coordination for Internet-scale systems


Patrick Hunt and Mahadev Konar
Yahoo! Grid

Flavio P. Junqueira and Benjamin Reed


Yahoo! Research

{phunt,mahadev}@yahoo-inc.com

{fpj,breed}@yahoo-inc.com

Abstract

that implement mutually exclusive access to critical resources.


One approach to coordination is to develop services
for each of the different coordination needs. For example, Amazon Simple Queue Service [3] focuses specifically on queuing. Other services have been developed specifically for leader election [25] and configuration [27]. Services that implement more powerful primitives can be used to implement less powerful ones. For
example, Chubby [6] is a locking service with strong
synchronization guarantees. Locks can then be used to
implement leader election, group membership, etc.
When designing our coordination service, we moved
away from implementing specific primitives on the
server side, and instead we opted for exposing an API
that enables application developers to implement their
own primitives. Such a choice led to the implementation of a coordination kernel that enables new primitives
without requiring changes to the service core. This approach enables multiple forms of coordination adapted to
the requirements of applications, instead of constraining
developers to a fixed set of primitives.
When designing the API of ZooKeeper, we moved
away from blocking primitives, such as locks. Blocking
primitives for a coordination service can cause, among
other problems, slow or faulty clients to impact negatively the performance of faster clients. The implementation of the service itself becomes more complicated
if processing requests depends on responses and failure detection of other clients. Our system, Zookeeper,
hence implements an API that manipulates simple waitfree data objects organized hierarchically as in file systems. In fact, the ZooKeeper API resembles the one of
any other file system, and looking at just the API signatures, ZooKeeper seems to be Chubby without the lock
methods, open, and close. Implementing wait-free data
objects, however, differentiates ZooKeeper significantly
from systems based on blocking primitives such as locks.
Although the wait-free property is important for per-

In this paper, we describe ZooKeeper, a service for coordinating processes of distributed applications. Since
ZooKeeper is part of critical infrastructure, ZooKeeper
aims to provide a simple and high performance kernel
for building more complex coordination primitives at the
client. It incorporates elements from group messaging,
shared registers, and distributed lock services in a replicated, centralized service. The interface exposed by ZooKeeper has the wait-free aspects of shared registers with
an event-driven mechanism similar to cache invalidations
of distributed file systems to provide a simple, yet powerful coordination service.
The ZooKeeper interface enables a high-performance
service implementation. In addition to the wait-free
property, ZooKeeper provides a per client guarantee of
FIFO execution of requests and linearizability for all requests that change the ZooKeeper state. These design decisions enable the implementation of a high performance
processing pipeline with read requests being satisfied by
local servers. We show for the target workloads, 2:1
to 100:1 read to write ratio, that ZooKeeper can handle
tens to hundreds of thousands of transactions per second.
This performance allows ZooKeeper to be used extensively by client applications.

Introduction

Large-scale distributed applications require different


forms of coordination. Configuration is one of the most
basic forms of coordination. In its simplest form, configuration is just a list of operational parameters for the
system processes, whereas more sophisticated systems
have dynamic configuration parameters. Group membership and leader election are also common in distributed
systems: often processes need to know which other processes are alive and what those processes are in charge
of. Locks constitute a powerful coordination primitive
1

formance and fault tolerance, it is not sufficient for coordination. We have also to provide order guarantees for
operations. In particular, we have found that guaranteeing both FIFO client ordering of all operations and linearizable writes enables an efficient implementation of
the service and it is sufficient to implement coordination
primitives of interest to our applications. In fact, we can
implement consensus for any number of processes with
our API, and according to the hierarchy of Herlihy, ZooKeeper implements a universal object [14].
The ZooKeeper service comprises an ensemble of
servers that use replication to achieve high availability
and performance. Its high performance enables applications comprising a large number of processes to use
such a coordination kernel to manage all aspects of coordination. We were able to implement ZooKeeper using a simple pipelined architecture that allows us to have
hundreds or thousands of requests outstanding while still
achieving low latency. Such a pipeline naturally enables
the execution of operations from a single client in FIFO
order. Guaranteeing FIFO client order enables clients to
submit operations asynchronously. With asynchronous
operations, a client is able to have multiple outstanding
operations at a time. This feature is desirable, for example, when a new client becomes a leader and it has to manipulate metadata and update it accordingly. Without the
possibility of multiple outstanding operations, the time
of initialization can be of the order of seconds instead of
sub-second.
To guarantee that update operations satisfy linearizability, we implement a leader-based atomic broadcast
protocol [23], called Zab [24]. A typical workload
of a ZooKeeper application, however, is dominated by
read operations and it becomes desirable to scale read
throughput. In ZooKeeper, servers process read operations locally, and we do not use Zab to totally order them.
Caching data on the client side is an important technique to increase the performance of reads. For example,
it is useful for a process to cache the identifier of the
current leader instead of probing ZooKeeper every time
it needs to know the leader. ZooKeeper uses a watch
mechanism to enable clients to cache data without managing the client cache directly. With this mechanism,
a client can watch for an update to a given data object,
and receive a notification upon an update. Chubby manages the client cache directly. It blocks updates to invalidate the caches of all clients caching the data being
changed. Under this design, if any of these clients is
slow or faulty, the update is delayed. Chubby uses leases
to prevent a faulty client from blocking the system indefinitely. Leases, however, only bound the impact of slow
or faulty clients, whereas ZooKeeper watches avoid the
problem altogether.
In this paper we discuss our design and implementa-

tion of ZooKeeper. With ZooKeeper, we are able to implement all coordination primitives that our applications
require, even though only writes are linearizable. To validate our approach we show how we implement some
coordination primitives with ZooKeeper.
To summarize, in this paper our main contributions are:
Coordination kernel: We propose a wait-free coordination service with relaxed consistency guarantees
for use in distributed systems. In particular, we describe our design and implementation of a coordination kernel, which we have used in many critical applications to implement various coordination
techniques.
Coordination recipes: We show how ZooKeeper can
be used to build higher level coordination primitives, even blocking and strongly consistent primitives, that are often used in distributed applications.
Experience with Coordination: We share some of the
ways that we use ZooKeeper and evaluate its performance.

The ZooKeeper service

Clients submit requests to ZooKeeper through a client


API using a ZooKeeper client library. In addition to exposing the ZooKeeper service interface through the client
API, the client library also manages the network connections between the client and ZooKeeper servers.
In this section, we first provide a high-level view of the
ZooKeeper service. We then discuss the API that clients
use to interact with ZooKeeper.
Terminology. In this paper, we use client to denote a
user of the ZooKeeper service, server to denote a process
providing the ZooKeeper service, and znode to denote
an in-memory data node in the ZooKeeper data, which
is organized in a hierarchical namespace referred to as
the data tree. We also use the terms update and write to
refer to any operation that modifies the state of the data
tree. Clients establish a session when they connect to
ZooKeeper and obtain a session handle through which
they issue requests.

2.1

Service overview

ZooKeeper provides to its clients the abstraction of a set


of data nodes (znodes), organized according to a hierarchical name space. The znodes in this hierarchy are data
objects that clients manipulate through the ZooKeeper
API. Hierarchical name spaces are commonly used in file
systems. It is a desirable way of organizing data objects,
since users are used to this abstraction and it enables better organization of application meta-data. To refer to a
2

given znode, we use the standard UNIX notation for file


system paths. For example, we use /A/B/C to denote
the path to znode C, where C has B as its parent and B
has A as its parent. All znodes store data, and all znodes,
except for ephemeral znodes, can have children.

chical keys. The hierarchal namespace is useful for allocating subtrees for the namespace of different applications and for setting access rights to those subtrees. We
also exploit the concept of directories on the client side to
build higher level primitives as we will see in section 2.4.
Unlike files in file systems, znodes are not designed
for general data storage. Instead, znodes map to abstractions of the client application, typically corresponding
to meta-data used for coordination purposes. To illustrate, in Figure 1 we have two subtrees, one for Application 1 (/app1) and another for Application 2 (/app2).
The subtree for Application 1 implements a simple group
membership protocol: each client process pi creates a
znode p i under /app1, which persists as long as the
process is running.
Although znodes have not been designed for general
data storage, ZooKeeper does allow clients to store some
information that can be used for meta-data or configuration in a distributed computation. For example, in a
leader-based application, it is useful for an application
server that is just starting to learn which other server is
currently the leader. To accomplish this goal, we can
have the current leader write this information in a known
location in the znode space. Znodes also have associated
meta-data with time stamps and version counters, which
allow clients to track changes to znodes and execute conditional updates based on the version of the znode.

/app1

/app1/p_1

/app1/p_2

/app2

/app1/p_3

Figure 1: Illustration of ZooKeeper hierarchical name


space.
There are two types of znodes that a client can create:
Regular: Clients manipulate regular znodes by creating
and deleting them explicitly;
Ephemeral: Clients create such znodes, and they either delete them explicitly, or let the system remove
them automatically when the session that creates
them terminates (deliberately or due to a failure).
Additionally, when creating a new znode, a client can
set a sequential flag. Nodes created with the sequential flag set have the value of a monotonically increasing counter appended to its name. If n is the new znode
and p is the parent znode, then the sequence value of n
is never smaller than the value in the name of any other
sequential znode ever created under p.
ZooKeeper implements watches to allow clients to
receive timely notifications of changes without requiring polling. When a client issues a read operation
with a watch flag set, the operation completes as normal except that the server promises to notify the client
when the information returned has changed. Watches
are one-time triggers associated with a session; they
are unregistered once triggered or the session closes.
Watches indicate that a change has happened, but do
not provide the change. For example, if a client issues a getData(/foo, true) before /foo
is changed twice, the client will get one watch event
telling the client that data for /foo has changed. Session events, such as connection loss events, are also sent
to watch callbacks so that clients know that watch events
may be delayed.

Sessions. A client connects to ZooKeeper and initiates


a session. Sessions have an associated timeout. ZooKeeper considers a client faulty if it does not receive anything from its session for more than that timeout. A session ends when clients explicitly close a session handle
or ZooKeeper detects that a clients is faulty. Within a session, a client observes a succession of state changes that
reflect the execution of its operations. Sessions enable a
client to move transparently from one server to another
within a ZooKeeper ensemble, and hence persist across
ZooKeeper servers.

2.2

Client API

We present below a relevant subset of the ZooKeeper


API, and discuss the semantics of each request.
create(path, data, flags): Creates a znode
with path name path, stores data[] in it, and
returns the name of the new znode. flags enables a client to select the type of znode: regular,
ephemeral, and set the sequential flag;
delete(path, version): Deletes the znode
path if that znode is at the expected version;
exists(path, watch): Returns true if the znode
with path name path exists, and returns false otherwise. The watch flag enables a client to set a

Data model. The data model of ZooKeeper is essentially a file system with a simplified API and only full
data reads and writes, or a key/value table with hierar3

watch on the znode;


getData(path, watch): Returns the data and
meta-data, such as version information, associated
with the znode. The watch flag works in the same
way as it does for exists(), except that ZooKeeper does not set the watch if the znode does not
exist;
setData(path, data, version): Writes
data[] to znode path if the version number is
the current version of the znode;
getChildren(path, watch): Returns the set of
names of the children of a znode;
sync(path): Waits for all updates pending at the start
of the operation to propagate to the server that the
client is connected to. The path is currently ignored.
All methods have both a synchronous and an asynchronous version available through the API. An application uses the synchronous API when it needs to execute
a single ZooKeeper operation and it has no concurrent
tasks to execute, so it makes the necessary ZooKeeper
call and blocks. The asynchronous API, however, enables an application to have both multiple outstanding
ZooKeeper operations and other tasks executed in parallel. The ZooKeeper client guarantees that the corresponding callbacks for each operation are invoked in order.
Note that ZooKeeper does not use handles to access
znodes. Each request instead includes the full path of
the znode being operated on. Not only does this choice
simplifies the API (no open() or close() methods),
but it also eliminates extra state that the server would
need to maintain.
Each of the update methods take an expected version number, which enables the implementation of conditional updates. If the actual version number of the znode does not match the expected version number the update fails with an unexpected version error. If the version
number is 1, it does not perform version checking.

2.3

client to have multiple outstanding operations, and consequently we can choose to guarantee no specific order
for outstanding operations of the same client or to guarantee FIFO order. We choose the latter for our property.
It is important to observe that all results that hold for
linearizable objects also hold for A-linearizable objects
because a system that satisfies A-linearizability also satisfies linearizability. Because only update requests are Alinearizable, ZooKeeper processes read requests locally
at each replica. This allows the service to scale linearly
as servers are added to the system.
To see how these two guarantees interact, consider the
following scenario. A system comprising a number of
processes elects a leader to command worker processes.
When a new leader takes charge of the system, it must
change a large number of configuration parameters and
notify the other processes once it finishes. We then have
two important requirements:
As the new leader starts making changes, we do not
want other processes to start using the configuration
that is being changed;
If the new leader dies before the configuration has
been fully updated, we do not want the processes to
use this partial configuration.
Observe that distributed locks, such as the locks provided by Chubby, would help with the first requirement
but are insufficient for the second. With ZooKeeper,
the new leader can designate a path as the ready znode;
other processes will only use the configuration when that
znode exists. The new leader makes the configuration
change by deleting ready, updating the various configuration znodes, and creating ready. All of these changes
can be pipelined and issued asynchronously to quickly
update the configuration state. Although the latency of a
change operation is of the order of 2 milliseconds, a new
leader that must update 5000 different znodes will take
10 seconds if the requests are issued one after the other;
by issuing the requests asynchronously the requests will
take less than a second. Because of the ordering guarantees, if a process sees the ready znode, it must also see
all the configuration changes made by the new leader. If
the new leader dies before the ready znode is created, the
other processes know that the configuration has not been
finalized and do not use it.
The above scheme still has a problem: what happens
if a process sees that ready exists before the new leader
starts to make a change and then starts reading the configuration while the change is in progress. This problem
is solved by the ordering guarantee for the notifications:
if a client is watching for a change, the client will see
the notification event before it sees the new state of the
system after the change is made. Consequently, if the
process that reads the ready znode requests to be notified
of changes to that znode, it will see a notification inform-

ZooKeeper guarantees

ZooKeeper has two basic ordering guarantees:


Linearizable writes: all requests that update the state
of ZooKeeper are serializable and respect precedence;
FIFO client order: all requests from a given client are
executed in the order that they were sent by the
client.
Note that our definition of linearizability is different
from the one originally proposed by Herlihy [15], and
we call it A-linearizability (asynchronous linearizability). In the original definition of linearizability by Herlihy, a client is only able to have one outstanding operation at a time (a client is one thread). In ours, we allow a
4

ing the client of the change before it can read any of the
new configuration.
Another problem can arise when clients have their own
communication channels in addition to ZooKeeper. For
example, consider two clients A and B that have a shared
configuration in ZooKeeper and communicate through a
shared communication channel. If A changes the shared
configuration in ZooKeeper and tells B of the change
through the shared communication channel, B would expect to see the change when it re-reads the configuration.
If Bs ZooKeeper replica is slightly behind As, it may
not see the new configuration. Using the above guarantees B can make sure that it sees the most up-to-date
information by issuing a write before re-reading the configuration. To handle this scenario more efficiently ZooKeeper provides the sync request: when followed by
a read, constitutes a slow read. sync causes a server
to apply all pending write requests before processing the
read without the overhead of a full write. This primitive
is similar in idea to the flush primitive of ISIS [5].
ZooKeeper also has the following two liveness and
durability guarantees: if a majority of ZooKeeper servers
are active and communicating the service will be available; and if the ZooKeeper service responds successfully
to a change request, that change persists across any number of failures as long as a quorum of servers is eventually able to recover.

2.4

the most recent information. For example, if a process


watching zc is notified of a change to zc and before it
can issue a read for zc there are three more changes to
zc , the process does not receive three more notification
events. This does not affect the behavior of the process,
since those three events would have simply notified the
process of something it already knows: the information
it has for zc is stale.

Rendezvous Sometimes in distributed systems, it is


not always clear a priori what the final system configuration will look like. For example, a client may want to
start a master process and several worker processes, but
the starting processes is done by a scheduler, so the client
does not know ahead of time information such as addresses and ports that it can give the worker processes to
connect to the master. We handle this scenario with ZooKeeper using a rendezvous znode, zr , which is an node
created by the client. The client passes the full pathname
of zr as a startup parameter of the master and worker
processes. When the master starts it fills in zr with information about addresses and ports it is using. When
workers start, they read zr with watch set to true. If zr
has not been filled in yet, the worker waits to be notified
when zr is updated. If zr is an ephemeral node, master
and worker processes can watch for zr to be deleted and
clean themselves up when the client ends.

Examples of primitives

In this section, we show how to use the ZooKeeper API


to implement more powerful primitives. The ZooKeeper
service knows nothing about these more powerful primitives since they are entirely implemented at the client using the ZooKeeper client API. Some common primitives
such as group membership and configuration management are also wait-free. For others, such as rendezvous,
clients need to wait for an event. Even though ZooKeeper
is wait-free, we can implement efficient blocking primitives with ZooKeeper. ZooKeepers ordering guarantees
allow efficient reasoning about system state, and watches
allow for efficient waiting.

Group Membership We take advantage of ephemeral


nodes to implement group membership. Specifically, we
use the fact that ephemeral nodes allow us to see the state
of the session that created the node. We start by designating a znode, zg to represent the group. When a process
member of the group starts, it creates an ephemeral child
znode under zg . If each process has a unique name or
identifier, then that name is used as the name of the child
znode; otherwise, the process creates the znode with the
SEQUENTIAL flag to obtain a unique name assignment.
Processes may put process information in the data of the
child znode, addresses and ports used by the process, for
example.
After the child znode is created under zg the process
starts normally. It does not need to do anything else. If
the process fails or ends, the znode that represents it under zg is automatically removed.
Processes can obtain group information by simply listing the children of zg . If a process wants to monitor
changes in group membership, the process can set the
watch flag to true and refresh the group information (always setting the watch flag to true) when change notifications are received.

Configuration Management ZooKeeper can be used


to implement dynamic configuration in a distributed application. In its simplest form configuration is stored in
a znode, zc . Processes start up with the full pathname
of zc . Starting processes obtain their configuration by
reading zc with the watch flag set to true. If the configuration in zc is ever updated, the processes are notified
and read the new configuration, again setting the watch
flag to true.
Note that in this scheme, as in most others that use
watches, watches are used to make sure that a process has
5

Simple Locks Although ZooKeeper is not a lock service, it can be used to implement locks. Applications
using ZooKeeper usually use synchronization primitives
tailored to their needs, such as those shown above. Here
we show how to implement locks with ZooKeeper to
show that it can implement a wide variety of general synchronization primitives.
The simplest lock implementation uses lock files.
The lock is represented by a znode. To acquire a lock,
a client tries to create the designated znode with the
EPHEMERAL flag. If the create succeeds, the client
holds the lock. Otherwise, the client can read the znode with the watch flag set to be notified if the current
leader dies. A client releases the lock when it dies or explicitly deletes the znode. Other clients that are waiting
for a lock try again to acquire a lock once they observe
the znode being deleted.
While this simple locking protocol works, it does have
some problems. First, it suffers from the herd effect. If
there are many clients waiting to acquire a lock, they will
all vie for the lock when it is released even though only
one client can acquire the lock. Second, it only implements exclusive locking. The following two primitives
show how both of these problems can be overcome.

EPHEMERAL flag on creation, processes that crash will


automatically cleanup any lock requests or release any
locks that they may have.
In summary, this locking scheme has the following advantages:
1. The removal of a znode only causes one client to
wake up, since each znode is watched by exactly
one other client, so we do not have the herd effect;
2. There is no polling or timeouts;
3. Because of the way we have implemented locking,
we can see by browsing the ZooKeeper data the
amount of lock contention, break locks, and debug
locking problems.
Read/Write Locks To implement read/write locks we
change the lock procedure slightly and have separate
read lock and write lock procedures. The unlock procedure is the same as the global lock case.
Write Lock
1 n = create(l + /write-, EPHEMERAL|SEQUENTIAL)
2 C = getChildren(l, false)
3 if n is lowest znode in C, exit
4 p = znode in C ordered just before n
5 if exists(p, true) wait for event
6 goto 2
Read Lock
1 n = create(l + /read-, EPHEMERAL|SEQUENTIAL)
2 C = getChildren(l, false)
3 if no write znodes lower than n in C, exit
4 p = write znode in C ordered just before n
5 if exists(p, true) wait for event
6 goto 3

Simple Locks without Herd Effect We define a lock


znode l to implement such locks. Intuitively we line up
all the clients requesting the lock and each client obtains
the lock in order of request arrival. Thus, clients wishing
to obtain the lock do the following:

This lock procedure varies slightly from the previous


locks. Write locks differ only in naming. Since read
locks may be shared, lines 3 and 4 vary slightly because
only earlier write lock znodes prevent the client from obtaining a read lock. It may appear that we have a herd
effect when there are several clients waiting for a read
lock and get notified when the write- znode with the
lower sequence number is deleted; in fact, this is a desired behavior, all those read clients should be released
since they may now have the lock.

Lock
1 n = create(l + /lock-, EPHEMERAL|SEQUENTIAL)
2 C = getChildren(l, false)
3 if n is lowest znode in C, exit
4 p = znode in C ordered just before n
5 if exists(p, true) wait for watch event
6 goto 2
Unlock
1 delete(n)

The use of the SEQUENTIAL flag in line 1 of Lock


orders the clients attempt to acquire the lock with respect to all other attempts. If the clients znode has the
lowest sequence number at line 3, the client holds the
lock. Otherwise, the client waits for deletion of the znode that either has the lock or will receive the lock before this clients znode. By only watching the znode
that precedes the clients znode, we avoid the herd effect
by only waking up one process when a lock is released
or a lock request is abandoned. Once the znode being
watched by the client goes away, the client must check
if it now holds the lock. (The previous lock request may
have been abandoned and there is a znode with a lower
sequence number still waiting for or holding the lock.)
Releasing a lock is as simple as deleting the znode n that represents the lock request. By using the

Double Barrier Double barriers enable clients to synchronize the beginning and the end of a computation.
When enough processes, defined by the barrier threshold, have joined the barrier, processes start their computation and leave the barrier once they have finished. We
represent a barrier in ZooKeeper with a znode, referred
to as b. Every process p registers with b by creating
a znode as a child of b on entry, and unregisters removes the child when it is ready to leave. Processes
can enter the barrier when the number of child znodes
of b exceeds the barrier threshold. Processes can leave
the barrier when all of the processes have removed their
children. We use watches to efficiently wait for enter and
6

Katta Katta [17] is a distributed indexer that uses ZooKeeper for coordination, and it is an example of a nonYahoo! application. Katta divides the work of indexing
using shards. A master server assigns shards to slaves
and tracks progress. Slaves can fail, so the master must
redistribute load as slaves come and go. The master can
also fail, so other servers must be ready to take over in
case of failure. Katta uses ZooKeeper to track the status
of slave servers and the master (group membership),
and to handle master failover (leader election). Katta
also uses ZooKeeper to track and propagate the assignments of shards to slaves (configuration management).

exit conditions to be satisfied. To enter, processes watch


for the existence of a ready child of b that will be created by the process that causes the number of children to
exceed the barrier threshold. To leave, processes watch
for a particular child to disappear and only check the exit
condition once that znode has been removed.

ZooKeeper Applications

We now describe some applications that use ZooKeeper,


and explain briefly how they use it. We show the primitives of each example in bold.
The Fetching Service Crawling is an important part of
a search engine, and Yahoo! crawls billions of Web documents. The Fetching Service (FS) is part of the Yahoo!
crawler and it is currently in production. Essentially, it
has master processes that command page-fetching processes. The master provides the fetchers with configuration, and the fetchers write back informing of their status
and health. The main advantages of using ZooKeeper
for FS are recovering from failures of masters, guaranteeing availability despite failures, and decoupling the
clients from the servers, allowing them to direct their request to healthy servers by just reading their status from
ZooKeeper. Thus, FS uses ZooKeeper mainly to manage configuration metadata, although it also uses ZooKeeper to elect masters (leader election).
2000

Yahoo! Message Broker Yahoo! Message Broker


(YMB) is a distributed publish-subscribe system. The
system manages thousands of topics that clients can publish messages to and receive messages from. The topics
are distributed among a set of servers to provide scalability. Each topic is replicated using a primary-backup
scheme that ensures messages are replicated to two machines to ensure reliable message delivery. The servers
that makeup YMB use a shared-nothing distributed architecture which makes coordination essential for correct
operation. YMB uses ZooKeeper to manage the distribution of topics (configuration metadata), deal with failures of machines in the system (failure detection and
group membership), and control system operation.
broker domain

read
write

Number of operations

shutdown

nodes

migration_prohibited

topics

broker_disabled

<topic>

.... < t o p i c >

1500

<hostname> <hostname>

1000

..... < h o s t n a m e >

<topic>

load
# of topics
primary

500

backup

hostname

0
0h

6h

Figure 3: The layout of Yahoo! Message Broker (YMB)


structures in ZooKeeper

12h 18h 24h 30h 36h 42h 48h 54h 60h 66h
Time in seconds

Figure 2: Workload for one ZK server with the Fetching


Service. Each point represents a one-second sample.

Figure 3 shows part of the znode data layout for YMB.


Each broker domain has a znode called nodes that has
an ephemeral znode for each of the active servers that
compose the YMB service. Each YMB server creates
an ephemeral znode under nodes with load and status information providing both group membership and
status information through ZooKeeper. Nodes such as
shutdown and migration prohibited are monitored by all of the servers that make up the service and
allow centralized control of YMB. The topics directory has a child znode for each topic managed by YMB.
These topic znodes have child znodes that indicate the

Figure 2 shows the read and write traffic for a ZooKeeper server used by FS through a period of three days.
To generate this graph, we count the number of operations for every second during the period, and each point
corresponds to the number of operations in that second.
We observe that the read traffic is much higher compared
to the write traffic. During periods in which the rate is
higher than 1, 000 operations per second, the read:write
ratio varies between 10:1 and 100:1. The read operations
in this workload are getData(), getChildren(),
and exists(), in increasing order of prevalence.
7

primary and backup server for each topic along with the
subscribers of that topic. The primary and backup
server znodes not only allow servers to discover the
servers in charge of a topic, but they also manage leader
election and server crashes.

message proposals consisting of state changes from the


leader and agree upon state changes.

4.1

Since the messaging layer is atomic, we guarantee that


the local replicas never diverge, although at any point in
time some servers may have applied more transactions
than others. Unlike the requests sent from clients, the
transactions are idempotent. When the leader receives
a write request, it calculates what the state of the system will be when the write is applied and transforms it
into a transaction that captures this new state. The future state must be calculated because there may be outstanding transactions that have not yet been applied to
the database. For example, if a client does a conditional
setData and the version number in the request matches
the future version number of the znode being updated,
the service generates a setDataTXN that contains the
new data, the new version number, and updated time
stamps. If an error occurs, such as mismatched version
numbers or the znode to be updated does not exist, an
errorTXN is generated instead.

ZooKeeper Service
Write
Request

Request
Processor

Response

txn
Replicated
Database
Atomic
Broadcast

txn

Read
Request

Figure 4: The components of the ZooKeeper service.

Request Processor

ZooKeeper Implementation

ZooKeeper provides high availability by replicating the


ZooKeeper data on each server that composes the service. We assume that servers fail by crashing, and such
faulty servers may later recover. Figure 4 shows the highlevel components of the ZooKeeper service. Upon receiving a request, a server prepares it for execution (request processor). If such a request requires coordination among the servers (write requests), then they use an
agreement protocol (an implementation of atomic broadcast), and finally servers commit changes to the ZooKeeper database fully replicated across all servers of the
ensemble. In the case of read requests, a server simply
reads the state of the local database and generates a response to the request.
The replicated database is an in-memory database containing the entire data tree. Each znode in the tree stores a
maximum of 1MB of data by default, but this maximum
value is a configuration parameter that can be changed in
specific cases. For recoverability, we efficiently log updates to disk, and we force writes to be on the disk media
before they are applied to the in-memory database. In
fact, as Chubby [8], we keep a replay log (a write-ahead
log, in our case) of committed operations and generate
periodic snapshots of the in-memory database.
Every ZooKeeper server services clients. Clients connect to exactly one server to submit its requests. As we
noted earlier, read requests are serviced from the local
replica of each server database. Requests that change the
state of the service, write requests, are processed by an
agreement protocol.
As part of the agreement protocol write requests are
forwarded to a single server, called the leader1 . The
rest of the ZooKeeper servers, called followers, receive

4.2

Atomic Broadcast

All requests that update ZooKeeper state are forwarded


to the leader. The leader executes the request and
broadcasts the change to the ZooKeeper state through
Zab [24], an atomic broadcast protocol. The server that
receives the client request responds to the client when it
delivers the corresponding state change. Zab uses by default simple majority quorums to decide on a proposal,
so Zab and thus ZooKeeper can only work if a majority
of servers are correct (i.e., with 2f + 1 server we can
tolerate f failures).
To achieve high throughput, ZooKeeper tries to keep
the request processing pipeline full. It may have thousands of requests in different parts of the processing
pipeline. Because state changes depend on the application of previous state changes, Zab provides stronger
order guarantees than regular atomic broadcast. More
specifically, Zab guarantees that changes broadcast by a
leader are delivered in the order they were sent and all
changes from previous leaders are delivered to an established leader before it broadcasts its own changes.
There are a few implementation details that simplify
our implementation and give us excellent performance.
We use TCP for our transport so message order is maintained by the network, which allows us to simplify our
implementation. We use the leader chosen by Zab as
the ZooKeeper leader, so that the same process that creates transactions also proposes them. We use the log to
keep track of proposals as the write-ahead log for the in-

1 Details

of leaders and followers, as part of the agreement protocol,


are out of the scope of this paper.

memory database, so that we do not have to write messages twice to disk.


During normal operation Zab does deliver all messages in order and exactly once, but since Zab does not
persistently record the id of every message delivered,
Zab may redeliver a message during recovery. Because
we use idempotent transactions, multiple delivery is acceptable as long as they are delivered in order. In fact,
ZooKeeper requires Zab to redeliver at least all messages
that were delivered after the start of the last snapshot.

4.3

sponds to that update. Servers process writes in order


and do not process other writes or reads concurrently.
This ensures strict succession of notifications. Note that
servers handle notifications locally. Only the server that
a client is connected to tracks and triggers notifications
for that client.
Read requests are handled locally at each server. Each
read request is processed and tagged with a zxid that corresponds to the last transaction seen by the server. This
zxid defines the partial order of the read requests with respect to the write requests. By processing reads locally,
we obtain excellent read performance because it is just an
in-memory operation on the local server, and there is no
disk activity or agreement protocol to run. This design
choice is key to achieving our goal of excellent performance with read-dominant workloads.
One drawback of using fast reads is not guaranteeing
precedence order for read operations. That is, a read operation may return a stale value, even though a more
recent update to the same znode has been committed.
Not all of our applications require precedence order, but
for applications that do require it, we have implemented
sync. This primitive executes asynchronously and is
ordered by the leader after all pending writes to its local replica. To guarantee that a given read operation returns the latest updated value, a client calls sync followed by the read operation. The FIFO order guarantee
of client operations together with the global guarantee of
sync enables the result of the read operation to reflect
any changes that happened before the sync was issued.
In our implementation, we do not need to atomically
broadcast sync as we use a leader-based algorithm, and
we simply place the sync operation at the end of the
queue of requests between the leader and the server executing the call to sync. In order for this to work, the
follower must be sure that the leader is still the leader.
If there are pending transactions that commit, then the
server does not suspect the leader. If the pending queue
is empty, the leader needs to issue a null transaction to
commit and orders the sync after that transaction. This
has the nice property that when the leader is under load,
no extra broadcast traffic is generated. In our implementation, timeouts are set such that leaders realize they are
not leaders before followers abandon them, so we do not
issue the null transaction.
ZooKeeper servers process requests from clients in
FIFO order. Responses include the zxid that the response
is relative to. Even heartbeat messages during intervals
of no activity include the last zxid seen by the server that
the client is connected to. If the client connects to a new
server, that new server ensures that its view of the ZooKeeper data is at least as recent as the view of the client
by checking the last zxid of the client against its last zxid.
If the client has a more recent view than the server, the

Replicated Database

Each replica has a copy in memory of the ZooKeeper


state. When a ZooKeeper server recovers from a crash, it
needs to recover this internal state. Replaying all delivered messages to recover state would take prohibitively
long after running the server for a while, so ZooKeeper
uses periodic snapshots and only requires redelivery of
messages since the start of the snapshot. We call ZooKeeper snapshots fuzzy snapshots since we do not lock
the ZooKeeper state to take the snapshot; instead, we do
a depth first scan of the tree atomically reading each znodes data and meta-data and writing them to disk. Since
the resulting fuzzy snapshot may have applied some subset of the state changes delivered during the generation of
the snapshot, the result may not correspond to the state
of ZooKeeper at any point in time. However, since state
changes are idempotent, we can apply them twice as long
as we apply the state changes in order.
For example, assume that in a ZooKeeper data tree two
nodes /foo and /goo have values f1 and g1 respectively and both are at version 1 when the fuzzy snapshot begins, and the following stream of state changes
arrive having the form htransactionType, path,
value, new-versioni:
hSetDataTXN, /foo, f2, 2i
hSetDataTXN, /goo, g2, 2i
hSetDataTXN, /foo, f3, 3i

After processing these state changes, /foo and /goo


have values f3 and g2 with versions 3 and 2 respectively. However, the fuzzy snapshot may have recorded
that /foo and /goo have values f3 and g1 with versions 3 and 1 respectively, which was not a valid state
of the ZooKeeper data tree. If the server crashes and
recovers with this snapshot and Zab redelivers the state
changes, the resulting state corresponds to the state of the
service before the crash.

4.4

Client-Server Interactions

When a server processes a write request, it also sends out


and clears notifications relative to any watch that corre9

server does not reestablish the session with the client until the server has caught up. The client is guaranteed to
be able to find another server that has a recent view of the
system since the client only sees changes that have been
replicated to a majority of the ZooKeeper servers. This
behavior is important to guarantee durability.
To detect client session failures, ZooKeeper uses timeouts. The leader determines that there has been a failure
if no other server receives anything from a client session within the session timeout. If the client sends requests frequently enough, then there is no need to send
any other message. Otherwise, the client sends heartbeat
messages during periods of low activity. If the client
cannot communicate with a server to send a request or
heartbeat, it connects to a different ZooKeeper server to
re-establish its session. To prevent the session from timing out, the ZooKeeper client library sends a heartbeat
after the session has been idle for s/3 ms and switch to a
new server if it has not heard from a server for 2s/3 ms,
where s is the session timeout in milliseconds.

go to the leader, but does not get broadcast.) Clients


send counts of the number of completed operations every 300ms and we sample every 6s. To prevent memory
overflows, servers throttle the number of concurrent requests in the system. ZooKeeper uses request throttling
to keep servers from being overwhelmed. For these experiments, we configured the ZooKeeper servers to have
a maximum of 2, 000 total requests in process.
Throughput of saturated system
90000

Operations per second

70000
60000
50000
40000
30000
20000
10000
0
0

20

40
60
Percentage of read requests

80

100

Figure 5: The throughput performance of a saturated system as the ratio of reads to writes vary.

Evaluation

We performed all of our evaluation on a cluster of 50


servers. Each server has one Xeon dual-core 2.1GHz
processor, 4GB of RAM, gigabit ethernet, and two SATA
hard drives. We split the following discussion into two
parts: throughput and latency of requests.

5.1

3 servers
5 servers
7 servers
9 servers
13 servers

80000

Servers
13
9
7
5
3

Throughput

100% Reads
460k
296k
257k
165k
87k

0% Reads
8k
12k
14k
18k
21k

Table 1: The throughput performance of the extremes of


a saturated system.

To evaluate our system, we benchmark throughput when


the system is saturated and the changes in throughput
for various injected failures. We varied the number of
servers that make up the ZooKeeper service, but always
kept the number of clients the same. To simulate a large
number of clients, we used 35 machines to simulate 250
simultaneous clients.
We have a Java implementation of the ZooKeeper
server, and both Java and C clients2 . For these experiments, we used the Java server configured to log to one
dedicated disk and take snapshots on another. Our benchmark client uses the asynchronous Java client API, and
each client has at least 100 requests outstanding. Each
request consists of a read or write of 1K of data. We
do not show benchmarks for other operations since the
performance of all the operations that modify state are
approximately the same, and the performance of nonstate modifying operations, excluding sync, are approximately the same. (The performance of sync approximates that of a light-weight write, since the request must

In Figure 5, we show throughput as we vary the ratio


of read to write requests, and each curve corresponds to
a different number of servers providing the ZooKeeper
service. Table 1 shows the numbers at the extremes of
the read loads. Read throughput is higher than write
throughput because reads do not use atomic broadcast.
The graph also shows that the number of servers also has
a negative impact on the performance of the broadcast
protocol. From these graphs, we observe that the number
of servers in the system does not only impact the number of failures that the service can handle, but also the
workload the service can handle. Note that the curve for
three servers crosses the others around 60%. This situation is not exclusive of the three-server configuration,
and happens for all configurations due to the parallelism
local reads enable. It is not observable for other configurations in the figure, however, because we have capped
the maximum y-axis throughput for readability.
There are two reasons for write requests taking longer
than read requests. First, write requests must go through
atomic broadcast, which requires some extra processing

2 The

implementation is publicly available at http://hadoop.


apache.org/zookeeper.

10

and adds latency to requests. The other reason for longer


processing of write requests is that servers must ensure
that transactions are logged to non-volatile store before
sending acknowledgments back to the leader. In principle, this requirement is excessive, but for our production systems we trade performance for reliability since
ZooKeeper constitutes application ground truth. We use
more servers to tolerate more faults. We increase write
throughput by partitioning the ZooKeeper data into multiple ZooKeeper ensembles. This performance trade off
between replication and partitioning has been previously
observed by Gray et al. [12].

Atomic Broadcast Throughput


70000

Requests per second

60000

Operations per second

70000

40000
30000
20000
10000
20

40
60
Percentage of read requests

80

8
Size of ensemble

10

12

14

versions all require CPU. The contention for CPU lowers ZooKeeper throughput to substantially less than the
atomic broadcast component in isolation. Because ZooKeeper is a critical production component, up to now our
development focus for ZooKeeper has been correctness
and robustness. There are plenty of opportunities for improving performance significantly by eliminating things
like extra copies, multiple serializations of the same object, more efficient internal data structures, etc.

50000

20000

Figure 7: Average throughput of the atomic broadcast


component in isolation. Error bars denote the minimum
and maximum values.

60000

30000

3 servers
5 servers
7 servers
9 servers
13 servers

80000

40000

10000

Throughput of saturated system (all requests to leader)


90000

50000

100

Figure 6: Throughput of a saturated system, varying the


ratio of reads to writes when all clients connect to the
leader.

Time series with failures


70000

Throughput

Operations per second

60000

ZooKeeper is able to achieve such high throughput by


distributing load across the servers that makeup the service. We can distribute the load because of our relaxed
consistency guarantees. Chubby clients instead direct all
requests to the leader. Figure 6 shows what happens if
we do not take advantage of this relaxation and forced
the clients to only connect to the leader. As expected the
throughput is much lower for read-dominant workloads,
but even for write-dominant workloads the throughput is
lower. The extra CPU and network load caused by servicing clients impacts the ability of the leader to coordinate the broadcast of the proposals, which in turn adversely impacts the overall write performance.
The atomic broadcast protocol does most of the work
of the system and thus limits the performance of ZooKeeper more than any other component. Figure 7 shows
the throughput of the atomic broadcast component. To
benchmark its performance we simulate clients by generating the transactions directly at the leader, so there is
no client connections or client requests and replies. At
maximum throughput the atomic broadcast component
becomes CPU bound. In theory the performance of Figure 7 would match the performance of ZooKeeper with
100% writes. However, the ZooKeeper client communication, ACL checks, and request to transaction con-

50000

C

40000
30000

A

B

20000
10000

0
0

50

100
150
200
Seconds since start of series

250

300

Figure 8: Throughput upon failures.


To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up
of 5 machines. We ran the same saturation benchmark
as before, but this time we kept the write percentage at
a constant 30%, which is a conservative ratio of our expected workloads. Periodically we killed some of the
server processes. Figure 8 shows the system throughput
as it changes over time. The events marked in the figure
are the following:
1. Failure and recovery of a follower;
2. Failure and recovery of a different follower;
3. Failure of the leader;
4. Failure of two followers (a, b) in the first two marks,
and recovery at the third mark (c);
5. Failure of the leader.
11

6. Recovery of the leader.


There are a few important observations from this
graph. First, if followers fail and recover quickly, then
ZooKeeper is able to sustain a high throughput despite
the failure. The failure of a single follower does not prevent servers from forming a quorum, and only reduces
throughput roughly by the share of read requests that the
server was processing before failing. Second, our leader
election algorithm is able to recover fast enough to prevent throughput from dropping substantially. In our observations, ZooKeeper takes less than 200ms to elect a
new leader. Thus, although servers stop serving requests
for a fraction of second, we do not observe a throughput
of zero due to our sampling period, which is on the order
of seconds. Third, even if followers take more time to recover, ZooKeeper is able to raise throughput again once
they start processing requests. One reason that we do
not recover to the full throughput level after events 1, 2,
and 4 is that the clients only switch followers when their
connection to the follower is broken. Thus, after event 4
the clients do not redistribute themselves until the leader
fails at events 3 and 5. In practice such imbalances work
themselves out over time as clients come and go.

5.2

# of barriers
200
400
800
1600

5.3

Performance of barriers

In this experiment, we execute a number of barriers sequentially to assess the performance of primitives implemented with ZooKeeper. For a given number of barriers
b, each client first enters all b barriers, and then it leaves
all b barriers in succession. As we use the double-barrier
algorithm of Section 2.4, a client first waits for all other
clients to execute the enter() procedure before moving to next call (similarly for leave()).
We report the results of our experiments in Table 3.
In this experiment, we have 50, 100, and 200 clients
entering a number b of barriers in succession, b 2
{200, 400, 800, 1600}. Although an application can have
thousands of ZooKeeper clients, quite often a much
smaller subset participates in each coordination operation as clients are often grouped according to the
specifics of the application.
Two interesting observations from this experiment are
that the time to process all barriers increase roughly linearly with the number of barriers, showing that concurrent access to the same part of the data tree did not produce any unexpected delay, and that latency increases
proportionally to the number of clients. This is a consequence of not saturating the ZooKeeper service. In
fact, we observe that even with clients proceeding in
lock-step, the throughput of barrier operations (enter and
leave) is between 1,950 and 3,100 operations per second
in all cases. In ZooKeeper operations, this corresponds
to throughput values between 10,700 and 17,000 operations per second. As in our implementation we have a
ratio of reads to writes of 4:1 (80% of read operations),
the throughput our benchmark code uses is much lower
compared to the raw throughput ZooKeeper can achieve
(over 40,000 according to Figure 5). This is due to clients
waiting on other clients.

To assess the latency of requests, we created a benchmark modeled after the Chubby benchmark [6]. We create a worker process that simply sends a create, waits
for it to finish, sends an asynchronous delete of the new
node, and then starts the next create. We vary the number
of workers accordingly, and for each run, we have each
worker create 50,000 nodes. We calculate the throughput
by dividing the number of create requests completed by
the total time it took for all the workers to complete.

3
776
2074
2740

# of clients
100
200
19.8
41.0
34.1
62.0
55.9
112.1
102.7 234.4

Table 3: Barrier experiment with time in seconds. Each


point is the average of the time for each client to finish
over five runs.

Latency of requests

Workers
1
10
20

50
9.4
16.4
28.9
54.0

Number of servers
5
7
9
748
758
711
1832 1572 1540
2336 1934 1890

Table 2: Create requests processed per second.


Table 2 show the results of our benchmark. The create requests include 1K of data, rather than 5 bytes in
the Chubby benchmark, to better coincide with our expected use. Even with these larger requests, the throughput of ZooKeeper is more than 3 times higher than the
published throughput of Chubby. The throughput of the
single ZooKeeper worker benchmark indicates that the
average request latency is 1.2ms for three servers and
1.4ms for 9 servers.

Related work

ZooKeeper has the goal of providing a service that mitigates the problem of coordinating processes in distributed applications. To achieve this goal, its design uses
ideas from previous coordination services, fault tolerant
systems, distributed algorithms, and file systems.
12

We are not the first to propose a system for the coordination of distributed applications. Some early systems
propose a distributed lock service for transactional applications [13], and for sharing information in clusters
of computers [19]. More recently, Chubby proposes a
system to manage advisory locks for distributed applications [6]. Chubby shares several of the goals of ZooKeeper. It also has a file-system-like interface, and it uses
an agreement protocol to guarantee the consistency of the
replicas. However, ZooKeeper is not a lock service. It
can be used by clients to implement locks, but there are
no lock operations in its API. Unlike Chubby, ZooKeeper
allows clients to connect to any ZooKeeper server, not
just the leader. ZooKeeper clients can use their local
replicas to serve data and manage watches since its consistency model is much more relaxed than Chubby. This
enables ZooKeeper to provide higher performance than
Chubby, allowing applications to make more extensive
use of ZooKeeper.
There have been fault-tolerant systems proposed in
the literature with the goal of mitigating the problem of
building fault-tolerant distributed applications. One early
system is ISIS [5]. The ISIS system transforms abstract
type specifications into fault-tolerant distributed objects,
thus making fault-tolerance mechanisms transparent to
users. Horus [30] and Ensemble [31] are systems that
evolved from ISIS. ZooKeeper embraces the notion of
virtual synchrony of ISIS. Finally, Totem guarantees total
order of message delivery in an architecture that exploits
hardware broadcasts of local area networks [22]. ZooKeeper works with a wide variety of network topologies
which motivated us to rely on TCP connections between
server processes and not assume any special topology or
hardware features. We also do not expose any of the ensemble communication used internally in ZooKeeper.
One important technique for building fault-tolerant
services is state-machine replication [26], and Paxos [20]
is an algorithm that enables efficient implementations
of replicated state-machines for asynchronous systems.
We use an algorithm that shares some of the characteristics of Paxos, but that combines transaction logging
needed for consensus with write-ahead logging needed
for data tree recovery to enable an efficient implementation. There have been proposals of protocols for practical
implementations of Byzantine-tolerant replicated statemachines [7, 10, 18, 1, 28]. ZooKeeper does not assume
that servers can be Byzantine, but we do employ mechanisms such as checksums and sanity checks to catch
non-malicious Byzantine faults. Clement et al. discuss an approach to make ZooKeeper fully Byzantine
fault-tolerant without modifying the current server code
base [9]. To date, we have not observed faults in production that would have been prevented using a fully Byzantine fault-tolerant protocol. [29].

Boxwood [21] is a system that uses distributed lock


servers. Boxwood provides higher-level abstractions to
applications, and it relies upon a distributed lock service
based on Paxos. Like Boxwood, ZooKeeper is a component used to build distributed systems. ZooKeeper,
however, has high-performance requirements and is used
more extensively in client applications. ZooKeeper exposes lower-level primitives that applications use to implement higher-level primitives.
ZooKeeper resembles a small file system, but it only
provides a small subset of the file system operations
and adds functionality not present in most file systems
such as ordering guarantees and conditional writes. ZooKeeper watches, however, are similar in spirit to the
cache callbacks of AFS [16].
Sinfonia [2] introduces mini-transactions, a new
paradigm for building scalable distributed systems. Sinfonia has been designed to store application data,
whereas ZooKeeper stores application metadata. ZooKeeper keeps its state fully replicated and in memory for
high performance and consistent latency. Our use of file
system like operations and ordering enables functionality
similar to mini-transactions. The znode is a convenient
abstraction upon which we add watches, a functionality
missing in Sinfonia. Dynamo [11] allows clients to get
and put relatively small (less than 1M) amounts of data in
a distributed key-value store. Unlike ZooKeeper, the key
space in Dynamo is not hierarchal. Dynamo also does
not provide strong durability and consistency guarantees
for writes, but instead resolves conflicts on reads.
DepSpace [4] uses a tuple space to provide a Byzantine fault-tolerant service. Like ZooKeeper DepSpace
uses a simple server interface to implement strong synchronization primitives at the client. While DepSpaces
performance is much lower than ZooKeeper, it provides
stronger fault tolerance and confidentiality guarantees.

Conclusions

ZooKeeper takes a wait-free approach to the problem of


coordinating processes in distributed systems, by exposing wait-free objects to clients. We have found ZooKeeper to be useful for several applications inside and
outside Yahoo!. ZooKeeper achieves throughput values of hundreds of thousands of operations per second
for read-dominant workloads by using fast reads with
watches, both of which served by local replicas. Although our consistency guarantees for reads and watches
appear to be weak, we have shown with our use cases that
this combination allows us to implement efficient and
sophisticated coordination protocols at the client even
though reads are not precedence-ordered and the implementation of data objects is wait-free. The wait-free
property has proved to be essential for high performance.
13

Although we have described only a few applications,


there are many others using ZooKeeper. We believe such
a success is due to its simple interface and the powerful
abstractions that one can implement through this interface. Further, because of the high-throughput of ZooKeeper, applications can make extensive use of it, not
only course-grained locking.

[12]

[13]

[14]

Acknowledgements

[15]

We would like to thank Andrew Kornev and Runping Qi


for their contributions to ZooKeeper; Zeke Huang and
Mark Marchukov for valuable feedback; Brian Cooper
and Laurence Ramontianu for their early contributions
to ZooKeeper; Brian Bershad and Geoff Voelker made
important comments on the presentation.

[16]

[17]
[18]

References
[1] M. Abd-El-Malek, G. R. Ganger, G. R. Goodson, M. K. Reiter,
and J. J. Wylie. Fault-scalable byzantine fault-tolerant services.
In SOSP 05: Proceedings of the twentieth ACM symposium on
Operating systems principles, pages 5974, New York, NY, USA,
2005. ACM.

[19]

[2] M. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis. Sinfonia: A new paradigm for building scalable distributed
systems. In SOSP 07: Proceedings of the 21st ACM symposium
on Operating systems principles, New York, NY, 2007.

[21]

[3] Amazon. Amazon simple queue service.


amazon.com/sqs/, 2008.

[20]

http://aws.

[22]

[4] A. N. Bessani, E. P. Alchieri, M. Correia, and J. da Silva Fraga.


Depspace: A byzantine fault-tolerant coordination service. In
Proceedings of the 3rd ACM SIGOPS/EuroSys European Systems
Conference - EuroSys 2008, Apr. 2008.

[23]

[5] K. P. Birman. Replication and fault-tolerance in the ISIS system.


In SOSP 85: Proceedings of the 10th ACM symposium on Operating systems principles, New York, USA, 1985. ACM Press.

[24]

[6] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th ACM/USENIX Symposium on Operating Systems Design and Implementation (OSDI),
2006.

[25]
[26]

[7] M. Castro and B. Liskov. Practical byzantine fault tolerance and


proactive recovery. ACM Transactions on Computer Systems,
20(4), 2002.

[27]

[8] T. Chandra, R. Griesemer, and J. Redstone. Paxos made live: An


engineering perspective. In Proceedings of the 26th annual ACM
symposium on Principles of distributed computing (PODC), Aug.
2007.

[28]

[9] A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin,


and T. Riche. UpRight cluster services. In Proceedings of the 22
nd ACM Symposium on Operating Systems Principles (SOSP),
Oct. 2009.

[29]

[10] J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shira. Hq


replication: A hybrid quorum protocol for byzantine fault tolerance. In SOSP 07: Proceedings of the 21st ACM symposium on
Operating systems principles, New York, NY, USA, 2007.

[30]

[31]

[11] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazons highly available key-value store. In

14

SOSP 07: Proceedings of the 21st ACM symposium on Operating systems principles, New York, NY, USA, 2007. ACM Press.
J. Gray, P. Helland, P. ONeil, and D. Shasha. The dangers of
replication and a solution. In Proceedings of SIGMOD 96, pages
173182, New York, NY, USA, 1996. ACM.
A. Hastings. Distributed lock management in a transaction processing environment. In Proceedings of IEEE 9th Symposium on
Reliable Distributed Systems, Oct. 1990.
M. Herlihy. Wait-free synchronization. ACM Transactions on
Programming Languages and Systems, 13(1), 1991.
M. Herlihy and J. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming
Languages and Systems, 12(3), July 1990.
J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols,
M. Satyanarayanan, R. N. Sidebotham, and M. J. West. Scale
and performance in a distributed file system. ACM Trans. Comput. Syst., 6(1), 1988.
Katta. Katta - distribute lucene indexes in a grid. http://
katta.wiki.sourceforge.net/, 2008.
R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong.
Zyzzyva: speculative byzantine fault tolerance. SIGOPS Oper.
Syst. Rev., 41(6):4558, 2007.
N. P. Kronenberg, H. M. Levy, and W. D. Strecker. Vaxclusters (extended abstract): a closely-coupled distributed system.
SIGOPS Oper. Syst. Rev., 19(5), 1985.
L. Lamport. The part-time parliament. ACM Transactions on
Computer Systems, 16(2), May 1998.
J. MacCormick, N. Murphy, M. Najork, C. A. Thekkath, and
L. Zhou. Boxwood: Abstractions as the foundation for storage
infrastructure. In Proceedings of the 6th ACM/USENIX Symposium on Operating Systems Design and Implementation (OSDI),
2004.
L. Moser, P. Melliar-Smith, D. Agarwal, R. Budhia, C. LingleyPapadopoulos, and T. Archambault. The totem system. In Proceedings of the 25th International Symposium on Fault-Tolerant
Computing, June 1995.
S. Mullender, editor. Distributed Systems, 2nd edition. ACM
Press, New York, NY, USA, 1993.
B. Reed and F. P. Junqueira. A simple totally ordered broadcast protocol. In LADIS 08: Proceedings of the 2nd Workshop
on Large-Scale Distributed Systems and Middleware, pages 16,
New York, NY, USA, 2008. ACM.
N. Schiper and S. Toueg. A robust and lightweight stable leader
election service for dynamic systems. In DSN, 2008.
F. B. Schneider. Implementing fault-tolerant services using the
state machine approach: A tutorial. ACM Computing Surveys,
22(4), 1990.
A. Sherman, P. A. Lisiecki, A. Berkheimer, and J. Wein. ACMS:
The Akamai configuration management system. In NSDI, 2005.
A. Singh, P. Fonseca, P. Kuznetsov, R. Rodrigues, and P. Maniatis. Zeno: eventually consistent byzantine-fault tolerance.
In NSDI09: Proceedings of the 6th USENIX symposium on
Networked systems design and implementation, pages 169184,
Berkeley, CA, USA, 2009. USENIX Association.
Y. J. Song, F. Junqueira, and B. Reed.
BFT for the
skeptics. http://www.net.t-labs.tu-berlin.de/
petr/BFTW3/abstracts/talk-abstract.pdf.
R. van Renesse and K. Birman. Horus, a flexible group communication systems. Communications of the ACM, 39(16), Apr.
1996.
R. van Renesse, K. Birman, M. Hayden, A. Vaysburd, and
D. Karr. Building adaptive systems using ensemble. Software
- Practice and Experience, 28(5), July 1998.

S-ar putea să vă placă și