Modern Distributed File System Design: Vrije Universiteit Amsterdam

Modern distributed file system design
Mischa Geldermans
Vrije Universiteit Amsterdam
Abstract
This paper gives an overview of five modern distributed file systems. For each file system a high
level design is given and the consistency model and fault tolerance mechanisms are described. By focusing
on the consistency model and fault tolerance, the paper compares the design choices that are made by these
distributed file systems to obtain reasonable performance.
1. Introduction
Data is stored in files. The software that manages files is called a file system. A file system offers a
namespace to address files and administrates the structures to store both file names and file data on a stor-
age medium.
With the use of computer networks network file systems were employed to have access to one’s files
irrespective of one’s network location. Initially, network file systems were centralized and consisted of a
network interface on top of a local file system. In order to scale, the file system can be distributed over dif-
ferent nodes in the network. Such a file system is called a distributed file system.
Files are often used to share data among different applications. In a distributed file system files are
also essential for distributed applications to share information.
Distributed file systems often replicate data. As the system is using multiple nodes in the network,
the file system needs to be tolerant to failing nodes. If it were not, the availability of the whole file system
would decrease with the size of the system.
Many distributed file systems attempt to replicate file data transparently as much as possible, as that
is the interface that local file systems offer. The POSIX file system interface is an example of a sequen-
tially consistent file system interface. It is the consistency model that reflects the actual semantics of a dis-
tributed file system.
The consistency model also influences the performance of a distributed file system to a large extent.
Maintaining sequential consistency is costly when there are many replicas or when there are many concur-
rent writes to a file. It is the combination of a consistency model, fault tolerance, and performance opti-
mizations that distinguishes one distributed file system from another.
This paper gives an overview of design choices with respect to the consistency model, fault tolerance,
and performance, that are made in modern distributed file systems. Note that the consistency model
described models the file system interface that is offered to applications. Internally, the file system might
use other consistency mechanisms.
Five modern distributed file systems are described in the sections below. From mere designs (zFS),
to mission critical systems (GFS), to very complex systems (Ceph).
2. Farsite
FARSITE, the Federated, Available, and Reliable Storage for an Incompletely Trusted Environment1
is a distributed file system developed by Microsoft. Farsite is a serverless system, it uses a network of
untrusted desktop workstations for the distribution of the file system data. By using the users’ workstations
it combines the advantages of a logically centralized file server, sharing and reliability, with the advantages
of desktop file systems, locality and low cost, without the use of additional systems.
Workstations are application servers, metadata servers, and data servers. A group of metadata servers
that all store a replication of the same portion of the namespace use a Byzantine fault tolerant protocol.2
When the metadata such a group manages grows too large, it randomly selects a set of systems to manage a
portion of the metadata as a group.
When an application requests to open a file, the group of metadata servers that manage the file prove
authority over the file by presenting a namespace certificate signed by its parent, recursively up to the root
of the namespace. The group then sends a list of data servers that store encrypted replicas of the file. The
application retrieves a replica and uses his private key to decrypt the data. On writing, the group checks if
the application is allowed to write to the file, and if so instructs the data server to retrieve encrypted copies
of the newly written data. File data is locally cached at the application server.
2.1. Consistency model
The goal of Farsite is to offer an interface that resembles the NTFS file system interface. It diverges
from the NTFS interface when it is too much of a burden to maintain consistency. When too many readers
access a file concurrently, some will receive a snapshot of the data at the time of opening the file. The snap-
shot does not reflect changes made afterwards. For concurrent writers, the system does not accept addi-
tional writers above a certain threshold.
The metadata servers maintain the consistency of data by issuing leases to clients that access the data.
Different classes of leases exist to control different data entities, separating namespace consistency from
data consistency.
When an application opens a file a lease is requested either for read-only or read/write access. For
reading the lease assures that data is not stale, for writing it entitles an application to write to its local
cache. Leases are revoked when another application opens the same file with a conflicting mode. On clo-
sure the lease is not yet canceled as the file might be reopened quickly. Namespace leases allow an applica-
tion to modify subtrees without contacting metadata servers.
2.2. Fault tolerance

Farsite replicates both data and metadata over random sets of nodes in the network and manages the
metadata using a Byzantine fault tolerant protocol. The metadata servers maintain cryptographically secure
hashes of data to identify corrupted data.
Metadata has a high replication factor since Byzantine groups only make progress if more than two
third of their members are non-faulty. For details see the paper by Castro and Liskov. File data has a low
replication factor as all but one server can fail among a set of data servers that manage a file. When server
failures are detected, the data is replicated on other servers to retain the replication factor. Only if many
servers fail in a short time window will data be lost.
Leases have an expiration time to prevent a failing system from making data permanently inaccessi-
ble.
2.3. Performance
Farsite is designed as a serverless system at the cost of scalability. It will not scale beyond a network
with a few thousand nodes. It also assumes sequential data access patterns and does not attempt to prevent
leases from migrating frequently.
Modified data is lazily written to the data servers. By keeping it in the cache the written data might
be obsoleted by subsequent modifications.
3. zFS
zFS3 is a distributed file system designed at IBM Labs, Israel. IBM’s zFS is not related to Sun’s
Zettabyte File System (ZFS).
Physically zFS consists of application servers and data servers, logically it also includes a file man-
ager, a lease manager, a transaction server, and a cooperative cache. Unlike data servers used by GFS and
Ceph, described below, zFS data servers store both file metadata as well as file data.
Every opened file in zFS is managed by a file manager, located at the application node where the file
was initially opened. A file manager can manage multiple files. When an I/O request cannot be served by
the local cache of the application server, the file manager requests a lease for the I/O request from the lease
manager. For reads the file manager then checks to see if another application has the data cached. If so, it
forwards the request and lease to the cache containing the data. For writes all other overlapping leases are
revoked synchronously and the data is written to the local cache.
The transaction server performs directory operations. The cooperative cache is the logical union of
all caches in the application servers.

zFS offers a sequentially consistent file system interface. The transaction manager maintains sequen-
tial consistency for directory operations by obtaining all necessary leases and then doing the operation. If
the operation fails, the transaction is either rolled forward or backward.
The file manager maintains sequential consistency for regular data operations by obtaining and
revoking leases.

Fault tolerance is mainly dealt with by the leases. Instead of using locks, leases are used to eventu-
ally release resources that are claimed by servers that failed.
zFS limits the number of dirty pages in the cache of an application server so that all pages will safely
be written before leases expire.
Data servers are assumed to be reliable, file data is not replicated.
3.3. Performance
The main assumption in zFS is that network bandwidth is much higher than disk bandwidth. By
combining all local caches into a cooperative cache—preferring network transfers over disk transfers—zFS
will perform well as long as this is the case. As the objective of zFS is to deal with configurations ranging
from a few to at most thousands of servers this is probably the case.
4. GFS
The Google File System4 (GFS) is a distributed file system developed by Google. It is used by
Google only and specifically built for their needs. The file system and the applications are co-designed.
GFS is optimized for very large files and very large disk blocks. By mainly supporting very large
files stored in very large disk blocks, the amount of metadata is reduced to the level that it can be main-
tained in-memory by a single metadata server.
The metadata server manages the file namespace, the mapping from files to disk blocks, and the data
server locations of each replica of a disk block. Disk blocks are managed by the data servers.

There is no redundant data in the metadata server, the only replicated data in GFS are disk blocks.
Consistency of disk blocks is defined only when disk blocks are successfully written. Unsuccessful writes
leave disk blocks inconsistent, subsequent reads may obtain different data. Serial writes are sequentially
consistent. Concurrent writes are not. However, all subsequent readers will see the same contents.
GFS provides atomic append operations as they are mostly used by Google’s applications. The
append operation is similar to opening a file with the append flag in Unix, each write is appended to the end
of the file. As the file offset when appending is left to the system, atomic appends are implemented by a
coordinated effort of the data servers involved.
Atomic append operations are sequentially consistent, but they may be interspersed by inconsistent
data caused by padding or duplicated appends.

GFS does not focus on tolerating faults, it focuses on fast recovery. Next to fast recovery disk blocks
are replicated. The metadata server uses a log to store updates. The log is replicated on other machines.
When a metadata server fails, it takes only seconds to restore its state. On serious failure, a secondary mas-
ter provides read-only access to the file system.
Each disk block is replicated over multiple data servers. A heartbeat mechanism is used by the meta-
data server to check the availability of data servers, and to designate a principal data server from replicas
for each disk block for the duration of a heartbeat. The principal server determines a serial order for disk
block updates and forwards this order to the replicas while the application server writes the data to all repli-
cas. This is all done synchronously.
4.3. Performance
The designers of GFS mainly tried to minimize the involvement of the metadata server in all opera-
tions so that it does not become a bottleneck. By assuming a modest number of very large files stored in
very large disk blocks, the metadata server only rarely needs to be consulted for obtaining block locations.
Another assumption is that file system load mainly stems from large sequential reads and large
sequential append operations causing disk bandwidth to be used efficiently. Dataservers do not cache disk
blocks themselves, they rely on the underlying local file system to do caching at a finer granularity. As the
file system is closely tied to the applications the workload will probably live up to these assumptions.
To use network bandwidth efficiently, data is forwarded linearly over the replicas. By knowing the
network topology, each replica can forward data to the closest server in the network that not yet received it.
5. Ceph
Ceph5 is a distributed file system developed at the University of California, Santa Cruz. Its main
characteristic is the clear separation between data and metadata at the server level, like GFS. But unlike
GFS—which has a single metadata server, the master—Ceph has multiple metadata servers. Ceph goes a
long way to eliminate the central component that is present in GFS and to deal with many small files as
well.
By using a data distribution function metadata servers can locate data servers and direct an applica-
tion there when it creates or opens a file. In this way, the location of a data server is calculated, instead of
being looked up.
Low level block allocation is dealt with by the data servers. File data is striped over data servers and
all data servers that store a set of files are grouped logically into placement groups. A data server stores
pieces of multiple placement groups.
Caching, if possible, is done by application servers.

Ceph offers a sequentially consistent file system interface. When an application opens a file it is
returned a capability by a metadata server. The capability describes the caching behavior that the applica-
tion server must adhere to. When either multiple writers open a file, or a file has multiple readers and writ-
ers, any previously issued capabilities are revoked for the file, forcing I/O to be synchronous. This is simi-
lar to revoking leases in zFS. Consistency is thus maintained by the metadata servers.
To improve performance Ceph implements some of the unofficial file system interface extensions
proposed by the high-performance computing community.6 Applications can request Ceph to open a file
lazily so that file data is still cached while multiple applications modify it. The applications can then
explicitly propagate modifications to data stores and they can explicitly invalidate cached contents of a file.
When an application collects file statistics, Ceph temporarily revokes write capabilities.
Furthermore Ceph has an undocumented global switch to relax consistency, so at the system level
Ceph supports either strong consistency or high performance.

Fault tolerance must be dealt with in the metadata servers and in the data servers. Metadata servers
update metadata through journals. These journals are big to efficiently store them on data servers and to
prevent many updates from actually being written as they are already obsolete by the time the data is
flushed. The journals will also be used by the metadata servers to do recovery when a server fails, but it is
not yet implemented.
Replication for data servers is at the level of placement groups. A placement group lists n data
servers for n-way replication. All application servers communicate with the first data server in the list that
responds. This data server forwards writes synchronously to other data servers in replicated placement
groups.
5.3. Performance
Ceph claims to have excellent performance by explicitly setting apart system resources for metadata
lookups, the metadata servers. No metadata is shared by the data and metadata servers. The metadata
servers manage the namespace and some file information. The latter is maintained by the metadata servers
in order to optimize for the common case—obtaining statistics for all files in a directory. As metadata
lookups make up for half of typical file system workloads, Ceph is optimized for that.7
Furthermore, for each file the name and information is stored sequentially. This way, an entire direc-
tory can be fetched using a single read request.
Each metadata server maintains file usage statistics using a decay filter. Periodically these statistics
are compared system wide and directories are migrated to balance the load at the metadata server level.
Ceph does not attempt to locate metadata close to the applications using it.
The main performance issue using Ceph is the principality of data servers. Having all applications
interact with a principal data server for each file, and having the data server synchronously forward writes
to replicas shows the focus of Ceph to improve metadata performance.
6. WheelFS
WheelFS8 is a distributed file system created at MIT. It uses cooperative reading like BitTorrent9 and
a distribution function like Ceph uses. However, just like zFS, WheelFS does not partition files over differ-
ent nodes like Ceph does.
What sets WheelFS apart from the others is the trade-off it offers to applications. Instead of offering
a single consistency model it lets applications control consistency through semantic cues, leaving it up to an
application to trade consistency for performance. By default weak consistency is used. Seven cues are
envisioned; the designers of WheelFS state that the exact set is still subject to research. They are described
below.

WheelFS offers a sequentially consistent file system interface next to semantic cues. The semantic
cues are attached to a file or directory when it is created or when an application opens a file or directory for
access. Multiple cues can be specified and are reflected in the path name of the file or directory.
The following semantic cues can be selected on file creation:
• Strict: sequentially consistent;
• WriteMany: new versions of the file may be created;
• WriteOnce: only one version, immutable after creation;
• Lax: allow multiple concurrent versions in case of network failures.
The following semantics cues can be selected when a file is opened:
• LatestVersion: check the principal node for the latest version;
• BestVersion: select the highest version obtained within a certain time window;
• AnyVersion: select the first version found.
As some semantic cues allow an application to modify files irrespective of the version, a file can be
represented at different locations with different content. WheelFS resolves this by selecting an arbitrary
copy of the file deterministically. Note that it is up to the application to prevent havoc by this behavior.
The lax cue is comparable to the lazy flag implemented by Ceph, but while Ceph allows an applica-
tion to control consistency by actually doing synchronization, WheelFS offers different consistency models.

By selecting a semantic cue, an application implicitly selects failure behavior. When using strict
semantics, file or directory operations cannot proceed when the principal node cannot be contacted. Lax
cues allow file access even when the network falls apart, as long as a node containing a replica is still acces-
sible.
6.3. Performance
As stated WheelFS allows an application to trade consistency for performance. High performance is
obtained by offering weak semantics to read file data cooperatively from the nearest locations and by hav-
ing the distribution function select the local machine for storing newly created files.
7. Summary
This paper gave an overview of the high level designs of five modern distributed file systems. For
each file system the consistency model and fault tolerance mechanisms are described.
The main goal of distributed file systems is scalability. Furthermore, by running on workstations it
can offer a logically centralized file system without the need for extra servers. Farsite is an example of such
a system.
Most distributed file systems described in this paper sacrifice file system semantics for performance.
Some leave it up to the application to select appropriate semantics, like WheelFS. Ceph offers an open flag
to propagate local modifications to data servers. Another solution is to use snapshots for readers and limit
the number of writers when a file is opened by many applications. The snapshots are not synchronized
after opening. This is what Farsite does.
zFS and Ceph adhere to sequential consistency, they synchronize access to a file when multiple appli-
cations modify it. Leases are used by readers to assure they do not read stale data. When an application
opens the file for writing, the read leases are revoked. Leases have an expiration time, when a network
node fails the file becomes available again after the expiration time has passed.
GFS ties the file system to applications. Instead of attempting to deal with failures, it simply leaves
files in an inconsistent state. Subsequent reads from different applications may see different data. It is up
to the application to deal with such inconsistencies.
Almost all distributed file systems described in this paper replicate file data for fault tolerance. Only
zFS assumes reliable storage. Metadata is also typically replicated. Only GFS does not replicate metadata,
it uses a centralized metadata server that can recover quickly by logging updates and restore state using the
log after a failure. Ceph journals metadata.
In WheelFS, the semantic cues that define the consistency on a per file basis also define failure
behavior.
References
1. A. Adya, W. Bolosky, M. Castro, R. Chaiken, G. Cermak, J. Douceur, J. Howell, J. Lorch, M.
Theimer, and R. Wattenhofer, FARSITE: Federated, Available, and Reliable Storage for an Incom-
pletely Trusted Environment, OSDI (2002).
2. M. Castro and B. Liskov, Practical Byzantine Fault Tolerance, OSDI (1999).
3. Ohad Rodeh and Avi Teperman, zFS - A Scalable Distributed File System Using Object Disks, MSS
(2003).
4. S. Ghemawat, H. Gobioff, and S. Leung, The Google File System, ACM SOSP (2003).
5. Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Darrell D. E. Long, Ceph: A Scalable, High-Per-
formance Distributed File System, OSDI (2006).
6. B. Welch, POSIX IO extensions for HPC, FAST (2005).
7. D. Roselli, J. Lorch, and T. Anderson, A comparison of file system workloads, USENIX (2000).
8. Jeremy Stribling, Emil Sit, M. Frans Kaashoek, Jinyang Li, and Robert Morris, Don’t Give Up on
Distributed File Systems, IPTPS07 (2007).
9. B. Cohen, Incentives build robustness in BitTorrent, Workshop on Economics of Peer-to-Peer Sys-
tems (2003).

Modern Distributed File System Design: Vrije Universiteit Amsterdam

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Modern Distributed File System Design: Vrije Universiteit Amsterdam

Încărcat de

Drepturi de autor:

Formate disponibile

Modern distributed file system design

2.2. Fault tolerance

3.1. Consistency model

3.2. Fault tolerance

4.1. Consistency model

4.2. Fault tolerance

5.1. Consistency model

5.2. Fault tolerance

6.1. Consistency model

6.2. Fault tolerance

S-ar putea să vă placă și