Documente Academic
Documente Profesional
Documente Cultură
Mischa Geldermans
Vrije Universiteit Amsterdam
Abstract
This paper gives an overview of five modern distributed file systems. For each file system a high
level design is given and the consistency model and fault tolerance mechanisms are described. By focusing
on the consistency model and fault tolerance, the paper compares the design choices that are made by these
distributed file systems to obtain reasonable performance.
1. Introduction
Data is stored in files. The software that manages files is called a file system. A file system offers a
namespace to address files and administrates the structures to store both file names and file data on a stor-
age medium.
With the use of computer networks network file systems were employed to have access to one’s files
irrespective of one’s network location. Initially, network file systems were centralized and consisted of a
network interface on top of a local file system. In order to scale, the file system can be distributed over dif-
ferent nodes in the network. Such a file system is called a distributed file system.
Files are often used to share data among different applications. In a distributed file system files are
also essential for distributed applications to share information.
Distributed file systems often replicate data. As the system is using multiple nodes in the network,
the file system needs to be tolerant to failing nodes. If it were not, the availability of the whole file system
would decrease with the size of the system.
Many distributed file systems attempt to replicate file data transparently as much as possible, as that
is the interface that local file systems offer. The POSIX file system interface is an example of a sequen-
tially consistent file system interface. It is the consistency model that reflects the actual semantics of a dis-
tributed file system.
The consistency model also influences the performance of a distributed file system to a large extent.
Maintaining sequential consistency is costly when there are many replicas or when there are many concur-
rent writes to a file. It is the combination of a consistency model, fault tolerance, and performance opti-
mizations that distinguishes one distributed file system from another.
This paper gives an overview of design choices with respect to the consistency model, fault tolerance,
and performance, that are made in modern distributed file systems. Note that the consistency model
described models the file system interface that is offered to applications. Internally, the file system might
use other consistency mechanisms.
Five modern distributed file systems are described in the sections below. From mere designs (zFS),
to mission critical systems (GFS), to very complex systems (Ceph).
2. Farsite
FARSITE, the Federated, Available, and Reliable Storage for an Incompletely Trusted Environment1
is a distributed file system developed by Microsoft. Farsite is a serverless system, it uses a network of
untrusted desktop workstations for the distribution of the file system data. By using the users’ workstations
it combines the advantages of a logically centralized file server, sharing and reliability, with the advantages
of desktop file systems, locality and low cost, without the use of additional systems.
Workstations are application servers, metadata servers, and data servers. A group of metadata servers
that all store a replication of the same portion of the namespace use a Byzantine fault tolerant protocol.2
When the metadata such a group manages grows too large, it randomly selects a set of systems to manage a
portion of the metadata as a group.
When an application requests to open a file, the group of metadata servers that manage the file prove
authority over the file by presenting a namespace certificate signed by its parent, recursively up to the root
of the namespace. The group then sends a list of data servers that store encrypted replicas of the file. The
application retrieves a replica and uses his private key to decrypt the data. On writing, the group checks if
the application is allowed to write to the file, and if so instructs the data server to retrieve encrypted copies
of the newly written data. File data is locally cached at the application server.
2.1. Consistency model
The goal of Farsite is to offer an interface that resembles the NTFS file system interface. It diverges
from the NTFS interface when it is too much of a burden to maintain consistency. When too many readers
access a file concurrently, some will receive a snapshot of the data at the time of opening the file. The snap-
shot does not reflect changes made afterwards. For concurrent writers, the system does not accept addi-
tional writers above a certain threshold.
The metadata servers maintain the consistency of data by issuing leases to clients that access the data.
Different classes of leases exist to control different data entities, separating namespace consistency from
data consistency.
When an application opens a file a lease is requested either for read-only or read/write access. For
reading the lease assures that data is not stale, for writing it entitles an application to write to its local
cache. Leases are revoked when another application opens the same file with a conflicting mode. On clo-
sure the lease is not yet canceled as the file might be reopened quickly. Namespace leases allow an applica-
tion to modify subtrees without contacting metadata servers.
2.3. Performance
Farsite is designed as a serverless system at the cost of scalability. It will not scale beyond a network
with a few thousand nodes. It also assumes sequential data access patterns and does not attempt to prevent
leases from migrating frequently.
Modified data is lazily written to the data servers. By keeping it in the cache the written data might
be obsoleted by subsequent modifications.
3. zFS
zFS3 is a distributed file system designed at IBM Labs, Israel. IBM’s zFS is not related to Sun’s
Zettabyte File System (ZFS).
Physically zFS consists of application servers and data servers, logically it also includes a file man-
ager, a lease manager, a transaction server, and a cooperative cache. Unlike data servers used by GFS and
Ceph, described below, zFS data servers store both file metadata as well as file data.
Every opened file in zFS is managed by a file manager, located at the application node where the file
was initially opened. A file manager can manage multiple files. When an I/O request cannot be served by
the local cache of the application server, the file manager requests a lease for the I/O request from the lease
manager. For reads the file manager then checks to see if another application has the data cached. If so, it
forwards the request and lease to the cache containing the data. For writes all other overlapping leases are
revoked synchronously and the data is written to the local cache.
The transaction server performs directory operations. The cooperative cache is the logical union of
all caches in the application servers.
3.3. Performance
The main assumption in zFS is that network bandwidth is much higher than disk bandwidth. By
combining all local caches into a cooperative cache—preferring network transfers over disk transfers—zFS
will perform well as long as this is the case. As the objective of zFS is to deal with configurations ranging
from a few to at most thousands of servers this is probably the case.
4. GFS
The Google File System4 (GFS) is a distributed file system developed by Google. It is used by
Google only and specifically built for their needs. The file system and the applications are co-designed.
GFS is optimized for very large files and very large disk blocks. By mainly supporting very large
files stored in very large disk blocks, the amount of metadata is reduced to the level that it can be main-
tained in-memory by a single metadata server.
The metadata server manages the file namespace, the mapping from files to disk blocks, and the data
server locations of each replica of a disk block. Disk blocks are managed by the data servers.
4.3. Performance
The designers of GFS mainly tried to minimize the involvement of the metadata server in all opera-
tions so that it does not become a bottleneck. By assuming a modest number of very large files stored in
very large disk blocks, the metadata server only rarely needs to be consulted for obtaining block locations.
Another assumption is that file system load mainly stems from large sequential reads and large
sequential append operations causing disk bandwidth to be used efficiently. Dataservers do not cache disk
blocks themselves, they rely on the underlying local file system to do caching at a finer granularity. As the
file system is closely tied to the applications the workload will probably live up to these assumptions.
To use network bandwidth efficiently, data is forwarded linearly over the replicas. By knowing the
network topology, each replica can forward data to the closest server in the network that not yet received it.
5. Ceph
Ceph5 is a distributed file system developed at the University of California, Santa Cruz. Its main
characteristic is the clear separation between data and metadata at the server level, like GFS. But unlike
GFS—which has a single metadata server, the master—Ceph has multiple metadata servers. Ceph goes a
long way to eliminate the central component that is present in GFS and to deal with many small files as
well.
By using a data distribution function metadata servers can locate data servers and direct an applica-
tion there when it creates or opens a file. In this way, the location of a data server is calculated, instead of
being looked up.
Low level block allocation is dealt with by the data servers. File data is striped over data servers and
all data servers that store a set of files are grouped logically into placement groups. A data server stores
pieces of multiple placement groups.
Caching, if possible, is done by application servers.
5.3. Performance
Ceph claims to have excellent performance by explicitly setting apart system resources for metadata
lookups, the metadata servers. No metadata is shared by the data and metadata servers. The metadata
servers manage the namespace and some file information. The latter is maintained by the metadata servers
in order to optimize for the common case—obtaining statistics for all files in a directory. As metadata
lookups make up for half of typical file system workloads, Ceph is optimized for that.7
Furthermore, for each file the name and information is stored sequentially. This way, an entire direc-
tory can be fetched using a single read request.
Each metadata server maintains file usage statistics using a decay filter. Periodically these statistics
are compared system wide and directories are migrated to balance the load at the metadata server level.
Ceph does not attempt to locate metadata close to the applications using it.
The main performance issue using Ceph is the principality of data servers. Having all applications
interact with a principal data server for each file, and having the data server synchronously forward writes
to replicas shows the focus of Ceph to improve metadata performance.
6. WheelFS
WheelFS8 is a distributed file system created at MIT. It uses cooperative reading like BitTorrent9 and
a distribution function like Ceph uses. However, just like zFS, WheelFS does not partition files over differ-
ent nodes like Ceph does.
What sets WheelFS apart from the others is the trade-off it offers to applications. Instead of offering
a single consistency model it lets applications control consistency through semantic cues, leaving it up to an
application to trade consistency for performance. By default weak consistency is used. Seven cues are
envisioned; the designers of WheelFS state that the exact set is still subject to research. They are described
below.
6.3. Performance
As stated WheelFS allows an application to trade consistency for performance. High performance is
obtained by offering weak semantics to read file data cooperatively from the nearest locations and by hav-
ing the distribution function select the local machine for storing newly created files.
7. Summary
This paper gave an overview of the high level designs of five modern distributed file systems. For
each file system the consistency model and fault tolerance mechanisms are described.
The main goal of distributed file systems is scalability. Furthermore, by running on workstations it
can offer a logically centralized file system without the need for extra servers. Farsite is an example of such
a system.
Most distributed file systems described in this paper sacrifice file system semantics for performance.
Some leave it up to the application to select appropriate semantics, like WheelFS. Ceph offers an open flag
to propagate local modifications to data servers. Another solution is to use snapshots for readers and limit
the number of writers when a file is opened by many applications. The snapshots are not synchronized
after opening. This is what Farsite does.
zFS and Ceph adhere to sequential consistency, they synchronize access to a file when multiple appli-
cations modify it. Leases are used by readers to assure they do not read stale data. When an application
opens the file for writing, the read leases are revoked. Leases have an expiration time, when a network
node fails the file becomes available again after the expiration time has passed.
GFS ties the file system to applications. Instead of attempting to deal with failures, it simply leaves
files in an inconsistent state. Subsequent reads from different applications may see different data. It is up
to the application to deal with such inconsistencies.
Almost all distributed file systems described in this paper replicate file data for fault tolerance. Only
zFS assumes reliable storage. Metadata is also typically replicated. Only GFS does not replicate metadata,
it uses a centralized metadata server that can recover quickly by logging updates and restore state using the
log after a failure. Ceph journals metadata.
In WheelFS, the semantic cues that define the consistency on a per file basis also define failure
behavior.
References
1. A. Adya, W. Bolosky, M. Castro, R. Chaiken, G. Cermak, J. Douceur, J. Howell, J. Lorch, M.
Theimer, and R. Wattenhofer, FARSITE: Federated, Available, and Reliable Storage for an Incom-
pletely Trusted Environment, OSDI (2002).
2. M. Castro and B. Liskov, Practical Byzantine Fault Tolerance, OSDI (1999).
3. Ohad Rodeh and Avi Teperman, zFS - A Scalable Distributed File System Using Object Disks, MSS
(2003).
4. S. Ghemawat, H. Gobioff, and S. Leung, The Google File System, ACM SOSP (2003).
5. Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Darrell D. E. Long, Ceph: A Scalable, High-Per-
formance Distributed File System, OSDI (2006).
6. B. Welch, POSIX IO extensions for HPC, FAST (2005).
7. D. Roselli, J. Lorch, and T. Anderson, A comparison of file system workloads, USENIX (2000).
8. Jeremy Stribling, Emil Sit, M. Frans Kaashoek, Jinyang Li, and Robert Morris, Don’t Give Up on
Distributed File Systems, IPTPS07 (2007).
9. B. Cohen, Incentives build robustness in BitTorrent, Workshop on Economics of Peer-to-Peer Sys-
tems (2003).