Documente Academic
Documente Profesional
Documente Cultură
System Scalability
Data Management in Distributed Systems
A key tenet in the design of massively scalable systems is ensuring that each participating
node manages only a bounded amount of state, independent of cluster size. Accomplishing
this requires that there be no master node responsible for maintaining all data and metadata
in the clustered system. This concept is paramount in truly scalable architectures, and one
that is very difficult to retrofit into legacy architectures.
Nutanix Distributed Filesystem (NDFS) efficiently manages all types of data in order to scale
out capacity linearly for both small and large-size clusters, and without loss of node or
cluster performance.
Configuration Data
NDFS stores cluster configuration data using a very small in-memory database backed by
solid-state drives (SSDs). Three copies of this configuration database are maintained in the
cluster at all times. Importantly, there are strict upper bounds that are honored to ensure
that this database never exceeds a few megabytes in size.
Even for a hypothetical million node cluster, the database holding configuration data
(e.g., identity of participating nodes, health status of services, etc.) for the entire cluster
would only be a few megabytes in size. In the event that one of the three participating
nodes fails or becomes unavailable, any other node in the cluster can be seamlessly
converted to a configuration node.
Metadata
The most important and complex part of a filesystem is its metadata. In a scalable
filesystem, the amount of metadata can potentially get very large. Further complicating
the task, it is not possible to hold the metadata centrally in a few designated nodes or
in memory.
NDFS employs multiple NoSQL concepts to scale the storage and management of
metadata. For example, the system implements a NoSQL database called Cassandra to
maintain key-value pairs, where the key is the offset in a particular virtual disk and the
value represents the physical locations of the replicas of that data in the cluster.
When a key needs to be stored, a consistent hash is used to calculate the locations where
the key and value will be stored in the cluster. The consistent hash function is responsible for
uniformly distributing the load of storing keys in the cluster. As the cluster grows or shrinks,
the ring self-heals and rebalances key storage responsibility among the participating nodes.
This ensures that every node will be responsible for managing roughly the same amount
of metadata.
The NDFS architecture ensures that the amount of data stored on a cluster node is directly
proportional to the amount of storage space on that node including both SSD and HDD
capacities. This by definition is bounded and does not depend on the size of the cluster.
A system component called Curator, which is responsible for keeping the cluster running
smoothly, runs background map-reduce tasks periodically to check for uneven disk
utilization on the nodes in the cluster in a distributed manner.
When a VM creates data, NDFS keeps one copy resident on the local node for optimal
performance, and distributes a redundant copy across other nodes in the cluster. Distributed
replication enables quick recovery in the event of a disk or node failure. All remote Nutanix
nodes participate in the replication. This enables higher I/O for larger size clusters as there
are more nodes handling the replication.
In traditional, non-converged architectures I/O from a single VM may arrive via multiple
interfaces due to network multi-pathing and controller load balancing. This forces legacy
filesystems to use fine-grained locks to avoid non-stop transitions of ownership of locks
between controllers on writes, and massive chatter of invalidations on the network. Such
use of fine-grained locks in architectures employing n-way controllers imposes scalability
bottlenecks that are nearly impossible to overcome.
I/O for a particular VM is served by the local Controller VM that is running on the host.
That local controller VM acquires the lock for all of the virtual disks backing the VM. Because
virtual disks are not typically shared with other hosts, NDFS simply uses a vDisk-level lock.
As such, there are no invalidations or cache coherency issues. Even with a very large number
of VMs, NDFS effectively manages locks with minimal overhead to ensure that the system
can still scale linearly.
Alerts
One of the more common oversights in designing scalable systems is building
troubleshooting facilities that scale with the system. When such services become unstable,
it is even more important to have an alert and event recording system that does not buckle
due to non-scalability.
Each host in the Nutanix cluster runs an agent that gathers local statistics and periodically
updates the NoSQL store. When the GUI requests data in a sorted fashion (e.g. top 10 CPU
consuming VMs), the request is sent using a map-reduce framework to all hosts in the
cluster to get live information in a scalable fashion. Each host is only responsible to serve
requests for local stats.
Nutanix also provides cluster-wide statistics, which are handled by a dynamically elected
leader in the cluster. (See below section for discussion on leaders) Strict limits are enforced
on the number of such cluster-wide statistics to ensure overall scalability.
It is not necessary, however, for the leaders for all functions be co-located on any given
node. In fact, the leaders are randomly distributed in order to spread the leadership load
among cluster nodes. When an elected leader fails, either due to the failure of the service
or the host itself, a new leader is elected automatically from among the healthy nodes in the
While this works well for some systems, it will cause corruption in storage filesystems.
It may appear as a natural consequence that NoSQL systems are ill suited for building
scalable filesystems. NDFS, however, achieves strict consistency on top of NoSQL by
implementing a distributed version of the Paxos algorithm.
Paxos is a widely used protocol that builds consensus among nodes in clustered systems.
In the NDFS metadata store, for each key there is a set of three nodes that might have the
latest value of the data. NDFS runs the Paxos algorithm to obtain consensus on the latest
value for the key being requested. Paxos guarantees that the most recent version of the data
will get consensus. Strict consistency is guaranteed even though the underlying data store is
based on NoSQL. For any key, the consensus needs to be formed between only three nodes
regardless of cluster size.
The ability to achieve strict consistency using a scalable NoSQL database enables NDFS to
be a linearly scalable filesystem.
For relatively mundane work that must be performed by a typical filesystem, it is far more
efficient and scalable to complete common tasks in the background. User performance
benefits by offloading tasks from the active I/O path to background processes. Typical
examples include disk scrubbing, disk balancing, offline compression, metadata scrubbing
and calculation of free space.
Each node is responsible for a bounded number of map-reduce tasks, which are processed
by all cluster nodes in phases. A coordinating node is randomly elected and performs only
the lightweight tasks of coordinating nodes in the cluster.