Sunteți pe pagina 1din 5

Nutanix Tech Note

System Scalability
Data Management in Distributed Systems

A key tenet in the design of massively scalable systems is ensuring that each participating
node manages only a bounded amount of state, independent of cluster size. Accomplishing
this requires that there be no master node responsible for maintaining all data and metadata
in the clustered system. This concept is paramount in truly scalable architectures, and one
that is very difficult to retrofit into legacy architectures.

Nutanix Distributed Filesystem (NDFS) efficiently manages all types of data in order to scale
out capacity linearly for both small and large-size clusters, and without loss of node or
cluster performance.

Configuration Data
NDFS stores cluster configuration data using a very small in-memory database backed by
solid-state drives (SSDs). Three copies of this configuration database are maintained in the
cluster at all times. Importantly, there are strict upper bounds that are honored to ensure
that this database never exceeds a few megabytes in size.

Even for a hypothetical million node cluster, the database holding configuration data
(e.g., identity of participating nodes, health status of services, etc.) for the entire cluster
would only be a few megabytes in size. In the event that one of the three participating
nodes fails or becomes unavailable, any other node in the cluster can be seamlessly
converted to a configuration node.

Metadata
The most important and complex part of a filesystem is its metadata. In a scalable
filesystem, the amount of metadata can potentially get very large. Further complicating
the task, it is not possible to hold the metadata centrally in a few designated nodes or
in memory.

NDFS employs multiple NoSQL concepts to scale the storage and management of
metadata. For example, the system implements a NoSQL database called Cassandra to
maintain key-value pairs, where the key is the offset in a particular virtual disk and the
value represents the physical locations of the replicas of that data in the cluster.

When a key needs to be stored, a consistent hash is used to calculate the locations where
the key and value will be stored in the cluster. The consistent hash function is responsible for
uniformly distributing the load of storing keys in the cluster. As the cluster grows or shrinks,
the ring self-heals and rebalances key storage responsibility among the participating nodes.
This ensures that every node will be responsible for managing roughly the same amount
of metadata.

2013 All Rights Reserved, Nutanix Corporation 1


Virtual Machine Data and I/O
Every Nutanix node includes a Controller Virtual Machine (CVM) to handle all data I/O
operations for the local hypervisor and guest VMs, and to serve as a gateway to NDFS.
This n-way controller model means the number of CVMs scales evenly with the number of
Nutanix nodes in the cluster, thus eliminating the possibility of controller bottlenecks that
occur in traditional arrays.

The NDFS architecture ensures that the amount of data stored on a cluster node is directly
proportional to the amount of storage space on that node including both SSD and HDD
capacities. This by definition is bounded and does not depend on the size of the cluster.
A system component called Curator, which is responsible for keeping the cluster running
smoothly, runs background map-reduce tasks periodically to check for uneven disk
utilization on the nodes in the cluster in a distributed manner.

When a VM creates data, NDFS keeps one copy resident on the local node for optimal
performance, and distributes a redundant copy across other nodes in the cluster. Distributed
replication enables quick recovery in the event of a disk or node failure. All remote Nutanix
nodes participate in the replication. This enables higher I/O for larger size clusters as there
are more nodes handling the replication.

If a VM moves to another node in response to a High Availability (HA) or VM movement


event, Curator automatically migrates hot data to the node where the VM is running.
NDFS enables all of the clusters storage resources to be available to any host or any
VM, but without requiring all data be local to that node.

Scalable vDisk-level Locks

In traditional, non-converged architectures I/O from a single VM may arrive via multiple
interfaces due to network multi-pathing and controller load balancing. This forces legacy
filesystems to use fine-grained locks to avoid non-stop transitions of ownership of locks
between controllers on writes, and massive chatter of invalidations on the network. Such
use of fine-grained locks in architectures employing n-way controllers imposes scalability
bottlenecks that are nearly impossible to overcome.

NDFS eliminates this impediment to system scalability by implementing locks sparingly,


and only at a VMs file level (vDisk on NDFS). NDFS queries the hypervisor and gathers
information on all the VMs running on the host, as well as the files backing the virtual disks
of the VMs. Each virtual disk, or any other large file, is converted into a Nutanix vDisk that is
managed as a first-class citizen in the filesystem.

I/O for a particular VM is served by the local Controller VM that is running on the host.
That local controller VM acquires the lock for all of the virtual disks backing the VM. Because
virtual disks are not typically shared with other hosts, NDFS simply uses a vDisk-level lock.
As such, there are no invalidations or cache coherency issues. Even with a very large number
of VMs, NDFS effectively manages locks with minimal overhead to ensure that the system
can still scale linearly.

2013 All Rights Reserved, Nutanix Corporation 2


Scalable Alerting, Monitoring and Reporting

Alerts
One of the more common oversights in designing scalable systems is building
troubleshooting facilities that scale with the system. When such services become unstable,
it is even more important to have an alert and event recording system that does not buckle
due to non-scalability.

NDFS implements a completely scalable alert system supported by a service running on


every node. All events and alerts are recorded into a strictly consistent distributed NoSQL
database that is accessible via any node in the cluster. Alerts and events are indexed at
scale, and within the NoSQL database. They are easily accessed either via the GUI or
through a standards-based REST-based API.

Statistics and Visualization


All distributed systems require reliable, live statistical monitoring and reporting. NDFS
scales the statistics-gathering component to provide near real-time insights to the cluster
administrator. This is accomplished by implementing a scale-out statistics database that
leverages the NoSQL key-value store.

Each host in the Nutanix cluster runs an agent that gathers local statistics and periodically
updates the NoSQL store. When the GUI requests data in a sorted fashion (e.g. top 10 CPU
consuming VMs), the request is sent using a map-reduce framework to all hosts in the
cluster to get live information in a scalable fashion. Each host is only responsible to serve
requests for local stats.

Nutanix also provides cluster-wide statistics, which are handled by a dynamically elected
leader in the cluster. (See below section for discussion on leaders) Strict limits are enforced
on the number of such cluster-wide statistics to ensure overall scalability.

Avoiding Single Points of Failure


A key NDFS design principle is to not require any fixed special nodes to maintain cluster
operation and services. There are a few operations in the clustered environment, however,
which embody the notion of a leader. It is necessary that this leader not be statically
assigned, otherwise that node reduces to a special node and introduces a potential
single-point-of-failure, which inhibits scalability.

To overcome these drawbacks, Nutanix implements a dynamic leader election scheme.


For all functions and administrative roles in the cluster, services on all hosts volunteer to
be elected as leader. A leader is elected efficiently using the distributed configuration
service implemented in the system.

It is not necessary, however, for the leaders for all functions be co-located on any given
node. In fact, the leaders are randomly distributed in order to spread the leadership load
among cluster nodes. When an elected leader fails, either due to the failure of the service
or the host itself, a new leader is elected automatically from among the healthy nodes in the

2013 All Rights Reserved, Nutanix Corporation 3


cluster. This occurs in a sub-second timeframe for all size clusters. In other words, there is no
correlation between the time duration to elect a new leader upon failure and the number of
nodes in a cluster. The newly elected leader automatically assumes the responsibilities of the
previous leader.

Strict Consistency at Scale Using PAXOS


Most NoSQL implementations sacrifice strict consistency to gain better availability. For
example, Facebooks Cassandra and Amazons Dynamo provide only eventual consistency,
which is not a viable option for true filesystems. Eventual consistency implies that if a piece
of data is written to one node in a cluster, the data will become visible to another node in
the cluster only eventually. There is no guarantee that the data will be immediately visible.

While this works well for some systems, it will cause corruption in storage filesystems.
It may appear as a natural consequence that NoSQL systems are ill suited for building
scalable filesystems. NDFS, however, achieves strict consistency on top of NoSQL by
implementing a distributed version of the Paxos algorithm.

Paxos is a widely used protocol that builds consensus among nodes in clustered systems.
In the NDFS metadata store, for each key there is a set of three nodes that might have the
latest value of the data. NDFS runs the Paxos algorithm to obtain consensus on the latest
value for the key being requested. Paxos guarantees that the most recent version of the data
will get consensus. Strict consistency is guaranteed even though the underlying data store is
based on NoSQL. For any key, the consensus needs to be formed between only three nodes
regardless of cluster size.

The ability to achieve strict consistency using a scalable NoSQL database enables NDFS to
be a linearly scalable filesystem.

Scalable Map-Reduce Framework

For relatively mundane work that must be performed by a typical filesystem, it is far more
efficient and scalable to complete common tasks in the background. User performance
benefits by offloading tasks from the active I/O path to background processes. Typical
examples include disk scrubbing, disk balancing, offline compression, metadata scrubbing
and calculation of free space.

NDFS implements a map-reduce framework, called Curator, which is responsible for


performing cluster wide operations at scale to perform these tasks without impacting
scalability. Since all nodes participate and each handle a part of the Curator responsibility,
performance scales linearly as the size of the cluster grows.

Each node is responsible for a bounded number of map-reduce tasks, which are processed
by all cluster nodes in phases. A coordinating node is randomly elected and performs only
the lightweight tasks of coordinating nodes in the cluster.

2013 All Rights Reserved, Nutanix Corporation 4

S-ar putea să vă placă și