Lustre File System Overview

Lustre contains three main kinds of systems: File system clients which access the storage.
ge. Object Storage Servers (OSS), which are connected to the Object Storage Targets (OST). Metadata Servers (MSD) managing names and directories in the file system.
The storage attached to the servers are partitioned, optionally organized with LVM and formatted with a file system. The Lustre OSS and metadata servers read, write and modify data in the format imposed by
these file systems. Each OSS can be responsible for multiple OSTs, one for each volume and I/O traffic is balanced against servers and targets. Depending on the hardware, an OSS server can be responsible for 2 25 targets where each target can be as large as 8TB. The capacity of the file system is the entire capacity provided by all the targets. An OSS server should also do load balancing between the system network and the attached storage in order to avoid network bottlenecks. For example, 64 OSS servers, each with two 8-TB targets, provide a file system with a capacity of nearly 1 PB. If this system uses 16 1-TB SATA disks, it may be possible to get 50 MB/sec from each drive, providing up to 800 MB/sec of disk bandwidth. If this system is used as a storage back-end with a system network such as InfiniBand, which supports a similar bandwidth, then each OSS could provide 800 MB/sec of end-to-end I/O throughput. It is important to note that the OSS must provide inbound and outbound bus throughput of 800 MB/sec simultaneously. The cluster could see aggregate I/O bandwidth of 64x800, or about 50 GB/ sec. The architectural constraints described here are simple, however, in practice, extremely careful hardware selection, benchmarking and integration are required to obtain such results, which are tasks best left to experts. Future Lustre file systems may include Server Network Striping (SNS). Often OSS does not direct attached storage but FC connected storage or SAS attached. The storage should be protected using RAID5 or RAID6. OSS memory is used to cache read-only files only and in some case dirty data from writes. Software RAID5 uses one CPU core per 300MB/sec. The minimal CPU usage is when the network has RDMA (Remote Direct Memory Access) capacity. This feature is present in all networks except TCP/IP. In the future CPU utilization will increase as file system hardening is implemented. In the Lustre file system, clients are not connected to any storage and servers also find it better to use point-to-point architecture rather than switching to connect to the storage. The same considerations are valid for the MDS. MDS requires storage but typically only 1-2% of the file system storage is enough. However, the data access patterns for the MDS and OSS are completely different. The former has a lot of seek and small amounts of read and write whereas the latter has large quantities of read and write. Thus high throughput is not important for MDS. Also, RAID5 or RAID6 for a lot of seek I/O does not provide optimal performance RAID 0+1 is much better. Lustre uses journaling file systems on the target and for the MDS approximately 20% performance can be increased by putting the journal on a separate device. The MDS typically requires CPU power and at least four cores are recommended. Lustre configuration First the Lustre software is installed and then the MDT and OST partitions are formatted using the standard mkfs command. Next the volumes carrying the Lustre file system targets are mounted on the server nodes as local systems. Finally the clients are mounted similar to NFS mounts.
Lustre Networking The system network exists between the clients and the OSS and MDS. The LNET exists on the system network providing the entire communication infrastructure required for Lustre to work. Key features of LNET: Providing RDMA if the underlying network supports it Elan, Myrinet, InfiniBand. High availability and recovery features enabling transparent recovery in conjunction with failover servers. Simultaneous availability of multiple network types with routing between them.
Common end-to-end throughputs are: 1. In excess of 110MB/sec over GigE networks. 2. In excess of 1GB/sec over 10GigE networks. 3. Up to 1.5GB/sec over double data rate (DDR) InfiniBand networks. High availability and rolling updates A Lustre file system can have an enormous number of storage devices which maybe serving thousands of clients. Thus it is required to have a high availability mechanism in case of server failures or reboots. The applications should not perceive anything more than a delay in reply to the system calls. A robust failover mechanism and software that allows interoperability with different versions allows node upgradation without taking the whole system offline. One node at a time is taken offline, upgraded and is joined back to the server. The applications merely face a delay during the process. MDS is configured as an Active/Passive pair while OSS are configured as Active/Active. Often the standby MDS for one file system is the active one for another leaving no node idle in the system. Although a file system checking tool (lfsck) is provided, journaling and sophisticated protocols resynchronize the system within seconds without the need for lengthy file system checks. Where are the files The inodes stored in the MDT does not point to the files but to one or more objects associated with the files.
The objects are implemented in the OST file system and contain file data.
If only one object is associated with an inode than that object contains all the data of that file. If there are more than one object then the data is striped across all those objects. The benefit of this approach is that capacity and aggregate I/O bandwidth scaling is dependent completely on the number of OSS. There is no dependency on the MDS. The number of objects associated with an inode is known as stripe_count. The amount of data to be written to each object before moving on to the next object in a circular fashion is known as stripe_size. Working with stripe objects leads to interesting behavior. For example, figure below shows a rendering application in which each client node renders one frame. The application uses a shared file model where the rendered frames of the movie are written into one file. The file that is written can contain interesting patterns, such as objects without any data. Objects can also have sparse sections into which client 6 has written data.
The major benefit of striping is that the file size is not limited by the size of one target. In Lustre striping can be done over up to 160 targets where each can be 8TB in size. This means that the total size of the file can be 1.48PB. Actual file sizes are much larger (264) but data allocation cannot exceed 1.48PB. This implies that any file larger than 1.48PB must have sparse regions. Lustre systems have been built over 5000 targets, which is enough to build a 40PB file system. Another benefit is that the I/O bandwidth to the file is the aggregate I/O bandwidth to the objects. Thus the I/O bandwidth to a file can be the aggregate I/O bandwidth of 160 targets. Additional features Interoperability The file system supports many CPU architectures. Also interoperability between versions is allowed when clients and servers are mixed. However, future releases may require servers first or all at once upgrade approach. Accessed Control Lists (ACLs) The Lustre security features are those of UNIX enhanced with POSIX ACLs. Other features include root squash and connecting from privileged ports only. Quotas are available. OSS addition Object Storage Servers with new OSTs can be added to the cluster without causing any interruptions to increase capacity and aggregate I/O bandwidth. Controlled striping The stripe count and stripe size are decided at the time of formatting the file system. However, every directly and recursively every sub-directory has an attribute that will determine the striping pattern for
that directory. Moreover, system utilities and application libraries are allowed to decide striping behavior on individual file basis. Snapshots Lustre recognizes the fact that the file system consists of individual volumes. Snapshots can be taken of all volumes and grouped together into a snapshot file system that can be mounted along with the Lustre system. Backup tools The Lustre 1.6 system comes with two utilities to facilitate backup and restore. One is a very fast file scanner that can seek out files modified after a certain point in time and provide path names of the same. The other is a modified version of the star utility to backup and restore Lustre stripe information.

Lustre File System Overview

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lustre File System Overview

Încărcat de

Drepturi de autor:

Formate disponibile

Lustre contains three main kinds of systems: File system clients which access the storage.

S-ar putea să vă placă și