Documente Academic
Documente Profesional
Documente Cultură
This publication is intended to help Cluster File Systems, Inc.s (CFS) Customers and Partners who are involved in installing, conguring, and administering Lustre. The information contained in this document has not been submitted to any formal CFS test and is distributed AS IS. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by CFS for accuracy in a specic situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. Comments may be addressed to: Cluster File Systems, Inc. 110 Capen Street Medford MA 02155-4230
Copyright Cluster File Systems, Inc. 2004 All rights reserved. Use or disclosure is subject to restrictions. Duplication of this manual is prohibited.
Contents
1 Prerequisites 1.1 Lustre Version Selection . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 1.1.2 1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.3 1.3.1 1.3.2 1.3.3 1.4 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.5 1.5.1 1.5.2 1.5.3 How To Get Lustre . . . . . . . . . . . . . . . . . . . . . . . Supported Congurations . . . . . . . . . . . . . . . . . . . Choosing a Pre-packaged kernel . . . . . . . . . . . . . . . . Lustre Tools . . . . . . . . . . . . . . . . . . . . . . . . . . Building Other Modules Against the Lustre kernel . . . . . . Other Required Software . . . . . . . . . . . . . . . . . . . Building Your Own kernel . . . . . . . . . . . . . . . . . . . Building Lustre . . . . . . . . . . . . . . . . . . . . . . . . Environment Requirements . . . . . . . . . . . . . . . . . . Installing LDAP Packages . . . . . . . . . . . . . . . . . . . Updating slapd.conf . . . . . . . . . . . . . . . . . . . . . . Specifying Password Location . . . . . . . . . . . . . . . . . Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using LDAP to congure the cluster . . . . . . . . . . . . . Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . Installing the reporting client daemon (LMD) . . . . . . . . . Installing the management client (LMM) . . . . . . . . . . . 3 9 9 9 9 9 10 10 11 11 12 13 14 16 16 16 17 17 17 18 18 18 20 21
LDAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Installing Lustre-Manager . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS 2 Creating a New File System 2.1 What do you need to know to setup Lustre? . . . . . . . . . . . . . . 2.1.1 2.1.2 2.1.3 2.1.4 2.2 2.2.1 2.2.2 2.2.3 2.3 2.3.1 2.3.2 2.3.3 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 2.5.6 2.5.7 2.6 2.6.1 2.6.2 2.6.3 2.6.4 Architecture Refresher . . . . . . . . . . . . . . . . . . . . . Sizing Your Nodes . . . . . . . . . . . . . . . . . . . . . . . High Availability . . . . . . . . . . . . . . . . . . . . . . . . Total Usable Storage . . . . . . . . . . . . . . . . . . . . . . Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lustre on RAID . . . . . . . . . . . . . . . . . . . . . . . . Logical Volume Manager (LVM) . . . . . . . . . . . . . . . Peak Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . Total Storage Capacity . . . . . . . . . . . . . . . . . . . . . When Your Best Isnt Good Enough . . . . . . . . . . . . . Advantages of Striping . . . . . . . . . . . . . . . . . . . . Disadvantages of Striping . . . . . . . . . . . . . . . . . . . Stripe Size . . . . . . . . . . . . . . . . . . . . . . . . . . . Choosing OSTs . . . . . . . . . . . . . . . . . . . . . . . . Basic Multi-node Setup . . . . . . . . . . . . . . . . . . . . Basic Service Management . . . . . . . . . . . . . . . . . . Large Parallel I/O Conguration . . . . . . . . . . . . . . . . Conguring for Failover . . . . . . . . . . . . . . . . . . . . LDAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multinet and Routing . . . . . . . . . . . . . . . . . . . . . Conguration Pitfalls . . . . . . . . . . . . . . . . . . . . . Shared Storage . . . . . . . . . . . . . . . . . . . . . . . . 23 23 23 24 25 26 26 26 27 28 29 29 29 30 30 30 31 31 32 32 32 35 36 38 38 38 38 38 38 39 39 39
Disk Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
File Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using Lustre-Manager . . . . . . . . . . . . . . . . . . . . . . . . .
Failover Example . . . . . . . . . . . . . . . . . . . . . . . . . . . Conguring With Failover Manager . . . . . . . . . . . . . . Pairwise Cong . . . . . . . . . . . . . . . . . . . . . . . . Passive/active Failover . . . . . . . . . . . . . . . . . . . . 4
CONTENTS 2.6.5 2.6.6 2.6.7 2.6.8 2.7 2.8 2.9 2.7.1 2.8.1 2.9.1 2.9.2 2.9.3 2.9.4 2.10 Active/Active Failover . . . . . . . . . . . . . . . . . . . . . N-way Failover . . . . . . . . . . . . . . . . . . . . . . . . The Default Lustre Upcall . . . . . . . . . . . . . . . . . . . Testing Failover . . . . . . . . . . . . . . . . . . . . . . . . Automatic Client Mounting via fstab . . . . . . . . . . . . . Lustre Throughput Tests . . . . . . . . . . . . . . . . . . . . Automatic Service Stopping and Starting . . . . . . . . . . . File System Parameters . . . . . . . . . . . . . . . . . . . . Upcall Generation and Conguration . . . . . . . . . . . . . Log Levels and Timeouts . . . . . . . . . . . . . . . . . . . 39 40 40 40 40 40 41 41 41 42 43 43 43 43 44 44 44 44 44 46 46 46 46 46 47 47 47 47 48 48 49
Client Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . Validation and Light Testing . . . . . . . . . . . . . . . . . . . . . . Conguration, Under the Hood . . . . . . . . . . . . . . . . . . . .
Striping Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.1 Per-File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.2 Per-Directory . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.3 Inspecting Stripe Settings . . . . . . . . . . . . . . . . . . . 2.10.4 Finding Files on a Given OST . . . . . . . . . . . . . . . . . 2.10.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Conguring Monitoring 3.1 Basic monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.2 3.2.1 3.2.2 3.2.3 System Health . . . . . . . . . . . . . . . . . . . . . . . . . Current Load . . . . . . . . . . . . . . . . . . . . . . . . . . Bandwidth/Disk/CPU . . . . . . . . . . . . . . . . . . . . . OST performance monitoring with LMT . . . . . . . . . . . Lustre Operation/RPC Rate . . . . . . . . . . . . . . . . . . Conguring System Log . . . . . . . . . . . . . . . . . . . . Logging from the Upcall . . . . . . . . . . . . . . . . . . . . SNMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS 4.1 4.2 4.3 File system consistency . . . . . . . . . . . . . . . . . . . . . . . . E2fsck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 4.3.1 4.3.2 4.3.3 4.3.4 4.4 4.5 4.6 4.4.1 4.5.1 4.6.1 4.6.2 4.7 4.7.1 4.7.2 4.8 4.8.1 4.8.2 4.8.3 Supported e2fsck Releases . . . . . . . . . . . . . . . . . . . What is lfsck? . . . . . . . . . . . . . . . . . . . . . . . . . When To Run lfsck . . . . . . . . . . . . . . . . . . . . . . What if I dont run lfsck? . . . . . . . . . . . . . . . . . . . Using lfsck . . . . . . . . . . . . . . . . . . . . . . . . . . . lustre-lint . . . . . . . . . . . . . . . . . . . . . . . . . . . . Automating . . . . . . . . . . . . . . . . . . . . . . . . . . OST Failure . . . . . . . . . . . . . . . . . . . . . . . . . . MDT Failure . . . . . . . . . . . . . . . . . . . . . . . . . . Aborting Server Recovery . . . . . . . . . . . . . . . . . . . Manual Recovery . . . . . . . . . . . . . . . . . . . . . . . Using XML . . . . . . . . . . . . . . . . . . . . . . . . . . Using LDAP . . . . . . . . . . . . . . . . . . . . . . . . . . Failing Back to Primary Nodes . . . . . . . . . . . . . . . . lfsck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 50 50 51 51 51 51 52 52 52 53 53 53 53 53 54 54 54 54 55 55 56 56 56 56 56 57 58 59 59 59
Validation of Conguration . . . . . . . . . . . . . . . . . . . . . . Recovering from Network Partition . . . . . . . . . . . . . . . . . . Recovering from Disk Failure . . . . . . . . . . . . . . . . . . . . .
Lustre Timeouts . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Automating Failover . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Health Checking 5.1 What to do when Lustre seems too slow . . . . . . . . . . . . . . . 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.2 5.2.1 5.2.2 Debug Level . . . . . . . . . . . . . . . . . . . . . . . . . . Stripe Count . . . . . . . . . . . . . . . . . . . . . . . . . . Stripe Balance . . . . . . . . . . . . . . . . . . . . . . . . . Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . Why are POSIX le writes slow? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unresponsive to Requests . . . . . . . . . . . . . . . . . . . Gathering Evidence . . . . . . . . . . . . . . . . . . . . . . 6
CONTENTS 5.2.3 5.2.4 5.2.5 5.2.6 Basic analysis . . . . . . . . . . . . . . . . . . . . . . . . . Reporting problems . . . . . . . . . . . . . . . . . . . . . . Support Contacts . . . . . . . . . . . . . . . . . . . . . . . . Mailing lists . . . . . . . . . . . . . . . . . . . . . . . . . . 60 61 61 61 62 62 62 62 62 63 63 63 64 65 65 65 65 66 66 66 67 67 68 68 68 68 70 70 70 70 71
6 Managing Congurations 6.1 Adding OSTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 6.1.2 6.2 6.3 6.4 6.5 6.2.1 The importance of OST ordering . . . . . . . . . . . . . . . Extending with LVM . . . . . . . . . . . . . . . . . . . . . Adding OSTs Without Upsetting the Balance . . . . . . . . .
Poor Mans Migration . . . . . . . . . . . . . . . . . . . . . . . . . Network Topology Changes . . . . . . . . . . . . . . . . . . . . . . Adding Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding a Distinct le system . . . . . . . . . . . . . . . . . . . . .
7 Managing Lustre 7.1 7.2 Changing Congurations . . . . . . . . . . . . . . . . . . . . . . . Backing Up Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 7.2.2 7.2.3 7.2.4 7.3 7.4 7.5 7.6 Backing up at the Client File System Level . . . . . . . . . . Backing up at the Target Level . . . . . . . . . . . . . . . . How to back up an OST . . . . . . . . . . . . . . . . . . . . How to back up the MDS . . . . . . . . . . . . . . . . . . .
Restoring Backups . . . . . . . . . . . . . . . . . . . . . . . . . . . Exporting via NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . Exporting via Samba (CIFS) . . . . . . . . . . . . . . . . . . . . . Upgrading Your Software . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 7.6.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . Upgrading From Previous Versions . . . . . . . . . . . . . .
8 Mixing Architectures 8.1 8.2 Mixing Lustre versions . . . . . . . . . . . . . . . . . . . . . . . . Mixing kernel versions . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 8.2.2 Different client/server kernels . . . . . . . . . . . . . . . . . Mixing 2.4 and 2.6 kernels . . . . . . . . . . . . . . . . . . 7
CONTENTS 8.3 Mixing hardware classes . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 8.3.2 Hardware Page Size . . . . . . . . . . . . . . . . . . . . . . Endian mixing . . . . . . . . . . . . . . . . . . . . . . . . . 71 71 71
Chapter 1
Prerequisites
1.1 Lustre Version Selection
1.1.1 How To Get Lustre
The current, stable version of Lustre is available for download from the Cluster File Systems web site. Download Lustre http://www.clusterfs.com/download.html The software available for download on the Cluster File Systems web site is released under the GNU General Public License. We strongly recommend that you read the complete license and release notes for this software before downloading if you have not already done so. This license and release notes can be found at the aforementioned web site.
Supported Type Red Hat Linux 7.1+, SuSE Linux 8.0+, Linux 2.4.x IA-32, IA-64, x86-64 TCP/IP, Quadrics Elan 3 Table 1.1: Supported congurations 9
1 Prerequisites Release chaos ia64 Details Based on the 2.4.18 linux kernel, IA-32, supports both TCP and elan3 Based on the 2.4.20 linux kernel, IA-64, supports both TCP and elan3 Table 1.2: Pre-packaged release details
Lustre contains kernel modications which interact with your storage devices and may introduce security issues and data loss if not installed, congured, and administered correctly. Please exercise caution and back up all data before using this software.
1 Prerequisites
All the Lustre packages were built using gcc v2.96. If you need to build additional modules for your cluster, the same compiler and version should be used. gcc http://gcc.gnu.org/
1.2.4.1
Core Requirements
Table 1.3 contains hyperlinks to the software tools required by Lustre. Depending on your operating system, pre-packaged versions of these tools may be available, either from the sources listed below, or from your operating system vendor.
1.2.4.2
If you plan to enable failover server functionality with Lustre (either OSS or MDS), high availability software will be a necessary addition to your cluster software. Two of the better known high availability software packages are Clumanager and Kimberlite. Clumanager, also called cluman, (Cluster Manager) from Red Hat Enterprise Linux AS (Advanced Server) version provides high availability (HA) features that are essential for data integrity and uninterruptible service. The basic idea behind of these HA features are redundant systems and fail-over mechanism (which moves the services from the failed server to the remaining backup server). More information about CluManager can be found in the Lustre CluManager wiki: CluManager wiki https://wiki.clusterfs.com/lustre/CluManager Kimberlite is an open-source (GNU GPL) high-availability clustering solution for Linux, designed for use in commercial application environments. It guarantees data integrity using commodity hardware components, and the benets can be applied to any application, with no requirement to modify the application. As a bonus, Kimberlite comes with a command-line management interface for scripting of regular operations. More information about Kimberlite can be found at: Kimberlite http://oss.missioncriticallinux.com/projects/kimberlite/ 11
1 Prerequisites Software pdsh Version >=1.6 Lustre Function distributed shell: useful for general cluster maintenance pdsh http://www.llnl.gov/linux/pdsh/pdsh.html perl >=5.6 scripting language: used by monitoring and test scripts perl http://www.perl.com/pub/a/language/info/ software.html python >=2 scripting language: required by core Lustre tools python http://www.python.org/download/ PyXML >=0.8 XML processor for python: required PyXML http://sourceforge.net/project/showfiles. php?group_id=6473 Table 1.3: Software URLs
1.2.4.3
Debugging Tools
Things inevitably go wrong disks fail, packets get dropped, software has bugs and when they do, it is always useful to have debugging tools on hand to help gure out how and why. The most useful tools in this regard are gdb, coupled with crash. Together, these tools can be used to investigate both live systems and kernel core dumps. There are also useful kernel patches/modules, such as netconsole and netdump, that allow core dumps to be made across the network. More information about these tools can be found at the following locations: gdb http://www.gnu.org/software/gdb/gdb.html crash http://oss.missioncriticallinux.com/projects/crash/ netconsole http://lwn.net/2001/0927/a/netconsole.php3 netdump http://www.redhat.com/support/wpapers/redhat/netdump/
1 Prerequisites
Depending on which kernel you are using, a different series of patches needs to be applied. Cluster File Systems maintains a collection of different patch series les for the various supported kernels in lustre/kernel_patches/series/. For instance, the le lustre/kernel_patches/series/rh-2.4.20 le lists all the patches that should be applied to a Red Hat 2.4.20 kernel to build a Lustre compatible kernel. The current set of all supported kernels and corresponding patch series can always be found in the le lustre/kernel_patches/which_patch. 1.3.1.2 Using Quilt
A variety of Quilt packages (RPMs, SRPMs, and tarballs) are available on the Cluster File Systems ftp site: quilt ftp site ftp://ftp.clusterfs.com/pub/quilt/ The Quilt RPMs have some installation dependencies on other utilities, e.g. the coreutils RPM that is available only in Red Hat 9. You will also need a recent version of the diffstat package. If you cannot fulll the Quilt RPM dependencies for the packages made available by Cluster File Systems, we suggest building Quilt from the tarball. After you have acquired the Lustre source (CVS or tarball), and chosen series le to match you kernel sources, you must also choose a kernel cong le. Supported kernel ".cong" les are in the lustre/kernel_patches/kernel_congs, and are named in such 13
1 Prerequisites $ cd /tmp/kernels/linux-2.4.20 $ quilt setup -l ../lustre/kernel_patches/series/rh-2.4.20 \ -d ../lustre/kernel_patches/patches Figure 1.1: Quilt Setup
$ cd /tmp/kernels/linux-2.4.20 $ quilt push -av Figure 1.2: Applying a patch series using Quilt
a way as to indicate which kernel and architecture they are meant for, e.g. vanilla2.4.20.uml.cong is a UML cong le for the vanilla 2.4.20 kernel series. Next, unpack the appropriate kernel source tree. For the purposes of illustration, this documentation will assume that the resulting source tree is in /tmp/kernels/linux-2.4.20 and we will call this the destination tree. You are now ready to use Quilt to manage the patching process for your kernel. The commands in Figure 1.1 will setup the necessary symlinks between the Lustre kernel patches and your kernel sources. You can now have Quilt apply all the patches in the chosen series to your kernel sources by using the commands in Figure 1.2. If the right series le was chosen and the patches and kernel sources were up-to-date, the patched destination Linux tree should now be able to act as a base Linux source tree for Lustre.
The patched Linux source does not need to be compiled in order for Lustre to be built from it. However, the same Lustre-patched kernel must be compiled and then booted on any node on which you intend to run a version of Lustre built using this patched kernel source.
1 Prerequisites
$ cd /path/to/lustre/source $ ./configure --with-linux=/path/to/lustre/patched/kernel/source --disable-liblustr $ make Figure 1.3: Lustre build instructions
1.3.2.1
Conguration Options
Lustre supports several different features and packages that extend the core functionality of Lustre. These features/packages can be enabled at build time by issuing appropriate arguments to the congure command. A complete listing of supported features and packages can always be obtained by issuing the command ./congure help in your Lustre source directory.
1.3.2.2
liblustre
The Lustre library client, liblustre, relies on libsysio, which is a library that provides POSIX-like le and name space support for remote le systems from application program address space. Libsysio can be obtained from: libsysio URL http://sourceforge.net/projects/libsysio/ Development of libsysio has continued since it was rst targeted for use with Lustre, so you should checkout the b_lustre branch from the libsysio CVS repository. This should give you a version of libsysio that is compatible with Lustre. Once checked out, the steps listed in Figure 1.4 will build libsysio. $ sh autogen.sh $ ./configure --with-sockets $ make Figure 1.4: Building libsysio
Once libsysio is built, you can build liblustre using the commands listed in Figure 1.5. $ ./configure --with-lib --with-sysio=/path/to/libsysio/source $ make Figure 1.5: Building liblustre 15
The compiler of note for Lustre is gcc version 2.96. This version of gcc has been used to successfully compile all of the pre-packaged releases made available by Cluster File Systems, and as such is the only compiler that is ofcially supported. Your mileage may vary with other compilers, or even other versions of gcc.
1.3.3.1
Consistent Clocks
Machine clocks should be kept in sync as much as possible. The standard way to accomplish this by using the Network Time Protocol, or ntp. All the machines in your cluster should synchronize their time from a local time server (or servers) at a suitable interval. More information about ntp can be found at: ntp http://www.ntp.org/
1.3.3.2
Universal UID/GID
In order to maintain uniform le access permissions on all the nodes of your cluster, the same user (UID) and group (GID) IDs should be used on all clients. You can store and disseminate such user information centrally for the entire cluster using a tool such as LDAP or NIS. OpenLDAP http://www.openldap.org/ Network Information System http://www.faqs.org/docs/linux_network/ x-087-2-nis.html
1.4 LDAP
A single Lustre conguration le is used for the whole cluster, and this le needs to be accessible to all the cluster nodes. This can be achieved either by keeping the conguration le in shared storage visible to all the nodes (e.g. and NFS-mounted directory) or by putting it in an LDAP server. An LDAP server is also useful for supporting MDS and OSS failovers. In this section, we will describe the various components required to congure an LDAP server and outline the steps to start up an LDAP server. Information on how LDAP can be used to congure a cluster can be found in a later chapter. 16
1 Prerequisites # cd lustre/conf # cp lustre.schema /etc/openldap/schema # cp slapd-lustre.conf /etc/openldap # mkdir -m 700 /var/lib/ldap/lustre # chown ldap.ldap /var/lib/ldap/lustre Figure 1.6: Manual installation of the Lustre schema include /etc/openldap/slapd-lustre.conf Figure 1.7: Updating slapd.conf
If the lustre-ldap rpm is not available, e.g. you are running a tarball Lustre release, you can execute the series of commands found in Figure 1.6 to install the Lustre schema on your LDAP server:
1.4.4 Caveats
17
1 Prerequisites # load_ldap.sh /path/to/cong.xml Figure 1.8: Load XML cong le # lconf ldapurl ldap://ldap_node cong=<cong-name> Figure 1.9: Startup cluster by LDAP On larger clusters, LDAP can limit the speed with which clients can be mounted due to limitations in the number of concurrent queries the LDAP server can accept from clients at a given time.
1.5.1 Dependencies
Both packages (LMD and LMT) require that the following other packages be installed: python, version >=2; gd: graphic library (has various dependencies of its own, lipjpeg, libpng,...); python-gd: Python interface to the gd drawing package. These packages should all be available through whichever package distribution mechanism you use for your system. 1.5.1.1 Installing PDSH
PDSH and PyXML must be installed on all the nodes of your system for LMT to work properly. 18
1 Prerequisites Recent versions of PDSH may be downloaded from the LLNL public FTP site at ftp://ftp.llnl.gov/pub/linux/pdsh/ ftp://ftp.llnl.gov/pub/linux/pdsh/. You can use the following commands to build PDSH with ssh support enabled (recommended) from tar-ed sources: tar xzvf pdsh-xxx.tgz configure --with-ssh make make install Once PDSH is built, it is necessary to congure login equivalence for ssh on all the nodes in your cluster. Login equivalence allows you to connect to the nodes in your cluster without having to specify a password for each connection. You can use the following steps to enable login equivalence for ssh: 1. 2. 3. 4. Login to the head node as root; Create a directory called /root/.ssh if it does not already exist; cd /root/.ssh Create a private identity key using the following command: ssh-keygen -t dsa 5. When prompted for a passphrase, press enter to use an empty passphrase (easier). If you want to use a passphrase, then you have to use an SSH agent; Append the public key generated by ssh-keygen to the list of keys on each remote, cluster node that are allowed to login using root equivalence. Unless you changed the default location during the keygen process, the id_dsa.pub le is generated on the head node in the /root/.ssh/ directory. The authorized_keys le will be found in /root/.ssh/ on each remote node. If the authorized_keys le does not already exist, you can create it. NOTE: you will need to copy the id_dsa.pub le to each remote node (or to shared storage) before appending it to the authorized_keys le; cat id_dsa.pub > > authorized_keys 7. Change the permissions for all .ssh/ directories and authorized_keys les: cd /root chmod go-w .ssh .ssh/authorized_keys 19
6.
1 Prerequisites 8. Repeat the above steps for all nodes in your cluster. If your cluster is very large, you may want to consider automating the process.
You should now be able to connect to your cluster nodes using ssh without being prompted for your password. NOTE: that the rst time you connect to each machine, you will have to type yes to conrm the new connection. You can use the following command to test wether PDSH work: pdsh -w node[000-064] uptime | sort On RedHat systems, PDSH is often installed in /usr/local/bin by default, so it may be necessary to add /usr/local/bin to the path, or create a symlink in /usr/bin.
Install the lustre-manager-collector rpm on all the nodes that will be running Lustre: rpm -ivh lustre-manager-collector-XXX.i386.rpm In a SuSe environment:
Edit /etc/syscong/lustre-manager-collector to contain the hostname that will run the Lustre manager: LMD_MONITOR_HOST=management-node 1.5.2.3 Starting the daemon
LMD is installed as a Red Hat-style service, so it can be started/stopped using the following syntax: /sbin/service lmd start | stop In a SuSe environment, LMD can be started/stopped using the following command: 20
1 Prerequisites /etc/init.d/lmd start | stop You can congure the LMD service to start up at boot time using whater mechanism is appropriate for your system, e.g. chkcong on Red Hat installations.
Install the management client (LMM) on an admin or head node for your cluster. The only real criteria here is that the node you choose as the management client be visible to all the Lustre nodes in your cluster, i.e. all Lustre nodes are able to send reports to this node: rpm -ivh lustre-manager-XXX.i386.rpm In a SuSe environment: rpm -ivh lustre-manager-suse-XXX.i386.rpm 1.5.3.2 Starting the daemon
Before starting the LMM daemon for the rst time, you can elect to congure two parameters. If you would like to use a secure HTTP connection (HTTPS) for the management client, you must rst install the m2crypto package for Python, available from the m2crypto website http://sandbox.rulemaker.net/ngps/m2/, and then set LMM_OPTS=use-https in /etc/syscong/lustre-manager. The port on which LMM will listen can also be changed from its default of 8000 through the use of port #. You then must run lustre-manager/data/generate_https_certs.sh to generate SSL certicates. Like the LMD, the LMM is installed as a Red Hat-style service, so it can be started/stopped using the following syntax: /sbin/service lustre-manager start | stop In a SuSe environment, the lustre-manager can be started/stopped using the following command: /etc/init.d/lustre-manager start | stop You can congure the LMM service to start up at boot time using whater mechanism is appropriate for your system, e.g. chkcong on Red Hat installations. 21
1 Prerequisites 1.5.3.3 Using the management client for the rst time
The rst time you run the management client, a random password is created for superuser, admin, and written out to the logs. You can get this password by executing the following command:
grep admin /var/lib/lustre-manager/log/manager.log You should now be able to connect to http(s)://localhost:8000 and login using admin and the password from the logs. Your rst task with the management client should be to change the password for the admin user from the randomly-generated default. Note: The Mozilla web browser is recommended for use with LMT.
22
Chapter 2
23
There are three types of systems which make up a Lustre installation, and while they are usually run on separate nodes, it is possible to run a test or demonstration setup entirely on one node. Metadata Servers (referred to throughout as an MDS) provide access to services called Metadata Targets (MDT). An MDT manages a backend le system which contains all of the metadata, but none of the actual le data, for an entire Lustre le system. An MDS can export more than one MDT, and multiple MDSs can be congured to act as a group for purposes of failover redundancy (see Section 2.6). Object Storage Servers (OSS) export one or more Object Storage Targets (OST). An individual OST contains part of the le data for a given le system, and very little metadata. In almost all cases, many OSTs are grouped to form a single le system through a Logical Object Volume (LOV), and it is in this way that Lustre distributes I/O and locking load amongst many OSSs. One MDT plus one or more OSTs make up a single Lustre le system, and are managed as a group. You can think of this group as being analogous to a single line in /etc/exports on an NFS server. Client nodes mount the Lustre le system over the network and access the les with POSIX le system semantics. Each client communicates directly with the MDS and OSSs responsible for that le system, using a distributed lock manager to keep everything synchronized and protected.
24
2 Creating a New File System There are many factors which will affect the performance of your Lustre system, discussed in more detail throughout this manual. In general, you will achieve the greatest value for your money by keeping things balanced. There are many data pipelines within the Lustre architecture, but there are two in particular which have very direct performance impact: the network pipe between clients and OSSs, and the disk pipe between the OSS software and its backend storage. By balancing these two pipes, you are saving money and maximizing performance. For example: if your OSSs disks are capable of much higher bandwidth than your network, it is likely that cheaper, slower disk would work just as well or alternatively, you could spread those disks across more OSSs, to increase your aggregate performance. 2.1.2.1 The Rule of Thirds
When sizing your server nodes, the OSSs in particular, we recommend that you divide your CPU into thirds: one third for the disk backend; one third for the network stack; and one third for Lustre. In your disk backend calculation, include the drivers but not the le system; in other words, are you able, with your chosen processor, to achieve the desired bandwidth to a raw device without exceeding 33% CPU utilization? In the network stack estimate, include the TCP stack if you plan to run Lustre over TCP. If you plan to run Lustre over an advanced network such as Elan or Myrinet, use the vendors supplied benchmark tools to determine if you can drive the network with one third of your CPU. If you adhere roughly to these guidelines, the remaining third of the CPU should be sufcient for the Lustre software stack (locking, backend le system, networking layers, etc.).
In most cases, Lustre can recover from any single server or infrastructure failure in a way which is transparent to your applications. However its also the most difcult part of the architecture to systematically test, so we currently recommend that administrators expect Lustre to recover from 90% of node failures in a way which is transparent 25
2 Creating a New File System to their users. By a single failure, we mean that the entire cluster must rst recognize and recover from the failure of one component before another component fails. In the context of a recovery discussion, a component could be a single metadata server, one or more object storage servers, one or more clients, or the network as a whole. For this reason, if you run a client on the same node as the metadata server, a failure of that node is automatically a double failure which will not recover in an applicationtransparent way. When Lustre is unable to recover transparently, it is nevertheless exceptionally rare that the cluster will need to be rebooted, or the entire le system remounted. In almost all cases, your applications will receive an error for any in-progress le system calls, but be able to access the le system normally from that point.
2.1.4.1
A non-sparse le will grow its component stripes at roughly the same rate (see Section 2.4) meaning that for a le striped over 5 OSTs, each OST will need enough free space for roughly 1/5th of that les data. To determine the largest single non-striped le you can have, multiply the amount of free space on the most-used OST by the number of stripes in the le. Because OSTs can ll up at different rates, it is possible that a write to one part of that le will return -ENOSPC, and a write to a different part will succeed.
2 Creating a New File System RAID device (/dev/md*), or an LVM device (/dev/group_name/*).
2.2.2.1
RAID 0 (striping)
RAID 0 (striping) does not offer any form of data protection by itself. It is purely a performance improvement in which multiple devices are combined to create a single larger block device. If any of the underlying devices are damaged, the entire RAID device is unreadable. For this reason, using RAID 0 by itself improves performance but increases the risk of data loss.
2.2.2.2
RAID 1 (mirroring)
RAID 1 (mirroring) keeps identical copies of data on multiple devices, usually at some small performance penalty. In the event of a crash or other unclean shutdown, Linux software RAID 1 will re-sync the devices in the background by copying the entire device. During this (possibly long) re-sync period, some performance degradation can be expected. As a result, as long as any one device in a RAID 1 set is functioning, your data is available. Because at least half of your disk space is being used for copies of identical data, it is not very space-efcient.
2.2.2.3
RAID 5 (parity)
RAID 5 (parity) aggregates at least three devices in a fault-tolerant and reasonably performant way. Instead of making identical copies as in RAID 1, parity blocks are stored on a different disk from the le data. This parity information is used to reconstruct data in the event of a disk failure, and is much smaller than a duplicate copy of the data. RAID 5 is the most popular choice of our current customers, balancing availability requirements with reasonable performance and space efciency. 27
Lustre will run equally well on hardware and software RAID solutions, all else being equal. However, it is important to keep in in mind that a software RAID solution on the object storage server (OSS) comes at some cost in overhead. If you deploy a software RAID 1 solution, remember that all writes will need to be written at least twice (once to each device in the RAID set). For a write-intensive load on a high-bandwidth OSS, make sure that your buses are capable, and expect CPU overhead on the order of 5%. If you deploy a software RAID 5 solution, keep in mind that there is some considerable overhead in parity calculation. For a write-intensive load on a high-bandwidth OSS, you should plan to allocate 15% of a modern CPU. If your bandwidth requirements are high more than 120 MB/s per OSS and you require availability guarantees provided by RAID 5, hardware RAID may be a better option.
Lustre 1.x does not support the clustering of multiple metadata targets (MDT) into a single Lustre le system. As a result, if you require additional metadata storage, you have three choices: reformat your le system and copy all of the data back (see 7.2 and 7.3); use a hardware RAID device which can grow volumes safely; or use the LVM software solution. If your MDS device is an LVM volume, you can grow it by adding a device to the volume with the LVM tools (a precise explanation about which is outside of the scope of this document). After you resize the LVM device, you must re-size the Lustre backend MDS le system. Lustre uses a modied version of ext3 as its backend le system; an updated version of the ext3 resize tool will be supplied with Lustre 1.4 later in 2004. 2.2.3.2 LVM on Object Storage Servers
Each object storage target (OST) has its own block device and backend le system. Unlike with metadata, you can have more than one object storage server (OSS) some large installations have hundreds in a single le system. Each OSS can itself hold multiple OSTs, and these OSTs are the building blocks of a Lustre le system. 28
2 Creating a New File System There are four ways to increase the data storage in your Lustre le system: reformat with more OSTs and copy your data back; add more OSTs to an existing le system; use a hardware RAID device which can grow volumes safely; or re-size the OSTs using the LVM. Today, Lustre does not include internal data migration features to re-balance le data amongst many OSTs. Although you can grow a Lustre le system by adding more, the usual situation is that the old OSTs have a lot of data and the new OSTs are empty. In future versions of Lustre, a migrator will rebalance this data in the background. Until then, one alternative is to use the LVM to grow your le system without adding additional OSTs, exactly as you would on the MDS.
2.2.3.3
Limitations
This is not a completely perfect solution, however, because there are limitations in Linux about how large a single ext3 le system can be. A single MDT or OST device cannot exceed 2 TB, which also means that a single data object on an OST cannot exceed 2 TB. Fortunately, les in a Lustre partition can exceed 2 TB by striping them over multiple OSTs (see Section 2.4). If your devices have reached 2 TB and you require additional storage, your only choices are to reformat or add new, empty OSTs.
With Linux 2.4 kernels there is an upper limit of 2 TB per block device, although some device drivers actually have only a 1 TB limit. If you require large amounts of storage, you should create each OST with the maximum possible size. It is possible to congure an OSS with multiple OSTs, although having too many OSTs on one node will hurt performance. Other things being equal, it is preferable to have fewer, larger OSTs in order to use space more efciently and to allow larger les without requiring large numbers of stripes.
If your aggregate bandwidth is ultimately limited by your disk subsystem i.e., you have excess network capacity on each OSS then you can improve your peak I/O by adding more disk. You can add disk by growing an existing volume or by adding OSTs to an existing OSS (see Section 6.1). In both cases, by adding disk bandwidth without adding more OSS nodes, you will begin to take advantage of your excess network capacity. 2.3.3.2 Increasing Network Bandwidth
Whether you have local disks on each OSS or a large SAN, if you have enough disks, you will eventually overwhelm the network pipe on that node. Assuming that your network fabric is up to the challenge (which is outside of the scope of this document), the way to increase the aggregate network bandwidth available to Lustre is to add OSS nodes.
There are two reasons to create les of multiple stripes: bandwidth and size. There are many applications which require high-bandwidth access to a single le more bandwidth than can be provided by a single OSS for example, scientic applications which write to a single le from hundreds of nodes, or a binary executable which is loaded by many nodes when an application starts. In cases such as these, you want to stripe your le over as many OSSs as it takes to achieve the required peak aggregate bandwidth for that le. In our experience, the requirement is as quickly as possible, which usually means all OSSs. Note. This assumes that your application is using enough client nodes, and can read/write data fast enough, to take advantage of that much OSS bandwidth. The largest useful stripe count is bounded by (the I/O rate of your clients/jobs) divided by (the performance per OSS). The second reason to stripe is when a single object storage target (OST) does not have enough free space to hold the entire le. In an extreme example, this can be used to overcome the Linux 2.4 maximum le size limitation of 2 TB.
2 Creating a New File System for IA-64 clients. Although you could create les with a stripe size of 16 KB, this would be a poor choice. Practically, the smallest recommended stripe size is 512 KB, because Lustre tries to batch I/O into 512 KB chunks over the network. We have found that this is a good amount of data to transfer at once, and choosing a smaller stripe size may hinder that batching. Our testing indicates that between 1 MB and 4 MB are good stripe sizes for sequential I/O using high-speed networks. Stripe sizes of larger than 4 MB will not parallelize as effectively, because Lustre tries to keep the amount of dirty cached data below 4 MB per server with the default conguration. Writes which cross an object boundary are slightly less efcient than writes which go entirely to one server. Depending on your applications write patterns, you can assist it by choosing the stripe size with that in mind. If the le is written in a very consistent and aligned way, you can do it a favor by making the stripe size a multiple of the write() size. The choice of stripe size has no effect on a single-stripe le.
When choosing le system or per-directory defaults (see Section 2.10), we strongly recommend that you allow Lustre to choose randomly, by specifying a starting OST of -1.
32
Be aware that a client running on the same node as the MDT will prevent clean failover, and a client running on the same node as an OST is not completely stable. In production environments, you should avoid these congurations.
2.5.1.1
Basic Information
Using the Cong Builder, your conguration begins by specifying a name for your le system and selecting your network type from the drop-down menu. Depending on your selections here, certain values will be pre-calculated for you elsewhere in the Cong Builder. These pre-calculated values represent a best-guess, so please review all the elds and change these pre-calculated values if they are not accurate. Note: the Cong Builder does not yet allow you to override the default UUID seletion; if you need to do this today, you will need to use lmc (see Section 2.9). The following characters are valid for use in lesystem conguration, le names and user names: [A-Z],[a-z],[0-9],_,-.
2.5.1.2
MDT Setup
Field descriptions for Cong Builder used in MDT setup are listed in Table 2.1.
2.5.1.3
OST Setup
Only two slots for OSTs are displayed by default in the Cong Builder. If you wish to add more than one OST, click on the Add OST button and another OST slot will appear, up to a maximum of 32 OSTs. The eld descriptions for Cong Builder used in OST setup are listed in Table 2.2. In the current version, LMT cannot customize the OST service name; it will be auto-generated in the format:OST_hostname,OST_hostname_2,OST_hostname_3,..... 33
2 Creating a New File System Field MDT service name MDT service host Hostname of the MDT node MDT host network ID For TCP hosts, the network ID is identical to the hostname. For other network types, the network ID is calculated differently. MDT backing store Path to the device used as the backing store for the MDT, e.g. /dev/sda1. If you do not have dedicated storage hardware available, the backing store can also point to a loopback device. Store size (Optional) The size (KB) to use for the backing store. If not specied, the maximum possible size is used on the backing store device. Note: you must specify a size if using a loopback device. Table 2.1: Field description for Cong Builder: MDT Field OST host name Description Hostname of the OST node OST host network ID For TCP hosts, the network ID is identical to the hostname. For other network types, the network ID is calculated differently. OST backing store Path to the device used as the backing store for the OST, e.g. /dev/sda1. If you do not have dedicated storage hardware available, the backing store can also point to a loopback device. Store size (Optional) The size (KB) to use for the backing store. If not specied, the maximum possible size is used on the backing store device. Note: you must specify a size if using a loopback device. Table 2.2: Field description for Cong Builder: OST 34 Description A unique name for the MDT service. Other nodes use this name to refer to this MDT.
2 Creating a New File System Field Client nodes Mount point Description A glob list of the client nodes in your cluster, e.g. clientnode[3-6,8,10-12] The mount point for Lustre on your client nodes (Default: /mnt/lustre) Table 2.3: Field description for Cong Builder: Clients
2.5.1.4
In the absence of specic client information, LMT will create a generic client entry in the conguration le. This generic entry can be used to mount an arbitrary number of clients. If you do decide to provide specic client information, Table 2.3 describes how to ll out the various elds in the Clients section of the Cong Builder:
2.5.1.5
Striping Patterns
By default, the Cong Builder creates a single logical object volume (LOV) which encompasses all of the OSTs you specied using the LMT. This LOV is setup with the following striping pattern: stripe size: 1 MB; stripe count: 1; OST choice: random For more information about what these values mean, and how to choose reasonable defaults for your environment, see Section 2.4. This rst version of the Cong Builder does not allow changes to this default; if you want to specify a different default striping pattern, see Section 2.9. All les will be created using this stripe conguration unless an alternate is requested. You can alter the striping policy on a per-directory or -le basis using the lfs tool. For more information about lfs and other striping tools, see Section 2.10.
Note: In the current version, when you start/stop a service, if the service status does not automatically update, you can click the refresh button on the settings page to discard cached status data and get updated status information.
2.5.3.1
Basic Information
Choose your le system name and select your network type as in the simpler conguration above.
2.5.3.2
MDT Setup
Congure the single MDT as before. This node can shared with an OST, if necessary.
2.5.3.3
OST Setup
If you have high-speed storage controllers, you can congure 2 OSTs on the same node (Object Storage Server, or OSS). We suggest a 4 OSS/8 OST setup for this parallel I/O conguration. The OSTs are congured in the same way as before. Two OSTs would have the same service host and network ID, but the service name and device name must be unique for each OST. The store size remains optional. If you dont have special hardware, cluster performance will likely be better with a single OST per OSS.
2.5.3.4
Client Setup
The client conguration remains generic, so you can still mount an arbitrary number of clients. For parallel I/O operations, 16 or more clients would be typical. 36
Lustres end-to-end performance is most meaningful in relation to the raw performance of your network. We recommend running a simple TCP benchmark such as gen/sink to establish a baseline, and then running an I/O benchmark on the mounted Lustre le system. Depending on your setup, and assuming your disk backend is capable, you should see Lustre performance of greater than 80% of the raw network performance. 2.5.3.6 Characterizing Disk Speed
In order to properly characterize disk speed under Lustre, you must rst determine what your disk speed will be for the underlying storage without Lustre running. We recommend running the iozone benchmark described in Section 2.8.1 to determine the raw speed of an ext3 partition on your disk device. You can then run the same test again with Lustre mounted, using the disk device as the backing storage for your OST, and then compare the numbers from the two tests. Depending on your setup, you should see Lustre performance of greater than 80% of the raw ext3 rate. 2.5.3.7 Monitoring CPU Usage
gen/sink and iozone both create large amounts of network and disk trafc, respectively. It is a good idea to monitor the CPU usage in both cases (on both servers and clients) using a tool such as vmstat. This will let you know whether either performance measurement is being limited by excessive CPU usage. 2.5.3.8 Lustre Throughput Tests
You can determine Lustre throughput rates using the various tests described in the Validation section: 2.8.1 2.5.3.9 Performance Effects of Resource Sharing
If you have a limited number of nodes available, it may sometimes be necessary to have to more than one Lustre service on the same node. For example, it is not uncommon for an MDT to share a node with an OST in a smaller cluster. However, this setup will likely have an impact on cluster performance. If possible, we recommend that you perform the above proling steps on both a separate- and shared-node conguration for your cluster to determine whether the performance trade-offs are acceptable to you. It is also recommended that Lustre servers (both MDTs and OSTs) be run on dedicated nodes, i.e., not nodes that are running other servers (NFS, etc.), or nodes that are used 37
2 Creating a New File System as administration or login nodes. An excess of non-Lustre processes will necessarily degrade Lustre performance, and vice-versa. 2.5.3.10 Effects of Different Striping Patterns
The current version of Lustre-Manager does not support changing the default striping pattern. For more information on setting striping patterns manually, please see the section: 2.5.
2.5.5 LDAP
LDAP is not supported in the current version of Lustre-Manager. For instructions on setting up an LDAP server, please see Section 1.4.
2 Creating a New File System unrecoverable le system damage could result if there are journal pages, etc. in that cache. At any rate, an fsck would denitely be required in that case and that will take along time. For this reason, we strongly recommend that all write caches in hardware RAID devices be battery-backed.
ost --ost ost1 --failover --node nodeA \ lov1 --device /dev/sda1 ost --ost ost1 --failover --node nodeB \ lov1 --device /dev/sdb1
lmc --add ost --ost ost1 --failover --node nodeA \ --group nodeA --lov lov1 --device /dev/sda1 lmc --add ost --ost ost1 --failover --node nodeB --lov lov1 --device /dev/sdb1 lmc --add ost --ost ost2 --failover --node nodeB \ --group nodeB --lov lov1 --device /dev/sdc1 39
2 Creating a New File System lmc --add ost --ost ost2 --failover --node nodeA --lov lov1 --device /dev/sdd1
If, after a failover, the client sees the server is missing some transactions that were committed, then youll see "server went back in time" errors. This is usually the result of losing data in a hardware write cache, as mentioned above.
2 Creating a New File System add below kptlrouter portals add below ptlrpc ksocknal add below llite lov osc alias lustre llite Figure 2.1: modules.conf entries for automatic client mounting mdt_hostname:/mdt_service_name/client_name /mnt/lustre lustre defaults,_netdev 0 0 Figure 2.2: fstab entry to allow automatic client mounting mdt_service_name, client_name, and /mnt/lustre should all be updated with local values as shown in Figure 2.2.
Iozone Iozone is a le system benchmark tool that measures a variety of le operations. It is quite useful for providing a broad analysis of the le system and to conrm that Lustre is congured correctly. Iozone can be downloaded from the Iozone homepage http://www.iozone.org. IOR IOR is the Interleaved Or Random parallel le system test code developed at Lawrence Livermore National Laboratory. IOR uses the Message Passing Interface (MPI) to perform parallel writes and reads to calculate le system throughput. To download IOR, visit the IOR homepage. http://www.llnl.gov/asci/purple/ benchmarks/limited/ior/. 2.8.1.2 Metadata
Bonnie Bonnie is another le system benchmark that performs a series of tests on a le of known size. The tests that bonnie performs will do well to stress small le IO performance and metadata operations. Bonnie can be found here. http://www. textuality.com/bonnie/. 41
It is important to note that the OSTs must be setup rst, followed by the MDS, and nally the client can mount Lustre. This order is necessary, as the MDS will need to talk to the OSTs to nish its setup, and the client will need to communicate with the MDS and OSTs to mount Lustre. 42
2 Creating a New File System # configure the OSTs $ lconf --reformat --gdb --node node2 config.xml # configure the MDS $ lconf --reformat --gdb --node node1 config.xml # mount the client $ lconf --reformat --gdb --node node3 config.xml
2 Creating a New File System File striping (introduced in Section 2.4) can be specied on a per-le system, perdirectory, or per-le basis. After a le has been created, the stripe conguration is locked in and cannot be changed. This section will describe how to use the lfs tool to change and inspect the stripe conguration.
2.10.1 Per-File
New les with a specic stripe conguration can be created with lfs setstripe: lfs setstripe <filename> <stripe-size> <starting-ost> <stripe-count> If you pass a stripe-size of 0, the le system default stripe size will be used. Otherwise, the stripe-size must be a multiple of 16 KB. If you pass a starting-ost of -1, a random rst OST will be chosen. Otherwise the le will start on the specied OST index (starting at zero). If you pass a stripe-count of 0, the le system default number of OSTs will be used. A stripe-count of -1 means that all available OSTs should be used.
2.10.2 Per-Directory
lfs setstripe also works on directories to set a default striping conguration for les created within that directory. The usage is the same as for lfs setstripe for a regular le, except that the directory must exist prior to setting the default striping conguration on it. If a le is created in a directory with a default stripe conguration (without otherwise specifying the striping) Lustre will use those striping parameters instead of the le system default for the new le.
2.10.5 Examples
Create a le striped on one random OST, with the default stripe size: $ lfs setstripe /mnt/lustre/file 0 -1 1 List the striping pattern of a single le: $ lfs getstripe /mnt/lustre/file List the striping pattern of all les in a given directory: $ lfs find /mnt/lustre/ List all les which have objects on a specic OST: $ lfs find -r --obd OST2-UUID /mnt/lustre/
45
Chapter 3
Conguring Monitoring
3.1 Basic monitoring
3.1.1 System Health
The Lustre Management Tool provides an overview of what is happening on all servers and clients. Early warnings, recent throughput, space utilization, and all other things you need to know to keep an eye on the servers. Errors and surprises do happen; disks ll up, or may fail.
3.1.3 Bandwidth/Disk/CPU
The vmstat program can be used to provide simple monitoring of system performance on clients and servers. vmstat output shows the number of runnable processes, and processes in uninterruptible sleep (often waiting for disk or RPC completion in the context of Lustre). It shows free and cached memory, although it should be noted that the free memory in Linux is usually very low because Linux aggressively caches le 46
3 Conguring Monitoring data in otherwise-unused memory. vmstat shows disk device input and output, but this is not indicative of read and write behavior on Lustre clients because their IO goes over the network. It also includes percentage CPU usage for user, system, and idle (and iowait on some platforms).
The Lustre Manager provides a summary on overview tab which displays a report of per-service capacity/free space.
3 Conguring Monitoring
Lustre will log a large variety of messages to the system log in the kern.* log facility. These include messages for Lustre startup, shutdown, network timeouts, and other errors. Messages are prexed with Lustre: or LustreError: for easy identication and ltering. Since Lustre is a distributed le system, it is most benecial to log all of the client and server system logs to a central logging host so that events that take place on several nodes can be correlated more easily. To forward syslog messages to another node, add the node name to the syslog.conf le for a facility. In addition, the remote logging host should be congured to allow remote hosts to log messages there (the -r option for the standard syslogd). kern.info kern.info,*.err /var/log/kern.log @remote_logging_host
3.2.3 SNMP
It is possible to integrate Lustre to SNMP via syslog monitoring for LustreError events.
48
Chapter 4
4.2 E2fsck
49
Journal recovery is normally handled by the kernel ext3 driver at le system mount time (i.e., Lustre OST or MDS setup). If the Lustre server is set to start automatically at boot time it is also possible to have e2fsck do the journal recovery before the le system is mounted so that it can validate the le system superblock and check for errors that were detected during the previous mount. Mounting and writing to a le system with errors may lead to further le system corruption. If e2fsck detects an error, having it do a full le system consistency check can take upwards of 20 minutes per 100 GB of storage, so some administrators prefer to perform le system checking manually at a scheduled outage instead of immediately after reboot, although there is some (usually small) risk associated with using a le system with errors. To have e2fsck do journal recovery and the normally-quick basic consistency checks at boot time, the storage device should be listed in /etc/fstab with the noauto option to prevent it from being mounted and a non-zero value for the 6th eld (fs_passno, normally 2 so it is checked after the root le system) as shown in Figure 4.1. # Lustre OST devices, do not mount /dev/sdb1 none ext3 noauto 0 2 /dev/sdb2 none ext3 noauto 0 2 Figure 4.1: /etc/fstab for checking Lustre devices at boot In addition to recovering the journal, e2fsck will do a full le system check if there was an error reported on the device during a previous mount, or some time interval has passed since the last le system check. In order to avoid an unnecessary and possibly lengthy le system check at startup, the automatic date- and mount-based le system checks should be turned off with tune2fs -c 0 -i 0 or set to some suitable value for your environment (see tune2fs(8) man page for details).
4.3 lfsck
50
In order to use lfsck, one must rst run the Lustre-modied e2fsck on the MDS in order to generate the MDS inode and LOV EA database, and then once on each OST in order to create the object databases. The MDS database, mdsdb, must be created rst and made available on all of the OST nodes during their fsck runs, but it is not modied by the OSTs so they can run e2fsck on the target le systems in parallel to create one ostdb for each OST. Figure 4.3 shows the commands used to run e2fsck on the MDS and OST nodes.
It should be noted that the e2fsck runs MUST be done while Lustre is unmounted from all of the clients and the MDS and OST services shut down. Running e2fsck while the MDS or OST services are using the le systems will lead to severe le system corruption.
mds# e2fsck -f -y --mdsdb /path/to/shared/mdsdb ost1# e2fsck -f -y --mdsdb /share/mdsdb --ostdb ost1# e2fsck -f -y --mdsdb /share/mdsdb --ostdb : : ostN# e2fsck -f -y --mdsdb /share/mdsdb --ostdb ostN# e2fsck -f -y --mdsdb /share/mdsdb --ostdb /dev/sdb1 /share/ostdb1-1 /dev/sdb1 /share/ostdb1-2 /dev/sdb2
Figure 4.3: running e2fsck on the mds and ost nodes After the mdsdb and ostdb les are created they must all be made available on a single client node in order to combine the databases and nd any errors that the le system corruption may have caused. At this point the MDS and OST services can be started and the Lustre le system mounted on clients. Lfsck is run on the client with the databases and will report (and optionally repair) any problems it nds (see Figure 4.4). It should be possible to use the Lustre le system during lfsck operation.
# lfsck -l --mdsdb /shared/mdsdb --ostdb \ /shared/ostdb1-1,/shared/ostdb1-2,... /lustre/mountpoint
lactive --ldapurl ldap://lustre --config fs --pwfile <pw> --group nodeA --active nodeB lconf --ldapurl ldap://lustre --config fs --node nodeB [--group nodeA] If the group is specied, then only the devices in that group will be started. If group is not used, then all devices that are not already started on the nodeB and are active will be started.
lconf --node nodeB [--group nodeA] --select ost2=nodeA --cleanup --force --failover <config.xml> The group here is used to specify which devices to shutdown, and is required in an active/active conguration to prevent all the devices from being stopped. The rest of failback is identical to a regular failover.
55
Chapter 5
Health Checking
5.1 What to do when Lustre seems too slow
There are many reasons why Lustre may not perform as well as it should. But the rst step is to make sure that your expectations are reasonable. Ask yourself these important questions: Am I expecting more bandwidth than the raw network or disk hardware would allow? (See Section 2.5.3.5) Does the application perform I/O from enough client nodes to take advantage of the aggregate bandwidth provided by the object storage servers?
5 Health Checking
5.1.4 Investigation
If the problem still exists after checking the above, there may be a legitimate bottleneck which requires investigation. There are several general investigative tools that can be used to evaluate which nodes in a Lustre le system may be the cause of slowness.
5.1.4.1
vmstat
The CPU use columns in the vmstat output can identify a node whose CPU is entirely consumed. On metadata server (MDS) and object storage server (OSS) nodes, the I/O columns tell you how many blocks are owing through the nodes I/O subsystem. Combined with knowledge of the nodes attached storage, you can determine if this subsystem is the bottleneck. Clients do not show any block IO in vmstat. The columns that report swap activity can identify nodes that are having trouble keeping their working applications in memory.
5.1.4.2
iostat
When the host kernel has been congured to provide detailed I/O stats per partition iostat can provide insight into the nature of I/O bottlenecks. It provides the nature and concurrency of requests being made of attached storage.
5.1.4.3
top
top helps identify tasks that are monopolizing system resources. It can identify tasks that arent generating le system load because they are busy using the CPU, or server threads that are struggling to keep up on an overloaded node. 57
oprole (http://oprofile.sourceforge.net/) is invaluable for proling the use of CPU on a node. Its installation and use is beyond the scope of this document but we highly recommend it.
If an application is to take advantage of large network and disk pipes, it must generate a lot of write trafc, which can be cached on the client node and packaged into RPCs for the network. There must be free memory on the node for use as a write cache. If the kernel cant keep at least 4 MB in use for Lustre write caching, it cannot keep an optimal number of network transactions in progress at once. There must be enough CPU capacity for the application to do the work which generates data for writing. 5.1.5.2 Nearly-full le systems
To prevent a situation in which Lustre puts application data into its cache, but cannot write it to disk because the disk is full, Lustre clients must reserve disk space in advance. If it is unable to reserve this space because the OST is within 2% of full it must execute its writes synchronously with the server, instead of caching them for efcient bundling. The degree to which this affects performance depends on how much your application would benet from write caching. The cur_dirty_bytes le (in each OSCs subdirectory of /proc/fs/lustre/osc/ on a client) records the amount of cached writes which are destined for a particular storage target. The maximum amount of cached data per OSC is determined by the max_dirty_mb value in the same directory, 4 MB by default. Increasing this value will allow more dirty data to be cached on a client before it needs to ush to the OST, but also increases the time needed for other clients to read or overwrite that data as it needs to be written to the OST before other clients can access it. 5.1.5.3 Network congestion
The network between the client and storage target needs to have capacity for the write trafc. 58
5 Health Checking ifcong has byte counters for each interface, which can be used to measure the throughput of a TCP session over that interface. netstat -t shows the size of the queues on a given socket. netstat -s can show packet loss and retransmissions for TCP on the node. 5.1.5.4 Server thread availability
Write RPCs arrive at the server and are processed synchronously by kernel threads (named ll_ost_*). ps will help to identify the number of threads that are in the D state, which indicates that theyre busy servicing a request. vmstat can give a rough approximation of the number of threads that are blocked processing I/O requests when a node is busy servicing only I/O RPCs. The number of threads sets an upper bound on the number of I/O RPCs that can be processed concurrently, which in turn sets an upper bound on the number of I/O requests that will be serviced concurrently by the attached storage. 5.1.5.5 Backend throughput
iostat -x is invaluable for proling the load on the storage attached to a server node. Its man page details the meaning of the various columns in the output. The raw throughput numbers (wkB/s) combines well with the requests per second (w/s) to give the average size of I/O requests to the device. The service time gives the amount of time it takes the device to respond to an I/O request. This sets the maximum number of requests that can be handled in turn when requests are not issued concurrently. Comparing this with the requests per second gives a measure of the amount of storage device concurrency.
59
5 Health Checking When reporting a bug, you may be asked to submit a sample of the Lustre kernel debug log after youve reproduced the bug. These logs are very verbose, so its important to reproduce the bug on a quiescent system whenever possible. You may be asked to change the system debug level to gather more or less information when you reproduce the problem. 5.2.2.1 Log levels
Lustres default debug level is very low, appropriate for a production system which requires minimal logging and minimal performance impact. A high debug level can impact performance by as much as 90%. Debugging is controlled by two bitmaps, one which controls the type of messages saved (tracing, locking, memory allocation, etc.), and one which controls which subsystems are saved (MDS, DLM, Portals, etc.). These reside in /proc/sys/portals/debug and subsystem_debug, respectively. The meaning of these bits can be found in lustre/portals/include/linux/kp30.h 5.2.2.2 Log collection
Before you reproduce the bug, clear the existing log data: lctl clear After you reproduce the problem, save a copy of the debug log: lctl dk <filename> If you trigger a Lustre assertion in the form of an LBUG error, a debug log will automatically be dumped in /tmp; the exact le name will be printed to the console.
slow commit/write These messages occur on the server. lock timouts client When the client times out waiting for a lock, it will assume the connection to the server has failed and start recovery. server When the server times out waiting for a client to respond to a lock cancel request, the server will then evict the client to allow other nodes to 60
5 Health Checking make progress. The next time the client sends an RPC, it will receive an error and will have to reconnect. Socknal timeout message The socknal will not attempt to retransmit after a timeout. It will just close the connection and drop the message, after which the higher layers of Lustre will reconnect and attempt recovery. RPC timeouts A client will attempt to reconnect after an RPC timeout occurs. ENOSPC (errno 28) When the MDS or OST runs out of space, you will see ENOSPC errors in the logs. ENOTCONN (errno 107) This usually means the client has been evicted by the server. It can also mean the server has been restarted. In either case, the client will reconnect and either recover or clear its state if it has been evicted. Remounting le system read-only When the underlying disk le system detects corruption, it will remount itself read-only to prevent further damage. Lustre must be shut down on this target, and e2fsck run on the block device.
61
Chapter 6
Managing Congurations
6.1 Adding OSTs
6.1.1 The importance of OST ordering
Adding an object storage target (OST) is as easy as creating a new OST entry in the conguration le. This must be done while the Lustre le system is not running. If you dont use lconf to start your clients (i.e., you are using a 0-conf setup, or you just run "mount"), then you will need to re-generate the conguration logs on your metadata server (MDT).
The new OST entry must come at the end of the list of OSTs in the conguration le. If you insert the entry in the middle, it will have undened results, and corruption and data loss may occur.
6 Managing Congurations set of OSTs at le creation time. Unfortunately, the only manual migration option at this time is to copy your les with standard Unix tools. As the old les are replaced by new les, the random allocation policy will ensure that the OSTs eventually get back into balance. You can view the disk usage statistics for each OST separately with the Lustre management tool (see Section 3.1.4).
6 Managing Congurations
You can add failover to an existing conguration in much the same way as described in the Failover Example Section 2.6. You would create a new OST that would act as a failover pair with an existing OST. Two existing OSTs cannot be used to create a failover pair without reformatting at least one of the two targets.
Be aware that adding more than one target on a given node may have an impact on le system performance.
64
Chapter 7
Managing Lustre
7.1 Changing Congurations 7.2 Backing Up Data
7.2.1 Backing up at the Client File System Level
It is possible to back up Lustre at the client level from one or more clients. Running backups on multiple clients in parallel for different subsets of the le system can take advantage of the parallel nature of Lustre storage if the backup system can handle this. This would allow the use of standard le system backup tools (tar, Amanda, etc) that read les using the standard POSIX API. This has the advantage that the backups can be managed using the same tools as other backups in an organization, possibly allowing users to manage their own backups. It is also generally easier to restore individual les in case of a user or application error that deletes or corrupts specic les. This is the less complex method of performing backups. One disadvantage of using backups at the le system level is that this loses Lustre metadata such as how the le was striped over the OSTs. In some organizations, les will always be created with the default striping pattern, or it is possible to set a default directory striping pattern before restoring les so this may not be a concern. It may actually be advantageous to use the default striping during le restoration in some circumstances in order to rebalance space usage on the OSTs. Another disadvantage of le-system-level backups is that the data must be transferred over the network from the OSTs to wherever the backup is running, and possibly again over the network to a 65
7 Managing Lustre backup server. The use of le-system-level backup tools is beyond the scope of this document.
7 Managing Lustre mds# mount -t ext3 /dev/sdb1 /mnt/mds mds# cd /mnt/mds mds# getfattr -R -d . -m . -e hex . > backup.EA Figure 7.1: backing up extended attributes to a regular le # mke2fs -J size=400 /dev/sdb1 # tune2fs -c 0 -i 0 -O dir_index /dev/sdb1 Figure 7.2: formatting a target device
is not possible to do device-level backups on a quiescent le system. If LVM device snapshots are available, it is possible to mount the MDS le system from the snapshot device, otherwise MDS service must be shut down and the MDS device mounted as a regular ext3 le system. To dump the extended attribute data to a regular le, that can itself be backed up using normal backup tools, requires getfattr from the attr package to be installed (see Figure 7.1). At this point it is possible to use normal backup tools to back up the MDS le system.
# mount -t ext3 /dev/sdb1 /mnt/mds # cd /mnt/mds # setfattr --restore=backup.EA Figure 7.3: restoring extended attributes 67
7 Managing Lustre
7.6.2.1
Before Lustre is upgraded, it should be shut down properly on all nodes. The proper order for this is: 1. Unmount clients 2. Shut down Lustre on MDS nodes 3. Shut down Lustre on OSS nodes 68
7 Managing Lustre If lconf is not used to shut down the clients, rmmod -a should be run on the clients twice, so that all Lustre modules are unloaded (the rst time marks the modules as autocleanable, the second actually does the autocleaning). 7.6.2.2 Upgrading Lustre
The method for upgrading Lustre depends on how its installed on your systems. If the Lustre modules were included in a kernel RPM, you will need to install the new kernel rpm. When upgrading kernels, we advise you to use rpm -i rather than rpm -U. -U will remove older versions, and it is a good idea to keep the current kernel installed in case of problems with the new kernel. The RPMs should update the bootloader menu, but you may have to manually set the new kernel as the default. Consult your systems bootloader documentation for information on this. Reboot to run the new kernel. If the Lustre modules are contained in their own RPM, you can upgrade this RPM with the rpm -U command. The node does not need to be rebooted unless a new kernel is also required, but you should make sure that all old Lustre modules were unloaded before restarting Lustre (as above).
Before you restart, make sure that all nodes were upgraded to the same version of Lustre. If youre planning to run with different versions, please see Section 8.1.
69
Chapter 8
Mixing Architectures
8.1 Mixing Lustre versions
Running Lustre with different release levels on the clients and servers is not a supported conguration. Given Lustres very rapid development, there are almost guaranteed to be network protocol differences between minor releases (e.g. 1.0 and 1.2). There are never network protocol changes between micro releases (e.g. 1.2.0 and 1.2.1). However, there is a strong effort made to ensure that the on-disk format does not change between releases, or is changed in a compatible manner with older releases. It should always be possible to update from an older Lustre version to a new release. Please consult the release notes for a given release in case of compatibility issues. In some circumstances it may be possible to update clients with bug xes without taking a full le system outage (e.g. as clients nish jobs, without interrupting other running jobs), but this needs to be determined on a case-by-case basis and is not normally recommended.
8 Mixing Architectures Lustre kernel patches are numbered with a release number in <linux/lustre_kernel_version.h> LUSTRE_KERNEL_VERSION that is checked at Lustre module build time that the kernel patch matches the version of the Lustre code being built. In some cases it might be benecial to have slightly different kernel builds on the clients and servers, in order to improve performance. Conguring the server kernels with 3 GB kernel address space (1 GB user address space) allows the kernel to cache more metadata if the server nodes have more than 1 GB of RAM. In all cases where the kernel version is different between nodes the Lustre modules need to be rebuilt for each kernel and installed with the matching kernels.
8 Mixing Architectures should be considered in the preliminary stages at this time. A conguration that is known to work is a PPC64 client with an i386 server for which we have done simple load testing (e.g. iozone) but not large-scale testing.
72