Sunteți pe pagina 1din 72

Lustre Operations Manual

Cluster File Systems

Lustre Operations Manual


First Edition (March 31, 2004)

This publication is intended to help Cluster File Systems, Inc.s (CFS) Customers and Partners who are involved in installing, conguring, and administering Lustre. The information contained in this document has not been submitted to any formal CFS test and is distributed AS IS. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by CFS for accuracy in a specic situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. Comments may be addressed to: Cluster File Systems, Inc. 110 Capen Street Medford MA 02155-4230

Copyright Cluster File Systems, Inc. 2004 All rights reserved. Use or disclosure is subject to restrictions. Duplication of this manual is prohibited.

Contents
1 Prerequisites 1.1 Lustre Version Selection . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 1.1.2 1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.3 1.3.1 1.3.2 1.3.3 1.4 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.5 1.5.1 1.5.2 1.5.3 How To Get Lustre . . . . . . . . . . . . . . . . . . . . . . . Supported Congurations . . . . . . . . . . . . . . . . . . . Choosing a Pre-packaged kernel . . . . . . . . . . . . . . . . Lustre Tools . . . . . . . . . . . . . . . . . . . . . . . . . . Building Other Modules Against the Lustre kernel . . . . . . Other Required Software . . . . . . . . . . . . . . . . . . . Building Your Own kernel . . . . . . . . . . . . . . . . . . . Building Lustre . . . . . . . . . . . . . . . . . . . . . . . . Environment Requirements . . . . . . . . . . . . . . . . . . Installing LDAP Packages . . . . . . . . . . . . . . . . . . . Updating slapd.conf . . . . . . . . . . . . . . . . . . . . . . Specifying Password Location . . . . . . . . . . . . . . . . . Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using LDAP to congure the cluster . . . . . . . . . . . . . Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . Installing the reporting client daemon (LMD) . . . . . . . . . Installing the management client (LMM) . . . . . . . . . . . 3 9 9 9 9 9 10 10 11 11 12 13 14 16 16 16 17 17 17 18 18 18 20 21

Using a pre-packaged Lustre release . . . . . . . . . . . . . . . . .

Building From Source . . . . . . . . . . . . . . . . . . . . . . . . .

LDAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Installing Lustre-Manager . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS 2 Creating a New File System 2.1 What do you need to know to setup Lustre? . . . . . . . . . . . . . . 2.1.1 2.1.2 2.1.3 2.1.4 2.2 2.2.1 2.2.2 2.2.3 2.3 2.3.1 2.3.2 2.3.3 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 2.5.6 2.5.7 2.6 2.6.1 2.6.2 2.6.3 2.6.4 Architecture Refresher . . . . . . . . . . . . . . . . . . . . . Sizing Your Nodes . . . . . . . . . . . . . . . . . . . . . . . High Availability . . . . . . . . . . . . . . . . . . . . . . . . Total Usable Storage . . . . . . . . . . . . . . . . . . . . . . Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lustre on RAID . . . . . . . . . . . . . . . . . . . . . . . . Logical Volume Manager (LVM) . . . . . . . . . . . . . . . Peak Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . Total Storage Capacity . . . . . . . . . . . . . . . . . . . . . When Your Best Isnt Good Enough . . . . . . . . . . . . . Advantages of Striping . . . . . . . . . . . . . . . . . . . . Disadvantages of Striping . . . . . . . . . . . . . . . . . . . Stripe Size . . . . . . . . . . . . . . . . . . . . . . . . . . . Choosing OSTs . . . . . . . . . . . . . . . . . . . . . . . . Basic Multi-node Setup . . . . . . . . . . . . . . . . . . . . Basic Service Management . . . . . . . . . . . . . . . . . . Large Parallel I/O Conguration . . . . . . . . . . . . . . . . Conguring for Failover . . . . . . . . . . . . . . . . . . . . LDAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multinet and Routing . . . . . . . . . . . . . . . . . . . . . Conguration Pitfalls . . . . . . . . . . . . . . . . . . . . . Shared Storage . . . . . . . . . . . . . . . . . . . . . . . . 23 23 23 24 25 26 26 26 27 28 29 29 29 30 30 30 31 31 32 32 32 35 36 38 38 38 38 38 38 39 39 39

Disk Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Counting Your Object Storage Servers/Targets . . . . . . . . . . . .

File Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Using Lustre-Manager . . . . . . . . . . . . . . . . . . . . . . . . .

Failover Example . . . . . . . . . . . . . . . . . . . . . . . . . . . Conguring With Failover Manager . . . . . . . . . . . . . . Pairwise Cong . . . . . . . . . . . . . . . . . . . . . . . . Passive/active Failover . . . . . . . . . . . . . . . . . . . . 4

CONTENTS 2.6.5 2.6.6 2.6.7 2.6.8 2.7 2.8 2.9 2.7.1 2.8.1 2.9.1 2.9.2 2.9.3 2.9.4 2.10 Active/Active Failover . . . . . . . . . . . . . . . . . . . . . N-way Failover . . . . . . . . . . . . . . . . . . . . . . . . The Default Lustre Upcall . . . . . . . . . . . . . . . . . . . Testing Failover . . . . . . . . . . . . . . . . . . . . . . . . Automatic Client Mounting via fstab . . . . . . . . . . . . . Lustre Throughput Tests . . . . . . . . . . . . . . . . . . . . Automatic Service Stopping and Starting . . . . . . . . . . . File System Parameters . . . . . . . . . . . . . . . . . . . . Upcall Generation and Conguration . . . . . . . . . . . . . Log Levels and Timeouts . . . . . . . . . . . . . . . . . . . 39 40 40 40 40 40 41 41 41 42 43 43 43 43 44 44 44 44 44 46 46 46 46 46 47 47 47 47 48 48 49

Client Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . Validation and Light Testing . . . . . . . . . . . . . . . . . . . . . . Conguration, Under the Hood . . . . . . . . . . . . . . . . . . . .

Striping Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.1 Per-File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.2 Per-Directory . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.3 Inspecting Stripe Settings . . . . . . . . . . . . . . . . . . . 2.10.4 Finding Files on a Given OST . . . . . . . . . . . . . . . . . 2.10.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Conguring Monitoring 3.1 Basic monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.2 3.2.1 3.2.2 3.2.3 System Health . . . . . . . . . . . . . . . . . . . . . . . . . Current Load . . . . . . . . . . . . . . . . . . . . . . . . . . Bandwidth/Disk/CPU . . . . . . . . . . . . . . . . . . . . . OST performance monitoring with LMT . . . . . . . . . . . Lustre Operation/RPC Rate . . . . . . . . . . . . . . . . . . Conguring System Log . . . . . . . . . . . . . . . . . . . . Logging from the Upcall . . . . . . . . . . . . . . . . . . . . SNMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Integrating with other monitoring . . . . . . . . . . . . . . . . . . .

4 Health Checking and Troubleshooting 5

CONTENTS 4.1 4.2 4.3 File system consistency . . . . . . . . . . . . . . . . . . . . . . . . E2fsck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 4.3.1 4.3.2 4.3.3 4.3.4 4.4 4.5 4.6 4.4.1 4.5.1 4.6.1 4.6.2 4.7 4.7.1 4.7.2 4.8 4.8.1 4.8.2 4.8.3 Supported e2fsck Releases . . . . . . . . . . . . . . . . . . . What is lfsck? . . . . . . . . . . . . . . . . . . . . . . . . . When To Run lfsck . . . . . . . . . . . . . . . . . . . . . . What if I dont run lfsck? . . . . . . . . . . . . . . . . . . . Using lfsck . . . . . . . . . . . . . . . . . . . . . . . . . . . lustre-lint . . . . . . . . . . . . . . . . . . . . . . . . . . . . Automating . . . . . . . . . . . . . . . . . . . . . . . . . . OST Failure . . . . . . . . . . . . . . . . . . . . . . . . . . MDT Failure . . . . . . . . . . . . . . . . . . . . . . . . . . Aborting Server Recovery . . . . . . . . . . . . . . . . . . . Manual Recovery . . . . . . . . . . . . . . . . . . . . . . . Using XML . . . . . . . . . . . . . . . . . . . . . . . . . . Using LDAP . . . . . . . . . . . . . . . . . . . . . . . . . . Failing Back to Primary Nodes . . . . . . . . . . . . . . . . lfsck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 50 50 51 51 51 51 52 52 52 53 53 53 53 53 54 54 54 54 55 55 56 56 56 56 56 57 58 59 59 59

Validation of Conguration . . . . . . . . . . . . . . . . . . . . . . Recovering from Network Partition . . . . . . . . . . . . . . . . . . Recovering from Disk Failure . . . . . . . . . . . . . . . . . . . . .

Lustre Timeouts . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Automating Failover . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Health Checking 5.1 What to do when Lustre seems too slow . . . . . . . . . . . . . . . 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.2 5.2.1 5.2.2 Debug Level . . . . . . . . . . . . . . . . . . . . . . . . . . Stripe Count . . . . . . . . . . . . . . . . . . . . . . . . . . Stripe Balance . . . . . . . . . . . . . . . . . . . . . . . . . Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . Why are POSIX le writes slow? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unresponsive to Requests . . . . . . . . . . . . . . . . . . . Gathering Evidence . . . . . . . . . . . . . . . . . . . . . . 6

Common Failure Symptoms

CONTENTS 5.2.3 5.2.4 5.2.5 5.2.6 Basic analysis . . . . . . . . . . . . . . . . . . . . . . . . . Reporting problems . . . . . . . . . . . . . . . . . . . . . . Support Contacts . . . . . . . . . . . . . . . . . . . . . . . . Mailing lists . . . . . . . . . . . . . . . . . . . . . . . . . . 60 61 61 61 62 62 62 62 62 63 63 63 64 65 65 65 65 66 66 66 67 67 68 68 68 68 70 70 70 70 71

6 Managing Congurations 6.1 Adding OSTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 6.1.2 6.2 6.3 6.4 6.5 6.2.1 The importance of OST ordering . . . . . . . . . . . . . . . Extending with LVM . . . . . . . . . . . . . . . . . . . . . Adding OSTs Without Upsetting the Balance . . . . . . . . .

Poor Mans Migration . . . . . . . . . . . . . . . . . . . . . . . . . Network Topology Changes . . . . . . . . . . . . . . . . . . . . . . Adding Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding a Distinct le system . . . . . . . . . . . . . . . . . . . . .

7 Managing Lustre 7.1 7.2 Changing Congurations . . . . . . . . . . . . . . . . . . . . . . . Backing Up Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 7.2.2 7.2.3 7.2.4 7.3 7.4 7.5 7.6 Backing up at the Client File System Level . . . . . . . . . . Backing up at the Target Level . . . . . . . . . . . . . . . . How to back up an OST . . . . . . . . . . . . . . . . . . . . How to back up the MDS . . . . . . . . . . . . . . . . . . .

Restoring Backups . . . . . . . . . . . . . . . . . . . . . . . . . . . Exporting via NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . Exporting via Samba (CIFS) . . . . . . . . . . . . . . . . . . . . . Upgrading Your Software . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 7.6.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . Upgrading From Previous Versions . . . . . . . . . . . . . .

8 Mixing Architectures 8.1 8.2 Mixing Lustre versions . . . . . . . . . . . . . . . . . . . . . . . . Mixing kernel versions . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 8.2.2 Different client/server kernels . . . . . . . . . . . . . . . . . Mixing 2.4 and 2.6 kernels . . . . . . . . . . . . . . . . . . 7

CONTENTS 8.3 Mixing hardware classes . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 8.3.2 Hardware Page Size . . . . . . . . . . . . . . . . . . . . . . Endian mixing . . . . . . . . . . . . . . . . . . . . . . . . . 71 71 71

Chapter 1

Prerequisites
1.1 Lustre Version Selection
1.1.1 How To Get Lustre
The current, stable version of Lustre is available for download from the Cluster File Systems web site. Download Lustre http://www.clusterfs.com/download.html The software available for download on the Cluster File Systems web site is released under the GNU General Public License. We strongly recommend that you read the complete license and release notes for this software before downloading if you have not already done so. This license and release notes can be found at the aforementioned web site.

1.1.2 Supported Congurations


Cluster File Systems supports Lustre on the congurations listed in Table1.1.

Aspect Operating System: Platforms: Interconnect:

Supported Type Red Hat Linux 7.1+, SuSE Linux 8.0+, Linux 2.4.x IA-32, IA-64, x86-64 TCP/IP, Quadrics Elan 3 Table 1.1: Supported congurations 9

1 Prerequisites Release chaos ia64 Details Based on the 2.4.18 linux kernel, IA-32, supports both TCP and elan3 Based on the 2.4.20 linux kernel, IA-64, supports both TCP and elan3 Table 1.2: Pre-packaged release details

1.2 Using a pre-packaged Lustre release


Due to the complexity involved in building and installing Lustre, Cluster File Systems has made available several different pre-packaged releases that cover some of the more common congurations listed above. The pre-packaged release consists of three different les: Lustre-patched kernel; Lustre tools (lustre-lite-utils); and, optionally, a Lustre-patched kernel source package. This is required only if you need to build your own modules (for networking, etc.) against the kernel source.

Lustre contains kernel modications which interact with your storage devices and may introduce security issues and data loss if not installed, congured, and administered correctly. Please exercise caution and back up all data before using this software.

1.2.1 Choosing a Pre-packaged kernel


Determining which pre-packaged kernel is best for you depends largely on the combination of hardware and software you are currently running. Cluster File Systems provides the pre-packaged releases listed in Table 1.2.

1.2.2 Lustre Tools


Many times, packages marked utils are not required, but such is not the case with Lustre. The lustre-lite-utils package contains tools that are required for proper Lustre setup and monitoring. The package contains many tools, the most important being: lconf: higher-level conguration tool that acts on XML les; lctl: a low-level conguration utility that can also be used for troubleshooting and debugging; lfs: tool for reading/setting striping information for your cluster, as well as performing other Lustre-le-system specic actions; mount.lustre: mounting script required by Lustre clients. 10

1 Prerequisites

1.2.3 Building Other Modules Against the Lustre kernel


If your cluster requires specic drivers that are not included in the Lustre release, Cluster File Systems provides a Lustre-patched kernel source RPM that can be used to build additional out-of-kernel modules.

All the Lustre packages were built using gcc v2.96. If you need to build additional modules for your cluster, the same compiler and version should be used. gcc http://gcc.gnu.org/

1.2.4 Other Required Software


Aside from the tools provided with Lustre itself, Lustre also requires that some separate software tools be installed.

1.2.4.1

Core Requirements

Table 1.3 contains hyperlinks to the software tools required by Lustre. Depending on your operating system, pre-packaged versions of these tools may be available, either from the sources listed below, or from your operating system vendor.

1.2.4.2

High Availability Software

If you plan to enable failover server functionality with Lustre (either OSS or MDS), high availability software will be a necessary addition to your cluster software. Two of the better known high availability software packages are Clumanager and Kimberlite. Clumanager, also called cluman, (Cluster Manager) from Red Hat Enterprise Linux AS (Advanced Server) version provides high availability (HA) features that are essential for data integrity and uninterruptible service. The basic idea behind of these HA features are redundant systems and fail-over mechanism (which moves the services from the failed server to the remaining backup server). More information about CluManager can be found in the Lustre CluManager wiki: CluManager wiki https://wiki.clusterfs.com/lustre/CluManager Kimberlite is an open-source (GNU GPL) high-availability clustering solution for Linux, designed for use in commercial application environments. It guarantees data integrity using commodity hardware components, and the benets can be applied to any application, with no requirement to modify the application. As a bonus, Kimberlite comes with a command-line management interface for scripting of regular operations. More information about Kimberlite can be found at: Kimberlite http://oss.missioncriticallinux.com/projects/kimberlite/ 11

1 Prerequisites Software pdsh Version >=1.6 Lustre Function distributed shell: useful for general cluster maintenance pdsh http://www.llnl.gov/linux/pdsh/pdsh.html perl >=5.6 scripting language: used by monitoring and test scripts perl http://www.perl.com/pub/a/language/info/ software.html python >=2 scripting language: required by core Lustre tools python http://www.python.org/download/ PyXML >=0.8 XML processor for python: required PyXML http://sourceforge.net/project/showfiles. php?group_id=6473 Table 1.3: Software URLs

1.2.4.3

Debugging Tools

Things inevitably go wrong disks fail, packets get dropped, software has bugs and when they do, it is always useful to have debugging tools on hand to help gure out how and why. The most useful tools in this regard are gdb, coupled with crash. Together, these tools can be used to investigate both live systems and kernel core dumps. There are also useful kernel patches/modules, such as netconsole and netdump, that allow core dumps to be made across the network. More information about these tools can be found at the following locations: gdb http://www.gnu.org/software/gdb/gdb.html crash http://oss.missioncriticallinux.com/projects/crash/ netconsole http://lwn.net/2001/0927/a/netconsole.php3 netdump http://www.redhat.com/support/wpapers/redhat/netdump/

1.3 Building From Source


12

1 Prerequisites

1.3.1 Building Your Own kernel


Lustre requires a few changes to the core Linux kernel. These changes are organized in a set of patches in the kernel_patches directory of the Lustre CVS repository. If you are building your kernel from source, you will need to apply the appropriate patches. Managing patches for kernels is very involved, and most patches are intended to work with several kernels, but the Quilt package developed by Andreas Gruenbacher simplies the process considerably. Patch management with Quilt works as follows: a series le lists a collection of patches; the patches in a series form a stack; you push and pop the patches with Quilt; you can edit and refresh (update) patches if you manage the stack with Quilt; you can revert inadvertent changes and fork or clone patches and show the diffs before and after work conveniently. 1.3.1.1 Patch Series Selection

Depending on which kernel you are using, a different series of patches needs to be applied. Cluster File Systems maintains a collection of different patch series les for the various supported kernels in lustre/kernel_patches/series/. For instance, the le lustre/kernel_patches/series/rh-2.4.20 le lists all the patches that should be applied to a Red Hat 2.4.20 kernel to build a Lustre compatible kernel. The current set of all supported kernels and corresponding patch series can always be found in the le lustre/kernel_patches/which_patch. 1.3.1.2 Using Quilt

A variety of Quilt packages (RPMs, SRPMs, and tarballs) are available on the Cluster File Systems ftp site: quilt ftp site ftp://ftp.clusterfs.com/pub/quilt/ The Quilt RPMs have some installation dependencies on other utilities, e.g. the coreutils RPM that is available only in Red Hat 9. You will also need a recent version of the diffstat package. If you cannot fulll the Quilt RPM dependencies for the packages made available by Cluster File Systems, we suggest building Quilt from the tarball. After you have acquired the Lustre source (CVS or tarball), and chosen series le to match you kernel sources, you must also choose a kernel cong le. Supported kernel ".cong" les are in the lustre/kernel_patches/kernel_congs, and are named in such 13

1 Prerequisites $ cd /tmp/kernels/linux-2.4.20 $ quilt setup -l ../lustre/kernel_patches/series/rh-2.4.20 \ -d ../lustre/kernel_patches/patches Figure 1.1: Quilt Setup

$ cd /tmp/kernels/linux-2.4.20 $ quilt push -av Figure 1.2: Applying a patch series using Quilt

a way as to indicate which kernel and architecture they are meant for, e.g. vanilla2.4.20.uml.cong is a UML cong le for the vanilla 2.4.20 kernel series. Next, unpack the appropriate kernel source tree. For the purposes of illustration, this documentation will assume that the resulting source tree is in /tmp/kernels/linux-2.4.20 and we will call this the destination tree. You are now ready to use Quilt to manage the patching process for your kernel. The commands in Figure 1.1 will setup the necessary symlinks between the Lustre kernel patches and your kernel sources. You can now have Quilt apply all the patches in the chosen series to your kernel sources by using the commands in Figure 1.2. If the right series le was chosen and the patches and kernel sources were up-to-date, the patched destination Linux tree should now be able to act as a base Linux source tree for Lustre.

The patched Linux source does not need to be compiled in order for Lustre to be built from it. However, the same Lustre-patched kernel must be compiled and then booted on any node on which you intend to run a version of Lustre built using this patched kernel source.

1.3.2 Building Lustre


The Lustre source can be obtained by downloading a release tarball from the Lustre web site: Lustre releases ftp://ftp.lustre.org/pub/lustre/releases/. Once you have your Lustre source tree, you can build Lustre by running the sequence of commands found in Figure 1.3. You can then also optionally run make install to install Lustre on the local system. 14

1 Prerequisites

$ cd /path/to/lustre/source $ ./configure --with-linux=/path/to/lustre/patched/kernel/source --disable-liblustr $ make Figure 1.3: Lustre build instructions

1.3.2.1

Conguration Options

Lustre supports several different features and packages that extend the core functionality of Lustre. These features/packages can be enabled at build time by issuing appropriate arguments to the congure command. A complete listing of supported features and packages can always be obtained by issuing the command ./congure help in your Lustre source directory.

1.3.2.2

liblustre

The Lustre library client, liblustre, relies on libsysio, which is a library that provides POSIX-like le and name space support for remote le systems from application program address space. Libsysio can be obtained from: libsysio URL http://sourceforge.net/projects/libsysio/ Development of libsysio has continued since it was rst targeted for use with Lustre, so you should checkout the b_lustre branch from the libsysio CVS repository. This should give you a version of libsysio that is compatible with Lustre. Once checked out, the steps listed in Figure 1.4 will build libsysio. $ sh autogen.sh $ ./configure --with-sockets $ make Figure 1.4: Building libsysio

Once libsysio is built, you can build liblustre using the commands listed in Figure 1.5. $ ./configure --with-lib --with-sysio=/path/to/libsysio/source $ make Figure 1.5: Building liblustre 15

1 Prerequisites 1.3.2.3 Compiler Choice

The compiler of note for Lustre is gcc version 2.96. This version of gcc has been used to successfully compile all of the pre-packaged releases made available by Cluster File Systems, and as such is the only compiler that is ofcially supported. Your mileage may vary with other compilers, or even other versions of gcc.

1.3.3 Environment Requirements


Building software for distributed systems entails an extra level of system administration rigor because you need to ensure a certain base level of consistency across all your systems.

1.3.3.1

Consistent Clocks

Machine clocks should be kept in sync as much as possible. The standard way to accomplish this by using the Network Time Protocol, or ntp. All the machines in your cluster should synchronize their time from a local time server (or servers) at a suitable interval. More information about ntp can be found at: ntp http://www.ntp.org/

1.3.3.2

Universal UID/GID

In order to maintain uniform le access permissions on all the nodes of your cluster, the same user (UID) and group (GID) IDs should be used on all clients. You can store and disseminate such user information centrally for the entire cluster using a tool such as LDAP or NIS. OpenLDAP http://www.openldap.org/ Network Information System http://www.faqs.org/docs/linux_network/ x-087-2-nis.html

1.4 LDAP
A single Lustre conguration le is used for the whole cluster, and this le needs to be accessible to all the cluster nodes. This can be achieved either by keeping the conguration le in shared storage visible to all the nodes (e.g. and NFS-mounted directory) or by putting it in an LDAP server. An LDAP server is also useful for supporting MDS and OSS failovers. In this section, we will describe the various components required to congure an LDAP server and outline the steps to start up an LDAP server. Information on how LDAP can be used to congure a cluster can be found in a later chapter. 16

1 Prerequisites # cd lustre/conf # cp lustre.schema /etc/openldap/schema # cp slapd-lustre.conf /etc/openldap # mkdir -m 700 /var/lib/ldap/lustre # chown ldap.ldap /var/lib/ldap/lustre Figure 1.6: Manual installation of the Lustre schema include /etc/openldap/slapd-lustre.conf Figure 1.7: Updating slapd.conf

1.4.1 Installing LDAP Packages


The following set of rpms will be required to be able to congure an LDAP server: 1. 2. 3. 4. 5. 6. openldap (version>=2); openldap-servers; openldap-clients (not strictly required, but it is useful to have client tools available on the server); 4Suite; python-ldap; lustre-ldap (rpm is included with the other Lustre release rpms);

If the lustre-ldap rpm is not available, e.g. you are running a tarball Lustre release, you can execute the series of commands found in Figure 1.6 to install the Lustre schema on your LDAP server:

1.4.2 Updating slapd.conf


Now you must add the line in Figure 1.7 to /etc/openldap/slapd.conf.

1.4.3 Specifying Password Location


The included le mentioned in the previous section, /etc/openldap/slapd-lustre.conf, contains useful information for accessing your Lustre LDAP directory, such as the root domain name (dn) and password. We recommend that you change the password from the default.

1.4.4 Caveats
17

1 Prerequisites # load_ldap.sh /path/to/cong.xml Figure 1.8: Load XML cong le # lconf ldapurl ldap://ldap_node cong=<cong-name> Figure 1.9: Startup cluster by LDAP On larger clusters, LDAP can limit the speed with which clients can be mounted due to limitations in the number of concurrent queries the LDAP server can accept from clients at a given time.

1.4.5 Using LDAP to congure the cluster


The script,load_ldap.sh,is included in both the pre-packaged Lustre releases, and the Lustre source in the /utils directory. You can use this script to load an existing XML cong le into an LDAP directory by running following command on your LDAP server node: After initialization,the lconf utility can access your LDAP directory to congure and start the clients and servers in the cluster, as shown in the following command:

1.5 Installing Lustre-Manager


The Lustre-Manager (LMT) comes in two packages. One package gathers statistics from nodes that are running Lustre (LMD), and the other package contains the web interface to the management tools themselves (LMM). Both packages need to be installed in order for LMT to function correctly. The following sections will lead you through the installation and setup of those packages.

1.5.1 Dependencies
Both packages (LMD and LMT) require that the following other packages be installed: python, version >=2; gd: graphic library (has various dependencies of its own, lipjpeg, libpng,...); python-gd: Python interface to the gd drawing package. These packages should all be available through whichever package distribution mechanism you use for your system. 1.5.1.1 Installing PDSH

PDSH and PyXML must be installed on all the nodes of your system for LMT to work properly. 18

1 Prerequisites Recent versions of PDSH may be downloaded from the LLNL public FTP site at ftp://ftp.llnl.gov/pub/linux/pdsh/ ftp://ftp.llnl.gov/pub/linux/pdsh/. You can use the following commands to build PDSH with ssh support enabled (recommended) from tar-ed sources: tar xzvf pdsh-xxx.tgz configure --with-ssh make make install Once PDSH is built, it is necessary to congure login equivalence for ssh on all the nodes in your cluster. Login equivalence allows you to connect to the nodes in your cluster without having to specify a password for each connection. You can use the following steps to enable login equivalence for ssh: 1. 2. 3. 4. Login to the head node as root; Create a directory called /root/.ssh if it does not already exist; cd /root/.ssh Create a private identity key using the following command: ssh-keygen -t dsa 5. When prompted for a passphrase, press enter to use an empty passphrase (easier). If you want to use a passphrase, then you have to use an SSH agent; Append the public key generated by ssh-keygen to the list of keys on each remote, cluster node that are allowed to login using root equivalence. Unless you changed the default location during the keygen process, the id_dsa.pub le is generated on the head node in the /root/.ssh/ directory. The authorized_keys le will be found in /root/.ssh/ on each remote node. If the authorized_keys le does not already exist, you can create it. NOTE: you will need to copy the id_dsa.pub le to each remote node (or to shared storage) before appending it to the authorized_keys le; cat id_dsa.pub > > authorized_keys 7. Change the permissions for all .ssh/ directories and authorized_keys les: cd /root chmod go-w .ssh .ssh/authorized_keys 19

6.

1 Prerequisites 8. Repeat the above steps for all nodes in your cluster. If your cluster is very large, you may want to consider automating the process.

You should now be able to connect to your cluster nodes using ssh without being prompted for your password. NOTE: that the rst time you connect to each machine, you will have to type yes to conrm the new connection. You can use the following command to test wether PDSH work: pdsh -w node[000-064] uptime | sort On RedHat systems, PDSH is often installed in /usr/local/bin by default, so it may be necessary to add /usr/local/bin to the path, or create a symlink in /usr/bin.

1.5.2 Installing the reporting client daemon (LMD)


1.5.2.1 Installation

Install the lustre-manager-collector rpm on all the nodes that will be running Lustre: rpm -ivh lustre-manager-collector-XXX.i386.rpm In a SuSe environment:

rpm -ivh lustre-manager-collector-suse-XXX.i386.rpm 1.5.2.2 Specifying the management client

Edit /etc/syscong/lustre-manager-collector to contain the hostname that will run the Lustre manager: LMD_MONITOR_HOST=management-node 1.5.2.3 Starting the daemon

LMD is installed as a Red Hat-style service, so it can be started/stopped using the following syntax: /sbin/service lmd start | stop In a SuSe environment, LMD can be started/stopped using the following command: 20

1 Prerequisites /etc/init.d/lmd start | stop You can congure the LMD service to start up at boot time using whater mechanism is appropriate for your system, e.g. chkcong on Red Hat installations.

1.5.3 Installing the management client (LMM)


1.5.3.1 Installation

Install the management client (LMM) on an admin or head node for your cluster. The only real criteria here is that the node you choose as the management client be visible to all the Lustre nodes in your cluster, i.e. all Lustre nodes are able to send reports to this node: rpm -ivh lustre-manager-XXX.i386.rpm In a SuSe environment: rpm -ivh lustre-manager-suse-XXX.i386.rpm 1.5.3.2 Starting the daemon

Before starting the LMM daemon for the rst time, you can elect to congure two parameters. If you would like to use a secure HTTP connection (HTTPS) for the management client, you must rst install the m2crypto package for Python, available from the m2crypto website http://sandbox.rulemaker.net/ngps/m2/, and then set LMM_OPTS=use-https in /etc/syscong/lustre-manager. The port on which LMM will listen can also be changed from its default of 8000 through the use of port #. You then must run lustre-manager/data/generate_https_certs.sh to generate SSL certicates. Like the LMD, the LMM is installed as a Red Hat-style service, so it can be started/stopped using the following syntax: /sbin/service lustre-manager start | stop In a SuSe environment, the lustre-manager can be started/stopped using the following command: /etc/init.d/lustre-manager start | stop You can congure the LMM service to start up at boot time using whater mechanism is appropriate for your system, e.g. chkcong on Red Hat installations. 21

1 Prerequisites 1.5.3.3 Using the management client for the rst time

The rst time you run the management client, a random password is created for superuser, admin, and written out to the logs. You can get this password by executing the following command:

grep admin /var/lib/lustre-manager/log/manager.log You should now be able to connect to http(s)://localhost:8000 and login using admin and the password from the logs. Your rst task with the management client should be to change the password for the admin user from the randomly-generated default. Note: The Mozilla web browser is recommended for use with LMT.

22

Chapter 2

Creating a New File System


2.1 What do you need to know to setup Lustre?
2.1.1 Architecture Refresher
This section is not meant to replace the exhaustively detailed Lustre Storage Architecture book. Rather, it will introduce the high-level terms used throughout the book, and assume that you have a basic grounding in what it means to use or administer a Lustre system.

23

2 Creating a New File System

There are three types of systems which make up a Lustre installation, and while they are usually run on separate nodes, it is possible to run a test or demonstration setup entirely on one node. Metadata Servers (referred to throughout as an MDS) provide access to services called Metadata Targets (MDT). An MDT manages a backend le system which contains all of the metadata, but none of the actual le data, for an entire Lustre le system. An MDS can export more than one MDT, and multiple MDSs can be congured to act as a group for purposes of failover redundancy (see Section 2.6). Object Storage Servers (OSS) export one or more Object Storage Targets (OST). An individual OST contains part of the le data for a given le system, and very little metadata. In almost all cases, many OSTs are grouped to form a single le system through a Logical Object Volume (LOV), and it is in this way that Lustre distributes I/O and locking load amongst many OSSs. One MDT plus one or more OSTs make up a single Lustre le system, and are managed as a group. You can think of this group as being analogous to a single line in /etc/exports on an NFS server. Client nodes mount the Lustre le system over the network and access the les with POSIX le system semantics. Each client communicates directly with the MDS and OSSs responsible for that le system, using a distributed lock manager to keep everything synchronized and protected.

2.1.2 Sizing Your Nodes

24

2 Creating a New File System There are many factors which will affect the performance of your Lustre system, discussed in more detail throughout this manual. In general, you will achieve the greatest value for your money by keeping things balanced. There are many data pipelines within the Lustre architecture, but there are two in particular which have very direct performance impact: the network pipe between clients and OSSs, and the disk pipe between the OSS software and its backend storage. By balancing these two pipes, you are saving money and maximizing performance. For example: if your OSSs disks are capable of much higher bandwidth than your network, it is likely that cheaper, slower disk would work just as well or alternatively, you could spread those disks across more OSSs, to increase your aggregate performance. 2.1.2.1 The Rule of Thirds

When sizing your server nodes, the OSSs in particular, we recommend that you divide your CPU into thirds: one third for the disk backend; one third for the network stack; and one third for Lustre. In your disk backend calculation, include the drivers but not the le system; in other words, are you able, with your chosen processor, to achieve the desired bandwidth to a raw device without exceeding 33% CPU utilization? In the network stack estimate, include the TCP stack if you plan to run Lustre over TCP. If you plan to run Lustre over an advanced network such as Elan or Myrinet, use the vendors supplied benchmark tools to determine if you can drive the network with one third of your CPU. If you adhere roughly to these guidelines, the remaining third of the CPU should be sufcient for the Lustre software stack (locking, backend le system, networking layers, etc.).

2.1.3 High Availability


Lustres protocols and implementation are designed with absolute recoverability in mind, to eliminate single points of failure. Even across a recovery event, Lustres protocols and lock manager guarantee that users will not see stale data, receive a misleading result from a metadata operation, lose data which is in ight, or leak objects from half-completed creates or unlinks. This is a very complicated process, in which lost transactions must be re-played, locks must be re-acquired, and partially-completed multi-node operations must be restarted. 2.1.3.1 Application Transparency

In most cases, Lustre can recover from any single server or infrastructure failure in a way which is transparent to your applications. However its also the most difcult part of the architecture to systematically test, so we currently recommend that administrators expect Lustre to recover from 90% of node failures in a way which is transparent 25

2 Creating a New File System to their users. By a single failure, we mean that the entire cluster must rst recognize and recover from the failure of one component before another component fails. In the context of a recovery discussion, a component could be a single metadata server, one or more object storage servers, one or more clients, or the network as a whole. For this reason, if you run a client on the same node as the metadata server, a failure of that node is automatically a double failure which will not recover in an applicationtransparent way. When Lustre is unable to recover transparently, it is nevertheless exceptionally rare that the cluster will need to be rebooted, or the entire le system remounted. In almost all cases, your applications will receive an error for any in-progress le system calls, but be able to access the le system normally from that point.

2.1.4 Total Usable Storage


By default, on each large OST, 400 MB is used for an internal le system journal, and 5% is reserved for the root user. This is why a newly-formatted le system may have many gigabytes of space used, and why the df/statfs Used and Available numbers dont always add up to the Total.

2.1.4.1

Largest Single File

A non-sparse le will grow its component stripes at roughly the same rate (see Section 2.4) meaning that for a le striped over 5 OSTs, each OST will need enough free space for roughly 1/5th of that les data. To determine the largest single non-striped le you can have, multiply the amount of free space on the most-used OST by the number of stripes in the le. Because OSTs can ll up at different rates, it is possible that a write to one part of that le will return -ENOSPC, and a write to a different part will succeed.

2.2 Disk Layout


2.2.1 Basics
As described in Section 2.1.1, Lustre distributes your data across two types of servers: the metadata servers (MDS) and object storage servers (OSS). Inside each MDS and OSS, Lustre manages one or more standard block devices to store this information. These are referred to as metadata target (MDT) and object storage target (OST) services, respectively. The most common examples are an IDE or SCSI disk (/dev/hd* or /dev/sd*), a software 26

2 Creating a New File System RAID device (/dev/md*), or an LVM device (/dev/group_name/*).

2.2.2 Lustre on RAID


Because Lustre does not yet provide an internal data redundancy feature, most serious installations choose to run on devices which provide data redundancy in hardware or software. If an OSS device is lost or severely corrupted, only the le data on that device is affected. The loss or severe corruption of the MDS le system, however, will render your entire Lustre volume unreadable. For this reason it is especially important to protect the MDS data.

2.2.2.1

RAID 0 (striping)

RAID 0 (striping) does not offer any form of data protection by itself. It is purely a performance improvement in which multiple devices are combined to create a single larger block device. If any of the underlying devices are damaged, the entire RAID device is unreadable. For this reason, using RAID 0 by itself improves performance but increases the risk of data loss.

2.2.2.2

RAID 1 (mirroring)

RAID 1 (mirroring) keeps identical copies of data on multiple devices, usually at some small performance penalty. In the event of a crash or other unclean shutdown, Linux software RAID 1 will re-sync the devices in the background by copying the entire device. During this (possibly long) re-sync period, some performance degradation can be expected. As a result, as long as any one device in a RAID 1 set is functioning, your data is available. Because at least half of your disk space is being used for copies of identical data, it is not very space-efcient.

2.2.2.3

RAID 5 (parity)

RAID 5 (parity) aggregates at least three devices in a fault-tolerant and reasonably performant way. Instead of making identical copies as in RAID 1, parity blocks are stored on a different disk from the le data. This parity information is used to reconstruct data in the event of a disk failure, and is much smaller than a duplicate copy of the data. RAID 5 is the most popular choice of our current customers, balancing availability requirements with reasonable performance and space efciency. 27

2 Creating a New File System 2.2.2.4 Hardware vs. Software RAID

Lustre will run equally well on hardware and software RAID solutions, all else being equal. However, it is important to keep in in mind that a software RAID solution on the object storage server (OSS) comes at some cost in overhead. If you deploy a software RAID 1 solution, remember that all writes will need to be written at least twice (once to each device in the RAID set). For a write-intensive load on a high-bandwidth OSS, make sure that your buses are capable, and expect CPU overhead on the order of 5%. If you deploy a software RAID 5 solution, keep in mind that there is some considerable overhead in parity calculation. For a write-intensive load on a high-bandwidth OSS, you should plan to allocate 15% of a modern CPU. If your bandwidth requirements are high more than 120 MB/s per OSS and you require availability guarantees provided by RAID 5, hardware RAID may be a better option.

2.2.3 Logical Volume Manager (LVM)


The LVM allows you to aggregate multiple devices into a single virtual device, much like RAID, but with additional benets. Volumes can be resized and migrated without destroying the data, and as a result is an important layer in any manageable storage stack. 2.2.3.1 LVM on Metadata Servers

Lustre 1.x does not support the clustering of multiple metadata targets (MDT) into a single Lustre le system. As a result, if you require additional metadata storage, you have three choices: reformat your le system and copy all of the data back (see 7.2 and 7.3); use a hardware RAID device which can grow volumes safely; or use the LVM software solution. If your MDS device is an LVM volume, you can grow it by adding a device to the volume with the LVM tools (a precise explanation about which is outside of the scope of this document). After you resize the LVM device, you must re-size the Lustre backend MDS le system. Lustre uses a modied version of ext3 as its backend le system; an updated version of the ext3 resize tool will be supplied with Lustre 1.4 later in 2004. 2.2.3.2 LVM on Object Storage Servers

Each object storage target (OST) has its own block device and backend le system. Unlike with metadata, you can have more than one object storage server (OSS) some large installations have hundreds in a single le system. Each OSS can itself hold multiple OSTs, and these OSTs are the building blocks of a Lustre le system. 28

2 Creating a New File System There are four ways to increase the data storage in your Lustre le system: reformat with more OSTs and copy your data back; add more OSTs to an existing le system; use a hardware RAID device which can grow volumes safely; or re-size the OSTs using the LVM. Today, Lustre does not include internal data migration features to re-balance le data amongst many OSTs. Although you can grow a Lustre le system by adding more, the usual situation is that the old OSTs have a lot of data and the new OSTs are empty. In future versions of Lustre, a migrator will rebalance this data in the background. Until then, one alternative is to use the LVM to grow your le system without adding additional OSTs, exactly as you would on the MDS.

2.2.3.3

Limitations

This is not a completely perfect solution, however, because there are limitations in Linux about how large a single ext3 le system can be. A single MDT or OST device cannot exceed 2 TB, which also means that a single data object on an OST cannot exceed 2 TB. Fortunately, les in a Lustre partition can exceed 2 TB by striping them over multiple OSTs (see Section 2.4). If your devices have reached 2 TB and you require additional storage, your only choices are to reformat or add new, empty OSTs.

2.3 Counting Your Object Storage Servers/Targets


A common question asked by Lustre administrators is Given a certain number of clients, how many object storage servers/targets should I have? In fact, the number of clients has no real impact on the number of OSSs to deploy. The more important question to ask is What is the maximum peak bandwidth that my applications require, and how many OSSs are required to reach that peak?, or What is the total storage capacity required for my system?

2.3.1 Peak Bandwidth


Assuming that enough disk, network, and client resources are available, Lustre le I/O bandwidth has been shown to scale linearly as more OSSs are added to the le system. This makes it very easy to determine how many OSSs are required to reach your requirements. If you require 1 GB/s of aggregate peak I/O bandwidth, and each OSS has been measured as being capable of 100 MB/s (see Section 2.5.3.5), you will need 10 OSSs for your cluster.

2.3.2 Total Storage Capacity


29

2 Creating a New File System

With Linux 2.4 kernels there is an upper limit of 2 TB per block device, although some device drivers actually have only a 1 TB limit. If you require large amounts of storage, you should create each OST with the maximum possible size. It is possible to congure an OSS with multiple OSTs, although having too many OSTs on one node will hurt performance. Other things being equal, it is preferable to have fewer, larger OSTs in order to use space more efciently and to allow larger les without requiring large numbers of stripes.

2.3.3 When Your Best Isnt Good Enough


Knowing how many OSTs you need for a new cluster is helpful. But its also helpful, given an existing cluster, to know what can be improved. 2.3.3.1 Increasing Disk Bandwidth

If your aggregate bandwidth is ultimately limited by your disk subsystem i.e., you have excess network capacity on each OSS then you can improve your peak I/O by adding more disk. You can add disk by growing an existing volume or by adding OSTs to an existing OSS (see Section 6.1). In both cases, by adding disk bandwidth without adding more OSS nodes, you will begin to take advantage of your excess network capacity. 2.3.3.2 Increasing Network Bandwidth

Whether you have local disks on each OSS or a large SAN, if you have enough disks, you will eventually overwhelm the network pipe on that node. Assuming that your network fabric is up to the challenge (which is outside of the scope of this document), the way to increase the aggregate network bandwidth available to Lustre is to add OSS nodes.

2.4 File Striping


Lustre stores les in one or more objects on object storage servers (OSS). When a le is comprised of more than one object, Lustre 1.x will stripe the le data across them in a round-robin fashion. The number of stripes, the size of each stripe, and the servers chosen are all congurable. One of the most frequently-asked Lustre questions is How should I stripe my les, and what is a good default? The short answer is that it depends on your needs. A good rule of thumb is to stripe over as few objects as will meet those needs, and no more.

2.4.1 Advantages of Striping


30

2 Creating a New File System

There are two reasons to create les of multiple stripes: bandwidth and size. There are many applications which require high-bandwidth access to a single le more bandwidth than can be provided by a single OSS for example, scientic applications which write to a single le from hundreds of nodes, or a binary executable which is loaded by many nodes when an application starts. In cases such as these, you want to stripe your le over as many OSSs as it takes to achieve the required peak aggregate bandwidth for that le. In our experience, the requirement is as quickly as possible, which usually means all OSSs. Note. This assumes that your application is using enough client nodes, and can read/write data fast enough, to take advantage of that much OSS bandwidth. The largest useful stripe count is bounded by (the I/O rate of your clients/jobs) divided by (the performance per OSS). The second reason to stripe is when a single object storage target (OST) does not have enough free space to hold the entire le. In an extreme example, this can be used to overcome the Linux 2.4 maximum le size limitation of 2 TB.

2.4.2 Disadvantages of Striping


There are two disadvantages to striping, which should deter you from choosing a default policy which stripes over all OSTs unless you really need it: increased overhead and increased risk. Increased overhead comes in the form of extra network operations during common operations such as stat and unlink, and more locks. Even when these operations can be performed in parallel, there is a big difference between doing one network operation and doing one hundred. Increased overhead also comes in the form of server concurrency. Consider a cluster with 100 clients and 100 OSSs, each with 1 OST. If each le has exactly one object and the load is distributed evenly, there is no concurrency, and the disks on each server can manage sequential I/O. If each le has 100 objects, then the clients will all compete with each other for the attention of the servers, and the disks on each node will be seeking in 100 different directions. In this case, there is needless concurrency. Increased risk is evident when you consider again the example of striping each le across all servers. In this case, if any one OSS catches on re, a small part of every le will be lost. By comparison, if every le has exactly one stripe, you will lose fewer les, but you will lose them in their entirety. In our experience, most users would rather lose some of their les entirely than all of their les partially.

2.4.3 Stripe Size


Choosing a stripe size is a small balancing act, but experience has taught us that there are reasonable defaults. The stripe size must be a multiple of the page size; for safety, our tools enforce a multiple of 16 KB (the page size on IA-64), so that users on platforms with smaller pages do not accidentally create les which might cause problems 31

2 Creating a New File System for IA-64 clients. Although you could create les with a stripe size of 16 KB, this would be a poor choice. Practically, the smallest recommended stripe size is 512 KB, because Lustre tries to batch I/O into 512 KB chunks over the network. We have found that this is a good amount of data to transfer at once, and choosing a smaller stripe size may hinder that batching. Our testing indicates that between 1 MB and 4 MB are good stripe sizes for sequential I/O using high-speed networks. Stripe sizes of larger than 4 MB will not parallelize as effectively, because Lustre tries to keep the amount of dirty cached data below 4 MB per server with the default conguration. Writes which cross an object boundary are slightly less efcient than writes which go entirely to one server. Depending on your applications write patterns, you can assist it by choosing the stripe size with that in mind. If the le is written in a very consistent and aligned way, you can do it a favor by making the stripe size a multiple of the write() size. The choice of stripe size has no effect on a single-stripe le.

2.4.4 Choosing OSTs


In most cases, it doesnt matter which OSTs are chosen for which stripes. A user or administrator can choose the OST for the rst stripe in the le; subsequent stripes are allocated on the remaining OSTs in order. Creating multiple les in a row will place each les stripes on subsequent OSTs. Allowing Lustre to choose the rst OST will usually lead to an even rate of disk consumption across your OSTs, and an even load distribution on average. In some cases, you may nd that you can ensure better performance by specifying the rst OST. For example, with a distributed application which creates as many les as you have OSTs, you may get better performance by making sure that each OST has exactly one output le if other les are being created at the same time that might disrupt the normal allocation patterns.

When choosing le system or per-directory defaults (see Section 2.10), we strongly recommend that you allow Lustre to choose randomly, by specifying a starting OST of -1.

2.5 Using Lustre-Manager


The following examples will demonstrate how to use the Lustre-Manager (LMT) CongBuilder web tool to setup new Lustre le systems of increasing complexity.

32

2 Creating a New File System

2.5.1 Basic Multi-node Setup


The simplest interesting Lustre conguration involves the following node setup: 1 metadata target (MDT); 2 object storage targets (OST) forming a single logical object volume (LOV); 1 or more clients. If you are limited to fewer than 4 nodes, some of the services can be combined onto a single node. For instance, it is possible to have more than 1 OST on the same node (performance will decrease), or to have the MDT reside on the same node as an OST.

Be aware that a client running on the same node as the MDT will prevent clean failover, and a client running on the same node as an OST is not completely stable. In production environments, you should avoid these congurations.

2.5.1.1

Basic Information

Using the Cong Builder, your conguration begins by specifying a name for your le system and selecting your network type from the drop-down menu. Depending on your selections here, certain values will be pre-calculated for you elsewhere in the Cong Builder. These pre-calculated values represent a best-guess, so please review all the elds and change these pre-calculated values if they are not accurate. Note: the Cong Builder does not yet allow you to override the default UUID seletion; if you need to do this today, you will need to use lmc (see Section 2.9). The following characters are valid for use in lesystem conguration, le names and user names: [A-Z],[a-z],[0-9],_,-.

2.5.1.2

MDT Setup

Field descriptions for Cong Builder used in MDT setup are listed in Table 2.1.

2.5.1.3

OST Setup

Only two slots for OSTs are displayed by default in the Cong Builder. If you wish to add more than one OST, click on the Add OST button and another OST slot will appear, up to a maximum of 32 OSTs. The eld descriptions for Cong Builder used in OST setup are listed in Table 2.2. In the current version, LMT cannot customize the OST service name; it will be auto-generated in the format:OST_hostname,OST_hostname_2,OST_hostname_3,..... 33

2 Creating a New File System Field MDT service name MDT service host Hostname of the MDT node MDT host network ID For TCP hosts, the network ID is identical to the hostname. For other network types, the network ID is calculated differently. MDT backing store Path to the device used as the backing store for the MDT, e.g. /dev/sda1. If you do not have dedicated storage hardware available, the backing store can also point to a loopback device. Store size (Optional) The size (KB) to use for the backing store. If not specied, the maximum possible size is used on the backing store device. Note: you must specify a size if using a loopback device. Table 2.1: Field description for Cong Builder: MDT Field OST host name Description Hostname of the OST node OST host network ID For TCP hosts, the network ID is identical to the hostname. For other network types, the network ID is calculated differently. OST backing store Path to the device used as the backing store for the OST, e.g. /dev/sda1. If you do not have dedicated storage hardware available, the backing store can also point to a loopback device. Store size (Optional) The size (KB) to use for the backing store. If not specied, the maximum possible size is used on the backing store device. Note: you must specify a size if using a loopback device. Table 2.2: Field description for Cong Builder: OST 34 Description A unique name for the MDT service. Other nodes use this name to refer to this MDT.

2 Creating a New File System Field Client nodes Mount point Description A glob list of the client nodes in your cluster, e.g. clientnode[3-6,8,10-12] The mount point for Lustre on your client nodes (Default: /mnt/lustre) Table 2.3: Field description for Cong Builder: Clients

2.5.1.4

Client Setup (optional)

In the absence of specic client information, LMT will create a generic client entry in the conguration le. This generic entry can be used to mount an arbitrary number of clients. If you do decide to provide specic client information, Table 2.3 describes how to ll out the various elds in the Clients section of the Cong Builder:

2.5.1.5

Striping Patterns

By default, the Cong Builder creates a single logical object volume (LOV) which encompasses all of the OSTs you specied using the LMT. This LOV is setup with the following striping pattern: stripe size: 1 MB; stripe count: 1; OST choice: random For more information about what these values mean, and how to choose reasonable defaults for your environment, see Section 2.4. This rst version of the Cong Builder does not allow changes to this default; if you want to specify a different default striping pattern, see Section 2.9. All les will be created using this stripe conguration unless an alternate is requested. You can alter the striping policy on a per-directory or -le basis using the lfs tool. For more information about lfs and other striping tools, see Section 2.10.

2.5.2 Basic Service Management


The Lustre-manager Tools provide a services tab to start/stop services and mount/unmount clients. It can also show the service status, list clients, and you can choose to reformat lesystem using the format option. 35

2 Creating a New File System

Note: In the current version, when you start/stop a service, if the service status does not automatically update, you can click the refresh button on the settings page to discard cached status data and get updated status information.

2.5.3 Large Parallel I/O Conguration


The Cong Builder can also be used to setup more elaborate congurations. The interface for conguring the nodes remains the same.

2.5.3.1

Basic Information

Choose your le system name and select your network type as in the simpler conguration above.

2.5.3.2

MDT Setup

Congure the single MDT as before. This node can shared with an OST, if necessary.

2.5.3.3

OST Setup

If you have high-speed storage controllers, you can congure 2 OSTs on the same node (Object Storage Server, or OSS). We suggest a 4 OSS/8 OST setup for this parallel I/O conguration. The OSTs are congured in the same way as before. Two OSTs would have the same service host and network ID, but the service name and device name must be unique for each OST. The store size remains optional. If you dont have special hardware, cluster performance will likely be better with a single OST per OSS.

2.5.3.4

Client Setup

The client conguration remains generic, so you can still mount an arbitrary number of clients. For parallel I/O operations, 16 or more clients would be typical. 36

2 Creating a New File System 2.5.3.5 Characterizing Your Network

Lustres end-to-end performance is most meaningful in relation to the raw performance of your network. We recommend running a simple TCP benchmark such as gen/sink to establish a baseline, and then running an I/O benchmark on the mounted Lustre le system. Depending on your setup, and assuming your disk backend is capable, you should see Lustre performance of greater than 80% of the raw network performance. 2.5.3.6 Characterizing Disk Speed

In order to properly characterize disk speed under Lustre, you must rst determine what your disk speed will be for the underlying storage without Lustre running. We recommend running the iozone benchmark described in Section 2.8.1 to determine the raw speed of an ext3 partition on your disk device. You can then run the same test again with Lustre mounted, using the disk device as the backing storage for your OST, and then compare the numbers from the two tests. Depending on your setup, you should see Lustre performance of greater than 80% of the raw ext3 rate. 2.5.3.7 Monitoring CPU Usage

gen/sink and iozone both create large amounts of network and disk trafc, respectively. It is a good idea to monitor the CPU usage in both cases (on both servers and clients) using a tool such as vmstat. This will let you know whether either performance measurement is being limited by excessive CPU usage. 2.5.3.8 Lustre Throughput Tests

You can determine Lustre throughput rates using the various tests described in the Validation section: 2.8.1 2.5.3.9 Performance Effects of Resource Sharing

If you have a limited number of nodes available, it may sometimes be necessary to have to more than one Lustre service on the same node. For example, it is not uncommon for an MDT to share a node with an OST in a smaller cluster. However, this setup will likely have an impact on cluster performance. If possible, we recommend that you perform the above proling steps on both a separate- and shared-node conguration for your cluster to determine whether the performance trade-offs are acceptable to you. It is also recommended that Lustre servers (both MDTs and OSTs) be run on dedicated nodes, i.e., not nodes that are running other servers (NFS, etc.), or nodes that are used 37

2 Creating a New File System as administration or login nodes. An excess of non-Lustre processes will necessarily degrade Lustre performance, and vice-versa. 2.5.3.10 Effects of Different Striping Patterns

The current version of Lustre-Manager does not support changing the default striping pattern. For more information on setting striping patterns manually, please see the section: 2.5.

2.5.4 Conguring for Failover


Failover conguration is not supported in the current version of Lustre-Manager. For instructions on how to setup failover by hand please see Section 2.6.

2.5.5 LDAP
LDAP is not supported in the current version of Lustre-Manager. For instructions on setting up an LDAP server, please see Section 1.4.

2.5.6 Multinet and Routing


The current version of Lustre-Manager does not include support for multinet or routing.

2.5.7 Conguration Pitfalls


As previously mentioned, some Lustre services can be safely combined on a single node (MDT+OST, OST+OST), and others cannot. Due to deadlocks in the virtual memory system, clients can not run on the same nodes as OSTs. Similarly, a production environment should avoid running a client and an MDT on the same node, because it will prevent proper failover (see Section 2.1.3.1).

2.6 Failover Example


Several examples demonstrating Lustre failover.

2.6.1 Shared Storage


Lustre supports failover for MDS and OSS targets, and requires the target to be on shared storage. If a hardware RAID device is used, then the cache on the device needs to be coherent between the various MDS and OSS connections to that disk, or at failover time the new node will see stale data. Also, many hardware RAID device caches are much too large to be synced to disk in the event of a power failure. If this were to happen, massive 38

2 Creating a New File System unrecoverable le system damage could result if there are journal pages, etc. in that cache. At any rate, an fsck would denitely be required in that case and that will take along time. For this reason, we strongly recommend that all write caches in hardware RAID devices be battery-backed.

2.6.2 Conguring With Failover Manager


Please read the documentation for your failover software. As mentioned in Section 1.2.4.2, we suggest using Red Hats Cluster Manager or Mission Criticals freely available Kimberlite. It is also recommended that the cluster manager have power control over the nodes, and that failed nodes are powered off before failover starts.

2.6.3 Pairwise Cong


Failover MDS and OSSs are congured in essentially the same way; multiple devices are added to the conguration with the same service name. The failover option should be specied on at least one of the devices to enable failover mode. (This is required to enable failover on OSSs.) For example, to create a failover OSS named OSS1 on nodes nodeA and nodeB with devices /dev/sda1 and /dev/sdb1 respectively:

lmc --add --lov lmc --add --lov

ost --ost ost1 --failover --node nodeA \ lov1 --device /dev/sda1 ost --ost ost1 --failover --node nodeB \ lov1 --device /dev/sdb1

2.6.4 Passive/active Failover


In the above example, if OSS1 is the only target on nodes A and B, then at any one time, only one of those nodes can be the active node, and the other is passive. The Lustre service must only be started on one node at a time.

2.6.5 Active/Active Failover


If multiple targets are exported by a pair of nodes, then both nodes can be active nodes for different groups of targets. Each node is the primary node for a group of OSSs, and the failover node for other groups. To expand the simple two node example, we add OSS2 which is primary on nodeB, and is on the LUNs nodeB:/dev/sdc1 and nodeA:/dev/sdd1.

lmc --add ost --ost ost1 --failover --node nodeA \ --group nodeA --lov lov1 --device /dev/sda1 lmc --add ost --ost ost1 --failover --node nodeB --lov lov1 --device /dev/sdb1 lmc --add ost --ost ost2 --failover --node nodeB \ --group nodeB --lov lov1 --device /dev/sdc1 39

2 Creating a New File System lmc --add ost --ost ost2 --failover --node nodeA --lov lov1 --device /dev/sdd1

2.6.6 N-way Failover


It is possible to congure an arbitrary number of nodes as failovers for one service. This would require a more complex SAN conguration to allow the storage to be shared by several nodes. The failover management conguration is also more complicated. Conguring Lustre to support N-way failover is not much more complicated than 2way. Just add the extra failovers to the same service. Managing the current active node is the same as for pair-wise failover.

2.6.7 The Default Lustre Upcall


When the Lustre upcall is set to DEFAULT, then Lustre will attempt to recover a failed connection without calling an external upcall. Currently, this does not support failover, but is suitable for simpler congurations that do not need failover.

2.6.8 Testing Failover


Generally, testing failover is a matter of powering off one node and starting Lustre on the failover, and allowing the clients to recover. Please see Managing Failover on page 54 for more information about the mechanics of failover. 2.6.8.1 Went Back In Time Errors

If, after a failover, the client sees the server is missing some transactions that were committed, then youll see "server went back in time" errors. This is usually the result of losing data in a hardware write cache, as mentioned above.

2.7 Client Conguration


2.7.1 Automatic Client Mounting via fstab
It is possible to setup your /etc/fstab le to allow automatic mounting of Lustre on client systems. In this case, automatic can mean either at boot time, or by issuing a command like mount /mnt/lustre from the command-line. Two things need to be congured to allow for automatic mounting of clients. First, the entries found in Figure 2.1 should be added to /etc/modules.conf (or conf.modules, depending on your system): Finally, your /etc/fstab le should be updated to include an entry for your Lustre mount. Appropriate values for your cluster substituted where necessary, i.e. mdt_hostname, 40

2 Creating a New File System add below kptlrouter portals add below ptlrpc ksocknal add below llite lov osc alias lustre llite Figure 2.1: modules.conf entries for automatic client mounting mdt_hostname:/mdt_service_name/client_name /mnt/lustre lustre defaults,_netdev 0 0 Figure 2.2: fstab entry to allow automatic client mounting mdt_service_name, client_name, and /mnt/lustre should all be updated with local values as shown in Figure 2.2.

2.8 Validation and Light Testing


2.8.1 Lustre Throughput Tests
2.8.1.1 I/O

Iozone Iozone is a le system benchmark tool that measures a variety of le operations. It is quite useful for providing a broad analysis of the le system and to conrm that Lustre is congured correctly. Iozone can be downloaded from the Iozone homepage http://www.iozone.org. IOR IOR is the Interleaved Or Random parallel le system test code developed at Lawrence Livermore National Laboratory. IOR uses the Message Passing Interface (MPI) to perform parallel writes and reads to calculate le system throughput. To download IOR, visit the IOR homepage. http://www.llnl.gov/asci/purple/ benchmarks/limited/ior/. 2.8.1.2 Metadata

Bonnie Bonnie is another le system benchmark that performs a series of tests on a le of known size. The tests that bonnie performs will do well to stress small le IO performance and metadata operations. Bonnie can be found here. http://www. textuality.com/bonnie/. 41

2 Creating a New File System

2.9 Conguration, Under the Hood


2.9.1 Automatic Service Stopping and Starting
When the XML conguration le is fed to the lconf utility, it is parsed and broken down into individual low-level commands. These low-level commands will congure the kernel modules and setup the individual Lustre devices. The most basic conguration for Lustre can be used as an example of how to use lconf to congure and mount Lustre. The conguration includes one metadata server (MDS) node, 2 object storage target (OST) devices on one node combined with a logical object volume (LOV), and one client node that will mount Lustre. The conguration XML le can be created using the lmc utility: # Create nodes $ lmc -o config.xml --add net --node node1 --nid node1 \ --nettype tcp $ lmc -m config.xml --add net --node node2 --nid node2 \ --nettype tcp $ lmc -m config.xml --add net --node node3 --nid node3 \ --nettype tcp # Configure MDS $ lmc -m config.xml --add mds --node node1 --mds mds1 \ --fstype ext3 --dev /tmp/mds1 --size 50000 # Configures OST $ lmc -m config.xml --add lov --lov lov1 --mds mds1 \ --stripe_sz 1048576 --stripe_cnt 1 --stripe_pattern 0 $ lmc -m config.xml --add ost --node node2 --lov lov1 \ --ost ost1 --fstype ext3 --dev /tmp/ost1 --size 100000 $ lmc -m config.xml --add ost --node node2 --lov lov1 \ --ost ost2 --fstype ext3 --dev /tmp/ost2 --size 100000 # Configure client $ lmc -m config.xml --add mtpt --node node3 --path \ /mnt/lustre --mds mds1 --lov lov1 To use lconf to congure the nodes, the following example can be followed.

It is important to note that the OSTs must be setup rst, followed by the MDS, and nally the client can mount Lustre. This order is necessary, as the MDS will need to talk to the OSTs to nish its setup, and the client will need to communicate with the MDS and OSTs to mount Lustre. 42

2 Creating a New File System # configure the OSTs $ lconf --reformat --gdb --node node2 config.xml # configure the MDS $ lconf --reformat --gdb --node node1 config.xml # mount the client $ lconf --reformat --gdb --node node3 config.xml

2.9.2 File System Parameters


In the above example, the striping parameters are given to lmc as: --stripe_sz 1048576 --stripe_cnt 1 --stripe_pattern 0 These are the le system defaults for striping, which will be used unless overridden by a per-directory default or at le creation time (see Section 2.10). In the case of the stripe count, a value of 0 here will use all OSTs. The stripe pattern controls the manner in which data is striped; at present, only RAID 0 is supported (pattern 0). For more information about what these settings do, and how to choose them, see Section 2.4.

2.9.3 Upcall Generation and Conguration


The upcall script will attempt to congure Lustre to recover from failures. The path to the upcall script is congurable by using the lustre.upcall and portals.upcall sysctls. Both the lmc and lconf utilities have the corresponding lustre_upcall and portals_upcall arguments that will set those sysctl values. lmc and lconf also have the upcall argument that will set both upcalls to the same value.

2.9.4 Log Levels and Timeouts


The recovery timeout and the portals debug level can be modied through the lmc and lconf utilities. The timeout argument can be changed to allow more time for the upcall script to instantiate recovery. The ptldebug option will modify the types of Lustre messages that are logged. The default value is a good trade-off between le system performance and helpful information in the logs. To turn on all debugging messages, a value of -1 can be set. The entire set of values are listed in lustre/portals/include/linux/kp30.h

2.10 Striping Tools


43

2 Creating a New File System File striping (introduced in Section 2.4) can be specied on a per-le system, perdirectory, or per-le basis. After a le has been created, the stripe conguration is locked in and cannot be changed. This section will describe how to use the lfs tool to change and inspect the stripe conguration.

2.10.1 Per-File
New les with a specic stripe conguration can be created with lfs setstripe: lfs setstripe <filename> <stripe-size> <starting-ost> <stripe-count> If you pass a stripe-size of 0, the le system default stripe size will be used. Otherwise, the stripe-size must be a multiple of 16 KB. If you pass a starting-ost of -1, a random rst OST will be chosen. Otherwise the le will start on the specied OST index (starting at zero). If you pass a stripe-count of 0, the le system default number of OSTs will be used. A stripe-count of -1 means that all available OSTs should be used.

2.10.2 Per-Directory
lfs setstripe also works on directories to set a default striping conguration for les created within that directory. The usage is the same as for lfs setstripe for a regular le, except that the directory must exist prior to setting the default striping conguration on it. If a le is created in a directory with a default stripe conguration (without otherwise specifying the striping) Lustre will use those striping parameters instead of the le system default for the new le.

2.10.3 Inspecting Stripe Settings


Individual les and directories can be examined with lfs getstripe: lfs getstripe <filename> lfs will print the index and UUID for each OST in the le system, along with the OST index and object ID for each stripe in the le. For directories, the default settings for les created in that directory will be printed. A whole tree of les can also be inspected with lfs nd: lfs find [--recursive | -r] <file or directory> ...

2.10.4 Finding Files on a Given OST


In addition to displaying the stripe settings, lfs nd can also be used to list all les which reside on a given OST. This is particularly useful if an OSS has caught on re, and you want to determine which les to restore from backup. lfs find [--obd <uuid>] [-r] <file or directory> ... 44

2 Creating a New File System

2.10.5 Examples
Create a le striped on one random OST, with the default stripe size: $ lfs setstripe /mnt/lustre/file 0 -1 1 List the striping pattern of a single le: $ lfs getstripe /mnt/lustre/file List the striping pattern of all les in a given directory: $ lfs find /mnt/lustre/ List all les which have objects on a specic OST: $ lfs find -r --obd OST2-UUID /mnt/lustre/

45

Chapter 3

Conguring Monitoring
3.1 Basic monitoring
3.1.1 System Health
The Lustre Management Tool provides an overview of what is happening on all servers and clients. Early warnings, recent throughput, space utilization, and all other things you need to know to keep an eye on the servers. Errors and surprises do happen; disks ll up, or may fail.

3.1.2 Current Load


The system load on Lustre servers generally reects the number of active service threads (which can be quite high for busy SMP systems).

3.1.3 Bandwidth/Disk/CPU
The vmstat program can be used to provide simple monitoring of system performance on clients and servers. vmstat output shows the number of runnable processes, and processes in uninterruptible sleep (often waiting for disk or RPC completion in the context of Lustre). It shows free and cached memory, although it should be noted that the free memory in Linux is usually very low because Linux aggressively caches le 46

3 Conguring Monitoring data in otherwise-unused memory. vmstat shows disk device input and output, but this is not indicative of read and write behavior on Lustre clients because their IO goes over the network. It also includes percentage CPU usage for user, system, and idle (and iowait on some platforms).

3.1.4 OST performance monitoring with LMT


The Lustre Manager provides a performance tab which shows instantaneous per-OST throughput, for read and write, as well as short-term historical data. Clicking the icon between the service name and the current display will toggle between the instantaneous and historical-graph modes.

The Lustre Manager provides a summary on overview tab which displays a report of per-service capacity/free space.

3.1.5 Lustre Operation/RPC Rate


On the OSS nodes, it is possible to use the llobdstat.pl program to monitor the read and write rate on a per-OST basis using /proc/fs/lustre/obdlter/<OSTNAME>/stats, since vmstat reports only aggregate IO statistics. For client nodes, it is possible to get a histogram of OST RPC transfer sizes and bulk RPC requests in ight from the le /proc/fs/lustre/osc/<OSCNAME>/rpcstats. On all client and server nodes, it is possible to monitor RPC statistics using the llstat.pl program and the per-service stats le under /proc/fs/lustre/<service_type>/<service_name>/stats. This can be used to provide one-time summary information, or ongoing operation rates like vmstat.

3.2 Integrating with other monitoring


3.2.1 Conguring System Log
47

3 Conguring Monitoring

Lustre will log a large variety of messages to the system log in the kern.* log facility. These include messages for Lustre startup, shutdown, network timeouts, and other errors. Messages are prexed with Lustre: or LustreError: for easy identication and ltering. Since Lustre is a distributed le system, it is most benecial to log all of the client and server system logs to a central logging host so that events that take place on several nodes can be correlated more easily. To forward syslog messages to another node, add the node name to the syslog.conf le for a facility. In addition, the remote logging host should be congured to allow remote hosts to log messages there (the -r option for the standard syslogd). kern.info kern.info,*.err /var/log/kern.log @remote_logging_host

Table 3.1: conguring /etc/syslog.conf remote logging

3.2.2 Logging from the Upcall


It can be helpful to add some logging information to the Lustre upcall to facilitate event monitoring or debugging of Lustre disconnection events. The logger command can be used to add upcall status to the syslog. #!/bin/sh lctl debug_kernel /tmp/lustre.debug lctl --device %$3 recover && return 0 logger -p kern.info recovery failed: $@ Table 3.2: logging from the recovery upcall

3.2.3 SNMP
It is possible to integrate Lustre to SNMP via syslog monitoring for LustreError events.

48

Chapter 4

Health Checking and Troubleshooting


4.1 File system consistency
Each of the Lustre targets (MDT and OST) is a journaled ext3 le system. Normally ext3 will maintain metadata consistency of the local le system and only take a few seconds after a reboot to recover or discard in-progress le system operations from the journal. In the case of hardware error (disk, memory, cable) or software error it is possible for corruption to appear in the le system that requires the use of an external le system checker (e2fsck). In addition to the consistency of each individual targets le system, the Lustre le system as a whole needs to be kept consistent. During normal, and some abnormal le system operations (MDS, OST, or client crashes) there are distributed transaction logging methods which work together to keep the various target le systems a consistent whole. Some problems, such as complete disk failure that requires restoration from backup, or corruption of a le system that causes e2fsck to modify the le system may need an additional distributed consistency check (lfsck). In most failure cases Lustre will operate correctly (as much as possible) in the presence of component failures and only return errors when failed components are accessed (les with missing objects, or les with objects on failed targets).

4.2 E2fsck
49

4 Health Checking and Troubleshooting

Journal recovery is normally handled by the kernel ext3 driver at le system mount time (i.e., Lustre OST or MDS setup). If the Lustre server is set to start automatically at boot time it is also possible to have e2fsck do the journal recovery before the le system is mounted so that it can validate the le system superblock and check for errors that were detected during the previous mount. Mounting and writing to a le system with errors may lead to further le system corruption. If e2fsck detects an error, having it do a full le system consistency check can take upwards of 20 minutes per 100 GB of storage, so some administrators prefer to perform le system checking manually at a scheduled outage instead of immediately after reboot, although there is some (usually small) risk associated with using a le system with errors. To have e2fsck do journal recovery and the normally-quick basic consistency checks at boot time, the storage device should be listed in /etc/fstab with the noauto option to prevent it from being mounted and a non-zero value for the 6th eld (fs_passno, normally 2 so it is checked after the root le system) as shown in Figure 4.1. # Lustre OST devices, do not mount /dev/sdb1 none ext3 noauto 0 2 /dev/sdb2 none ext3 noauto 0 2 Figure 4.1: /etc/fstab for checking Lustre devices at boot In addition to recovering the journal, e2fsck will do a full le system check if there was an error reported on the device during a previous mount, or some time interval has passed since the last le system check. In order to avoid an unnecessary and possibly lengthy le system check at startup, the automatic date- and mount-based le system checks should be turned off with tune2fs -c 0 -i 0 or set to some suitable value for your environment (see tune2fs(8) man page for details).

4.2.1 Supported e2fsck Releases


The standard e2fsprogs version 1.32 or later supports sufcient ext3 features so that it can be used safely on Lustre le systems (primarily the extended attribute, large inode, and indexed directory features). However, in order to do full correctness checking on Lustre-specic le system features (EAs stored in large inodes, lfsck support) a special Lustre-patched e2fsck must be used. The modied e2fsprogs RPM is available for download at the same location as the Lustre sources are.

4.3 lfsck
50

4 Health Checking and Troubleshooting

4.3.1 What is lfsck?


Lfsck is Lustres distributed le system consistency checking tool. It will catalog and cross-reference all of the MDS inodes and OST objects in order to determine if there are inodes with missing or duplicate objects, or objects that have no inodes referencing them (orphans). The lfsck tool is available as part of the modied e2fsprogs package, and works in conjunction with e2fsck on each target to generate this data. Under normal circumstances, Lustre will maintain inode and object consistency between the MDS and OST le systems. This is true even if a client fails in the middle of unlinking a le with multiple objects, or if an OST crashes after the MDS inode is unlinked but before that inodes objects are removed. This is done with distributed transaction logs on the MDS and OST that track object creation and destruction, and in case of a failure, these transaction logs are replayed when the MDS next reconnects to the failed OST.

4.3.2 When To Run lfsck


Under extreme circumstances such as complete OST or MDS disk failure that requires restoration from backup, or le system corruption caused by hardware or software that causes e2fsck to make a large amount of target le system modication in order to correct it may be useful to run lfsck in order to correct any inconsistencies in the Lustre le system. Lfsck will locate objects that are missing from les, locate les that reference the same object, and orphan objects that have no le referencing them. Depending on the selected options, lfsck will also repair these problems by creating new objects, and attaching orphan objects to lost+found or unlinking them.

4.3.3 What if I dont run lfsck?


If lfsck is not run after a target le system is corrupted, it is possible to have differences between the MDS and the OSTs that would cause some les to report IO or other errors upon access, or for orphan objects to consume space on the OST with no way to destroy those objects and reclaim their space. Under most circumstances these problems are relatively minor and Lustre will deal with them by returning an error to the application accessing the le. Files referencing objects that no longer exist can be removed with the unlink binary (from GNU coreutils or GNU leutils packages) or the munlink binary (from the Lustre package) if they are no longer needed. To read data from les that are missing objects, one can use dd with the conv=sync,noerror options to replace missing parts of the le with binary zeros (see Figure 4.2).
$ dd if=/file/missing/objects of=/new/file bs=16k conv=sync,noerror

Figure 4.2: using dd to read from les with missing objects

4.3.4 Using lfsck


51

4 Health Checking and Troubleshooting

In order to use lfsck, one must rst run the Lustre-modied e2fsck on the MDS in order to generate the MDS inode and LOV EA database, and then once on each OST in order to create the object databases. The MDS database, mdsdb, must be created rst and made available on all of the OST nodes during their fsck runs, but it is not modied by the OSTs so they can run e2fsck on the target le systems in parallel to create one ostdb for each OST. Figure 4.3 shows the commands used to run e2fsck on the MDS and OST nodes.

It should be noted that the e2fsck runs MUST be done while Lustre is unmounted from all of the clients and the MDS and OST services shut down. Running e2fsck while the MDS or OST services are using the le systems will lead to severe le system corruption.
mds# e2fsck -f -y --mdsdb /path/to/shared/mdsdb ost1# e2fsck -f -y --mdsdb /share/mdsdb --ostdb ost1# e2fsck -f -y --mdsdb /share/mdsdb --ostdb : : ostN# e2fsck -f -y --mdsdb /share/mdsdb --ostdb ostN# e2fsck -f -y --mdsdb /share/mdsdb --ostdb /dev/sdb1 /share/ostdb1-1 /dev/sdb1 /share/ostdb1-2 /dev/sdb2

/share/ostdbN-1 /dev/sdb1 /share/ostdbN-2 /dev/sdb2

Figure 4.3: running e2fsck on the mds and ost nodes After the mdsdb and ostdb les are created they must all be made available on a single client node in order to combine the databases and nd any errors that the le system corruption may have caused. At this point the MDS and OST services can be started and the Lustre le system mounted on clients. Lfsck is run on the client with the databases and will report (and optionally repair) any problems it nds (see Figure 4.4). It should be possible to use the Lustre le system during lfsck operation.
# lfsck -l --mdsdb /shared/mdsdb --ostdb \ /shared/ostdb1-1,/shared/ostdb1-2,... /lustre/mountpoint

Figure 4.4: running lfsck on the client node

4.4 Validation of Conguration


4.4.1 lustre-lint
The Lustre-Manager (LMT) will include a sanity checker for conguration les, but the sanity checker will not be released as a stand-alone tool at this time. 52

4 Health Checking and Troubleshooting

4.5 Recovering from Network Partition


4.5.1 Automating
The default lustre upcall, "DEFAULT" will automatically attempt to recover from network partitions. If the cluster does not have any services with failover nodes, then the default upcall is sufcient. If failover is required, then an upcall that supports failover will be required, please see 2.6 on page 38.

4.6 Recovering from Disk Failure


If the physical storage on a metadata target (MDT) or object storage target (OST) fails, then it will need to be recovered from backup. Creating and restoring from backups are discussed in Chapter 7.

4.6.1 OST Failure


The loss of an OST is a localized failure: only the data on that OST is unavailable, and the rest of the le system is unaffected. If the OST is going to remain down for an extended period, then it should be deactivated to stop the clients from attempting to recover, and to free any remaining locks the client might have on that OST. This needs to be done on each client: lctl --device %<OSC DEVICE NAME> deactivate After the OST has been restored, the clients can reactivate their connection: lctl --device %<OSC DEVICE NAME> activate This will recover and activate the OSC. If a failover is required, then the upcall will be used to recongure the OSC.

4.6.2 MDT Failure


If an MDT disk fails, that Lustre le system is largely unavailable until the MDT is restored. File I/O can continue on le handles that were opened before the failure, but new metadata operations (open, create, etc) will hang waiting for the MDS to recover. To stop Lustre entirely, clients will need a forced umount ("umount -f"), and the OSTs will need to be cleaned up with the force option.

4.7 Lustre Timeouts


53

4 Health Checking and Troubleshooting

4.7.1 Aborting Server Recovery


When a service starts after a crash, then the server starts in recovery mode. During recovery, the server only allows previously-connected clients to reconnect. Once all existing clients have reconnected and completed the recovery protocol, the server completes recovery and resumes normal operations. If all clients do not reconnect or complete recovery, the server will eventually timeout its recovery and evict all of the clients. If the administrator knows that not all of the clients are available to recover, then rather than waiting for recovery to timeout, it is possible to manually abort recovery. This is done with the lclt abort_recovery command: lctl --device %mds1 abort_recovery

4.7.2 Manual Recovery


If automated recovery fails for any reason, you can recover manually with lctl: lctl --device %mds1 recover

4.8 Automating Failover


4.8.1 Using XML
When using XML, the lconf select option is used to override the current active node for a service. For example, one way to manage the current active node is to save the node name in a shared location that is accessible to the client upcall. When the upcall runs, it determines which service has failed, and looks up the current active node for that service. The current node and the upcall parameters are then passed to lconf to complete recovery. Using the example above, if nodeA fails, then when nodeB starts the ost1 service, it needs to save "nodeB" in a le named ost1_UUID. Then it starts the service with this: lconf --node nodeB --select ost1=nodeB <config.xml> When the clients detect the failure and call the upcall, the second parameter will be target UUID, "ost1_UUID." The upcall reads this le, reads "nodeB" and then calls lconf: 54

4 Health Checking and Troubleshooting

lconf --recover --select ost1=nodeB --target_uuid $2 --client_uuid $3 --conn_uuid $4 <config.xml>

4.8.2 Using LDAP


When LDAP is used to store the cong, then the lactive command is used set the current active node for a group. The current active node should be set after the old node is conrmed unavailable, and before the new node has started Lustre. The lconf select option is not needed on the server or client. Again, nodeA fails and nodeB becomes active:

lactive --ldapurl ldap://lustre --config fs --pwfile <pw> --group nodeA --active nodeB lconf --ldapurl ldap://lustre --config fs --node nodeB [--group nodeA] If the group is specied, then only the devices in that group will be started. If group is not used, then all devices that are not already started on the nodeB and are active will be started.

4.8.3 Failing Back to Primary Nodes


Failing back to the primary node is essentially the same failover process, except it is initiated intentionally. A fail back is initiated by shutting down the current active node in failover mode:

lconf --node nodeB [--group nodeA] --select ost2=nodeA --cleanup --force --failover <config.xml> The group here is used to specify which devices to shutdown, and is required in an active/active conguration to prevent all the devices from being stopped. The rest of failback is identical to a regular failover.

55

Chapter 5

Health Checking
5.1 What to do when Lustre seems too slow
There are many reasons why Lustre may not perform as well as it should. But the rst step is to make sure that your expectations are reasonable. Ask yourself these important questions: Am I expecting more bandwidth than the raw network or disk hardware would allow? (See Section 2.5.3.5) Does the application perform I/O from enough client nodes to take advantage of the aggregate bandwidth provided by the object storage servers?

5.1.1 Debug Level


If you increase the debug level, be prepared for a major performance impact. Full debugging can slow the system by as much as 90%, compared to the default settings. For more information, see Section 2.9.4.

5.1.2 Stripe Count


It is important to remember that the peak aggregate bandwidth for I/O to a single le is bounded by (the number of stripes) multiplied by (peak bandwidth per server). No matter how many clients try to write to that le, if it only has one stripe, all of the I/O will go to only one server. For more information about how to set the striping of new les, or how to query the stripe setting on a le after the fact, see Section 2.10. 56

5 Health Checking

5.1.3 Stripe Balance


Lustre will create stripes on consecutive OSTs by default, so les created at one time will be optimally distributed among OSTs, assuming there are enough stripes and/or les created at that time. However, les created at different times may not have optimal distribution among OSTs. If you notice that some servers receive a disproportionate share of the I/O load (see Section 3.1.4), check to make sure that your les are striped evenly over the OSTs. If this happens, consider using lfs to create a balanced set of les before the application starts, or if applicable, teach your application about Lustre striping.

5.1.4 Investigation
If the problem still exists after checking the above, there may be a legitimate bottleneck which requires investigation. There are several general investigative tools that can be used to evaluate which nodes in a Lustre le system may be the cause of slowness.

5.1.4.1

vmstat

The CPU use columns in the vmstat output can identify a node whose CPU is entirely consumed. On metadata server (MDS) and object storage server (OSS) nodes, the I/O columns tell you how many blocks are owing through the nodes I/O subsystem. Combined with knowledge of the nodes attached storage, you can determine if this subsystem is the bottleneck. Clients do not show any block IO in vmstat. The columns that report swap activity can identify nodes that are having trouble keeping their working applications in memory.

5.1.4.2

iostat

When the host kernel has been congured to provide detailed I/O stats per partition iostat can provide insight into the nature of I/O bottlenecks. It provides the nature and concurrency of requests being made of attached storage.

5.1.4.3

top

top helps identify tasks that are monopolizing system resources. It can identify tasks that arent generating le system load because they are busy using the CPU, or server threads that are struggling to keep up on an overloaded node. 57

5 Health Checking 5.1.4.4 oprole

oprole (http://oprofile.sourceforge.net/) is invaluable for proling the use of CPU on a node. Its installation and use is beyond the scope of this document but we highly recommend it.

5.1.5 Why are POSIX le writes slow?


Writes ow from the application that generates them to the storage targets which commit them to disk. It is valuable to inspect the possible choke points along this path. 5.1.5.1 Generating data to write

If an application is to take advantage of large network and disk pipes, it must generate a lot of write trafc, which can be cached on the client node and packaged into RPCs for the network. There must be free memory on the node for use as a write cache. If the kernel cant keep at least 4 MB in use for Lustre write caching, it cannot keep an optimal number of network transactions in progress at once. There must be enough CPU capacity for the application to do the work which generates data for writing. 5.1.5.2 Nearly-full le systems

To prevent a situation in which Lustre puts application data into its cache, but cannot write it to disk because the disk is full, Lustre clients must reserve disk space in advance. If it is unable to reserve this space because the OST is within 2% of full it must execute its writes synchronously with the server, instead of caching them for efcient bundling. The degree to which this affects performance depends on how much your application would benet from write caching. The cur_dirty_bytes le (in each OSCs subdirectory of /proc/fs/lustre/osc/ on a client) records the amount of cached writes which are destined for a particular storage target. The maximum amount of cached data per OSC is determined by the max_dirty_mb value in the same directory, 4 MB by default. Increasing this value will allow more dirty data to be cached on a client before it needs to ush to the OST, but also increases the time needed for other clients to read or overwrite that data as it needs to be written to the OST before other clients can access it. 5.1.5.3 Network congestion

The network between the client and storage target needs to have capacity for the write trafc. 58

5 Health Checking ifcong has byte counters for each interface, which can be used to measure the throughput of a TCP session over that interface. netstat -t shows the size of the queues on a given socket. netstat -s can show packet loss and retransmissions for TCP on the node. 5.1.5.4 Server thread availability

Write RPCs arrive at the server and are processed synchronously by kernel threads (named ll_ost_*). ps will help to identify the number of threads that are in the D state, which indicates that theyre busy servicing a request. vmstat can give a rough approximation of the number of threads that are blocked processing I/O requests when a node is busy servicing only I/O RPCs. The number of threads sets an upper bound on the number of I/O RPCs that can be processed concurrently, which in turn sets an upper bound on the number of I/O requests that will be serviced concurrently by the attached storage. 5.1.5.5 Backend throughput

iostat -x is invaluable for proling the load on the storage attached to a server node. Its man page details the meaning of the various columns in the output. The raw throughput numbers (wkB/s) combines well with the requests per second (w/s) to give the average size of I/O requests to the device. The service time gives the amount of time it takes the device to respond to an I/O request. This sets the maximum number of requests that can be handled in turn when requests are not issued concurrently. Comparing this with the requests per second gives a measure of the amount of storage device concurrency.

5.2 Common Failure Symptoms


5.2.1 Unresponsive to Requests
When a failure is detected, which prevents a client node from performing a users request, the standard behavior is to block until either the system is recovered or the application receives a signal (KILL, TERM, INT, and ALRM are handled after a short timeout period, 60 seconds by default). In almost every case in which the system is unresponsive, you will nd interesting Lustre messages on the console shortly thereafter.

5.2.2 Gathering Evidence

59

5 Health Checking When reporting a bug, you may be asked to submit a sample of the Lustre kernel debug log after youve reproduced the bug. These logs are very verbose, so its important to reproduce the bug on a quiescent system whenever possible. You may be asked to change the system debug level to gather more or less information when you reproduce the problem. 5.2.2.1 Log levels

Lustres default debug level is very low, appropriate for a production system which requires minimal logging and minimal performance impact. A high debug level can impact performance by as much as 90%. Debugging is controlled by two bitmaps, one which controls the type of messages saved (tracing, locking, memory allocation, etc.), and one which controls which subsystems are saved (MDS, DLM, Portals, etc.). These reside in /proc/sys/portals/debug and subsystem_debug, respectively. The meaning of these bits can be found in lustre/portals/include/linux/kp30.h 5.2.2.2 Log collection

Before you reproduce the bug, clear the existing log data: lctl clear After you reproduce the problem, save a copy of the debug log: lctl dk <filename> If you trigger a Lustre assertion in the form of an LBUG error, a debug log will automatically be dumped in /tmp; the exact le name will be printed to the console.

5.2.3 Basic analysis


5.2.3.1 Interpreting syslog/dmesg output

slow commit/write These messages occur on the server. lock timouts client When the client times out waiting for a lock, it will assume the connection to the server has failed and start recovery. server When the server times out waiting for a client to respond to a lock cancel request, the server will then evict the client to allow other nodes to 60

5 Health Checking make progress. The next time the client sends an RPC, it will receive an error and will have to reconnect. Socknal timeout message The socknal will not attempt to retransmit after a timeout. It will just close the connection and drop the message, after which the higher layers of Lustre will reconnect and attempt recovery. RPC timeouts A client will attempt to reconnect after an RPC timeout occurs. ENOSPC (errno 28) When the MDS or OST runs out of space, you will see ENOSPC errors in the logs. ENOTCONN (errno 107) This usually means the client has been evicted by the server. It can also mean the server has been restarted. In either case, the client will reconnect and either recover or clear its state if it has been evicted. Remounting le system read-only When the underlying disk le system detects corruption, it will remount itself read-only to prevent further damage. Lustre must be shut down on this target, and e2fsck run on the block device.

5.2.4 Reporting problems


We track all bugs and support issues in Bugzilla. Before creating a new bug, please search for an existing bug to avoid ling a duplicate. More details on using Bugzilla can be found on the Lustre site, at http://www.lustre.org/bugs.html

5.2.5 Support Contacts


Registered support contacts for ongoing support contracts can submit support issues either via Bugzilla, as above, or by sending email to support@clusterfs.com When submitting via Bugzilla, make sure to select the checkbox for your organization, to ag the issue for priority handling.

5.2.6 Mailing lists


The lustre-announce mailing list is a very low-trafc list for those interested in major updates from CFS. Members of the lustre-discuss mailing list may be able to provide answers to common questions. If you require more support or guaranteed response times, per-incident and annual service level agreements are available. For more information about the Lustre mailing lists, see http://www.lustre. org/lists.html

61

Chapter 6

Managing Congurations
6.1 Adding OSTs
6.1.1 The importance of OST ordering
Adding an object storage target (OST) is as easy as creating a new OST entry in the conguration le. This must be done while the Lustre le system is not running. If you dont use lconf to start your clients (i.e., you are using a 0-conf setup, or you just run "mount"), then you will need to re-generate the conguration logs on your metadata server (MDT).

The new OST entry must come at the end of the list of OSTs in the conguration le. If you insert the entry in the middle, it will have undened results, and corruption and data loss may occur.

6.1.2 Extending with LVM


If you plan ahead somewhat, you may be able to use the Linux Logical Volume Manager to extend your OST le systems. See Section 2.2.3.

6.2 Poor Mans Migration


Unless otherwise instructed (see Section 2.10), Lustre will stripe a le over a random 62

6 Managing Congurations set of OSTs at le creation time. Unfortunately, the only manual migration option at this time is to copy your les with standard Unix tools. As the old les are replaced by new les, the random allocation policy will ensure that the OSTs eventually get back into balance. You can view the disk usage statistics for each OST separately with the Lustre management tool (see Section 3.1.4).

6.2.1 Adding OSTs Without Upsetting the Balance


If you have multiple OSTs on a single OSS, it is possible to move some of those OSTs to a separate node, to better distribute the load. Unfortunately this procedure is not yet automated, and should be performed only by experts. After you move the physical devices it is necessary to manually specify the OST UUID using the ostuuid option for that entry in the lmc conguration le and rebuild your conguration (see Section 2.9). One drawback of this approach is that the OST UUID normally includes the OSS node name and moving an OST with the old UUID to another node may be confusing. If you are an expert, you have the option of editing the UUID stored in the OST le system to reect the new OSS node. To do this: mount the block device make a backup of the le named last_rcvd, in case you make a mistake edit the last_rcvd le with a binary editor it begins with the old UUID, as found in the old conguration le; replace that with the new UUID if the new UUID is shorter, replace extra characters with NULLs verify that the new last_rcvd le is the same size as the old one (in case bytes were added removed by accident)

6.3 Network Topology Changes


The IP address of any node in you network can be changed. It is simply a matter of updating the hostname and network ID in your XML conguration les and/or LDAP directory. If you dont use lconf to start your clients (i.e., you just run "mount"), then you will need to re-generate the conguration logs on your metadata server (MDT). It is also possible to change the UUID for a given node/service, but it is non-trivial and, hence, beyond the scope of this document.

6.4 Adding Failover


63

6 Managing Congurations

You can add failover to an existing conguration in much the same way as described in the Failover Example Section 2.6. You would create a new OST that would act as a failover pair with an existing OST. Two existing OSTs cannot be used to create a failover pair without reformatting at least one of the two targets.

6.5 Adding a Distinct le system


A Lustre conguration le can contain entries for more than one le system. The only caveat is that individual metadata (MDT) and object storage (OST) targets cannot be shared between le systems. More than one target can exist on the same node (for either the same or different le systems), but each target must have a unique backing store.

Be aware that adding more than one target on a given node may have an impact on le system performance.

64

Chapter 7

Managing Lustre
7.1 Changing Congurations 7.2 Backing Up Data
7.2.1 Backing up at the Client File System Level
It is possible to back up Lustre at the client level from one or more clients. Running backups on multiple clients in parallel for different subsets of the le system can take advantage of the parallel nature of Lustre storage if the backup system can handle this. This would allow the use of standard le system backup tools (tar, Amanda, etc) that read les using the standard POSIX API. This has the advantage that the backups can be managed using the same tools as other backups in an organization, possibly allowing users to manage their own backups. It is also generally easier to restore individual les in case of a user or application error that deletes or corrupts specic les. This is the less complex method of performing backups. One disadvantage of using backups at the le system level is that this loses Lustre metadata such as how the le was striped over the OSTs. In some organizations, les will always be created with the default striping pattern, or it is possible to set a default directory striping pattern before restoring les so this may not be a concern. It may actually be advantageous to use the default striping during le restoration in some circumstances in order to rebalance space usage on the OSTs. Another disadvantage of le-system-level backups is that the data must be transferred over the network from the OSTs to wherever the backup is running, and possibly again over the network to a 65

7 Managing Lustre backup server. The use of le-system-level backup tools is beyond the scope of this document.

7.2.2 Backing up at the Target Level


It is also possible to back up Lustre at the target level, doing a backup of the MDS and each OST directly on the server nodes. This can also be done in parallel for each target. This has the advantage that the data only needs to be transferred from the target directly to the backup server/media once and avoids clients entirely. One reason to do backups at the target level is that storage device failures (one or more disk failures, software or hardware corruption of the le system) will often lead to the failure of the entire target le system, and having a target-level backup allows restoration of only that target instead of the whole le system. Target-level backups also maintain the same striping conguration as the original les, so if applications or users set up les with non-default striping parameters this data is preserved. One disadvantage of target-level backups is that it is more complex to do the backups because the whole le system should normally be quiesced in order to get a consistent backup. In addition to backing up the regular data and directory structure of the target, the backup must also include the Lustre metadata, which is stored in le system extended attributes. Another disadvantage is that it is generally not possible to restore individual les from such backups.

7.2.3 How to back up an OST


It is possible to back up an OST using a number of methods. The rst method is to use the ext2 dump tool to dump the le system directly from the block device. This can be done with the OST unmounted, with the OST set up and (mostly) idle, or from an LVM device snapshot. Dumping an active le system may lead to incomplete or corrupt backups. The second method is to shut down the OST or take a device snapshot with LVM and mount the target device directly as an ext3 le system and use normal le-system-level tools to do the backup. The OST does not contain extended attribute data, so normal le system backups are sufcient to do backups. In both cases it is possible to use incremental backups on the OST because each object maintains correct mtime and ctime that the backup software can use to determine if it has changed.

7.2.4 How to back up the MDS


When backing up the MDS, it is critical that the extended attribute data is also backed up, as this describes the conguration of objects and without this none of the le data is accessible. There are only a limited number of backup tools that can back up EA data directly, but it is possible to back up the EA data indirectly by rst saving it to a regular le in the MDS le system and then backing the le system up. At the time of this writing, the ext2 dump program does not back up EA data so it 66

7 Managing Lustre mds# mount -t ext3 /dev/sdb1 /mnt/mds mds# cd /mnt/mds mds# getfattr -R -d . -m . -e hex . > backup.EA Figure 7.1: backing up extended attributes to a regular le # mke2fs -J size=400 /dev/sdb1 # tune2fs -c 0 -i 0 -O dir_index /dev/sdb1 Figure 7.2: formatting a target device

is not possible to do device-level backups on a quiescent le system. If LVM device snapshots are available, it is possible to mount the MDS le system from the snapshot device, otherwise MDS service must be shut down and the MDS device mounted as a regular ext3 le system. To dump the extended attribute data to a regular le, that can itself be backed up using normal backup tools, requires getfattr from the attr package to be installed (see Figure 7.1). At this point it is possible to use normal backup tools to back up the MDS le system.

7.3 Restoring Backups


The rst part of restoring a target le system is to format the le system appropriately (see Figure 7.2). Restoring an OST backup should be done as appropriately for the backup created. In addition to restoring the MDS backup, the extended attributes must also be restored afterwards (see Figure 7.3). After the MDS or OST target is restored from backup, lfsck must be run in order to repair any inconsistencies in the Lustre le system. This is discussed in detail in section4.3.

# mount -t ext3 /dev/sdb1 /mnt/mds # cd /mnt/mds # setfattr --restore=backup.EA Figure 7.3: restoring extended attributes 67

7 Managing Lustre

7.4 Exporting via NFS


Lustre has support for NFS exporting of the Lustre le system from one or more clients to hosts which are not capable of mounting Lustre directly. While NFS mounting a Lustre le system doesnt offer the performance capable with a direct Lustre mount on a client, it does make a Lustre le system available on a wide variety of client systems. The performance of NFS-exported Lustre will be addressed in upcoming Lustre releases. To NFS export Lustre from a client, simply add the Lustre client le system mount to the NFS exports table on each exporting node and refresh the kernel exports table via exportfs. The Lustre clients kernel must of course have NFS Server support compiled into it in order to export it to the NFS clients.

7.5 Exporting via Samba (CIFS)


It should be possible to export Lustre le systems via Samba (CIFS, or Common Internet File System) the same protocol used by Windows computers for sharing le systems. The performance characteristics of such a conguration are currently unknown.

7.6 Upgrading Your Software


7.6.1 Release Notes
Before you install any version of Lustre, please be sure to read the release notes. A link to the release notes for your version of Lustre can be found at http://www. clusterfs.com/lustre.html.

7.6.2 Upgrading From Previous Versions

7.6.2.1

Shutting Down the File System

Before Lustre is upgraded, it should be shut down properly on all nodes. The proper order for this is: 1. Unmount clients 2. Shut down Lustre on MDS nodes 3. Shut down Lustre on OSS nodes 68

7 Managing Lustre If lconf is not used to shut down the clients, rmmod -a should be run on the clients twice, so that all Lustre modules are unloaded (the rst time marks the modules as autocleanable, the second actually does the autocleaning). 7.6.2.2 Upgrading Lustre

The method for upgrading Lustre depends on how its installed on your systems. If the Lustre modules were included in a kernel RPM, you will need to install the new kernel rpm. When upgrading kernels, we advise you to use rpm -i rather than rpm -U. -U will remove older versions, and it is a good idea to keep the current kernel installed in case of problems with the new kernel. The RPMs should update the bootloader menu, but you may have to manually set the new kernel as the default. Consult your systems bootloader documentation for information on this. Reboot to run the new kernel. If the Lustre modules are contained in their own RPM, you can upgrade this RPM with the rpm -U command. The node does not need to be rebooted unless a new kernel is also required, but you should make sure that all old Lustre modules were unloaded before restarting Lustre (as above).

Before you restart, make sure that all nodes were upgraded to the same version of Lustre. If youre planning to run with different versions, please see Section 8.1.

69

Chapter 8

Mixing Architectures
8.1 Mixing Lustre versions
Running Lustre with different release levels on the clients and servers is not a supported conguration. Given Lustres very rapid development, there are almost guaranteed to be network protocol differences between minor releases (e.g. 1.0 and 1.2). There are never network protocol changes between micro releases (e.g. 1.2.0 and 1.2.1). However, there is a strong effort made to ensure that the on-disk format does not change between releases, or is changed in a compatible manner with older releases. It should always be possible to update from an older Lustre version to a new release. Please consult the release notes for a given release in case of compatibility issues. In some circumstances it may be possible to update clients with bug xes without taking a full le system outage (e.g. as clients nish jobs, without interrupting other running jobs), but this needs to be determined on a case-by-case basis and is not normally recommended.

8.2 Mixing kernel versions


8.2.1 Different client/server kernels
Because the Lustre network protocol is xed for a specic release, it should not cause problems if clients have different kernel versions (whether kernel minor release levels or different vendor kernels) or if clients have different kernels from servers as long as both the client and server are built from the same Lustre sources. 70

8 Mixing Architectures Lustre kernel patches are numbered with a release number in <linux/lustre_kernel_version.h> LUSTRE_KERNEL_VERSION that is checked at Lustre module build time that the kernel patch matches the version of the Lustre code being built. In some cases it might be benecial to have slightly different kernel builds on the clients and servers, in order to improve performance. Conguring the server kernels with 3 GB kernel address space (1 GB user address space) allows the kernel to cache more metadata if the server nodes have more than 1 GB of RAM. In all cases where the kernel version is different between nodes the Lustre modules need to be rebuilt for each kernel and installed with the matching kernels.

8.2.2 Mixing 2.4 and 2.6 kernels


It should also be possible to mix 2.4 and 2.6 kernels on both the client and server, subject to any normal restrictions from device drivers (disk, network, etc). This might be advantageous for example to have a 2.6 kernel on server nodes in order to take advantage of the improved IO throughput while keeping a 2.4 kernel on the client because of software compatibility issues.

8.3 Mixing hardware classes


8.3.1 Hardware Page Size
If the client and target have different machine architectures (e.g. i386 and ia64) it is likely that they will also have different hardware page size. The i386 kernel only supports a single hardware page size, namely 4096 byte pages. Many other architectures support multiple page sizes, such as ia64 which supports 4096, 8192, 16384, and 65536 byte pages. The page size is congured for these architectures at kernel compile time. Currently, in order to support correct interoperability between clients and servers, the client page size must be the same size or larger than the server. For example, it is possible to run ia64 clients with 16kB pages against i386 OSTs (with 4 kB pages) and an ia64 MDS with 16kB pages. It is not currently possible to run i386 clients against ia64 servers unless the ia64 machines are congured with 4 kB pages (this is not a standard ia64 conguration).

8.3.2 Endian mixing


Lustre has support for different word endiannes on the client and the server nodes. The Lustre RPC layer does endian swabbing on an as-needed basis on the message receiver, so it is possible to mix both big- and little-endian clients and servers. The Lustre network protocol is 64-bit clean so it is also possible to mix 32-bit and 64-bit clients, subject to page size limitations discussed earlier. CFS has only done limited testing with mixed-endian environments so this support 71

8 Mixing Architectures should be considered in the preliminary stages at this time. A conguration that is known to work is a PPC64 client with an i386 server for which we have done simple load testing (e.g. iozone) but not large-scale testing.

72

S-ar putea să vă placă și