Sunteți pe pagina 1din 13

August 2012

IBM Systems and Technology Group

Configuring GPFS for Reliability


High availability for your enterprise applications

Scott Fadden – IBM Corporation


sfadden@us.ibm.com
Contents
Introduction .......................................................................................... 3
Quorum ................................................................................................. 3
Node Quorum ................................................................................... 3
Node Quorum with Tiebreaker disks ................................................ 4
File system descriptor quorum ......................................................... 5
Replication ............................................................................................ 5
File system descriptor quorum how it effects replication ................ 6
IO patterns in a replicated system .................................................... 6
Data replication Scenarios .................................................................... 9
Manual Recovery with replicated data ................................................. 9
Procedure for manual recovery ........................................................ 9
Automatic recovery with replicated data ........................................... 10
Automatic recovery with replicated data across sites ........................ 10
Option 1: Remote quorum node..................................................... 10
Option 2: Remote tiebreaker disk................................................... 11
Conclusion........................................................................................... 12

Page | 2
Introduction
I receive many questions on how to configure GPFS for
reliability. Questions like “what is quorum” and “why do I need
it”, or “what are failure groups” and “how do I use them”. This
article is an attempt to being all of these topics into one place.
This paper discusses of the options you have when configuring
GPFS for high availability.

Every application has different reliability requirements from


scientific scratch data to mission critical fraud detection
systems. GPFS supports a variety of reliability levels depending
on the needs of the application. When designing a GPFS cluster
consider what type of events you need your system to “survive”
and how automatic you want to the recovery to be. Any
discussion of reliability in a GPFS cluster starts with quorum.

Quorum
The worst type of failure in a cluster is called split brain. Split
brain happens when you have multiple nodes in a cluster that
continue operations independently, with no way to
communicate with each other. This situation cannot happen in
a cluster file system because without coordination your file
system could become corrupted. Coordination between the
nodes is essential to maintain data integrity. To keep the file
system consistent a lone node cannot be permitted to continue
to write data to the file system without being coordinating with
the other nodes in the cluster. When a network failure occurs
some node has to stop writing. Who continues and who stops is
determined in GPFS using a mechanism called quorum.

Maintaining quorum in a GPFS cluster means that a majority of


the nodes designated as quorum nodes are able to successfully
communicate. In a three quorum node configuration two nodes
have to be communicating for cluster operations to continue.
When one node is isolated by a network failure it stops all file
system operations until communications are restored so no
data is corrupted by a lack of coordination.

Node Quorum
So how many quorum nodes do you need? There is no exact
answer. Chose the number of quorum nodes based on your

Page | 3
August 2012
IBM Systems and Technology Group

cluster design and your reliability requirements. If you have


three nodes in your cluster, all nodes should be quorum nodes.
If you have a 3,000 node cluster you do not want all nodes to be
quorum nodes. True you can’t configure all of the nodes to be
quorum since the maximum is 128, but even that is too many.
When a node failure occurs the quorum nodes have to do some
work to decide what to do. Can cluster operations continue?
Who is the leader? So think of the number of quorum numbers
just like any other committee. The more members of the
committee the longer it takes to make a decision.

You can change node designations dynamically, so if a rack of


nodes fail and they are going to be down for a while, you can
designate another node as quorum to maintain your desired
level of reliability. Choose the smallest number of quorum
nodes that makes sense for your cluster configuration. Even in
the largest clusters this is typically 5 to 7 quorum nodes. One
quorum node per GPFS building block in large clusters or
something similar is common. Yes 5 and 7 are odd numbers, and
the general recommendation is to choose an odd number of
quorum nodes. This is more a matter of style than requirement
but it makes sense when considering how many nodes can fail.
If you have 4 quorum nodes you need 3 available (4/2+1=3 or
one more than half) to continue cluster operations, the same as
if you had 5 (5/2+1=3 1) quorum nodes. That is why an odd
number is typically recommended. In a single node cluster (yes
there are single node production “clusters”, typically for HSM)
there is no one to communicate with so a single quorum node is
all that is required.

Node Quorum with Tiebreaker disks


Use tiebreaker disks when you have a two node cluster or you
have a cluster where all of the nodes are SAN attached to a
common set of LUNS and you want to continue to serve data
with a single surviving node. Typically tiebreaker disks are only
used in two node clusters.

Tiebreaker disks are not special NSD’s, you can use any NSD as a
tiebreaker disk. You can chose one from three different file
systems, or from different storage controllers for additional

1
Yes I know that (5/2)+1 is 3.5, but you cannot have ½ a quorum node
and 3.5 is greater than ½ +1 of the quorum nodes.

Page | 4
August 2012
IBM Systems and Technology Group

availability. In most cases using tiebreaker disks adds to the


duration of a failover event, because there is an extra lease
timeout that has to occur. In a two node cluster you do not have
a choice if you want reliability, though this is why it is commonly
recommended that if you have more than 2 nodes you use node
quorum with no tiebreaker disks.

Using tiebreaker disks can improve failover performance only if


you use SCSI3 persistent reserve.

File system descriptor quorum


File system descriptor quorum is one type of quorum in GPFS
that is often overlooked. In a GPFS file system every disk has a
header that contains information about the file system. This
information is maintained on every disk in a file system but
when there are more than three NSDs in a file system 3 copies
of the file system descriptor are guaranteed to have the latest
information, all of the others are updated asynchronously when
the file system configuration is modified. Why not keep all of
them up to date? Consider a file system with 1,000 disk drives.
Each file system command would require that each copy is
guaranteed to be up to date, that many copies can be difficult
to guarantee, so three are maintained as official copies. For a
file system to remain accessible two of the three official copies
of the file system descriptor need to be available. We will
discuss this more after looking at replication.

Replication
In GPFS you can replicate (mirror) a single file, a set of files or
the entire file system and you can change the replication status
of a file at any time using a policy or command. You can
replicate metadata (file inode information) or file data or both.
Though in reality, if you do any replication, you need to
replicate metadata. Without replicated metadata, if there is a
failure, you cannot mount the file system to access the
replicated data anyway.

A replication factor of two in GPFS means that each block of a


replicated file is in at least two failure groups. A failure group is
defined by the administrator and contains one or more NSDs.
Each storage pool in a GPFS file system contains one or more
failure groups. Failure groups are defined by the administrator
and can be changed at any time. So when a file system is fully

Page | 5
August 2012
IBM Systems and Technology Group

replicated any single failure group can fail and the data remains
online.

File system descriptor quorum how it effects replication


So far we have discussed a replication factor of two and two
failure groups. There is one more aspect to replicating data in
GPFS that is important to consider, and that is file system
descriptor quorum. Remember that for a file system to remain
accessible two of the three official copies of the file system
descriptor need to be available. How can you do that in a
replicated file system with two failure groups? You can’t. When
there are more than three NSD’s in a file system GPFS creates
three official copies of the file system descriptor. With two
failure groups GPFS places one descriptor in one failure group
and the other two in the other failure group (assuming there
are at least 3 NSDs). In this configuration if you lose the failure
group which contains the two official copies of the file system
descriptor, the file system unmounts. Therefore for the file
system to remain accessible you need to create one more
failure group that contains at least a single NSD.

Typically this third failure group contains a single small NSD that
is defined with a type of descriptor only (decOnly). The decOnly
designation means that this disk does not contain any file
metadata or data. It is only there to be one of the official copies
of the file system descriptor. The descOnly disk does not need
to be high performance and only needs to be 20MB or more in
size so this is one case where often a local partition on a node is
used for this NSD. To create a descOnly NSD on a node you can
use a partition from a local LUN and define that node as the
NSD server for that LUN so other nodes in the file system can
see it.

IO patterns in a replicated system


When replicating a file system all writes go to all failure groups
in a storage pool. Though with replicated data since you have
two copies of the information there are some optimizations
GPFS can do when your application is reading data. By default
when a file system is replicated GPFS spreads the reads over all
of the available failure groups. This configuration provides the
best read performance when the nodes running GPFS have
equal access to both copies of the data. For example this

Page | 6
August 2012
IBM Systems and Technology Group

behavior is good if GPFS replication is used in a single data


center to replicate over two separate storage servers all SAN
attached to all of the GPFS nodes.

The readReplicaPolicy configuration parameter allows you to


change the read IO behavior in the file system. If you change
this parameter from default to a value of local GPFS changes the
read behavior with replicated data. A value of local has two
effects on reading data in a replicated storage pool. Instead of
simply reading from both failure groups GPFS reads data from
the failure group that is on either on “A local block device” or on
a “A local NSD server.”

A local block device means that the path to the disk is through a
block special device. On Linux, for example that would be a
/dev/sd* device or on AIX a /dev/hdisk device. GPFS does not
do any further determination, so if disks at two separate sites
are connected using a long distance SAN connection GPFS
cannot distinguish what copy is local. So to use this option
connect the sites using the NSD protocol over TCP/IP or
InfiniBand Verbs (Linux Only).

A local NSD server is determined by GPFS using the subnets


configuration setting to determine what NSD servers are "local"
to an NSD client. For NSD clients to benefit from "local" read
access the NSD servers supporting the local disk need to be on
the same subnet as the NSD clients accessing the data and that
subnet needs to be defined using the "subnets" configuration
parameter. This parameter is useful when GPFS replication is
used to mirror data across sites and there are NSD clients in the
cluster. This keeps read access requests from being sent over
the WAN.

Page | 7
August 2012
IBM Systems and Technology Group

NSD Sever Cluster


Location 1
42

41

40
42

41

40
42

41

40
42

41

40
Compute Cluster
39 39 39 39

38 38 38 38

37 37 37 37

36 36 36 36

35 35 35 35

34 34 34 34

33 33 33 33

32 32 32 32

31 31 31 31

30 30 30 30

29 29 29 29

28 28 28 28

27 27 27 27

26 26 26 26

25 25 25 25

24 24 24 24

23 23 23 23

22 22 22 22

21 21 21 21

20 20 20 20

Local Subnet (5.3.2.*)


19 19 19 19

18 18 18 18

17 17 17 17

16 16 16 16

15 15 15 15

14 14 14 14

13 13 13 13

12 12 12 12

11 11 11 11

10 10 10 10

09 09 09 09

08 08 08 08

07 07 07 07

06 06 06 06

05 05 05 05

04 04 04 04

03 03 03 03

02 02 02 02

01 01 01 01

Location 3
System x3650
0

2 3 4 5 6 7

Location 2
Compute Cluster
42 42 42 42

41 41 41 41

40 40 40 40

39 39 39 39

38 38 38 38

37 37 37 37

36 36 36 36

35 35 35 35

34 34 34 34

33 33 33 33

32 32 32 32

31 31 31 31

30 30 30 30

29 29 29 29

28 28 28 28

27 27 27 27

26 26 26 26

25 25 25 25

24 24 24 24

23 23 23 23

Local Subnet (1.2.3.*)


22 22 22 22

21 21 21 21

20 20 20 20

19 19 19 19

18 18 18 18

17 17 17 17

16 16 16 16

15 15 15 15

14 14 14 14

13 13 13 13

12 12 12 12

11 11 11 11

10 10 10 10

09 09 09 09

08 08 08 08

07 07 07 07

06 06 06 06

05 05 05 05

04 04 04 04

03 03 03 03

02 02 02 02

01 01 01 01

Figure 1: Multi-site Configuration

Error! Reference source not found. is an example of a multisite


configuration that can benefit from a readReplicaPolicy of local.
In this example Location 1 and Location 2 both have a copy of
the file system data and metadata. The subnets parameter for
the clusters is configured as subnets=”5.3.2.0,1.2.3.0” and
readReplicaPolicy=local. So the compute cluster at Location 1
reads from the NSD servers in Locaiton 1.

Page | 8
Data replication Scenarios
Most clusters do not need the highest availability with replicated data spread
across three sites. For most, multiple nodes and storage with RAID is sufficient.
This section examines various GPFS cluster configurations looking at each to see
what that configuration means for data reliability. It starts at the beginning with
the most common cluster architectures and works towards the most highly
available configuration.

Manual Recovery with replicated data


If for some reason you cannot configure a system for automatic recovery as long
as you have replicated metadata and data you have the option of performing a
manual recovery in the event of a failure. In this case when a site failed, for
example, a GPFS administrator could bring the data back online manually.

Procedure for manual recovery


1. Shut the GPFS daemon down on the surviving nodes

mmshutdown -N surviving-nodes

1. If it is necessary, assign a new primary cluster configuration server.

mmchcluster -p survivingNode

2. Relax node quorum by temporarily changing the designation of each of


the failed quorum nodes to non-quorum nodes:

mmchnode --nonquorum -N quorumNode1, quorumNode2

3. Relax file system descriptor quorum by migrating the file system


descriptor off of the failed disks.

mmfsctl fs0 exclude -d "gpfs1nsd"

4. Restart the GPFS daemon on the surviving nodes

mmsstartup -N surviving-nodes

5. Mount the file system on the surviving nodes

mmmount gpfs1 -N surviving-nodes

Page | 9
August 2012
IBM Systems and Technology Group

Automatic recovery with replicated data


If you need automatic recovery with direct attached storage it requires at least
three storage devices so you can maintain file system descriptor quorum.
Typically two separate RAID arrays, the third can be a small LUN from another
RAID array or disk local to one of the nodes.

Local attached with Replication

System x3650
0 TCP/IP System x3650
0
1

2 3 4 5 6 7
1

2 3 4 5 6 7

Quorum Node Quorum Node

SAN

SAN
SA
N SAN DescOnly Local
Drive
1 4

EXP3512
1 4

EXP3512
Failure Group 3
5 8 5 8

9 12 9 12

Failure Group 1 Failure Group 2

Figure 2: Automatic failover single site

In this configuration replication is typically configured as default data and


metadata and any single hardware component can fail and the file system
remains online. To set up this type of architecture setup a tiebreaker disk in
each failure groups and the DescOnly disk needs the server containing the LUN
to be defined as the NSD server for that LUN so all nodes can access the device.

Automatic recovery with replicated data across sites


To provide the highest level of data availability it is recommended that you use
a three site configuration with GPFS metadata and data replication. There are
two ways to create a three site configuration.

 Using a remote quorum node


 Using a remote quorum disk

These configurations are based on the GPFS use of quorum. Node quorum or
node quorum with tiebreaker disk and file system descriptor quorum. In the
event of a site failure your GPFS cluster needs to maintain node quorum and file
system descriptor quorum.

Option 1: Remote quorum node


In the remote quorum node configuration a third site contains a GPFS node and
the cluster uses node quorum. The quorum node does not have a performance
requirement so you have great flexibility in how this site is configured. For
example the quorum node can be run on a virtualized host with fractional CPU
access. The remote node has a few properties

Page | 10
August 2012
IBM Systems and Technology Group

1. A GPFS node with the designation quorum-client.


2. TCP/IP access to all of the other nodes in the cluster.
3. One local disk or partition available for each replicated file system. This
partition can be created using OS partitioning tools and is defined as a
descOnly disk and placed in a third failure group.

Location 1

System x3650
0

2 3 4 5 6 7

SAN
1 4

EXP3512

5 8

9 12

Failure Group 1
Location 3
TCP/IP
Location 2
System p5

Failure Group 3 System x3650


0

2 3 4 5 6 7

Disk Descriptor NSD

SAN
1 4

EXP3512

5 8

9 12

Failure Group 2

Figure 3: Remote quorum node

Figure 3: Remote quorum node is an example of a configuration using a remote


quorum node.

Option 2: Remote tiebreaker disk


In this configuration a tiebreaker disk is located at a separate site. The
tiebreaker disk is created as a descOnly and placed in a third failure group. This
NSD contains a file system descriptor and is defined as a tiebreaker disk. One
small LUN (50MiB) is required for each replicated file system.

Page | 11
August 2012
IBM Systems and Technology Group

Location 1

System x3650
0

2 3 4 5 6 7

SAN
1 4

EXP3512

5 8

9 12

SA
N
Location 2

System x3650
0

2 3 4 5 6 7

Location 3
SAN

SAN
1 4

EXP3512

5 8

1 4

EXP3512

9 12

5 8

9 12

Figure 4: Remote tiebreaker disk

Figure 4: Remote tiebreaker disk is an example of a configuration using a


remote tiebreaker disk using a long distance SAN connection. The connection to
storage could be based on any disk connection technology, iSCSI for example.

Conclusion
GPFS can be configured for basic availability using RAID protected data all the
way to multi-site configurations with GPFS replicated metadata and data. What
configuration you choose depends on your requirements and budget.

Page | 12
© IBM Corporation 2012

IBM Corporation
Marketing Communications
Systems Group
Route 100
Somers, New York 10589

Produced in the United States of America


August 2012
All Rights Reserved

This document was developed for products and/or


services offered in the United States. IBM may not
offer the products, features, or services discussed in
this document in other countries.

The information may be subject to change without


notice. Consult your local IBM business contact for
information on the products, features and services
available in your area.

All statements regarding IBM’s future directions and


intent are subject to change or withdrawal without
notice and represent goals and objectives only.

IBM, the IBM logo, AIX, eServer, General Purpose File


System, GPFS, pSeries, System p, System x, Tivoli are
trademarks or registered trademarks of International
Business Machines Corporation in the United States or
other countries or both. A full list of U.S. trademarks
owned by IBM may be found at
http://www.ibm.com/legal/copytrade.shtml.

UNIX is a registered trademark of The Open Group in


the United States, other countries or both.

Linux is a trademark of Linus Torvalds in the United


States, other countries or both.

Windows is a trademark of Microsoft in the United


States, other countries or both.

Intel is a registered trademark of Intel Corporation in


the United States and/or other countries.

Other company, product, and service names may be


trademarks or service marks of others.

Information concerning non-IBM products was


obtained from the suppliers of these products or
other public sources. Questions on the capabilities of
the non-IBM products should be addressed with the
suppliers.

When referring to storage capacity, 1 TB equals total


GB divided by 1024; accessible capacity may be less.

The IBM home page on the Internet can be found at


http://www.ibm.com.

Page | 13

S-ar putea să vă placă și