Sunteți pe pagina 1din 14

Disk path design for AIX including SAN zoning

The purpose of this document is to describe how to design the SAN setup to keep the number of disk paths to a reasonable level. Other factors include the use of VIO, NPIV and dual SAN fabrics. Setting up the storage involves the storage, SAN and AIX administrators, so we'll look at it from those perspectives. The article will examine how to determine the number of possible paths based on the SAN cabling, cover the concepts of SAN zoning, LUN masking, host port groups and how these affect disk paths, and look at disk paths with VIO using VSCSI or NPIV. The general case includes some number of available host HBA ports and some number of storage ports, and while this article doesn't cover the general case, the author hopes that by going thru a few specific examples, you'll have the skills to handle your situation. Overview The basic approach taken here is to group the sets of host and storage ports into a number of subsets, and then using SAN zoning and or LUN masking, connect a set of host ports to storage ports to keep the number of paths to a reasonable level. Then the full set of LUNs for a server are also split into the same number of subsets, and assign LUNs in one subset to use one subset of host and storage ports. You may not find this description of the approach to be very clear, and to fully understand it, we'll go over the concepts of LUN assignment, SAN zoning, LUN masking and disk paths which is needed to understand the approach. In the setup of SAN storage, LUNs are created on a disk subsystem then assigned to a group of world wide names (WWNs) representing fibre channel (FC) ports on a host (or group of hosts in clustered applications). WWNs have two versions which are clled the world-wide port name (WWPN) that refers to a specific port on a system and the world-wide node name (WWNN) which can refer to the entire system and all of its ports. Most storage systems will only work with a host's WWPN when configuring host access to a LUN. A WWN is a 16 digit hexadecimal number that is typically burned in to the hardware by the manufacturer. The group of WWNs is called various things depending on the disk subsystem, e.g., on the DS5000 it is called a "storage partition," "host" or "host group." The SVC calls this a "host object," XIV calls this a "host" or "cluster." And the DS800 calls this a "port group." For this article, we'll use the phrase "host port group." It's also worth knowing that when purchasing some disk subsystems, such as the DS5000, one must specify the number of host port groups to which the disk subsystem will be connected. The terminology for assigning or mapping a LUN to a host port group also varies across disk subsystems, and, depending on the disk subsystem, you may be able to control which storage ports are used to handle IO. Controlling which storage ports are used to handle IO for a LUN is called LUN masking (or port masking). The storage might have additional terminology which can also lead to confusion. E.G., on the DS8000 a group of LUNs which are assigned as a group is called a "volume group" which shouldn't be confused with a LVM volume group. The AIX administrator can display the WWN for an adapter with the # lscfg -vl <fcs#> command. SAN switch administrators can also see the WWNs attached to the switch: # lscfg -vl fcs0 fcs0 U789D.001.DQD51D7-P1-C1-T1 4Gb FC PCI Express Adapter (df1000fe)

Alternatively one may use the fcstat command: # fcstat fcs0 FIBRE CHANNEL STATISTICS REPORT: fcs0 Device Type: FC Adapter (adapter/pciex/df1000fe) Serial Number: 1F8240C89D Option ROM Version: 02E8277F Firmware Version: Z1D2.70A5 World Wide Node Name: 0x20000000C97710F3 World Wide Port Name: 0x10000000C97710F3

.... To understand the number of paths to a LUN. Figure 1 shows a server with two fibre channel (FC) ports and three FC ports on the storage with a single FC switch. In this setup, each LUN can have up to 6 paths, or number of host ports times the number of storage ports.

Figure 1 Many disk subsystems also allow you to specify which disk subsystem ports will be used to handle IOs for a LUN, and this is known as LUN masking. And we can zone the SAN so that the number of paths are further reduced. There are trade offs here. It'd be simpler to just use all ports on both the storage and server, and with a load balancing algorithm on the host, we'd balance use of all resources here: server ports, SAN links, and storage ports. However, with too many paths, overhead of path selection can reduce performance. It's worth pointing out that the algorithm and path control module (PCM) used for path selection will affect the overhead of path selection. For example, with an algorithm=fail_over with MPIO, the path is fixed so no cycles are used for path selection. Load balancing algorithms vary in how they choose the best path. They might choose the path with the fewest number of outstanding IOs, they might examine IO service times and choose the path providing the best IO service times, or may just choose a path at random. And often the method isn't documented. SAN Zoning A SAN zone is a collection of WWNs which can communicate with each other. In figure 1, there are 5 WWNs: two for the host ports and three for the storage ports. There are initiators (host FC ports) and targets (storage ports), with the terminology arising as initiators initiate an IO, and the request is sent to a target. There are various kinds of zoning: soft and hard and WWN and port zoning (e.g. seehttp://en.wikipedia.org/wiki/Fibre_Channel_zoning), and the subject is much broader than will be covered here. However, we do need to understand some basic SAN zoning to understand how it, and LUN masking, are related to disk paths. WWN zoning is relatively popular and so we'll examine it first. There is a difference between WWPN and WWN; however, they are often used

interchangeably, which can lead to some confusion. Zoning best practices indicate that WWPNs should always be used when implementing the WWN zoning method. The wikipedia article states that "With WWN zoning, when a device is unplugged from a switch port and plugged into a different port (perhaps on a different switch) it still has access to the zone, because the switches check only a device's WWN - i.e. the specific port that a device connects to is ignored." A dual port host FC adapter will have one node name (or WWN or WWNN) but have two WWPNs. So this is a benefit of WWN zoning. It's also worth noting that to achieve this capability the fscsi device attribute dyntrk must be set to yes (the default is no). . Best practice for WWN zoning is to use "single initiator zoning" where an initiator is a FC port on the host. Such a zone can include multiple targets. A major reason for this, is that as initiators are added to a zone (e.g. when a host is booted) the initiator logs into the fabric (aka. a PLOGI) and this causes a slight delay with other ports in the zone, and the delay is longer with larger zones. Another benefit of single initiator zoning, is that this minimizes issues that may result from a faulty initiator affecting others in the zone. You can think of an initiator as a fibre channel port on a host as it initiates the IO, and a target as a port on the storage as the IO is targeted to it. So best practice for Figure 1 would be to have two zones: one host port and typically the three storage ports in one zone, and the other host port and three storage ports in the other zone: Zone 1:host port 1, storage port 1, storage port 2, storage port 3 Zone 2: host port 2, storage port 1, storage port 2, storage port 3 Note that some disk subsystems have restrictions on zoning, such as the DS4000 when using RDAC where all host ports should not be zoned to all storage ports. Figure 2 shows the difference between WWN zoning and switch port zoning. The shaded area in the figure represents a single initiator zone, allowing communication to two storage ports. There would be similar zones for each host port: host port 2 communicating with storage ports 1 and 2, while host port 3 and 4 communicating with storage ports 3 and 4. The switch port zones allow host ports 1 and 2 to communicate with storage ports 1 and 2, and similarly for host ports 3 and 4 to storage ports 3 and 4. So we've achieved the same connections from the server to the storage, but using different zoning approaches. WWN zoning offers the flexibility to move cables from one port on a switch to another, while switch port zones do not. It is possible to have a single switch port in multiple port zones, so that offers some flexibility. With NPIV, where vFC adapters move around during a live partition migration (LPM), WWN zoning also offers the ability for the zone to follow the vFC adapter.

Figure 2 Customers often implement dual SAN fabrics. This has the benefit that if there's a problem with one fabric, hosts will still have access to storage via the other fabric. Figure 3 shows a common dual SAN setup:

Figure 3 Note that in this example, we could have up to 8 paths to a LUN, 4 thru each SAN fabric. Also, while each SAN has just one switch, it's possible to have more. This also shows an example of a belt and suspenders setup, as a single connection from the host to each switch, as well as a single connection from the storage to each switch provides full redundancy and we'd only have two paths for each LUN in that case. However, if one SAN fabric fails and the wrong link, port or adapter fails, then the host can lose access to the storage. SAN zoning would comprise 4 zones, one for each host port, and include just 2 storage ports in each zone. Note that if the customer is using dual port adapters in the host, the best practice would be to attach one port of an adapter to one fabric and the other port to the other fabric. This enhances availability in that if both an adapter and SAN fabric fail, we won't lose access to the storage. Disk Paths with VIO and VSCSI LUNs There are two layers of multi-path code in this case: the multi-path code on the VIO server (VIOS) which will be what the storage supports (often there's a choice, e.g. SDD or SDDPCM for DS8000), and the multi-path code on the VIO client (VIOC) which will be MPIO using the SCSI PCM that comes with AIX. On the VIOS, the multi-path code is used to choose among the FC paths to the disk, while on the VIOC it's used to choose among paths across dual VIOSs. The fact that there are two layers of multi-path code often leads to some confusion. In a dual VIOS setup, LUNs will typically have multiple paths to each VIOS, and the VIOC can use paths thru both VIOSs. It's worth knowing that all IO for a single LUN on a VIOC with VSCSI, will go thru one VIOS (except in the case of a VIOS failure or of all its paths to the disk). One can set a path priority so that IOs for half the LUNs will use one VIOS, and half will use the other. This is recommended to balance use of the VIOS resources (including its HBAs) and get the full bandwidth available. Figure 2 shows a dual VIOS setup.

Figure 4 The storage and SAN administrators will zone LUNs to adapters on both VIOSs. And the AIX administrator will map LUNs from each VIOS to the VIOC. For the example in Figure 3, each VIOS potentially has 4 paths to each LUN. The VIOC however, has sees only two paths for its hdisks and those are the paths to the VIOSs. Using single initiator zoning, there would be 4 zones. So this case essentially reduces to that of a server (here the VIOS) with 2 HBAs, and we only need concern ourselves with disk path design at the VIOS, since the VIOC will always have exactly two paths for each LUN in a dual VIOS setup. Disk Paths with VIO and NPIV LUNs With NPIV, only one layer of multi-path code exists, and that's at the VIOC. The multi-path code used will be specified by the storage vendor. One advantage of this is that IOs for each LUN can be balanced across VIOSs, and one doesn't need to set path priorities. With NPIV in figure 3, the AIX/VIO administrator would typically create 4 virtual FC (vFC) adapters. So each LUN potentially has 8 paths. Zoning would be the same as in the VSCSI case (4 zones, one for each vFC). One difference between NPIV and VSCSI, is that when creating a vFC, two WWNs are created to support Live Partition Migration (LPM). So the storage administrator assigns the LUN to both WWNs. As the partition moves around among systems, the two WWNs will be used alternatively. Both WWNs for the vFC can be used in a single zone if the number of zones needs to be limited. Disk Path Limits and Recommendations

The MPIO architecture doesn't have a practical limit to the number of paths (i.e. you can have far more paths than you need), though there are limits with SDDPCM which uses MPIO. The latest SDDPCM manual states that SDDPCM supports a maximum of 16 paths per device. More paths also require more memory and can affect the ability of the system to boot, so there is a limit there (seehttp://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.basea dmn/doc/baseadmndita/devconfiglots.htm) but this applies to configurations with so many devices that managing them is likely to be impractical, and most customers will have far fewer than the maximum. SDDPCM on the other hand does have a limit of 16 paths per LUN. The SDDPCM manual also states that "with the round robin or load balance path selection algorithms, configuring more than four paths per device might impact the I/O performance. Use the minimum number of paths necessary to achieve sufficient redundancy in the SAN environment." So the downside is that CPU and memory overhead to handle path selection will increase with more paths, and this will also have a very slight impact on IO latency; however, at this time the author has been unable to find data showing the overhead. Error recovery time (for failed paths, host ports or storage ports) may be a bigger factor with more paths, but data is lacking to make specific statements on overhead at the time this document was written. The path selection algorithm also affects the overhead. Algorithms that load balance across paths might do so based on the number of outstanding IOs on the path, or perhaps average IO service times for IOs on a path. For example, SDDPCM offers the load_balance and load_balance_port algorithms. These load balancing approaches require capturing IO data and calculating the best path; consequentially, requires more overhead than a round_robin algorithm which just keeps track of the last path used, or a fail_over algorithm which uses the available path with the highest priority. Bandwidth Sizing Considerations Bandwidth sizing considerations include the number of host adapters, storage adapters, SAN links and SAN link speeds (currently 1, 2, 4, 8 or 10 Gbps). Host and storage adapters have limits on the number of IOPS they can perform for small IOs, and also a limit on the thruput in MB/s for large IOs. The adapter IOPS limits depend on the processors on the host/storage adapters (assuming the IOPS bandwidth isn't limited by the number of disk spindles, or IOPS to the disk subsystem cache). Typically the limit for thruput is gated by the links. For example, a 4 Gbps link is capable of approximately 400 MB/s of simplex (in one direction) thruput, and 800 MB/s of duplex thruput, as there is both a transmit and receive fibre in the cable. However, the host or storage adapter may not be able to handle this thruput. So the sizing is done based on the expected IO workload: either using IOPS for small block IO or MB/s for large block IO. Often the IOPS bandwidth of the storage adapters is different than that of the host adapters, and in such a case we might not have the same number of host ports and storage ports. For example, if the host adapter can perform 3X the IOPS as the storage adapters, a balanced design would have 3X the storage adapters and ports as compared to host adapters and ports. Or if we size for large block IO and thruput, we might have host adapters operating at 4 Gbps to a switch and storage operating at 2 Gbps. In such a case we'd have twice as many links from the storage to the switch as compared to links from the host to the switch. Often, the minimum number of ports is sufficient for many workloads, but with VIO and

many VIOCs, or very high IOPS or MB/s workloads, sizing is important to avoid performance problems. Alternatively, one can implement a solution, determine if more bandwidth is needed, and add it then. Note that while bandwidth is related to the number of paths for a single LUN, they are different things. Adding more bandwidth only improves performance if there are bottlenecks or heavily utilized components in the solution. It's also worth considering the question of how many paths are best? Two physically independent paths are sufficient for availability; thus, one would have two host adapters, two storage adapters, and links to two physical SAN switches. But beyond this, we only need more physical resources for additional bandwidth. This adds more potential paths. So the approach to take is to configure sufficient resources for the bandwidth needed, then to use SAN zoning and LUN masking to keep the number of paths to a reasonable level. Disk Subsystem Considerations Every disk subsystem has architectural considerations that affect disk path design. These include considerations involving availability and utilization of various disk subsystem resources. Some disk subsystems, such as DS5000 and the SVC, have an active/passive controller design where each LUN is handled by one disk subsystem controller, and the passive controller is only used in the event the active controller (or all paths to it) fail. Typically the storage administrator will assign half the LUNs to each controller to balance use of the controllers. So the path selection algorithms will normally only use paths to the primary controller; thus, typically only half the paths for each LUN will be used. On the DS8000, there are up to 4 host ports per adapter, and groups of 4 adapters reside in a host bay that might be taken offline for maintenance. So one would want to have paths to different adapters, as opposed to paths to different ports on the same adapter for availability. Similarly, we'd want paths to adapters in different host bays in case one needs to be taken offline. On the SVC, there are up to 8 nodes with 2 nodes per IO group. Each IO group has its own cache, fibre channel ports and processors, and IOs for a LUN are handled by a single node. To fully utilize all the SVC resources, one may want the storage administrator to balance the LUNs for a host across all the available nodes and IO groups. To balance the use of disk subsystem resources, one needs a certain number of LUNs. To balance use across, say two storage controllers, we need at least 2 LUNs and preferably an even number. To balance use across the 4 host bays on the DS8000, we'd need at least 4 LUNs and preferably a multiple of 4. To fully utilize all SVC resources for an 8 node SVC, one would need a minimum of 8 LUNs and preferably a multiple of 8. And laying out the data so that IOs are balanced across LUNs will ensure that IOs are balanced across the resources. Methodology The approach to reduce the number of potential paths to a reasonable level is to create subsets of both the storage and host ports, and to connect these subsets together. We'll examine this by going thru a few examples.

Example 1: 8 host ports and 8 storage ports with a single SAN fabric

Figure 5 In this example, there are potentially 64 paths for each LUN which is more than is recommended. So a simple solution is to create two groups, or subsets, of four ports on the host (host ports 1-4 and host ports 5-8) and two groups of four ports on the storage (storage ports 1-4 and storage ports 5-8), as represented by the blue and red links connected to the ports in Figure 5. Using single initiator zoning, the first SAN zone would have host port 1 zoned to storage ports 1-4, then host port 2 zoned to storage ports 1-4, ..., host port 5 to storage ports 5-8, etc. Assuming the storage doesn't offer LUN masking, then the storage administrator would create two host port groups (the first containing host ports 1-4) and assign half the LUNs to each group. If the storage offers LUN masking, then a single host port group could be used (with host ports 1-8) and the storage administrator would assign half the LUNs to use storage ports 1-4 and the other half to use storage ports 5-8. With the SAN zoning, storage port 1 would not be able to communicate with host ports 5-8. Assuming the IOs are evenly balanced across the LUNs, the IOs will be balanced across the paths. This results in a total of 16 paths per LUN. Alternatively we could create 4 groups of 2 ports on the host and similarly on the storage as shown in Figure 6 with each group represented by a different color.

Figure 6 .SAN zoning and assignment of the LUNs (using either LUN masking with 2 ports used per LUN, or using 4 host port groups) would be similar, and yields 4 paths per LUN.

Example 2: 8 host ports and 8 storage ports with a dual SAN fabric

Figure 7 In this example, we'd have half the storage ports and half the host ports on each fabric: fabric 1 (represented by a single FC switch in Figure 7) containing host ports 1-4 and storage ports 1-4 and fabric 2 with the other ports. Thus, we'd have 16 potential paths for each LUN on each fabric (4 host ports x 4 storage ports) with a total of 32 potential paths per LUN. So we can create two groups or subsets of 4 ports on the host and two groups or

subsets of 4 ports on the storage as represented by the red and blue links connected to the ports in Figure 7 to get down to 8 paths per LUN. Again we'd use single port zoning and taking host port 1, it would be zoned to storage ports 1 and 2. If the disk subsystem offers LUN masking, we can use a single host port group for the host, and half the LUNs would use be assigned to the host port group via storage ports 1, 2, 5 and 6, and the other LUNs would use host ports 3, 4, 7 and 8. If the storage doesn't offer LUN masking, then we'd use two host port groups with host ports 1, 2, 5 and 6 in the first group. Then the storage administrator would assign half the LUNs to each host port group, and the zoning would assure that only the appropriate host ports would be used. To reduce the number of paths to 4, and assuming the disk subsystem supports LUN masking, the storage administrator could evenly split the LUNs across pairs of storage ports rather than groups of 4 as suggested previously. E.G., LUN1 would use storage ports 1 and 5, LUN 2 using ports 2 and 6, etc. Without LUN masking, we could use 4 groups or subsets of the links rather than 4 as shown in Figure 7 and this would reduce the number of paths per LUN to 2 which is sufficient for availability, but may not provide the belt and suspenders availability that more paths offer. Example 3: 4 host ports and 8 storage ports with a dual SAN fabric

Figure 8 This example shows a situation in which the host ports have twice the bandwidth of the storage ports; thus, we're using twice the ports on the storage as the host. Potentially we have 16 paths to each LUN (8 per fabric). By creating two groups or subsets, with each group containing two host ports (one for each fabric) and 4 storage ports (two for each fabric) as represented by the different colors in the figure. SAN zoning again is single initiator, with host port 1 zoned to storage ports 1 and 2, and similarly for the others. This zoning alone reduces the number of paths to 8. The storage administrator can then thru LUN masking or via two host port groups, further reduce the number of paths to 4.

Spreading the IOs evenly across the ports and paths If all LUNs use all ports, and we use some method to load balance the IOs, via the multipath IO driver, then IOs will be balanced across ports. But here we expect we need to reduce the number of paths, so not all server ports will see all storage ports. And not all LUNs will necessarily use all ports for IO. Thus we'll need some way to reasonably balance IOs across the ports and paths. As stated earlier: The basic approach taken here is to group the sets of host and storage ports into a number of subsets, and then using SAN zoning and or LUN masking, connect a set of host ports to storage ports to keep the number of paths to a reasonable level. Then the full set of LUNs for a server are also split into the same number of subsets, and assign LUNs in one subset to use one subset of host and storage ports. So if IOs are balanced evenly across LUNs, then evenly balancing LUNs across the subsets of host and storage ports will achieve balance across paths and ports. For many disk subsystems and applications, the best practice data layout (including the LVM setup) achieves balance across LUNs. So in that case we just evenly split the LUNs across the subsets of host/storage ports. For applications whose layout doesn't evenly balance IOs across LUNs, then one approach is to split LUNs from RAID arrays across the paths (which assumes the data layout achieves balanced IOs across the RAID arrays). Finally, for disk subsystems such as the XIV, or the SVC with striped VDisks (where IOs are balanced across back end disks but not across LUNs), then the only way to balance the IOs is to look at the IO rates for each LUN and then split them into separate groups such that the total IO rate for each group is approximately the same. This can be accomplished as follows in figure 9:

Figure 9 In this case, each LUN has 2 paths across separate fabrics. Alternatively, if all LUNs use all ports, but zoning is setup so that the host and storage ports of like color only do IO with each other, then each LUN will have 8 paths. And this is an approach that can be used for many configurations.

S-ar putea să vă placă și