Documente Academic
Documente Profesional
Documente Cultură
BROCADE
MAINFRAME
CONNECTIVITY
SOLUTIONS
Second Edition
STEVE GUENDERT
BROCADE
MAINFRAME
CONNECTIVITY
SOLUTIONS
SECOND EDITION
04.20.18
Notice: This document is for informational purposes only and does not set forth any warranty, expressed or
implied, concerning any equipment, equipment feature, or service offered or to be offered by Broadcom.
Broadcom reserves the right to make changes to this document at any time, without notice, and assumes
no responsibility for its use. This informational document describes features that may not be currently
available. Contact a Broadcom sales office for information on feature and product availability. Export
of technical data contained in this document may require an export license from the United States
government.
Important Notice
Use of this book constitutes consent to the following conditions. This book is supplied “AS IS” for
informational purposes only, without warranty of any kind, expressed or implied, concerning any equipment,
equipment feature, or service offered or to be offered by Broadcom. Broadcom reserves the right to
make changes to this book at any time, without notice, and assumes no responsibility for its use. This
informational document describes features that may not be currently available. Contact a Broadcom sales
office for information on feature and product availability. Export of technical data contained in this book may
require an export license from the United States government.
First of all, and most importantly, thank you to all Brocade customers for the trust and faith you place in us
and our equipment. If not for you, this book would not be written.
Brocade is blessed to have some of the best mainframe talented people in the industry who deserve some
recognition: Howard Johnson, Jack Consoli, Jon Fonseca, Bill Bolton, John Turner, Mark Detrick, Brian
Larsen, Ben Hart, Andy Dooley, Paul Wolaver, Don Martinelli, Stu Heath, Jeff Jordan, Tim Jobe, John Bowlby,
Lee Steichen, Dirk Hampel, Emerson Blue, Michel Pilato, Matt Brennan, Tim Jeka, Tim Werts, and Brian
Steffler.
Thanks also to Rachel Long for her outstanding work in editing and layout, and to Michelle Lemiuex-Dimas
at Brocade for managing the project.
I’ve had the good fortune of having some outstanding managers during my time at CNT, McDATA, and
Brocade who have all been very supportive. Thanks to Randy Morris, Sean Seitz, Jonathan Jeans, Jon
Verdonik, Kent Hanson (RIP), Ron Totah, Jim Jones, Ulrich Plechschmidt, John Noh, Duke Butler, David
Schmeichel, Jack Rondoni, Martin Skagen, and Anthony Pereis.
To Dennis Ng, Scott Drummond, Harv Emery, Ray Newsome, Harry Yudenfriend, Connie Beuselinck, Patty
Driever, Iain Neville, and Brian Sherman at IBM: a big thanks for working with me and putting up with me
and my questions for many years!
Thank you to my friends and mentors Pat Artis and Gilbert Houtekamer. Dr. Pat has been a wonderful
mentor and influence on me and my career, and convinced me to complete the Ph.D.
And finally, thanks to my wonderful wife Melinda, and our now -grown-up children Steve Jr. and Sarah. You
put up with my travel, long hours, and constant reading. Not to mention actually riding in a vehicle with the
license plate “FICON”.
Dedication
There is one person I have yet to acknowledge.
The second edition of this book is dedicated to my good friend and longtime McDATA/Brocade colleague
Dave Lytle. Dave was a tireless advocate for our customers for many years, and was an outstanding
colleague to all of us at Brocade. He taught all of us many valuable things in his nearly 50 years of working
in the mainframe industry, was always there to offer assistance, and was always cheerful in everything he
did. Dave retired in summer 2017 and I truly believe our company and the industry will really miss him. I
hope you are enjoying retirement my friend!
Steve Guendert
“Dr. Steve”
April 2018
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . iii
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . 2
System/360 . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. 2
SYSTEM/370 XA (MVS/XA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
ESCON Directors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
Fiber or Fibre? . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .9
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. 38
Chapter Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. 77
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. 90
FICON/FMS/CUP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 107
Addressing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Autonegotiation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Cabling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Resource Management Facility (RMF), Systems Automation (SA), and Control Unit Port (CUP). . . . . . . 120
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 127
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Remote Data Replication
for DASD . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. 147
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 148
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Remote Data
Replication for Tape . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . 175
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .176
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 195
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 213
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
IP Extension. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . 240
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 241
Summary . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. 260
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 263
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 295
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Brocade FICON CUP Diagnostics and the IBM Health Checker
for z/OS. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . 324
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 325
CUP Diagnostics and Brocade FOS 7.1 and FOS 7.2 . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . 327
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 341
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 361
Using CLI Commands to Monitor Average Frame Size and Buffer Credits. . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Calculating the Number of Buffer Credits That An ISL Will Require . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Forward Error Correction (FEC) for 16 Gbps, Gen 5 and 32 Gbps Gen 6 Links . . . . . . . . . . . . . . . . . . . . . . 377
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Introduction . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . 392
Overview . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . 392
Gen 6 Fibre Channel Technology Benefits for IBM Z And Flash Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
At the time of this writing, Broadcom Limited has completed its acquisition of Brocade. Broadcom acquired
Brocade for its Storage Area Networking (SAN) business and technology, and divested the IP business units
that were part of Brocade. The first edition of this book covered the Brocade IP technology that was part
of the mainframe ecosystem. The second edition of this book does not contain any IP-focused material,
and instead offers an expanded and updated publication dedicated exclusively to mainframe storage
networking.
Brocade is very proud to have been a part of the IBM Mainframe ecosphere for the past thirty years.
Brocade’s “family tree” roots are closely tied to the IBM Mainframe in the form of companies acquired by
Brocade over the past three decades going back to parallel channel extension with companies such as Data
Switch, General Signal Networks, ESCON channel extension/switching with Inrange and CNT, to ESCON and
FICON directors/extension with McDATA. The modern Brocade is the world leader in network connectivity
solutions for the modern IBM Mainframe, also known as IBM Z.
This book takes a brief look back at some of the history to set the stage for the more detailed discussion of
modern technology. The book is sectioned into four parts. Part One focuses on the history, and the basics
of mainframe storage and mainframe I/O. Part Two is a detailed discussion on all aspects of FICON SAN.
Part Three looks at business continuity and disaster recovery solutions: Fibre Channel over IP (FCIP), IP
Extension, remote replication for disk and tape, and geographically dispersed solutions such as IBM’s GDPS
and EMC’s GDDR. Part Four offers a more in-depth look at various Brocade mainframe SAN technologies in
chapters dedicated to each specific technology.
1
PART 1 CHAPTER 1: MVS AND z/OS SERVERS: FIBRE CHANNEL PIONEERS
Introduction
It may seem strange that “Big Iron” has a long and innovative history in the development of Fibre Channel
(FC), but it is nonetheless true. Although Enterprise Systems Connection (ESCON) was initially deployed
in 1990 with ES/3090 mainframe processors, it was the introduction in 1991 of ESCON— along with a
mainframe computer called the ES/9000—that created a great demand for Fibre Channel as storage
connection media. When contrasting the early mainframe ESCON deployments to the first Fibre Channel
switch for open systems connectivity in 1997 (the Brocade® Silkworm 1000), you can see just how
innovative ESCON was during the early days of Fibre Channel products.
The first and most basic fundamental concept the System/360 introduced was system architecture. Prior
to the IBM System/360, each new computer introduced to the market presented a different instruction
set and programming interface. These changes often required end users to extensively redesign or rewrite
programs each time they acquired a new computer. The System/360 architecture defined a common
set of programming interfaces for a broad set of machines. This allowed software (SW) to be completely
independent of a specific computer system (Artis & Houtekamer, 1993, 2006). This still holds true nearly
fifty years after the System/360 was introduced: Mainframe installations in 2013 often run programs that
were initially coded in the late 1960s.
The second fundamental concept the IBM System/360 introduced was device independence. Prior to
the System/360, each application program was required to directly support each device it managed. The
System/360 specified a logical boundary between the end-user program and the device support facilities
provided by the computer’s operating system. This innovation allowed for easy introduction of new device
technologies into the architecture (Guendert, 2007).
Third, and possibly most important, was the concept of channels. Prior computing architectures required
the central processor itself to manage the actual interfaces with attached devices. The channel concept
changed this. A channel is a specialized microprocessor that manages the interface between the processor
and devices. The channel provides a boundary between the central processor’s request for an I/O and the
physical implementation of the process that actually performs the I/O operation. The channel executes its
functions under the control of a channel program and connects the processor complex with the input and
output control units (Guendert & Lytle, 2007).
Over the past 53 years, the fundamental philosophy of I/O control transitioned from a strategy of centralized
control to one of distributed operation, which incorporates more intelligent devices. Simultaneously, the
evolution of the System/360 architecture into the modern IBM Z Systems significantly increased the power
of the channel subsystem and transferred more of the responsibility for managing the I/O operation to the
channel side of the I/O boundary. It is important to understand this evolution.
System/360
Many factors influenced the original System/360 design. However, the critical factor in the design of the IBM
System/360 I/O subsystem was IBM’s experience with the IBM 709 and IBM 7090 processors in the early
1960s. While the IBM 704 and other contemporary processors of the early 1960s had specific processor
operation codes for reading, writing, backspacing, or rewinding a tape, the IBM 7090 relied on a channel to
provide the instructions for executing the processors’ I/O requests (Artis & Houtekamer, 1993, 2006).
The function of a channel is to perform all control and data transfer operations that arise during an I/O
operation. The channel interacts with the CPU, main storage, and control units for devices. The CPU initiates
an I/O operation via an I/O instruction, and the channel functions independently from that point onward.
An I/O operation involves control operations, such as positioning the read/write head of a disk unit on a
specific track, as well as data transfer operations, such as reading a record from a disk (IBM, 1966).
A device is a peripheral unit that transmits, stores, or receives data and connects to a channel via a control
unit. Some examples of devices are disks, terminals, tape drives, printers, and similar peripheral equipment
used in data storage and transfer. Typically, several devices connect to a control unit, which in turn connects
to a channel. A control unit functions as an intermediary between device and channel. Usually, several
devices connect to a control unit, and several control units connect to a channel. The interactions between
channel and control unit and between control unit and device vary greatly, according to the type of device.
The System/360 architecture defined two types of channels: the selector and byte multiplexer channels
(Guendert, 2007). While contemporary IBM z Systems still include byte multiplexer channels, selector
channels had fundamental performance limitations. A selector channel performed only one I/O operation
at a time and was dedicated to the device for the duration of the I/O. The System/360 architecture used
selector channels for high-speed data transfer from devices such as tapes and disk drives. Several devices
can connect to a selector channel. However, the selector channel dedicates itself to a device from start
to end of the channel program. This is true even during the execution of channel commands that do not
result in data transfer. While this design was adequate for the relatively small processors comprising the
System/360 series, it provided a fundamental limit on achievable I/O throughput (Prasad, 1989). The block
multiplexer channel introduced in the System/370 architecture replaced the selector channel.
Byte multiplexer channels connect low-speed devices such as printers, cardpunches, and card readers.
Unlike the selector channel, byte multiplexer channels do not dedicate to a single device while performing
an I/O operation (Artis & Houtekamer, 1993, 2006). The byte multiplexer channel adequately supported
multiple low-speed devices at one time by interleaving bytes from different devices during data transfer.
The I/O generation process describes the topology of the System/360 I/O configuration to the System/360
operating system. This system generation process was commonly (and still is) referred to as an I/O Gen by
the systems programmers responsible for it (Artis & Houtekamer, 1993, 2006). Utilizing the information
provided in the I/O Gen, the System/360 operating system selected the path for each I/O. It also selected
the alternate path, in case the primary path was busy servicing another device or control unit.
In addition to path selection, the System/360 operating system was responsible for some additional tasks
in the I/O process. One of these tasks was supervising I/O operation during its execution and dealing with
any exceptional conditions through I/O interrupt handling. A restriction of the System/360 was that the
I/O interrupt that notified the processor of the completion of an I/O had to return over the same channel
path used by the initial start I/O to reach the control unit and devices. With the increasing complexity of I/O
configurations, and more shared components, the probability of some part of the originating path being busy
when the I/O attempted to complete substantially increased (Guendert, 2007). This was one of several issues
that led to the development of the System/370 architecture.
System/370
The System/370 architecture was a natural extension of the System/360 architecture. The System/370
design addressed many of the architectural limitations encountered with end-user implementations of larger
System/360 class systems. The architecture of the System/370 is evolutionary in the sense that it was
based on IBM’s decade of experience with the System/360 architecture. IBM’s experience demonstrated
that the storage architecture of the System/360 needed significant changes in order to function in an
on-line multiprogramming environment. Its architects viewed the System/370 architecture as an extension
of the System/360 architecture, necessitated by technological growth and operating system demands
(Case & Padegs, 1978). The System/370 architecture introduced the concepts of virtual storage and
multiprocessing. The redesigned I/O subsystem enhanced parallelism and throughput; IBM further
enhanced the I/O addressing scheme to support virtual storage, while the architecture was extended to
encompass the multiprocessing capability (Lorin, 1971).
The System/370 included the IBM 370/xxx, the 303x, and the initial 308x series of processors.
It also supported several major operating systems, including the first versions of IBM’s Multiple Virtual
Storage (MVS) operating system. MVS, in the form of z/OS, is still the primary IBM mainframe operating
system in use around the world. The System/370 architecture defined a new type of channel, the block
multiplexer channel. Block multiplexer channels connect high-speed devices. The block multiplexer channel
is part of a complex processor channel subsystem that can keep track of more than one device that is
transferring blocks and can interleave blocks on the channel to different devices (Johnson, 1989). Similar
to the System/360 selector channels, block multiplexer channels have a high data transfer rate. Unlike
selector channels, block multiplexer channels are capable of disconnecting from a device during portions
of an operation when no data transfer is occurring (for example, when the device is performing a central
operation to position the read/write head.) During such periods, a block multiplexer channel is free to
execute another channel program pertaining to a different device that is connected to it. During actual data
transfer, the device stays connected to the channel until data transfer is complete (Prasad, 1989). Block
multiplexer channels therefore could perform overlapping I/O operations to multiple high-speed devices.
The block multiplexer channel is dedicated to the device for the duration of data transfer, and it does not
interleave bytes from several devices, like a byte multiplexer channel (Artis & Houtekamer, 1993, 2006).
Block multiplexer channels are ideally suited for use with Direct Access Storage Devices (DASD) with
Rotational Position Sensing (RPS). RPS is the ability of a DASD device to disconnect from the channel while
certain Channel Command Words (CCWs) in a channel program execute. IBM originally introduced RPS for
the 3330 set sector command. The 3330 released the channel that started the I/O request until the disk
reached the sector/angular position specified in the CCW (Johnson, 1989). The device then attempted to
reconnect to the channel, so it could execute the next CCW (which was usually a search) in the channel
program under execution. In simpler terms, the channel disconnected from the device during the rotational
delay that was necessary before an addressed sector appeared under the read/write head. During this time
it could service another device. When the addressed sector on the first device approached, it reconnected
with the channel, and a data transfer took place. If the channel was busy with the second device, the first
device would try to reconnect after another rotation of the platter (disk) (Prasad, 1989).
To facilitate the parallelism brought about by the block multiplexer channel, IBM logically divided the channel
itself into a number of subchannels. A subchannel is the memory within a channel for recording byte count,
addresses, and status and control information for each unique I/O operation that the channel is executing
at any given time (Artis & Houtekamer, 1993, 2006). A single device can exclusively use each subchannel,
or multiple devices can share the subchannel. However, only one of the devices on a shared subchannel
can transfer data at any one time. A selector channel has only one subchannel, since it can store only data
elements and execute operations for one I/O at a time, and thus it has to remain busy for the entire duration
of the I/O operation. Both byte multiplexer and block multiplexer channels can have several subchannels,
because they can execute several I/O operations concurrently.
While selector channels were still present in the System/370 architecture, they were included simply for
the purpose of providing transitional support for existing devices incapable of attachment to the newer
block multiplexer channels. The System/370 architecture completely retained the byte multiplexer channels
for low-speed devices. The block multiplexer channels on the System/370 resulted in the realization of
a substantial increase in I/O throughput for a given level of channel utilization. The utilization of a block
multiplexer channel was less than one-sixth that of a System/360 selector channel for the same level of I/O
traffic (Artis & Houtekamer, 1993, 2006).
The System/370 architecture was not flawless, and it, too, eventually ran into problems and limitations,
as had the System/360. Chief among these were device-addressing problems in multiprocessor (MP)
environments. The other processor could not easily access channels owned by one processor in a MP
configuration. This resulted in having to mirror the I/O configurations of both sides of a multiprocessor
so that each processor could issue I/O requests using its own channels. This resulted in complex and
redundant cabling, and also consumed extra channel attach points to the configurations shared storage
control units. This in turn resulted in the limitation of the overall size of large multisystem configurations due
to the cabling requirements of multiprocessor systems. This subsequently led to the System/370 extended
architecture (System/370 XA) (Guendert, 2007).
SYSTEM/370 XA (MVS/XA)
The System/370 Extended Architecture (XA) and the MVS/XA operating system introduced the ability to
address two gigabytes (via 31-byte addressing) of virtual and central storage and improved I/O processing. It
improved the Reliability, Availability, and Serviceability (RAS) of the operating system (Johnson, 1989). These
changes evolved from IBM’s experience with the System/370 architecture.
IBM’s experience addressed several major end-user issues with the System/370 architecture, including
the extension of real and virtual storage and the expansion of the system address space from 16 to 26
megabytes (MB). IBM also relocated the device control blocks from low storage, added substantial new
hardware elements to facilitate the management of multiple processes, and streamlined multiprocessing
operations. Finally, I/O performance was improved and better operating efficiency realized on I/O operations
with the introduction of the channel subsystem (Cormier, 1983).
The I/O architecture of the 370 XA differed from the System/370 in several respects. First, while the
System/370 supported 4K devices, the 370 XA supported up to 64K. Second, an I/O processor, called
the channel subsystem, performed many of the functions previously performed by the CPU (that is,
by the operating system) in the System/360 and System/370. The most substantial change was the
assignment of all the channels to a new element called the External Data Controller (EXDC). Unlike the
System/370 architecture, where subchannels could be dedicated or shared, every device in a System/370
XA configuration was assigned its own subchannel by the EXDC. Every device now had an associated
subchannel, and there was a one-to-one correspondence between device and subchannel. Fourth, rather
than each processor owning a set of channels that it used almost exclusively, the processors shared all of
the channels. What this did was eliminate the redundant channel assignment and cabling problems for MP
processors.
At the time, many of these changes were radical new developments. A set of rules for optimization of CPU
time in I/O processing, operational ease, reduction of delays, and maximization of data transfer rates led to
these innovations. The rules resulted from many years of observation of the System/360 and System/370
I/O architectures in end-user environments. The System/370 XA architecture addressed the I/O throughput
problems that had developed with a variety of hardware and software innovations. With the System/370
XA and MVS/XA, IBM enhanced the channel subsystem and operating system to support the dynamic
reconnection of channel programs. Previously, the System/370 architecture required each I/O completion
to return to the processor via the same path on which it originated. As configurations grew larger and more
complex, the overall probability of finding all path elements used to originate an I/O that was free at the
time of its completion lowered significantly. This resulted in poor I/O performance: if all possible paths are
busy, dynamic reconnections fail.
New disk technologies also evolved to exploit the dynamic reconnection feature (Artis & Houtekamer,
1993,2006). Examples of such technologies were Device Level Selection (DLS) and Device Level Selection
Enhanced (DLSE).
The IBM 3380 disk drive was an example of a device with dynamic reconnection capability. The 3380 could
connect to a channel other than the channel that initiated the I/O operation when it was ready to proceed
with data transfer. A block multiplexer channel works with a disk drive that has rotational position-sensing
(RPS) capability: The channel disconnects from the device once it has issued a search ID equal command
prior to the read command. When the addressed sector is approaching the read/write head, the device
attempts to reconnect with the channel. If the channel is not ready, it attempts to reconnect at the end of
the next revolution (hence a delay). With the capability for dynamic reconnect, the disk can reconnect to any
available channel defined as a path, thus minimizing delays in the completion of I/O (Cormier, 1983).
Under the System/370 architecture, the CPU (operating system) was responsible for channel path
management, which consisted of testing the availability of channel paths associated with the device.
Obviously, this involved substantial overhead for the CPU. Also, under the System/370 architecture, the CPU
was interrupted at the end of various phases of execution of the I/O operation. In contrast, IBM enhanced
the System/370 XA operating system (MVS/XA) to allow any processor to service an I/O interrupt, rather
than the I/O being serviced only by the issuing processor. Although the channel passed an interrupt to notify
the processor of an I/O completion, the EXDC handled most of the interactions with the devices and other
control functions (Johnson, 1989).
The MVS/XA operating system no longer required knowledge of the configuration topology for path
selections. Thus, the majority of information in the I/O Gen in prior architectures, such as System/360
and System/370, was migrated to a new procedure in System/370 XA. This new procedure was the I/O
Configuration Program generation (IOCP Gen). The IOCP Gen produced a data set processed by the EXDC at
initialization time to provide the EXDC with knowledge of the configuration. Since the EXDC now selected the
path and dealt with intermediate results and/or interrupts, the processor’s view of I/O was far more logical
than physical. To initiate an I/O under System/370 XA, the processor merely executed a Start Subchannel
(SSCH) command to pass the I/O to the EXDC (Artis & Houtekamer, 1993, 2006). The EXDC was then
responsible for device path selection and I/O completion.
While the System/370 XA architecture addressed most of the I/O subsystem performance issues with
the System/360 and System/370, the complexity and size of the I/O configuration continued to grow.
Some issues caused by this growth included cable length restrictions, the bandwidth capacity of the I/O
subsystem, and the continued requirement for non-disruptive installation, to facilitate the end user’s
response to the increasing availability demands of critical online systems (Guendert, 2007).
Parallel (bus and tag) channels are also known as copper channels, since they consist of two cables with
multiple copper coaxial carriers. These copper cables connect to both the channel (mainframe) and I/O
devices using 132-pin connectors. The cables themselves weigh approximately one pound per linear foot
(1.5 kg/meter), and due to their size and bulk they require a considerable amount of under-raised-floor
space. The term “bus and tag” derives from the fact that one of the two cables is the bus cable, which
contains the signal lines for transporting the data. Separate lines (copper wires) are provided for eight
inbound signals and eight outbound signals (1 byte in, 1 byte out). The second cable is the tag cable, which
controls the data traffic on the bus cable (Guendert & Lytle, 2007).
Despite all of the historical improvements in the implementation of parallel channels, three fundamental
problems remained (Guendert, 2007):
1. The maximum data rate could not exceed 6 MB/sec without a major interface and cabling redesign.
2. Parallel cables were extremely heavy and bulky, requiring a large amount of space below the computer
room floor.
3. The maximum cable length of 400 ft. (122 m) presented physical constraints to large installations.
Although channel extenders allowed for longer connections, due to protocol conversion they typically
lowered the maximum data rate below 6 MB/sec. Also, at distances greater than 400 ft., the signal skew
of parallel cables became intolerable (Artis & Houtekamer, 1993, 2006).
These three fundamental issues were the driving force behind the introduction of fiber optic serial
channels in the System/390 architecture. These special channels are formally called the ESA/390
Enterprise Systems Connection (ESCON) architecture I/O interface by IBM (IBM, 1990) (IBM decided that
serial transmission cables were to be the next generation of interconnection links for mainframe channel
connectivity. Serial transmission cables could provide much improved data throughput (12 to 20 MB/sec
sustained data rates), in addition to other benefits, such as a significant reduction in cabling bulk. The
ESCON architecture was based on a 200 megabit (Mbit) (20 MB)/sec fiber optic channel technology that
addressed both the cable length and bandwidth restrictions that hampered large mainframe installations.
As a pioneering technology, ESCON was architected differently than today’s modern FICON® (FIber
CONnection) or open systems Fibre Channel Protocol (FCP). With ESCON, the fiber optic link consisted of
a pair of single fibers, with one fiber pair supporting the unidirectional transmission of data and the other
used for control and status communication. The two fibers comprising an optical link were packaged in
a single multimode, 62.5 micron cable, using newly developed connectors on each end. This functionally
replaced two parallel cables, each of which were approximately one inch in diameter, which significantly
reduced both weight and physical bulk.
All the ESCON system elements were provided with the newly developed ESCON connectors or sockets,
which were very different than the Small Form-factor Plugs (SFPs) used today for link cable attachment
to the mainframe, storage, or director. For the very first time, I/O link cables could be connected and
disconnected while mainframe and storage elements remained in an operational state. This innovation
was the first implementation of non-disruptive “hot pluggability” of I/O and storage components. Other
advantages afforded by fiber optic technology included improved security— because the cable did not
radiate signals—and insensitivity to external electrical noise, which provided enhanced data integrity.
Another “first” for ESCON was its innovative use of a point-to-point switched topology in 1991. This point-
to-point switched topology resulted in the creation of a “storage network,” the world’s first deployment of
Storage Area Networking (SAN). With a point-to-point switched topology, all elements were connected by
point-to-point links, in a star fashion, to a central switching element (an ESCON Director). In contrast, the
older IBM parallel data streaming I/O connectivity used a multidrop topology of “daisy-chained” cables that
had no switching capability (Guendert & Lytle, 2007).
ESCON Directors
At the beginning of the ESCON phase, IBM developed its own central dynamic crossbar switching
technology, called an ESCON Director. These first storage switching devices, announced in late 1990 and
enhanced over the years, provided many benefits:
• ESCON I/O channel ports that could dynamically connect to 15 storage control units, as compared to a
maximum of eight static storage control units with the older data streaming protocol
• Allowing devices on a control unit to be shared by any system’s ESCON I/O channels that were also
attached to that director
• Customizing the any-to-any connectivity characteristics of the ESCON Director at an installation—by means
of an operator interface provided by a standard PS/2 workstation or by means of an in-band connection
to a host running the ESCON Manager application program
• The use of high-availability (HA) redundant directors to create an HA storage network (five-nines) by
providing alternate pathing with automatic failover
• Common pathing for Channel-to-Channel (CTC) I/O operations between mainframe systems
• Non-disruptive addition or removal of connections while the installed equipment continued to be
operational
• An internal port (the Control Unit Port [CUP]) that could connect to a host channel via internal ESCON
Director switching, and a physical director port genned for that connectivity. When the mainframe and
CUP were connected in this manner, the host channel communicated with the ESCON Director just as it
would with any other control unit.
• An ability to “fence” and/or block/unblock ports to isolate problems, perform maintenance, and control
the flow of traffic across the director ports
9032-5
ESCON Director
ESCON Duplex MM
fig01_Mainframe_P1C1
From the beginning, mainframes led the way in making Fibre Channel media a commonplace connectivity
solution in the enterprise data center. The enterprise mainframe products from both McDATA and
InRange became widely known and respected for their high-availability characteristics. Today, many of the
characteristics that are standard for both z/OS servers and open systems servers were initially developed
specifically to meet the demanding requirements of the MVS and z/OS enterprise computing platforms. The
strict requirements of providing a high-availability, 7×24 enterprise computing data center drove many of the
technical initiatives to which McDATA responded with vision and innovation.
Interestingly, among the many accomplishments in the mainframe arena, it was McDATA and not IBM
that engineered and produced the first high-availability ESCON Director. IBM OEM’d the ESCON Director
(produced it using parts from the Original Equipment Manufacturer [OEM], which was McDATA) and then
deployed it as their 9032 model 3 (28-124 ports), announced in September 1994. Previously, IBM had
produced their two original, non-HA versions of the ESCON Director, the 9032 model 2 (28-60 ports) and
9033 model 1 (8-16 ports). After the release of the 9032-3 ESCON Director, McDATA then went on to
engineer, on behalf of IBM, two additional HA versions of this product line: the 9033 model 4 (4-16 ports)
and the 9032-5 model (24-248 ports). InRange also directly sold their high-availability CD/9000 (256 port)
ESCON Director to customers. Between 2003 and 2007, numerous mergers and acquisitions occurred
in the storage networking industry. In 2003 Computer Network Technology (CNT) acquired InRange.
In 2005 McDATA acquired CNT, and in 2007 Brocade acquired McDATA. In essence, Brocade, as it is
currently organized, has provided the world with 100 percent of the switched ESCON capability in use today
(Guendert & Lytle, 2007).
There were several features that allowed the ESCON Director 9032 model 3 and model 5 to be highly
available. High on the list of those HA features was the ability to non-disruptively activate new microcode.
McDATA first developed this capability in 1990 when it engineered the 9037 model 1 Sysplex Timer for IBM.
The Sysplex Timer was a mandatory hardware requirement for a Parallel Sysplex, consisting of more than
one mainframe server. It provided the synchronization for the Time-of-Day (TOD) clocks of multiple servers,
and thereby allowed events started by different servers to be properly sequenced in time. If the Sysplex
Timer went down, then all of the servers in the Parallel Sysplex were at risk for going into a hard wait state.
So, the Sysplex Timer that was built by McDATA to do non- disruptive firmware code loads was then available
to be used by McDATA when it created its ESCON (and later FICON) Directors.
Over time, ESCON deployments grew rapidly, as did optical fiber deployments. As discussed above, ESCON
was by necessity developed as a proprietary protocol by IBM, as there were no Fibre Channel (FC) standards
to fall back on at that time. Very little FC storage connectivity development was underway until IBM took
on that challenge and created ESCON. So as ESCON was pushed towards its limits, it began to show
its weaknesses and disadvantages, in particular, performance limitations, addressing limitations, and
scalability limitations. And while IBM mainframe customers deployed ESCON in data centers worldwide,
Fibre Channel matured as standards and best practices were developed and published (Guendert, 2007).
Fiber or Fibre?
The question of whether to use the spelling “Fibre Channel” or “Fiber Channel” occurs frequently. There are
two ways to explain this difference in usage. The typical British spelling is “fibre” (based on French—that is,
Latin—word origins), while the typical American spelling of “fiber” has Germanic origins. The two different
spellings, then, reflect the difference between Germanic and Latin-based words. However, the usage of
this term in the technology industry was determined by the T11 standards committee, which decided to
use “fiber” for the physical elements (cable, switch, and so forth) and to use “fibre” when discussing the
architecture of a fabric or the protocol itself (for example, Fibre Channel Protocol) (Guendert, 2007).
Brocade was among the first vendors to provide Fibre Channel switching devices. InRange and McDATA
became best known for their director-class switching devices and CNT for its distance extension devices.
However, it is the director-class switch that is at the heart of switch architectures in the MVS and z/OS data
centers. Because of the visionary work done by McDATA with the ESCON Director, and because of their
track record of working closely and successfully together, IBM asked McDATA to help create the follow-on
mainframe protocol to ESCON—the FIber CONnection, or FICON.
IBM wanted FICON to be its new high-performance I/O interface for “Big Iron” that would support the
characteristics of existing and evolving higher-speed access and storage devices. To that end, IBM wanted
the design of its new FC-based protocol to be based on currently existing Fibre Channel standards, so that it
would provide mobility and investment protection for customers in the future. In other words, IBM wanted the
new FICON protocol to use the protocol mapping layer based on the ANSI standard Fibre Channel-Physical and
Signaling Interface (FC-PH). FC-PH specifies the physical signaling, cabling, and transmission speeds for Fibre
Channel. In pursuit of that goal, IBM and McDATA engineers wrote the initial FICON specifications, co-authored
the FICON program interface specifications, and pushed the FICON protocol through the standards committees
until it became an accepted part of the FC standards. During this work, McDATA became the only infrastructure
vendor to hold FICON co-patents with IBM (Guendert & Lytle, 2007). McDATA and InRange then turned their
attention to providing FICON connectivity solutions. This occurred in two parts. Part one was providing a bridge
for ESCON storage users to test and use the new FICON protocol, and part two was creating native FICON
switching devices.
There were several good reasons for creating a FICON bridged environment. First of all, this was a great way
for customers to begin testing FICON without having to upgrade much of their existing ESCON infrastructure.
A FICON bridge card feature allowed for the early deployment of FICON while keeping the I/O configuration
simple, so that users could exploit the greater bandwidth and distance capabilities of a FICON mainframe
channel in conjunction with their existing ESCON storage devices. Another advantage of the FICON bridge
feature was the fact that IBM was late to deliver FICON-capable DASD storage products. Only the IBM
Magstar 3590 A60 tape control unit supported FICON, though not until December 2000, over two years
after FICON was introduced on the mainframe.
The history behind FICON’s development on the mainframe and the delay in getting FICON capable storage
from IBM is interesting. The fundamental force driving many of the FICON decisions was that at this time,
IBM was shifting from a hardware company that also sold software and services to a services and software
company that sold hardware. This shift in emphasis, of course, created some roadmap discrepancies
between the various entities within IBM. The IBM mainframe team was well aware of the various limitations
of ESCON and wanted FICON to be offered on storage as soon as possible. However, the IBM storage
team had embraced Fibre Channel in the form of FCP and had prioritized FCP development over FICON
development. Undaunted, the IBM mainframe team needed to provide customers with a migration path
to FICON without causing the customer to upgrade both sides of the ESCON fabric simultaneously. Their
innovative solution was the FICON Bridge feature for the 9032-5, which gave the mainframe team a method
of introducing the FICON channel without causing customers to replace their storage at the same time. Once
the bridge feature was brought to market, the pressure to produce FICON-attachable storage increased.
Even though customer interest in FICON DASD had increased significantly, the IBM storage team had
already prioritized FCP-attached storage and could not immediately begin the FICON-attached storage
development. The importance of the hardware pull model was too great to limit the production of FICON-
attached storage devices, and IBM ultimately provided the FICON specification to competitive storage
vendors. As expected, this spurred on the IBM storage group; they were the first to deliver FICON DASD at
the end of July 2001 (Guendert & Houtekamer, 2005).
From 1998 until 2001, not much native FICON storage was available. This was finally resolved in July 2001,
with the IBM announcement of FICON support for its TotalStorage Enterprise Storage Server (code-named
Shark) as well as for two models of its TotalStorage Virtual Tape Server (VTS). These were the very first
DASD and virtual tape systems that were natively compatible with FICON. After those announcements, other
storage vendors began releasing their own versions of FICON-capable storage. Between May 1998 and July
2001 and later, FICON bridging was in wide use in data centers that were ready to start deploying FICON. In
fact, manufacture of the FICON bridge card feature continued until December 31, 2004. This feature was
long-lasting, and it continues to be supported and used in some data centers even today.
Where did the FICON bridging card feature originate? IBM once again approached McDATA when IBM
needed a 9032 ESCON Director internal bridge feature (a card) built in order to translate between FICON
and ESCON protocols. The new FICON bridge feature, engineered and manufactured only by McDATA, would
manage the optoelectronics, framing, and format conversion from the newer 1 Gbps FICON channel paths
(CHPIDs) back to ESCON storage devices (Guendert & Lytle, 2007).
The functions of the bridge involved two components: one residing in the mainframe channel and the
other on the bridge card. IBM developed both of these, which required the IBM developers to split the
necessary code between the mainframe channel and the FICON bridge card. The actual conversion between
ESCON and FICON could then remain the intellectual property of IBM, and McDATA could concentrate its
development effort on Fibre Channel. This was important to the joint effort, because it allowed the two
companies to develop a customer solution without encountering problems with intellectual property rights
exchange.
The developmental effort on the FICON-to-ESCON bridge feature allowed the 9032 ESCON Director to
become the first-ever IBM switch to support FICON. Each of the bridge features could perform the work of
up to eight ESCON channels. Up to 16 FICON bridge features (equivalent to 128 ESCON ports or one half of
the Model 5 capacity) were supported in a single 9032 Model 5 ESCON Director. The FICON bridge feature
handled multiplexing (1 FICON port fed up to 8 ESCON ports) and protocol format conversion. IBM took the
responsibility of performing protocol conversion in the mainframe. The 1 Gbps optical input was steered to
an optoelectronic converter, and from there to a 32-bit, 26-megahertz (MHz) parallel bus. Next followed a
framer, then a Fibre Channel high-level protocol converter, followed by eight IBM-supplied ESCON engines—
the output of which connected to the 9032 ESCON Director’s backplane and thence outbound to ESCON
storage devices. Except for the costs associated with the bridge feature itself, upgrading the ESCON Director
was nearly painless and helped to conserve customer investment in the 9032 Model 5 ESCON Director. This
very first implementation of FICON on the mainframe helped to allow customers to protect their investment
in ESCON control units, ESCON Directors, and other infrastructure, while migrating to the new FICON
channels and storage control units (Guendert, 2007).
The 9032-5 ESCON Director has the distinction of being the first switch that was able to connect up to both
FICON and ESCON protocols in a single switching unit chassis—and it was the McDATA exclusive FICON
bridge card that allowed that connectivity to take place. In December 2000, InRange announced the ability
for Fibre Channel and FICON storage networks to coexist with legacy ESCON storage through a single
SAN director. InRange’s unique storage networking capability was provided through the release of a Fibre
Channel Port Adapter (FCPA) for the CD/9000 ESCON Director. For the first time, ESCON, FC, and FICON
could coexist within the same director-class switch footprint (Guendert, 2005).
S/390 Server
FICON Channel
fig03_Mainframe_P1C1
DASD DASD DASD DASD DASD DASD Tape Tape
w/ESCON w/ESCON w/ESCON w/ESCON w/ESCON w/ESCON w/ESCON w/ESCON
Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter
Both InRange and McDATA developed FC-based directors to support the high end of the expanding
and important open systems server market. In the early 2000s, Brocade was developing and shipping
motherboard-based switches into the small- to medium-size customer segment of the open systems server
market. However, InRange and McDATA were the early pioneers in switched-FICON technology.
Because of their many years of experience in the mainframe data center, both InRange and McDATA
knew that only a director-class switch would be welcomed as the backbone in demanding customer
infrastructures. A director-class switch is populated with redundant components, provided with non-
disruptive firmware activation, and engineered to be fault-tolerant to 99.999 percent availability if deployed
in redundant fabrics. These switches typically have a high port count and can serve as a central switch
to other fabrics. These high-end switches and features are also what IBM wanted to certify for its initial
switched-FICON implementations. Therefore, each of these rival vendors engineered, certified, and sold
a series of products to meet the growing needs of this segment of the storage interconnect marketplace
(Guendert & Lytle, 2007).
A FICON director is simply a Fibre Channel director that also supports the FICON protocol and certain FICON
management software. Customers usually demand that the backbone of a FICON infrastructure is based on
director-class switches that can provide five-nines of availability (Guendert 2005).
Tape &
Tape Libraries
Site 1
FICON
z-Series and z9 Servers
Director DASD
Tape &
Tape Libraries
fig04_Mainframe_P1C1
Site 2
FICON
Director
DASD
FICON Directors, FICON Fabrics, FICON Cascading
Five-9’s of Availability Infrastructure
Beyond what is available for open system SAN, the most significant enhancements to a FICON switching device
are the FICON Management Server—an ability to enable Control Unit Port (CUP) and have management views
with hexadecimal port numbers—and an ability to have switch wide connection control by manipulating the
Prohibit Dynamic Connectivity Mask (PDCM, a port connectivity matrix table). PDCM is a hardware feature of a
FICON switching device that provides extra security in FICON networks. Essentially, PDCM is hardware zoning,
which allows the blocking and unblocking of ports which provides switch-wide port-to-port connection control.
FICON Management Server (FMS), an optional feature, integrates a FICON switching device with IBM’s System
Automation OS/390 (SA OS/390) or SA z/OS management software and RMF. It enables CUP on the FICON
switching device and provides an in-band management capability for mainframe users. This means that
customers with IBM’s SA OS/390 or SA z/OS software on their mainframe host can manage a FICON switch
over a FICON connection in the same way they always managed their 9032 ESCON Directors. It also allows
RMF to directly draw performance statistics out of the FICON switching device (Guendert 2007).
In-Band Management
InRange and McDATA, once again, pioneered the ability to allow an embedded port in a director to
communicate with MVS or z/OS through one or more FICON channel paths. These companies also were
the first to provide out-of-band FICON management through the use of a separate laptop or server and
proprietary management code and functions (Guendert & Lytle, 2007). This in-band functionality has been
expanded and refined over the years, and now there are many applications and programs that can do in-
band discovery, monitoring, and management of Fibre Channel fabrics and switching devices.
FC Security
FC Security was a direct requirement of IBM for FICON cascading. FICON had to maintain a high- integrity
fabric that would be very difficult to configure into a failure. Every vendor supplying FICON switching devices
had to create processes and functions that met stringent IBM requirements. Security Authorization,
designed specifically for FICON, soon found its way into the open systems SAN world, which was in
desperate need of security enhancements.
Preferred Pathing
Several years ago, IBM suggested to McDATA that having a method to “prefer” an E_Port FICON cascaded
connection to a specific F_Port would be very valuable. McDATA created this method in just a few months
and incorporated it into a release of its firmware. Other vendors then provided the same functionality in
one way or another. This feature then became popular in open system SANs for separating workloads onto
unique Inter-Switch Links (ISLs).
Physical Partitioning
First requested by IBM IGS for FICON implementations, this became a reality when McDATA released its
10K 256-port director, now known as the Brocade Mi10K Director. Very similar in nature to Logical
Partitions (LPARs) on a z9 server, director physical partitioning is especially relevant for environments that
need to make use of the same director chassis for both FICON and SAN. These are now run in separate
partitions, for the strongest port isolation possible within the same chassis. Physical partitioning is the
capability to run a separate and distinct firmware load in each partition. If the firmware in a partition has a
problem, then only that partition is affected. This is unlike logical partitions (Virtual Fabrics, Administrative
Domains, OpenVSAN, or VSAN) where all occurrences of the partitions run on a single firmware base.
Using the same physical port for multiple Linux images is called Node Port ID Virtualization (NPIV). A server
port that supports node port virtualization is referred to as an NV_Port. An NV_Port has all the same
characteristics as an N_Port, except that it logically looks like several N_Ports. This was the first time that
Fibre Channel paths had been virtualized.
Each virtualized N_Port has a unique World Wide Name (WWN). In other words, virtualization is
accomplished by establishing a separate WWN for each Linux image sharing the NV_Port. The WWN is used
to keep track of which conversations on that physical port are associated with which Linux image. Fabric
switch ports must understand that an NV_Port has multiple WWNs and must be able to manage the WWNs
independently (in other words, zoning).
So, it was through the innovative use of NPIV that IBM was able to provide acceptable I/O performance to
thousands of Linux images running on a System z server. Only in this way could they convince so many users
to move Linux onto System z and away from more inefficient open systems servers.
Once NPIV became available with the IBM System z9, development began on ways to use this technology
in a SAN. For example, Brocade announced the availability of the Brocade Access Gateway. The Brocade
Access Gateway is a powerful Brocade Fabric OS® (FOS) feature for Brocade blade server SAN switches,
enabling seamless connectivity to any SAN fabric, enhanced scalability, and simplified manageability.
Utilizing NPIV standards, the Brocade Access Gateway connects server blades to any SAN fabric. With
blade servers connecting directly to the fabric, the Brocade Access Gateway eliminates switch/domain
considerations while improving SAN scalability.
Thus, again it was the mainframes that led the way to the use of a promising technology.
The remaining chapters of this book discuss these recent developments and technologies in more detail.
Highlights include:
1. The first 16 Gbps FICON director (the Brocade DCX® 8510 Backbone) was introduced in 2011.
2. The first 32 Gbps FICON director (the Brocade X6) was introduced in 2016.
3. The first and only industry FICON architect certification program (Brocade Certified Architect for FICON
[BCAF]) and training course were introduced in 2008.
4. Enhancements to FCIP extension technologies for IBM Z environments, including blades and standalone
switches, were introduced.
5. IP Extension technology was introduced in 2015 for IBM Z virtual tape systems such as the
IBM TS7700.
6. CUP Diagnostics technology integrated with the IBM Health Checker for z/OS was introduced in 2013.
7. Improved performance management and monitoring via Brocade Fabric Vision® technology tools.
7800 &
FX8-24
82xx 7500 &
USD USDX
FR4-18i 7840
Channelink
Edge
ED-5000 FC9000 i10K 48000 DCX
M6140 B24000 DCX 8510
9032 M6064
DCX-4S
1987 1990 1997 2000 2001 2002 2003 2004 2005 2007 2008 2009 2012 2015
The first and possibly largest IBM user group organization, SHARE Inc., has created a growing initiative
to seek out and educate the next generation of mainframers in anticipation of a robust growth in the
mainframe computing space. Called zNextGen, this creative initiative will positively impact future
generations of mainframe shops. To find out more about this innovative project, visit the SHARE website at
www.share.org.
Yet by observing customers, it is clear that mainframes are actually a window indicating where other
vendor platforms are heading. Director-class switches, non-disruptive firmware loads, hot pluggability and
swappability, fabric security, and channel virtualization are just a few of the many innovations available to
systems running Fibre Channel fabrics. These innovations exist because the mainframe drove their creation.
All of the major storage and server vendors, and even the other IBM server lines, observe developments
in mainframes and then build those technologies and features into their products for both mainframe and
open systems consumption.
So, if you want to predict future developments in Fibre Channel, watch what is happening with mainframes.
You are likely to find that the mainframe marketplace provides a window into the future of Fibre Channel
functionality. The past is prologue.
References
Artis, H., & Houtekamer, G. (1993, 2006). MVS I/O Subsystems: Configuration Management and Performance
Analysis. New York: McGraw Hill.
Case, R., & Padegs, A. (1978). Architecture of the IBM System/370. Communications of the ACM, 21.
Cormier, R. (1983). System/370 Extended Architecture: The Channel Subsystem. IBM Journal of Research and
Development, 3-27.
Guendert, S. (2005). Next Generation Directors, DASD Arrays and Multi-Service, Multi-Protocol Storage
Networks. zJournal, 26-29.
Guendert, S. (2005). Proper Sizing and Modeling of ESCON to FICON Migrations. Proceedings of the 2005
Computer Measurement Group. Turnersville, NJ: Computer Measurement Group.
Guendert, S. (2007). A Comprehensive Justifcation For Migrating from ESCON to FICON. Gahanna, OH: Warren
National University.
Guendert, S., & Houtekamer, G. (2005). Sizing Your FICON Conversion. zJournal, 22-30.
Guendert, S., & Lytle, D. (2007). MVS and z/OS Servers Were Pioneers in the Fibre Channel World. Journal of
Computer Resource Management, 3-17.
IBM. (1966). IBM System/360 Principles of Operation. Poughkeepsie, NY: IBM Corporation.
IBM. (1990). IBM Enterprise Systems Architecture/390 Principles of Operation. Poughkeepsie, NY: IBM
Corporation.
Johnson, R. (1989). MVS Concepts and Facilities. New York: McGraw Hill.
Lorin, H. (1971). Parallelism in Hardware and Software: Real and Apparent Concurrency. Englewood Cliffs, NJ:
Prentice Hall.
Prasad, N. (1989). IBM Mainframes: Architecture and Design. New York: McGraw Hill.
19
PART 1 Chapter 2: Mainframe I/O and Storage Basics
• One or more specialty Processing Units (PUs), called Systems Assist Processors (SAPs)
• Channels communicate with I/O Control Units (CUs) and manage the movement of data between
processor storage and the control units. Channels perform four primary functions:
–– Transfer data during read and write operations
–– Send channel commands from the processor to a control unit
–– Receive sense information from control units, for example, detailed information on any errors
–– Receive status at the end of operations
• I/O devices attached through control units
• Up to 64 K subchannels, also known as Unit Control Words (UCWs).
The channel subsystem relieves the mainframe’s CPUs from the task of communicating directly with the
I/O devices and main storage. In other words, it permits data processing to proceed concurrently with I/O
processing. The CSS uses one or more channel paths as the communications link to manage the flow of
information to and from I/O devices. The CSS also performs path management operations by testing for
channel path availability; it chooses a channel path from those available and then initiates the performance
of an I/O operation by a device. (IBM, 2015)
The design of IBM Z offers considerable processing power, memory size, and I/O connectivity. In support
of the larger I/O capability, the CSS concept has been scaled up correspondingly to provide relief for the
number of supported logical partitions, channels, and devices available to the system. A single channel
subsystem allows the definition of up to 256 channel paths. To overcome this limit, the multiple channel
subsystems concept was introduced. The IBM Z architecture provides for up to six channel subsystems
(CSS0 to CSS 5) on an IBM z13. CSSs are numbered from 0 to 5 and are sometimes referred to as the CSS
image ID (CSSID 0, 1, 2, 3, 4 or 5). The structure of the multiple CSSs provides channel connectivity to the
defined logical partitions in a manner that is transparent to subsystems and application programs, enabling
the definition of a balanced configuration for the processor and I/O capabilities. Each z13 CSS can have
from 1 to 256 channels and can be configured to between 1 and 15 logical partitions in CSS0 tp CSS4.
Therefore, with all 6 CSSs a z13 will support a maximum of 85 logical partitions (LPARs). These CSSs are
also referred to as Logical Channel Subsystems (LCSSs) (Lascu, et al 2016). The exact number of LCSSs and
LPARs that will be supported on IBM’s z14 (aka zNext) is anticpated to increase beyond the z13 numbers. At
publication time this exact number was not known.
Some highly I/O intensive environments, such as Transaction Processing Facility (z/TPF), might require
additional SAPs. Additional SAPs can be ordered or configured as an option to increase the capability of the
CSS to perform I/O operations. Table 1 shows standard and optional SAP configurations for five different IBM
z13 models.
Depending on the processor model, a channel subsystem has one or more SAPs. The SAP finds the
subchannel in the initiative queue and then attempts to find a channel that succeeds in initial selection
(in other words, a channel that connects to the control unit and starts the I/O operation). The SAP uses
information in the subchannel to determine which channels and control units can be used to reach the
target device for an I/O operation. There are four reasons why this initial selection might be delayed. During
these delays the request is still serviced by a SAP without the need for z/OS to become involved. These four
reasons are (IBM, 2015):
Once the I/O operation completes, the SAP queues the subchannel containing all of the I/O operation final
status information in the I/O interrupt queue.
Subchannels
The subchannel number identifies a device during execution of Start Subchannel (SSCH) instructions
and the I/O interrupt processing. Every device defined through Hardware Configuration Definition (HCD)
in the I/O Configuration Data Set (IOCDS) is assigned a subchannel number, or Unit Control Word (UCW).
A subchannel provides the logical representation and appearance of a device to a program within the
channel subsystem. One subchannel is provided for, and dedicated to, each I/O device that is accessible
to a program through the channel subsystem. Individual subchannels provide information concerning
the associated I/O devices and their attachment to the channel subsystem (Lascu, et al 2016). The
subchannels are the means by which the channel subsystem then provides information and details about
the associated I/O device to the individual CPUs on the mainframe. The actual number of subchannels
provided by a mainframe depends on the IBM Z model and configuration: the maximum addressability
available ranges from 0 to 65,535 addresses per subchannel set (a 16-bit value is used to address
a subchannel). One subchannel is assigned per device that is defined to the logical partition. Four
subchannel sets are available to z13/z13s. Three subchannel sets each are available to zEC12 and
z196. Two subchannel sets each are available on z114 and System z10. The number of subchannel sets
per z14 (aka zNext) is expected to be greater than 4, but the exact number was not known at publication
time.
A subchannel contains the information required for sustaining a single I/O operation. This information
consists of internal storage that contains information in the form of the following:
• Device number
• Count
• Status indications
• I/O interruption subclass code
• Path availability information
• Additional information on functions pending or currently being performed
An I/O operation is initiated with a device through the execution of I/O-specific instructions that designate
the subchannel associated with that device (Lascu, et al 2016).
Subchannel numbers, including their implied path information to a device, are limited to four hexadecimal
digits by hardware and software architectures. Four hexadecimal digits provide 64 K addresses, known as a
set. IBM reserved 256 subchannels, leaving 63.75 K subchannels for general use with the IBM Z mainframes.
Parallel Access Volumes (PAVs) made this limitation a challenge for larger installations; a single disk drive (with
PAV) often consumes at least four subchannels. (Four is a widely used number for PAV. It represents the base
address and three alias addresses.) (Lascu, et al 2016).
It was difficult for IBM to remove this constraint, because the use of four hexadecimal digits for
subchannels—and device numbers corresponding to subchannels—is architected in several places;
simply expanding the field would break too many programs. The solution developed by IBM allows sets of
subchannels (addresses) with a current implementation of four sets on z13. When the multiple subchannel
set facility is installed, subchannels are partitioned into multiple subchannel sets. There can be up to four
subchannel sets, each identified by a Subchannel Set Identifier (SSID). Each set provides 64 K addresses.
Subchannel set 0, the first set, reserves 256 subchannels for IBM use. Subchannel set 1 provides a full
range of 64 K subchannels on z13, zEC12, z196, z114, and System z10. Subchannel set 2 provides
another full range of 64 K subchannels on zEC12 and z196 systems only. The first subchannel set (SS0)
allows definitions of any type of device allowed today, for example, bases, aliases, secondaries, and devices
other than disk that do not implement the concept of associated aliases or secondaries. The second
and third subchannel sets (SS1 and SS2) are designated for use for disk alias devices of both primary
and secondary devices, and Metro Mirror Peer-to-Peer Remote Copy (PPRC) secondary devices only. The
appropriate subchannel set number must be included in IOCP definitions or in the HCD definitions that
produce the IOCDS. The subchannel set number defaults to zero, and IOCP changes are needed only when
using subchannel sets 1 or 2. When the multiple subchannel set facility is not installed, there is only one
subchannel set, which has an SSID of zero.
Control units can be attached to the channel subsystem by more than one channel path, and an I/O device
can be attached to more than one control unit. A path is simply a route to a device. From the perspective
of the operating system, a path flows in this way: from the Unit Control Block (UCB) (which represents
the device to z/OS), through a subchannel (which represents the device in the channel subsystem), over
a channel, possibly through a FICON director/switch, through the CU, and on to the device itself. From a
hardware perspective, a path runs from the IBM Z channel subsystem, through the FICON director, to the
control unit, and on to the device.
Each of these components along the path must physically exist and be connected correctly so that the path
can exist. The IOCDS also must accurately reflect the actual physical connectivity, to avoid problems. The
paths to a device are initialized during an Initial Program Load (IPL) via a process known as path validation.
During the path validation process, the state of each path is determined. If one component of the hardware
supporting the path is unavailable, that path is not brought online. There are three requirements for z/OS
to be able to access a device through a given path. First, the hardware must be configured so that physical
connectivity exists between the device and the CPC. Second, the configuration must have been correctly
defined using HCD. Finally, the software involved must be aware of the paths and also must have varied the
paths online.
Individual I/O devices are accessible to the channel subsystem by up to eight different channel paths via
a subchannel. The total number of channel paths provided by a single CSS depends on the model and
configuration of the IBM mainframe. The addressability range is 0–255. The CHPID is a system- unique
value assigned to each installed channel path of a z Systems mainframe that is used to address a channel
path. The z/OS operating system is limited to a maximum of 256 channel paths for use within an LPAR. To
facilitate the use of additional CHPIDs, the mainframe architecture supports a Logical Channel Subsystem
(LCSS). The structure of the multiple LCSSs provides channel connectivity to the defined logical partitions
in a manner that is transparent to subsystems and application programs, enabling the definition of a
balanced configuration for the processor and I/O capabilities. Each LCSS can have from 1 to 256 channels
and can be configured to 1 to 15 logical partitions. Therefore, four LCSSs support a maximum of 60 logical
partitions. As previously mentioned, these numbers are slightly different for z13 and later.
LCSSs are numbered from 0 to 3 and are sometimes referred to as the CSS image ID (CSSID 0, 1, 2 or
3). The LCSS is functionally identical to the channel subsystem; however, up to four LCSSs can be defined
within a central processor complex. CHPIDs are unique within the LCSS only; consequently, the limitation
of 256 CHPIDs can be overcome. To do so, the concept of the Physical Channel Identifier (PCHID) was
introduced. The PCHID reflects the physical location of a channel type interface. Now, CHPIDs no longer
correspond directly to a hardware channel part but are assigned to a PCHID in HCD or IOCP via a specialized
software application known as the CHPID Mapping Tool (Lascu, et al 2016).
The Multiple Image Facility (MIF) enables channel sharing across logical partitions within a single CSS
on IBM Z mainframes. Channel spanning extends the MIF concept of sharing channels across logical
partitions to sharing physical channels across logical partitions and channel subsystems. Channel spanning
is the ability to configure channels to multiple Logical Channel Subsystems. This way the channels can be
transparently shared by any or all of the configured logical partitions, regardless of the channel subsystem
to which the logical partition is configured. In other words, channel spanning is the ability for a physical
channel (a PCHID) to be mapped to the CHPIDs that are defined in multiple Channel Subsystems. When
defined that way, the channels can be transparently shared by any or all of the configured logical partitions,
regardless of the channel subsystem to which the logical partition is configured. A channel is considered
a spanned channel if the same CHPID number in different CSSs is assigned to the same PCHID in IOCP,
or if it is defined as spanned in HCD. FICON channels can be spanned across multiple LCSSs in z14, z13,
z13s, zEC12, z196, z114, and z10—and the CHPID number is the same on all logical CSSs that include that
CHPID.
Control Units
A CU provides the logical capabilities necessary to operate and control an I/O device. In all cases, an
I/O device operation is regulated by a control unit that provides the functions necessary to operate the
associated device. The control unit is also responsible for interfacing with the FICON channel.
Control units adopt the characteristics of each device so that the device can respond to the standard form
of control provided by the channel subsystem. Communication between the control unit and the channel
subsystem takes place over a channel path (Lascu, et al 2016). The control unit accepts control signals
from the channel subsystem, controls the timing of data transfer over the channel path, and provides
indications about the status of the device. A control unit can be housed separately, or it can be physically
and logically integrated with the I/O device, the channel subsystem, or in the system itself. From the modern
systems programmers’ perspective, most of the functions performed by the control unit can be merged
with those performed by the I/O device in question. The term control unit is often used to refer either to
a physical control unit, a logical control unit, or a logical control unit image. The term physical control unit
is synonymous with the terms controller and DASD subsystem (in other words, it is the piece of hardware
that sits on the data center floor). A Logical Control Unit (LCU) is a channel subsystem concept, intended
primarily for device definitions in HCD/IOCP, which represents a set of control units that attach to common
I/O devices.
I/O Devices
A modern IMB Z I/O device provides external storage, a means of communication between a system and its
environment, or a means of communication between data processing systems. In the simplest case, an I/O
device is attached to one CU and is accessible from one channel path. Switching equipment, such as the
Brocade X6 and DCX 8510 families of FICON directors, is available to make some devices accessible from
two or more channel paths—by switching devices among CUs and by switching CUs among channel paths.
This switching provides multiple paths by which an I/O device can be accessed. Such multiple channel
paths to an I/O device are provided to improve performance, availability, or both.
A device number is a 16-bit value assigned as one of the parameters of the subchannel at the time a
specific device is assigned to a specific subchannel. The device number is like a nickname; it is used
to identify a device in interactions between the user and the system. Device numbers are used in error
recording, console commands, console messages, and in Job Control Language (JCL). Every channel-
attached device in a z/OS configuration is assigned a unique device number when the systems programmer
defines the device to the channel subsystem and HCD. It is a 16-bit number in the range of 0000–FFFF,
which allows for a maximum of 64 K devices available per channel subsystem. In addition, there is also
another addressing element known as a device identifier. A device identifier is an address that is not
readily apparent to the application program, and it is used by the CSS to communicate with I/O devices.
For a FICON I/O interface, the device identifier consists of an 8-bit control unit image ID and an 8-bit device
address.
LINK is the FICON director port address that the control unit is connected to. The CUADD parameter
identifies the LCU image that contains the requested I/O device. The Unit Address (UA) is the name by which
a channel knows a device that is attached to one of the control units. The UA consists of two hexadecimal
digits in the range of 00–FF. There is not a requirement to have the UA match any portion of the device
number. However, the end user must be careful during HCD device definition, because the UA defaults to
the last two digits of the device number if it is not otherwise explicitly specified.
An I/O command word specifies a command and contains information associated with the command. When
the Fibre-Channel-Extensions (FCX) facility is “installed”, there are two elemental forms of I/O command
words: Channel Command Words (CCWs) and Device Command Words (DCWs). A CCW is 8 bytes in length
and specifies the command to be executed. For commands that are used to initiate certain types of
operations, the CCW also designates the storage area associated with the I/O operation, the count of data
(bytes), and the action to be taken when the command is completed. All I/O devices recognize CCWs. A
DCW is 8 bytes in length and specifies the command to be executed and the count of data (bytes). Only I/O
devices that support the FCX facility recognize DCWs (IBM, 2015).
The FCX facility is an optional facility that provides for the formation of a transport mode channel program
composed of DCWs, a Transport Control Word (TCW), Transport Command Control blocks (TCCB), and
Transport Status Blocks (TSB). If the FCX facility is not installed, a single form of channel program
(command mode) is utilized using CCWs. A CCW channel program is a channel program composed of
one or more CCWs. The CCWs are logically linked and arranged for execution by the Channel Subsystem
(IBM, 2015). The remainder of this chapter focuses on Command Mode FICON (CCW channel program)
operations. Transport Mode FICON—also known as IBM Z High Performance FICON (zHPF)—is covered in
detail in a later chapter.
fig01_P1Ch2
Status
exception
“CCW1” zHPF-
CMD • The channel then sends all the required
(Read 4 KB) enabled read “CCWs” for each 4 KB of data desired
+ “CCW2” in one single frame to the control unit.
(Read 4 KB)
+ “CCW3”
(Read 4 KB)
• The CU transfers the requested data over
+ “CCW4” the link to the channel, followed by a CE/DE
(Read 4 KB) Data if the operation was successful.
(16k)
Status
• Significantly less overhead is generated
(CE/DE)
compared to the existing FICON architecture.
fig02_P1Ch2
• A channel can communicate with other
Status devices on the same CU or on different
exception CUs concurrently.
S
CCW
BSAM EXCP
QSAM (Execute I zHPF-eligible access
Channel methods are:
Application Program) O I/O –DB2
I/O Sub-system – PDSE
Request TCW (IOS) – VSAM
VSAM CCW – zFS
Media Mgr – EF SAM DS
– QSAM
DB2 TCW – BSAM
SSCH – BPAM
If an I/O is zHPF-eligible, the EXCP or Media Manager
creates a TCW instead of a stream of CCWs.
fig03_P1Ch2
Subchannels Channels
(Key: DB2 = IBM DB2, PDSE = Partitioned Data Set Extended, zFS = zSeries File System, EF SAM
DS = Extended Format Sequential Access Method Data Set, BSAM = Basic Sequential Access Method)
Step 1. The end-user application program begins an I/O operation by allocating a data set. The application
program uses an I/O macro instruction (for instance, GET, READ, WRITE, PUT) that invokes an access
method through a branch instruction. The access method interprets the I/O request and determines which
system resources are necessary to satisfy the request.
1. It creates the channel program (with virtual addresses) that knows the physical organization
of the data.
2. It implements any required buffering.
3. It guarantees synchronization.
4. It automatically redrives the I/O request following a failure.
Although the application program might bypass the access method, many details of the I/O operations
must first be considered (for example, device physical characteristics). Then, the application program must
also create a channel program consisting of the instructions for the CSS. It also must invoke an IOS driver
(the Execute Channel Program [EXCP] processor) to handle the next phase of the I/O process. z/OS has
several access methods, each of which provides various functions to the user application program. Some
of these access methods are: Virtual Storage Access Method (VSAM), Queued Sequential Access Method
(QSAM), and Basic Partitioned Access Method (BPAM). The access method you use depends on whether
the application program accesses the data randomly or sequentially, and on the data set organization
(sequential, partitioned, and so forth). The application or access method provides CCWs or TCWs and
additional parameters, all of which are in the Operation Request Block (ORB).
Step 3. To request the actual movement of data, either the access method or the application program
presents information about the I/O operation to an I/O driver routine (EXCP or Media Manager) via a macro
instruction. The I/O driver routine is now in supervisor mode, and it translates the virtual storage addresses
in the channel program to real storage addresses that are in a format acceptable to the channel subsystem.
In addition, the I/O driver routine fixes the pages containing the CCWs and data buffers, guarantees the
data set extents, and invokes the IOS.
Step 4. The IOS places the I/O request in the queue for the chosen I/O device in the UCB. The IOS
services the request from the UCB on a priority basis. The IOS issues an SSCH instruction to send the
request to the channel subsystem. The SSCH is sent with the Subsystem Identification word (SSID)
representing the device and ORB as operands. The ORB contains start-specific control information and
information about “what to do” during the I/O operation. The ORB indicates whether the channel is
operating in transport mode (zHPF support) or command mode, and also indicates the starting address
of the Channel Programs (CPA) Channel Command. The SSCH moves the ORB contents into the respective
subchannel and places the subchannel in a specific Hardware System Area (HAS), called the initiative
queue. Generally, at this point the task that initiated the I/O request is placed in a wait state, and the CPU
continues processing other tasks until the channel subsystem indicates via an I/O interrupt that the I/O
operation has completed.
Step 5. The CSS selects the most appropriate channel or channel path to initiate the I/O operation between
the channel and control unit or device. The channel fetches from storage the Channel CCWs or TCWs and
associated data (for write operations). The channel assembles the required parametersand fields of the
FC-2 and FC-SB-3 or FC-SB-4 for the I/O request and passes them to the Fibre Channel adapter (which is
part of the FICON channel). Device-level information is transferred between a channel and a control unit in
SB-3 or SB-4 Information Units (IUs). Information units are transferred using both link-level and device-level
functions and protocols. For example, when the channel receives an initiative to start an I/O operation,
the device-level functions and protocols obtain the command and other parameters from the current CCW
or TCW and insert them into the appropriate fields within a command IU. When the command IU is ready
for transmission, link-level functions and protocols provide additional information (for example, address
identifiers and exchange ID in the frame header) and then coordinate the actual transmission of the frame
on the channel path. The CSS controls the movement of the data between the channel and processor
storage.
Step 6. The Fibre Channel adapter builds the complete FC-FS FC-2 serial frame and transmits it into the
Fibre Channel link. As part of building the FC-FS FC-2 frame for the I/O request, the FICON channel in FICON
native (FC) mode constructs the 24-bit FC port address of the destination N_Port of the control unit, and the
control unit image and device address within the physical CU.
Step 7. When the I/O operation is complete, the CSS signals completion by generating an I/O interrupt. The
IOS processes the interruption by determining the status of the I/O operation (successful or otherwise) from
the channel subsystem, using a Test Subchannel (TSCH) instruction.
Step 8. The I/O driver routine (EXCP or Media Manager) macro indicates that the I/O process is complete by
posting the access method and calling the dispatcher. At this point, the dispatcher reactivates the access
method, which then returns control to the user application program so it can continue with its normal
processing.
These steps describe a high-level summary of an I/O operation. A more detailed description of the SSCH, path
management process, channel program execution process, and I/O operation conclusion follows.
Start Subchannel
The SSCH instruction is executed by a CPU and is part of the CPU program that supervises the flow of
requests for I/O operations from other programs that manage or process the I/O data. When a SSCH is
executed, the parameters are passed to the target subchannel, requesting that the CSS perform a start
function with the I/O device associated with the subchannel. The CSS performs this start function by using
information at the subchannel, including the information passed during the SSCH execution, to find an
accessible channel path to the device. Once the device is selected, the execution of an I/O operation is
accomplished by decoding and executing a CCW by the CSS and the I/O device for CCW channel programs.
A SSCH passes the contents of an ORB to the subchannel. If the ORB specifies a CCW channel program, the
contents of the ORB include (IBM, 2015):
1. Subchannel key
2. Address of the first CCW to be executed
3. A specification of the format of the CCWs
4. The command to be executed and the storage area (if any) to be used
Path Management
If an ORB specifies a CCW channel program, and the first CCW passes certain validity testing (and does not
have the suspend flag specified), the channel subsystem attempts device selection by choosing a channel
path from the group of channel paths available for selection. A CU that recognizes the device identifier
connects itself logically to the channel path and responds to its selection. If the ORB specifies a CCW
channel program, the CSS sends the command-code part of the CCW over the channel path, and the device
responds with a status byte indicating whether or not the command can be executed. The CU might logically
disconnect from the channel path at this time, or it might remain connected, to initiate data transfer (IBM,
2015).
If the attempted path selection does not occur—as a result of either a busy indication or a path not-
operational condition—the CSS attempts to select the device by an alternate channel path, if one is
available. When path selection has been attempted on all available paths, and the busy condition persists,
the I/O operation remains pending until a path becomes free. If a path not-operational condition is detected
on one or more of the channel paths on which device selection was attempted, the program is alerted by
a subsequent I/O interruption. This I/O interruption occurs either upon execution of the channel program
(indicating that the device was selected on an alternate channel path) or as a result of the execution being
abandoned—because a path not operational condition was detected in all of the channel paths on which
device selection was attempted.
When the ORB contents have been passed to the subchannel, the execution of the SSCH command is
formally complete. An I/O operation can involve the transfer of data to/from one storage area designated
by a single CCW or to/from a number of non-contiguous storage areas. When each CCW designates a
contiguous storage area, the CCWs are coupled by data chaining. Data chaining is specified by a flag in
the CCW, and it causes the CSS to fetch another CCW upon the exhaustion or filling of the storage area
designated by the current CCW. The storage area designated by a CCW that is fetched on the data chaining
pertains to the I/O operation already in progress at the I/O device (the I/O device is not notified when a new
CCW is fetched) (IBM, 2015).
To complement Dynamic Address Translation (DAT) in CPUs, CCW indirect data addressing and CCW
modified indirect data addressing are provided. Both of these techniques permit the same—or essentially
the same—CCW sequences to be used for a program running with DAT active in the CPU, as would be used if
the CPU operated with the equivalent contiguous real storage.
The conclusion of an I/O operation is normally indicated by two status conditions: Channel End (CE) and
Device End (DE). A Channel End, also known as the primary interrupt condition, occurs when an I/O device
has received or provided all data associated with the operation and no longer needs the channel subsystem
facilities. A Device End, also known as the secondary interrupt condition, occurs when the I/O device has
concluded execution and is ready to perform another operation. When the primary interrupt condition (CE) is
recognized, the channel subsystem attempts to notify the application program, via an interruption request,
that a subchannel contains information describing the conclusion of an I/O operation at the subchannel
(Lascu, et al 2016).
To offer insight into the collaborative efforts of IBM and Brocade, there are advanced researchers within the
standards bodies making sure that new innovations do not exclude some of the unique characteristics needed
for an IBM Z (FICON) environment. There are also regular technical interlocks that highlight future development
plans from both companies. As joint efforts align into a specific release stream, the Brocade Quality Assurance
teams test and validate hardware and firmware within Brocade System z labs. Once Brocade finishes the
Systems Quality Assurance test cycle, those deliverables are set up at the IBM labs in Poughkeepsie, NY
and Tucson, AZ. The next round of testing is a rigorous set of defined steps to help identify any issues. This is
done by driving the highest throughput possible into a mix of current and new platforms being readied for the
market. The testing also includes commonly used configurations and setups to insure that end-user use cases,
are covered. In addition, some unique functions for long- distance extension within the FICON environments
are tested for disaster recovery and high-availability solutions (for instance, Geographically Dispersed Parallel
Sysplex [GDPS]). Global Mirroring with eXtended Remote Copy (XRC), along with high-end enterprise tape
extension, are tested to make sure these fit the critical environments they are destined for.
Overall, IBM and Brocade execute a defined set of steps to ensure the provision of the most reliable and
enterprise-ready solutions available in a IBM Z FICON environment.
References
IBM. (2015). z/Architecture Principles of Operation. IBM Corporation. Poughkeepsie, NY: International
Business Machines Corporation.
Lascu, O., Hoogerbrug, E., De Leon, C., Palacio, E., Pinto, F., Sannerud, B., Soellig, M., Troy,J., Yang. J. (2016).
IBM z13 Technical Guide. Poughkeepsie, NY: IBM Redbooks.
Singh, K., Dobos, I., Fries, W., & Hamid, P. (2013). IBM System z Connectivity Handbook. Poughkeepsie, NY:
IBM Redbooks.
Singh, K., Grau, M., Hoyle, P., & Kim, J. (2012). FICON Planning and Implementation Guide. Poughkeepsie,
NY: IBM Redbooks.
White, B., Packheiser F., Palacio, E., Lascu. O. (2017). IBM z Systems Connectivity Handbook. Poughkeepsie,
NY: IBM Redbooks.
32
PART 2 Chapter 1: Protocols and Standards
FCP and FICON are just a part of the upper layer (FC-4) protocol
They are compatible with existing lower layers in the protocol stack
fig01_P2Ch1
• FC-SB1 – ESCON over optical cables
• FC-SB2 – 1st FICON standard within FC for 1Gbps and no cascading
• FC-SB3 – 2nd FICON standard - @ 2Gbps and allowed cascading
• FC-SB4 – 3rd FICON standard – allows for high performance FICON (zHPF)
Figure 1 represents the Fibre Channel architectural structure. Fibre Channel is based on a structured
architecture with hierarchical functions, providing specifications from the physical interface through the
mapping and transport of multiple upper-level protocols. Unlike the OSI Reference Model, the divisions
in Fibre Channel are usually referred to as levels rather than layers. The levels are labeled Fibre Channel
level-0 (FC-0) through Fibre Channel level-4 (FC-4), as shown in Figure 1. Each level defines a specific set
of instructions. This structured approach to the Fibre Channel architecture allows the functionality at one
level to be insulated from the functionality of other levels. This modular and expandable approach to the
architecture allows for enhancements to be made to parts of the Fibre Channel architecture with a minimal
impact on other parts. It also allows for enhancement activities at the different levels to proceed in parallel
with each other. Some of the FC-4 layer upper-level protocols are listed at the bottom of Figure 1. (Note: In
the figure, HIPPI = High-Performance Parallel Interface and zHPF = z High-Performance FICON.)
FC-1 level: Encoding/decoding and link-level protocols. Fibre Channel is a serial interface. Therefore, it does
not provide a separate clock to indicate the validity of individual bits. This makes it necessary to encode
the clocking information within the data stream itself. Fibre Channel uses two different encoding/ decoding
schemes. 1, 2, 4, and 8 Gbps Fibre Channel uses 8 b/10 b encoding/decoding, while 10 and 16 Gbps Fibre
Channel use 64 b/66 b encoding/decoding. The FC-1 level also specifies/defines link-level protocols. These
are used to initialize or reset the link or to indicate that the link or port has failed. Link- level protocols are
implemented by a Port State Machine (PSM).
FC-2 level: Fibre Channel Transport. The FC-2 level is the most complex level of Fibre Channel. It includes
most of the Fibre Channel specific constructs, procedures, and operations. It provides the facilities to
identify and manage operations between node ports, as well as services to deliver information units. The
elements of the FC-2 level include:
FC-3 level: Common Services. The FC-3 level provides a “placeholder” for possible future Fibre Channel
functions.
FC-4 level: Protocol Mapping. The FC-4 level defines the mapping of Fibre Channel constructs to Upper
Layer Protocols (ULPs). Before a protocol can be transported via Fibre Channel, it is first necessary to
specify how that protocol will operate in a Fibre Channel environment. This is known as protocol mapping.
A protocol mapping specifies both the content and structure of the information sent between nodes by
that protocol. These protocol-specific structures are called Information Units (IUs). The number of IUs, their
usage, and their structure are specified by each protocol mapping. For command type protocols, the IUs
consist of command information, data transfer controls, data, and status or completion information. While
the processes in individual nodes understand the structure and content of the IUs, the actual Fibre Channel
port does not. The port simply delivers the IUs without regard for their structure, function, or content.
The FC-0 through FC-3 levels are common for FICON and open systems Fibre Channel Protocol (FCP) Storage
Area Networks (SANs). This is what allows for Protocol Intermix Mode (PIM) to be supported. PIM refers to the
use of a common SAN for both the FICON and open systems environments. A frequent implementation of
PIM is when an end user is running Linux for IBM Z on the mainframe. The only differences between FCP and
FICON occur at the FC-4 level.
The American National Standards Institute (ANSI) led the development efforts for the earlier Fibre Channel
standards. There was a subsequent restructuring of standards organizations, resulting in the more recent
Fibre Channel standards being developed by the National Committee for Information Technology Standards
(NCITS). There are also several available supporting documents (not developed by NCITS) and technical
reports. These are in addition to the official ANSI- and NCITS-approved standards. The structure and
relationships of the primary Fibre Channel standards are shown in Figure 2.
Physical (FC-0)
FC-PI-1/2/3/4 FC-10GFC
fig02_P2Ch1
Topologies
FC-SW-1/2/3/4/5 FC-IFR FC-BB-1/2/3/4
It was recognized that having three standards presented challenges to end users, so the committee
launched a project to merge these standards into a single new document, Fibre Channel Framing and
Signaling (FC-FS). Simultaneously, the FC-O physical interface definition and information was moved to a
new separate standard, Fibre Channel Physical Interface (FC-PI). FC-PI contains the FC-0 specifications that
were separated out of the FC-FS. Separating the physical interface specifications from the protocol levels
simplified and expedited the overall standardization process. Merging the three FC-PH documents into
the new FC-FS standard helped end-user understanding of the complete framing and signaling standard.
It also removed inconsistencies and redundant information. Finally, the FC-FS standard consolidated the
definitions for Extended Link Services (ELS) that was scattered among various other standards documents
and technical reports.
1. Fibre Channel Switched Fabric (FC-SW). This series of standards (FC-SW through FC-SW-5) specifies switch
behavior and inter-switch communications.
2. Fibre Channel Backbone (FC-BB). This series of standards focuses on defining how to use bridge devices to
enable Fibre Channel devices to communicate through intermediate, non-Fibre Channel networks, such as
an IP network or a Synchronous Optical Network (SONET).
3. Fibre Channel Inter-Fabric Routing (FC-IFR). This is a relatively new project (started in 2005) to provide a
standard specifying how frames can be routed from one fabric to another while potentially traveling through
intermediate fabrics.
• FC-SB: Approved in 1996 as the original mapping of the ESCON protocol to Fibre Channel. FC-SB
provided a one-to-one mapping of ESCON frames to Fibre Channel frames. Due to low performance of
the native ESCON frame protocol, the FC-SB mapping did not provide adequate performance over long-
distance links. Therefore, it was never implemented.
• FC-SB-2: This protocol mapping was an entirely new mapping of the ESCON command protocol for
transport via Fibre Channel. It is known as the IBM Fibre Connection architecture, or FICON. FC-SB-2
was approved in 2001, and it specifies a much more efficient mapping than that provided by the
original FC-SB standard. The FC-SB standard has since been withdrawn.
• FC-SB-3: Approved in 2003. FC-SB-3 extends the functionality provided by FC-SB-2. It also streamlines
several operations to improve efficiency on extended-distance links. A subsequent amendment introduced
Persistent IU Pacing, more commonly known as the IBM Extended Distance FICON.
• FC-SB-4: Defines a new mode of operation that significantly improves the performance of certain
types of data transfer operations, as compared to the existing FC-SB-3 protocol. The new mode of
operation is referred to as transport mode, more commonly known as z High-Performance FICON
(zHPF). Transport mode allows you to send multiple device commands to a control unit in a single
IU. Transport-mode operations utilize link-level protocols as described by the FC-4 FCP specification.
In transport mode, communication between the channel and control unit takes place using a single
bidirectional exchange. It utilizes fewer handshakes to close exchanges, transmit device commands,
and provide device status as compared with the FC-SB-3 protocol. Performance improvement is most
significant with I/O operations that are performing small block data transfers, because of the reduction
in overhead relative to transfer size. Certain types of complex I/O operations are still required to use
the existing FC-SB-3 protocol.
Topology Considerations
37
PART 2 Chapter 2: Topology Considerations
Introduction
IBM originally announced FICON channels for the S/390 9672 G5 processor in May of 1998. Over the
ensuing 19 years, FICON has rapidly evolved from 1 Gbit FICON Bridge Mode (FCV) to the 16 Gbit FICON
Express16S+ channels that were announced in July 2017 as part of the IBM z14 announcement. While
the eight generations of FICON channels have each offered increased performance and capacity, from
the perspective of this chapter the most important enhancements relate to the Fibre Channel Upper Level
Protocol (ULP) employed by FICON. Specifically, the first FICON ULP was called FC-SB-2. This specification
supported native FICON storage devices, channel-to-channel connections, FICON directors (switches), as
well as several other key features. However, the FC-SB-2 ULP did not support switched paths over multiple
directors. When IBM introduced the FC-SB-3 ULP in January of 2003, this limitation was addressed. While
FC-SB-2 employed a single byte link address that specified just the switch port on the director, FC-SB-3
employs a 2-byte address where the first byte is the address of the destination director and the second byte
is the address of the port on the destination director. This is known as cascaded FICON. Since 2003, there
have been many changes with how the FICON switching devices send Fibre Channel frame traffic between
each other.
This chapter has four sections. The first section introduces the three primary topologies for FICON: direct-
attached, single-switch, and cascaded switched FICON, including multi-hop support. The second section
discusses the advantages to using a switched FICON architecture. The third section discusses cascaded
and multihop FICON in grater technical detail. The final section in this chapter discusses the creation of
FICON fabrics at five-nines of availability.
The FICON channel itself determines whether the link that connects to it is a direct-attached/point- to-point
topology, or if it is a switched topology. The FICON channel does this by logging into the fabric via the fabric
login process (FLOGI ELS) and checking the accept response to the fabric login (ACC ELS). The accept
response indicates what type of topology exists. Figure 1 illustrates a direct- attached FICON topology.
Although only one Fibre Channel link attaches to the FICON channel in a FICON switched point-to-point
configuration, from the switch the FICON channel can address and communicate with many FICON control
units on different switch ports. At the control unit, the same addressing capabilities exist as for direct-
attached configurations. Switched point-to-point increases the communication and addressing capabilities
for the channel, giving it the capability to access multiple control units.
The communication path between a channel and control unit in a switched point-to-point configuration is
composed of two different parts: the physical channel path and the logical path. The physical paths are the
links, or the interconnection of two links connected by a switch that provides the physical transmission path
between a channel and control unit. A FICON logical path is a specific relationship established between
a channel image and a control unit image for communication during execution of an I/O operation and
presentation of status. Figure 2 shows an example of a switched FICON topology (single-switch).
Cascaded FICON allows a FICON Native (Fibre Channel) channel or a FICON CTC channel to connect a IBM
Z server to another similar server or peripheral device such as disk, tape library, or printer via two Brocade
FICON directors or switches. A FICON channel in FICON native mode connects one or more processor
images to an FC link. This link connects to the first FICON director, then connects dynamically through the
first director to one or more ports, and from there to a second cascaded FICON director. From the second
director, there are Fibre Channel links to FICON control unit ports on attached devices. Both FICON directors
can be geographically separate, providing greater flexibility and fiber cost savings. All FICON directors that
are connected in a cascaded FICON architecture must be from the same vendor (such as Brocade). Initial
support by IBM was limited to a single hop between cascaded FICON directors; however, the directors could
be configured in a hub-star (core-edge) architecture, with up to 24 directors in the fabric. In April 2017 IBM Z
announced qualification and support for multi-hop FICON configurations.
Note: The term “switch” is used to refer to a Brocade hardware platform (switch, director, or backbone)
unless otherwise indicated.
Cascaded FICON allows Brocade customers tremendous flexibility and the potential for fabric cost
savings in their FICON architectures. It is extremely important for business continuity/disaster recovery
implementations. Customers looking at these types of implementations can realize significant potential
savings in their fiber infrastructure costs and channel adapters by reducing the number of channels
for connecting two geographically separate sites with high-availability FICON connectivity at increased
distances.
We will revisit cascaded and multihop FICON in more detail later in the chapter. For now, let’s discuss
switched versus direct attached architectures with a focus on zHigh Performance FICON (zHPF).
IBM Z delivered zHPF in 2008 on the IBM z10 EC platform and has been strategically investing in this
technology ever since the initial delivery. The raw bandwidth of a FICON Express16S+ channel using zHPF
is 4.2 times greater than a FICON Express16S+ channel using original FICON capabilities. The raw I/O per
second (IOPS) capacity of the FICON Express16S+ using zHPF is 300,000 IOPS which is 13 times greater
than a FICON Express16S+ channel using default FICON capabilities.
FICON Express16S+ channels using default FICON on z14 processors can have up to 64 concurrent I/Os
(open exchanges) to different devices. FICON Express16S+ channels running zHPF can have up to 750
concurrent I/Os on the z14 processor family. Only when a director or switch is used between the host and
storage device can the true performance potential inherent in these channel bandwidth and I/O processing
gains be fully realized.
The 40 buffer credits per port on a FICON Express8 or FICON Express8S channel card and the 90 buffer
credits on the FICON Express16S+ channel card can support up to 10 km of distance for full-frame size
I/Os (2 KB frames). A switched architecture allows organizations to overcome the buffer credit limitations on
the FICON channel card for workloads with less than full-size frames. Depending upon the specific model,
Brocade FICON directors and switches can have more than 1,300 buffer credits available per port for
long-distance connectivity.
As a general rule, FICON Express16S +channels offer better performance, in terms of IOPS or bandwidth,
than the storage host adapter ports to which they are connected. Therefore, a direct-attached zHPF storage
architecture will typically see very low channel utilization rates. To overcome this issue, fan-in and fan-out
storage network designs are used.
A switched zHPF architecture allows a single channel to fan-out to multiple storage devices through
switching, improving overall resource use. This can be especially valuable if an organization’s environment
has newer zHPF channels, such as FICON Express16S+, but older tape drive technology. Figure 3 illustrates
how a single FICON channel can concurrently keep several tape drives running at full-rated speeds. The
actual fan-out numbers based on tape drives will depend on the specific tape drive and control unit. It is
not unusual, however, to see a FICON Express16S channel fan-out from a switch to five to six tape drives (a
1:5 or 1:6 fan-out ratio). The same principles apply for fan-out to direct-access storage device DASD arrays.
The exact fan-out ratio is dependent on the DASD array model and host adapter capabilities for IOPS and/or
bandwidth.
z14 X6 FICON X6 FICON
Director Director
Cascaded Storage
Fiber Cable FICON
380 MBps @ 2Gbps
760 MBps @ 4Gbps
135-1600 MBps
1520 MBps @ 8Gbps
@ 2/4/8/16Gbps
1900 MBps @ 10Gbps
per CHPID
3040 MBps @ 16Gbps
(transmit and receive)
(At 95% B/W per link
transmit and receive)
70-1600 MBps
With a switched architecture, failures are localized to only the affected zHPF channel interface or the
control unit interface, not both. The non-failing side remains available, and if the storage side has not
failed, other zHPF channels can still access that host adapter port through the switch or director (see Figure
4). This failure isolation, combined with fan-in and fan-out architectures, allows the most robust storage
architectures, minimizing downtime and maximizing availability.
X
Cable or SFP failure occurs
Direct attached
CHPID path fails
X
One failure affects 1 resource!
In addition, best-practice storage architecture designs include room for growth. With a switched zHPF
architecture, adding devices such as tape is much easier. Simply connect the new control unit to the
switch. This eliminates the need to open the channel cage in the mainframe to add new channel interfaces,
reducing both capital and operational costs. This also gives managers more flexible planning options when
upgrades are necessary, since the urgency of upgrades is lessened.
Over the past several years, IBM has announced a series of technology enhancements that require the use
of switched zHPF. These include:
IBM announced support for NPIV on Linux for IBM z Systems in 2005. Today, NPIV is supported on the
System z9, z10, z196, z114, zEC12, zBC12, z13, z14 and LinuxONE server platforms. Until NPIV was
supported on IBM Z, adoption of Linux on the mainframe had been relatively slow. NPIV allows for full
support of LUN masking and zoning by virtualizing the Fibre Channel identifiers. This in turn allows each
Linux on Z operating system image to appear as if it has its own individual Host Bus Adapter (HBA)—when
those images are in fact sharing Fibre Channel Protocol (FCP) channels. Since IBM began supporting
NPIV on the IBM Z, adoption of Linux on IBM Z has grown significantly—to the point where IBM believes
a significant percent of MIPS shipping on new IBM z Systems machines are for Linux on IBM z Systems
implementations. Implementation of NPIV on IBM z Systems requires a switched architecture. With IBM
z13, IBM has announced support of up to 8000 virtual machines in one system, as well as increasing the
support of the number of virtual images per FCP channel from 32 to 64 virtual images.
Dynamic Channel Path Management (DCM) is another feature that requires a switched zHPF architecture.
DCM provides the ability to have IBM z Systems automatically manage zHPF I/O paths connected to DASD
subsystems in response to changing workload demands. Use of DCM helps simplify I/O configuration
planning and definition, reduces the complexity of managing I/O, dynamically balances I/O channel
resources, and enhances availability. DCM can best be summarized as a feature that allows for more flexible
channel configurations— by designating channels as “managed”—and proactive performance management.
DCM requires a switched zHPF architecture because topology information is communicated through the
switch or director. The zHPF switch must have a Control Unit Port (CUP) license, and be configured or
defined as a control unit in the Hardware Configuration Definition (HCD) Sysgen.
IBM z/OS FICON Discovery and Auto-Configuration (zDAC) is the latest technology enhancement for zHPF.
IBM introduced zDAC as a follow-on to an earlier enhancement in which the zHPF channels log into the Fibre
Channel name server on a zHPF director. zDAC enables the automatic discovery and configuration of zHPF-
attached DASD and tape devices. Essentially, zDAC automates a portion of the HCD Sysgen process. zDAC
uses intelligent analysis to help validate the IBM z Systems and storage definitions’ compatibility, and uses
built-in best practices to help configure for High Availability (HA) and avoid single points of failure.
zDAC is transparent to existing configurations and settings. It is invoked and integrated with the z/OS HCD
and z/OS Hardware Configuration Manager (HCM). zDAC also requires a switched zHPF architecture.
IBM introduced support for zHPF in October 2008, and has made several enhancements to zHPF, most
recently with the 2015 announcement and GA of zHPF Extended Distance II. While not required for zHPF,
a switched architecture is recommended. Finally, organizations that implement NPIV with a switched zHPF
architecture can realize massive consolidation benefits in their Linux on IBM z Systems implementation.
They can realize even greater cost savings by implementing a Protocol Intermix (PIM) SAN.
Massive consolidation
With NPIV support on IBM Z, server and I/O consolidation is compelling. IBM undertook a well-publicized
project at its internal data centers called Project Big Green. Project Big Green consolidated 3900 open
systems servers onto 30 z Systems mainframes running Linux. IBM’s Total Cost of Ownership (TCO)
savings—taking into account footprint reductions, power and cooling, and management simplification costs—
was nearly 80 percent for a five-year period. These types of TCO savings are why nearly 30 percent of new
IBM mainframe processor shipments are now being used for Linux.
Implementation of NPIV requires connectivity from the FICON (FCP) channel to a switching device (director
or smaller port-count switch) that supports NPIV. A special microcode load is installed on the FICON channel
to enable it to function as an FCP channel. NPIV allows the consolidation of up to 255 Linux on IBM z
Systems images (“servers”) behind each FCP channel, using one port on a channel card and one port on
the attached switching device for connecting these virtual servers. This enables massive consolidation of
many HBAs, each attached to its own switch port in the SAN (see Figure 6). IBM currently supports up to 64
Linux images per FCP channel, and 8,000 total images per host. Although this level of I/O consolidation was
possible prior to NPIV support on IBM Z, implementing LUN masking and zoning in the same manner as with
open systems servers or SAN or storage was not possible prior to the support for NPIV with Linux on IBM Z.
NPIV implementation on IBM Z has also been driving consolidation and adoption of PIM for distributed and
open systems and mainframes (FICON). While IBM has supported PIM in IBM Z environments since 2003,
adoption rates were low until NPIV implementations for Linux on IBM Z picked up with the introduction of
System z10 in 2008.Enhanced segregation and security beyond simple zoning was now possible through
switch partitioning or virtual fabrics or logical switches. With a large number of new mainframes being
shipped for use with Linux on IBM Z, it is safe to say that a significant number of mainframe environments
are now running a shared PIM environment.
Using enhancements in switching technology, performance, and management, PIM users can now fully
populate the latest high-density directors with minimal or no oversubscription. They can use management
capabilities such as virtual fabrics or logical switches to fully isolate open systems ports and FICON ports
in the same physical director chassis. Rather than having more partially populated switching platforms
that are dedicated to either open systems or mainframe/FICON, PIM allows for consolidation onto fewer
switching devices, reducing management complexity and improving resource use. This, in turn, leads
to reduced operating costs, and a lower TCO for the storage network. It also allows for a consolidated,
simplified cabling infrastructure.
Switched zHPF avoids this problem. zHPF directors and switches have sufficient buffer credits available
on ports to allow them to stream frames at full-line performance rates with no bandwidth degradation.
IT organizations that implement a cascaded zHPF configuration between sites can, with the latest zHPF
director platforms, stream frames at 16 Gbps rates with no performance degradation for sites that are
100 km apart. This data traffic can also be compressed—and even encrypted—while traversing the network
between sites, allowing IT to securely move more data more quickly (see Figure 7).
Switched zHPF technology also allows organizations to take advantage of hardware-based FICON protocol
acceleration or emulation techniques for tape (reads and writes), as well as with zGM (z/OS Global Mirror,
formerly known as XRC, or Extended Remote Copy). This emulation technology—available on standalone
extension switches or as a blade in FICON directors—allows the channel programs to be acknowledged
locally at each site and avoids the back-and-forth protocol handshakes that normally travel between remote
sites. It also reduces the impact of latency on application performance and delivers local-like performance
over unlimited distances. In addition, this acceleration and emulation technology optimizes bandwidth
use. Bandwidth efficiency is important because it is typically the most expensive budget component in an
organization’s multisite disaster recovery or business continuity architecture. Anything that can be done to
improve the use or reduce the bandwidth requirements between sites would likely lead to significant TCO
savings.
Direct-attached zHPF storage also typically results in underused channel card ports and host adapter ports
on DASD arrays. FICON Express16S channels can comfortably perform at high-channel use rates, and direct-
attached storage architecture typically sees channel use rates of 10 percent or less. As illustrated in Figure
8, using zHPF directors or switches allows organizations to maximize channel use.
Figure 8: Switched FICON drives improved channel use, while preserving Channel Path Identifiers (CHPIDs)
for growth.
It also is very important to keep traffic for tape drives streaming, and avoid stopping and starting the tape
drives, as this leads to unwanted wear and tear of tape heads, cartridges, and the tape media itself. This is
accomplished by using FICON acceleration and emulation techniques as described earlier. A configuration
similar to the one shown in Figure 9 can also be implemented. Such a configuration requires solid analysis
and planning, but it will pay dividends for an organization’s FICON tape environment.
Finally, switches facilitate fan-in—allowing different hosts and LPARs whose I/O subsystems are not shared
to share the same assets. While some benefit may be realized immediately, the potential for value in future
equipment planning can be even greater. With the ability to share assets, equipment that would be too
expensive for a single environment can be deployed in a cost-saving manner.
The most common example is to replace tape farms with virtual tape systems. By reducing the number
of individual tape drives, maintenance (service contracts), floor space, power, tape handling, and cooling
costs are reduced. Virtual tape also improves reliable data recovery, allows for significantly shorter Recovery
Time Objectives (RTO) and nearer Recovery Point Objectives (RPO), and offers features such as peer-to-
peer copies. However, without the ability to share these systems, it may be difficult to amass sufficient cost
savings to justify the initial cost of virtual tape. And the only practical way to share these standalone tape
systems or tape libraries is through a switch.
With disk subsystems, in addition to sharing the asset, it is sometimes desirable to share the data across
multiple systems. The port limitations on DASD may prohibit or limit this capability using direct-attached
(point-to-point) FICON channels. Again, the switch can provide a solution to this issue. Even when there
is no need to share devices during normal production, this capability can be valuable in the event of a
failure. Data sets stored on tape can quickly be read by CPUs picking up workload that is already attached
to the same switch as the tape drives. Similarly, data stored on DASD can be available as soon as a fault is
determined.
Switch features such as preconfigured port prohibit and allow matrix tables can ensure that access
intended only for a disaster scenario is prohibited during normal production. Investments made in switches
for disaster recovery and business continuance are likely to pay the largest dividends. Having access to
alternative resources and multiple paths to those resources can result in significant savings in the event of
a failure.
As is shown in Figure 10, a logical connection over a cascaded FICON director configuration incorporates
three end-to-end links. The first link connects the FICON channel N_Port (node port) to an F_Port (fabric
port) on the FICON director. The second link connects an E_Port (expansion port) on the local director to an
E_Port on the remote director. An inter-switch link (ISL) is a link between two switches, E_Port-to-E_Port. The
ports of the two switches automatically come online as E_Ports once the login process finishes successfully.
Finally, the third link connects the F_Port on the director to an N_Port on the subsystem. Figure 11
illustrates this same concept for a multihop cascaded configuration.
With the growing trend toward increased usage of Linux on IBM z Systems, and the cost advantages of
NPIV implementations and PIM SAN architectures, direct-attached storage in a mainframe environment
is becoming a thing of the past. The advantages of a switched zHPF infrastructure are simply too great to
ignore.
HCD defines the relationship between channels and directors (by switch ID) and specific switch port (as
well as the destination switch ID for cascaded connections) for the switch to director segment of the link.
However, HCD does not define the ISL connections. While this may initially appear to be surprising based on
a reader’s prior experience with the exacting specificity required by HCD, it means that the management of
the traffic over ISLs is controlled exclusively by the directors. Hence, additional ISL bandwidth can be added
to a configuration without any modification to the environment’s HCD definitions. HCD simply assumes that
the links are present and requires that 2-byte addressing be employed to define the connectivity to storage
subsystems.
In cascaded FICON configurations, the Domain ID of the entry switch is different than the domain of
the switch where the control unit is attached. Therefore, cascading requires a 2-byte link address. Any
time a 2-byte link address is defined on a channel, all link addresses must be 2-byte link addresses. The
first byte is the switch ID and the second byte is the port to which the storage subsystem is connected.
During initialization, the switches identify their peers and create a routing table so that frames for remote
subsystems may be forwarded to the correct director. This configuration is supported for both disk and tape,
with multiple processors, disk subsystems, and tape subsystems sharing the ISLs between the directors.
Multiple ISLs between the directors are also supported (Artis & Guendert, 2006). Cascading between a
director and a switch—for example, from a Gen 6 Brocade X6 FICON director to a Brocade G620 Switch—is
also supported.
• Traditional ISLs
• Inter-Chassis Links (ICLs), which are used only with the Brocade families of FICON directors
• Fibre Channel over Internet Protocol (FCIP)
A cascaded configuration requires 2-byte addressing. Two-byte addressing requires a list of authorized
switches. This authorization feature, called fabric binding, is available through the Secure Access Control
List feature. The fabric binding policy allows a predefined list of switches (domains) to exist in the fabric and
prevents other switches from joining the fabric.
There are hardware and software requirements specific to cascaded FICON (Brocade Communications
Systems, 2008):
• The FICON directors themselves must be from the same vendor (that is, both should be from Brocade).
• The mainframes must be one of the following machines: z800, 890,900, 990, z9 BC, z9 EC, z10 BC, EC,
z196, z114, zEC12, zBC12, z13, z13s or z14. Cascaded FICON requires 64-bit architecture to support the
2-byte addressing scheme. Cascaded FICON is not supported on 9672 G5/G6 mainframes.
• z/OS version 1.4 or later, or z/OS version 1.3 with required PTFs/MCLs to support 2-byte link addressing
(DRV3g and MCL [J11206] or later) is required.
• Two-byte addressing is used, which has the following requirements:
–– E_D_TOV must be the same on all switches in the fabric (typically this is not changed from the
default).
–– R_A_TOV must be the same on all switches in the fabric (typically this is not changed from the
default).
–– Insistent Domain ID is used.
–– Fabric binding (strict Switch Connection Control [SCC] policy) is used.
In Figure 12, all hosts have access to all of the disk and tape subsystems at both locations. The host
channels at one location are extended to the Brocade X6 (FICON) platforms at the other location, to allow for
cross-site storage access. If each line represents two FICON channels, then this configuration needs a total
of 16 extended links; and these links are utilized only to the extent that the host has activity to the remote
devices.
The most obvious benefit of cascaded versus non-cascaded FICON is the reduction in the number of links
across the Wide Area Network (WAN). Figure 13 shows a cascaded, two-site FICON environment.
In this configuration, if each line represents two channels, only four extended links are required. Since
FICON is a packet-switched protocol (versus the circuit-switched ESCON protocol), multiple devices can
share the ISLs, and multiple I/Os can be processed across the ISLs at the same time. This allows for the
reduction in the number of links between sites and allows for more efficient utilization of the links in place.
In addition, ISLs can be added as the environment grows and traffic patterns dictate.
This is the key way in which a cascaded FICON implementation can reduce the cost of the enterprise
architecture. In Figure 13, the cabling schema for both intersite and intrasite has been simplified. Fewer
intrasite cables translate into decreased cabling hardware and management costs. This also reduces
the number of FICON adapters, director ports, and host channel card ports required, thus decreasing the
connectivity cost for mainframes and storage devices. In Figure 6, the sharing of links between the two
sites reduces the number of physical channels between sites, thereby lowering the cost by consolidating
channels and the number of director ports. The faster the channel speeds between sites, the better the
intersite cost savings from this consolidation. So, with 16 Gbps FICON and 32 Gbps FICON available, the
more attractive this option becomes.
Another benefit to this approach, especially over long distances, is that the Brocade FICON director typically
has many more buffer credits per port than do the processor and the disk or tape subsystem cards. More
buffer credits allow for a link to be extended to greater distances without significantly impacting response
times to the host.
• The number of ISLs between the two cascaded FICON directors and the routing of traffic across ISLs
• The number of FICON/FICON Express channels whose traffic is being routed across the ISLs
• The ISL link speed
• Contention for director ports associated with the ISLs
• The nature of the I/O workload (I/O rates, block sizes, use of data chaining, and read/write ratio)
• The distances of the paths between the components of the configuration (the FICON channel links from
processors to the first director, the ISLs between directors, and the links from the second director to the
storage control unit ports)
• The number of switch port buffer credits
The last factor—the number of buffer credits and the management of buffer credits—is typically the factor
that is examined most carefully, and the factor that is most often misunderstood.
1. Between the FICON channel card on the mainframe (known as an N_Port) and the FICON director’s FC
adapter card (which is considered an F_Port)
2. Between the two FICON directors via E_Ports (the link between E_Ports on the switches is an ISL)
3. The link from the F_Port to a FICON adapter card in the control unit port (N_Port) of the storage device
The physical paths are the actual FC links connected by the FICON switches that provide the physical
transmission path between a channel and a control unit. Note that the links between the cascaded FICON
switches may be multiple ISLs, both for redundancy and to ensure adequate I/O bandwidth.
Figure 14 shows that the FC-FS 24-bit FC port address identifier is divided into three fields:
1. Domain
2. Area
3. AL_Port
In a cascaded FICON environment, 16 bits of the 24-bit address must be defined, for the IBM Z server to
access a FICON control unit. The FICON switches provide the remaining byte used to make up the full 3-byte
FC port address of the control unit being accessed. The AL_Port (arbitrated loop) value is not used in FICON
and is set to a constant value.
The IBM Z domain and area fields are referred to as the F_Port’s port address field. The address field is
a 2-byte value, and when defining access to a control unit attached to this port using the IBM ZHardware
Configuration Definition (HCD) or IOCP, the port address is referred to as the link address. Figure 7 further
illustrates this, and Figure 8 shows an example of a cascaded FICON IOCP Sysgen.
The connections between the two directors are established through the Exchange of Link Parameters
(ELP). The switches pause for a FLOGI, and assuming that the device is another switch, they initiate an ELP
exchange. This results in the formation of the ISL connection(s).
In a cascaded FICON configuration, three additional steps occur beyond the normal FICON switched point-to-
point communication initialization. The three basic steps are (Singh, Grau, Hoyle, & Kim, 2012):
1. If a 2-byte link address is found in the control unit macro in the Input/Output Configuration Data Set
(IOCDS), a Query Security Attribute (QSA) command is sent by the host to check with the fabric controller
on the directors to determine if the directors have the high-integrity fabric features installed.
3. If it is an affirmative response, which indicates that a high-integrity fabric is present (fabric binding and
insistent Domain ID), the login continues. If not, the login stops, and the ISLs are treated as invalid
(which is not a good thing).
What does high-integrity fabric architecture and support entail (Artis & Guendert, 2006)?
Support of Insistent Domain IDs: This means that a FICON switch is not allowed to automatically change
its address when a duplicate switch address is added to the enterprise fabric. Intentional manual operator
action is required to change a FICON director’s address. Insistent Domain IDs prohibit the use of dynamic
Domain IDs, ensuring that predictable Domain IDs are being enforced in the fabric. For example, suppose a
FICON director has this feature enabled, and a new FICON director is connected to it via an ISL in an effort
to build a cascaded FICON fabric. If this new FICON director attempts to join the fabric with a Domain ID that
is already in use, the new director is segmented into a separate fabric. The director also makes certain that
duplicate Domain IDs are not used in the same fabric.
Fabric Binding: Fabric binding enables companies to allow only FICON switches that are configured to
support high-integrity fabrics to be added to the FICON SAN. Brocade switching devices perform this
through configuring the SCC policy and then placing it in strict mode. The FICON directors that you want to
connect to the fabric must be added to the fabric membership list of the directors already in the fabric. This
membership list is composed of the “acceptable” FICON director’s World Wide Name (WWN) and Domain
ID. Using the Domain ID ensures that there are no address conflicts—that is, duplicate Domain IDs—when
the fabrics are merged. The two connected FICON directors then exchange their membership list. This
membership list is a Switch Fabric Internal Link Service (SW_ILS) function, which ensures a consistent and
unified behavior across all potential fabric access points.
In a cascaded switch configuration, FICON channels use an Extended Link Service Query Security Attributes
(ELS QSA) function to determine whether they are connected to a high-integrity fabric. Each switch in a high-
integrity fabric must have the following attributes configured:
When validation is successful, High-Integrity Fabric (HIF) mode will automatically enable when the firmware
installs. If a FICON channel tries to connect to a fabric switch without these attributes configured, the
channel segments from the fabric. With HIF mode not configured, new channel bring-up is not possible in
v7.3.0 or later firmware, including v8.1.0.
The absence of HIF mode configuration will fail all the security attributes in QSA, even though they are
actually present. Therefore, you must perform HIF mode configuration when upgrading from v7.3.0 or later
firmware.
Once these attributes are configured, you must enable the switch in HIF mode using the Fabric OS configure
command. This verifies that the required features are set and locks the configuration to ensure connection
with the FICON channel. Once the HIF mode is set, you cannot change the IDID, fabric-wide consistency
policy, and SCC policy without disabling HIF mode.
• Once at 7.3.0 or later firmware you must enable HIF mode to enable FMS mode.
• Before a Fabric OS downgrade, you must disable HIF mode. Note that this operation is not recommended
for FICON and should be used only when downgrading firmware. You will receive a warning to this effect
if FMS mode is enabled. If HIF is disabled, any new channel initialization will fail as the Query Security
Attribute (QSA) reply from the switch to the channel will fail. The existing channel will continue to operate.
• Before a Fabric OS firmware upgrade, be sure that the switch has appropriate IDID, fabric-wide
consistency policy, SCC policy settings enabled.
• FICON users should verify that HIF is enabled after the upgrade to v7.3.0 or later.
When two chassis platforms are interconnected by ICLs, each chassis requires a unique domain and is
managed as a separate switch. The ICL ports appear as regular ports, with some restrictions. All port
parameters associated with ICL ports are static, and all portCfg commands are blocked from changing any
of the ICL port parameters, except for EX_Port and FEC port parameters. The only management associated
with ICL ports and cables is monitoring the status of the LEDs on the ICL ports and any maintenance if the
Attention LED is blinking yellow. The ICL ports are managed as E_Ports. When you connect two chassis
platforms, the trunking, Quality of Service (QoS) features are supported. It should be noted that a Brocade
trunking license is not required for trunking on ICL connections.
Enterprise ICL (EICL) license enforcement is applicable for Inter Fabric Links (IFL) via ICL links on Gen
5 platforms and is not applicable for Gen 6 platforms. In a mixed Gen 5 and Gen 6 environment, the
edge fabric will additionally count the ICL IFL links for its license enforcement. When a Gen 5 chassis is
connected to a Gen 6 chassis, the Gen 5 takes precedence and imposes the EICL license limit on the links.
Prior to FOS 8.0.1, only ICL ISLs were counted for license enforcement. In case of only Gen 6 platforms,
there is no restriction on the number of switches that can be connected via ICL. In case of only Gen 5
platforms, if no EICL license is present, only three other direct chassis are allowed to connect (for a total of
four).
In the case of 16 Gbps Gen 5 platforms, if EICL license is present, maximum of nine chassis are allowed to
connect (for a total of 10). The limit on the number of chassis depends only on the physical chassis and not
on the logical switches.
• Only the QSFPs that have the part number “57-1000310-01” and the serial number starting with
“HME114323P00044”support 2 km on ICLs. The second character (“M”) in this serial number indicates
that the QSFP supports 2 km distance. You can also use the sfpShow command to identify the QSFPs
that support 2 km on ICL ports.
• The new credit model does not affect the HA configuration.
• You cannot downgrade from Fabric OS 7.3.0 to any earlier version if either of the following conditions is
true:
–– When you have plugged in the QSFPs that support 2 km on one ore more ICL ports.
–– When you have configured buffer credits using the EportCredit command on one or more ICL ports.
• The portCfgEportCredits configuration cannot have less than five buffer credits or more than 16 buffer
credits per VC in Gen 5 core blades. Gen 6 core blades cannot have less than five buffer credits and more
than 160 buffer credits per VC. If there are insufficient buffer credits available, the default configuration is
retained and the message Failed: already exceeds Buffer credits allowed is displayed.
• Only a maximum of 10 ICL ports can be configured with 2 km QSFPs with 16 buffer credits per VC on Gen
5 core blades.
• Only a maximum of 14 ICL ports can be configured with 2 km QSFPs with 13 buffer credits per VC on Gen
5 core blades. On the Gen 5 core blades, when you configure the 15th port using portCfgEportCredit, the
message Failed: already exceeds Buffer credits allowed is displayed. The remaining two ports can have
only 34 buffer credits. To remedy this, disable some ports and configure the remaining ports using 13
buffer credits per VC.
• On Gen 6 core blades, only a maximum of eight ports can be configured for 2 km link with all frame sizes
at line rate. The remaining eight ports can be configured for 2 km link with full size frames at line rate.
–– For each port that requires 2 km line rate for all frame sizes, assign 96 buffers using the
portCfgEportCredits command .
–– For each port that requires 2 km line rate for FULL frame sizes, assign 37 buffers using the
portCfgEportCredits command.
• On the Gen 5 core blades, if you are using the portCfgEportCredit command, a maximum of 16 buffer
credits can be configured on all the ICL ports when all the 16 ICL ports are in the disabled state. After
enabling all the ports, until the buffer credits are available, the ports will come up with the configured
buffer credits. The remaining ports will come up in degraded mode. In case of remaining QoS-enabled
ports, the ports will come up without QoS enabled. If all the ICL ports are QoS enabled, there will only be
448 buffer credits available for 2 km distance support.
The Brocade X6-8 has four port groups on the CR32-8 core blade. The Brocade X6-4 has two port groups
on the CR32-4 core blade. Each port group has four QSFP connectors, and each QSFP connector maps to
four user ports. A minimum of four ICL ports (two on each core blade) must be connected between each
chassis pair. The dual connections on each core blade must reside within the same ICL trunk boundary on
the core blades. If more than four ICL connections are required between a pair of X6 chassis, additional ICL
connections should be added in pairs, one on each core blade with each additional pair residing within the
same trunk boundary.
For HA, you should have at least two ICLs from each core blade. The maximum number of ICLs between two
X6-4 chassis or between a X6-8 and a X6-4 is 16. The maximum number of ICLs between two X6-8 chassis
is 32.
Because the FSPF routing logic uses only the first 16 paths to come online, only 16 ICLs are utilized. With
Virtual Fabrics, however, you can define two logical switches on the chassis and have 16 ICLs in each.
Brocade recommends that you have a maximum of eight ICLs connected to the same neighboring domain,
with a maximum of four ICLs from each core blade. The ICLs can connect to either core blade in the
neighboring chassis. The QSFP ICLs do not need to be cross-connected. ICLs between Brocade X6 Director
QSFP ICLs and ISLs in the same logical switch and connected to the same neighboring switch are not
supported. This is a topology restriction with ICLs and any ISLs that are E_Ports or VE_Ports. If Virtual
Fabrics is enabled, you can have ICLs and ISLs between a pair of chassis if the ICLs are in a different logical
switch than the ISLs.
Figure 17 shows an example of an ICL configuration. For additional details on supported ICL configurations
and topologies, please refer to the Brocade Fabric OS 8.1.x Administration Guide.
Instead, implement FICON cascaded path management, which automatically responds to changing I/O
workloads and provides a simple, labor-free but elegant solution to a complex management problem.
Brocade has a detailed Cascaded FICON Best Practices Guide describes in detail the various tools,
techniques, and best practices for managing FICON ISLs.
The assignment of traffic between directors over the intervening ISLs is completely controlled by the
directors. At the inception of FICON, the only mechanism available was the Fabric Shortest Path First (FSPF)
protocol. FSPF is a link state selection protocol that directs traffic along the shortest path between the
source and destination based upon the link cost. FSPF detects link failures, determines the shortest route
for traffic, updates the routing table, provides fixed routing paths within a fabric, and maintains correct
ordering of the frames. Once established, FSPF programs the hardware routing tables for all active ports on
the switch.
Every time an ISL is added, the ISL traffic assignments change. While very simple, this technique is not
attractive to an experienced mainframe architect since they cannot prescribe how the paths to a subsystem
are mapped to the ISLs. In the worst case, all of the paths to a subsystem might be mapped to the same
ISL. Moreover, once a path is assigned an ISL, that assignment is persistent. FSPF is the foundation for
all subsequent ISL routing mechanisms that have been introduced since 2003. Therefore, it would be
beneficial to have a solid understanding of how FSPF works. The next section of this paper will review the
FSPF protocol.
FSPF is based on a replicated topology database that is present in every switching device in the FICON SAN
fabric. Each switching device uses information in this database to compute paths to its peers via a process
known as path selection. The FSPF protocol itself provides the mechanisms to create and maintain this
replicated topology database. When the FICON SAN fabric is first initialized, the topology database is created
in all operational switches. If a new switching device is added to the fabric, or the state of an ISL changes,
the topology database is updated in all the fabric’s switching devices to reflect the new configuration.
A Link State Record (LSR) describes the connectivity of a switch within the topology database. The topology
database contains one LSR for each switch in the FICON SAN fabric. Each LSR consists of a link state record
header, and one or more link descriptors. Each link descriptor describes an ISL associated with that switch.
A link descriptor identifies an ISL by the Domain_ID and output port index of the “owning” switch and the
Domain_ID and input port index of the “neighbor” switch. This combination uniquely identifies an ISL within
the fabric. LSRs are transmitted during fabric configuration to synchronize the topology databases in the
attached switches. They are also transmitted when the state of a link changes and on a periodic basis to
refresh the topology database.
Associated with each ISL is a value known as the link cost, which reflects the desirability of routing frames
via that ISL. The link cost is inversely proportional to the speed of the link-higher speed links are more
desirable transit paths, therefore they have a lower cost. The topology database has entries for all ISLs in
the fabric, enabling a switch to compute its least-cost path to every other switching device in the FIOCN SAN
fabrics from the information contained in its copy of the database.
There are three main functions associated with the FSPF protocol:
1. A Hello protocol to establish two-way communication with a neighbor switch and determine the
connectivity of an ISL
Hello messages are transmitted on a periodic basis on each ISL even after two-way communication is
established. The periodic Hello transmission provides a mechanism to detect a switch or an ISL failure. In
essence, the periodic Hello acts as a “heartbeat” between the switching devices. The time interval between
successive Hellos (in seconds) is defined by a timer called the Hello interval. The number of seconds that a
switching device will wait for a Hello before reverting back to one-way communication is defined by the Dead
interval. Both the Hello and Dead intervals are communicated between switches as part of the Hello itself.
If a switch fails to receive a Hello within the expected time it 1) assumes the neighbor switch is no longer
operational and 2) removes the associated ISL from its topology database. At this time the switch removes
the Domain_ID of the neighbor switch from the Hello messages and the Hello protocol reverts to the on-way
communication state. If a link failure occurs on an ISL, two-way communication is lost. When the ISL is
restored, the switching devices must re-establish two-way communication and synchronize their topology
databases before the ISL may be again used for routing frames.
While the entire topology database is exchanged during this initial synchronization process, LSRs are also
transmitted during the database maintenance phase to reflect topology changes in the FICON SAN. The
topology database must be updated whenever the state of any ISL changes. The specific events that will
cause such an LSR to be transmitted include:
1. An ISL fails (or the entire switch connected to the ISL fails). A new LSR is transmitted to remove the failed
ISL(s) from the topology database.
2. An ISL reverts to one-way communication status. A new LSR is transmitted to remove the one-way ISL
from the topology database.
3. A new ISL completes the link initialization and the initial database synchronization. An LSR is transmitted
to notify the other switches in the fabric to add the new information to their databases.
Path Selection
Once the topology database has been created and a switch has information about available paths, it can
compute its routing table and select the paths that will be used for forward frames. Fibre Channel uses a
least cost approach to determine paths used for routing frames. Each ISL is assigned a link cost metric
that reflects the desirability of using that link. The cost metric is based on the speed of the ISL and an
administratively applied cost factor. The link cost us calculated using the following formula:
The factor S is an administratively defined factor that may be used to change the cost of specific links. By
default, S is set to 1. Applying the formula, assuming S is set to 1, the link cost for a 1 Gbps link is 1,000,
while a 2 Gbps ISL cost is 500.
When there are multiple paths with the same cost to the same destination (which is typical for 99.9%
of cascaded FICON SAN architectures), a switching device must decide how to use these paths. It may
select one path only (not ideal) or it may attempt to balance the traffic among the available paths to avoid
congestion of ISLs. If a path fails, the switch may select an alternate ISL for frame delivery. This is where the
different types of ISL routing come into play. These routing techniques are the subject of the next section.
The path selection algorithms used by the IBM synchronous replication technology, Metro Mirror (previously
known as Peer-to-Peer Remote Copy or PPRC), is based on the link bandwidth. The IBM DS8000 series
DASD array keeps track of outstanding Metro Mirror I/O operations on each port. DS8000 path selection
chooses the next port to schedule work based on the link with the most available bandwidth. Thus, Metro
Mirror traffic should also steer around SAN congestion towards better performing routes. Consider a
configuration with four Metro Mirror links from the primary control unit to the secondary control unit. These
four Metro Mirror ports go to a fabric with eight ISLs. At most, four ISLs will be used. The hashing algorithm
might actually map traffic of two or more of the PPRC ports onto the same ISL resulting in fewer than four
ISLs actually being used. In many cases, static routing policies will do an adequate job spreading the work
across many routes. However, in others, exploitation of the link bandwidth can be suboptimal.
For another example, in a scenario with 12 ingress ports and 4 ISLs, PBR would assign three ingress ports
to each ISL in a round robin fashion. Ingress ports were assigned to ISLs regardless of whether or not they
would be sending traffic across the ISLs. This is why it was very common to see PBR implementations result
in 50% or more of the total cross site data workloads being on one ISL, and perhaps no data at all on other
ISLs. For large z/OS Global Mirror implementations, this kind of behavior by PBR would result in many
suspends and very poor replication performance.
DBR improved things somewhat by making the routing assignments based on source ID/destination ID
(SID/DID), so the assignment of ingress ports to ISLs actually did require the ingress ports to be used for
sending traffic across the ISLs. However, the other primary drawback of static routing policies still applied:
routing can change each time the switch is initialized which leads to unpredictable, non-repeatable results.
Static routing policies do have some advantages. One outstanding example is provided with how the routing
policies handle the impact of a slow drain device. Static routing has the advantage of limiting the impact
of a slow-drain device or a congested ISL to a small set of ports that are mapped to that specific ISL. If
congestion occurs in an ISL, the IBM z Systems channel path selection algorithm will detect the congestion
through the increasing initial Command Response (CMR) time in the in-band FICON measurement data. The
IBM z Systems channel subsystem begins to steer the I/O traffic away from congested paths and toward
better performing paths using the CMR time as a guide. Additional host recovery actions are also available
for slow-drain devices.
Routes across ISLs are assigned to I/O operations dynamically on a per exchange (per I/O) basis. Loading of
ISLs is applied at the time of the data flow. This provides the most effective mechanism for balancing data
workload traversing available ISLs. Thus, every exchange can take a different path through the fabric. There
are no longer fixed routes from the source ports to the destination ports. ISL resources are utilized equally
and therefore, the ISLs can run at higher utilization rates without incurring queuing delays. High workload
spikes resulting from peak period usage and/or link failures can also be dealt with more easily with FIDR.
One example of such a dynamic routing policy is Brocade Exchange Based Routing (EBR).
EBR is a form of dynamic path selection (DPS). When there are multiple paths to a destination, with DPS the
input traffic is distributed across the different paths in proportion to the bandwidth available on each of the
paths. This improves utilization of all available paths, thus reducing possible congestion on the paths.
Every time there is a change in the network which changes the available paths, the traffic can be
redistributed across the available paths.
• Potential bandwidth consolidation/ISL consolidation: For many years, Brocade and IBM have
recommended keeping FCP traffic (such as PPRC/Metro Mirror) and FICON traffic on their own ISLs or own
group of ISLs-in other words segregate the traffic types. With FIDR, it’s now possible to take advantage
of FIDR and share ISLs among previously segregated traffic. This leads to the obvious consolidation cost
saving in the hardware: fewer ISLs means fewer FICON director ports and/or DWDM links.
However, that cost savings could be minuscule in comparison to the potential bandwidth cost savings.
The potential bandwidth cost savings depends on an end users’ specific circumstances. In Asia and the
Americas, the distance between data centers is usually greater than it is in Europe. Many large end users
now own their bandwidth. In addition, the type of bandwidth is a significant factor.
• Better utilization of available bandwidth/assets: This is likely to be most attractive to PPRC/Metro Mirror
users and/or end users who own their own bandwidth. Many such end users will prefer to keep their
traffic types segregated and not share ISLs. One of the drawbacks to using static routing with PPRC was
that some ISLs might not be used at all-they would only be there for failover. Consider a configuration
with four PPRC links from a primary control unit to the secondary control unit and these four PPRC ports
connect to a fabric with eight ISLs. At most, with static routing only four of these ISLs would be used and
the other four would be there for redundancy/failover. With FIDR, all eight ISLs will be used which is far
more efficient utilization of the available bandwidth and could have a positive impact on performance.
Design Considerations
There are two behaviors that may occur in a FICON SAN with FIDR enabled that IBM Z end users need to
be aware of. The first behavior is dilution of error threshold counts. The second is the effects of slow drain
devices.
The first behavior of concern is dilution of error threshold counts. It is possible for errors to occur in a FICON
SAN that cause Fibre Channel frames to get lost. One example scenario is triggered when there are more
than 11 bit errors in a 2112 bit block of data and Forward Error Correction (FEC) code cannot correct this
large of a burst of errors. For IBM z Systems and the FICON protocol, an error is then detected by the host
operating system. This error will typically be an interface control check (IFCC) or a missing interrupt. The
exact error type depends on exactly where in the I/O operation the error occurs. z/OS counts these errors
and are tracked back to a specific FICON channel and control unit link. With static routes the error count
covers the specific ISL that is in the path between the FICON channel and the control unit. With FICON
dynamic routing mechanisms, the intermittent errors caused by a faulty ISL will occur on many channel to
control unit links, unknown to the operating system or the z Systems processor itself.
Therefore, the error threshold counters can get diluted: they get spread across the different operating
system counters. This then makes it entirely possible that either the thresholds are not reached in a time
period needed to recognize the faulty link, or that the host thresholds are reached by all control units that
cross the ISL in question, resulting in all the channel paths being fenced by the host operating system. To
prevent this behavior, the end user should use the FICON SAN switching devices capabilities to set tighter
error thresholds internal to the switch and fence/decommission faulty ISLS before the operating system’s
recovery processes get invoked.
The second behavior to be aware of/concerned with when using FIDR is more important: slow-drain devices
and their impact on the FICON SAN. Recall that a slow-drain device is a device that does not accept frames
at the rate generated by the source. When slow drain devices occur, FICON SANs are likely to lack buffer
credits at some point in the architecture. This lack of buffers can result in FICON switching device port
buffer credit starvation, which in the extreme cases can result in congested or even completely choked ISLs.
Frames that are “stuck” for long periods of time (typically for periods >500 milliseconds) may ultimately be
dropped by the FICON SAN fabric, resulting in errors being detected. The most common type of error in this
situation described would be C3 discards.
When a slow-drain device event does occur, and corrective action is not taken in short order, ISL traffic
can become congested as the effect of the slow-drain device propagates back into the FICON SAN. With
FICON dynamic routing policies being implemented a slow-drain device may cause the buffer to buffer credit
problem to manifest itself on all ISLs that can access the slow-drain device. This congestion spreads and
can potentially impact all traffic that needs to cross the shared pool of ISLs. With static routing policies, the
congestion and its impact is limited to the one ISL that accesses the slow-drain device.
Possible causes of slow-drain devices include, but are not limited to:
• An insufficient number of buffer credits configured for links that access devices a long distance away
• Disparate link speeds between the FICON channel and control unit links
• Significant differences in cable lengths
• Congestion at the control unit host adapters caused when the link traffic (across all ISLs) exceeds the
capacity of the host adapter
• This can occur when too many Metro Mirror ISLs share the same host adapter as FICON production traffic
ISLs.
The IBM z13 and IBM z14 processors and IBM z/OS have two capabilities that mitigate the effect of slow-
drain devices:
1. The IBM Z channel path selection algorithms steer the I/O workload away from the paths that are
congested by the slow-drain device towards the FICON channels in a separate, redundant FICON SAN.
Best practices for Brocade z Systems I/O configurations require at least two separate and redundant
FICON SANs. Many end users use four and the largest configurations often use eight.
2. The IBM Z SAN fabric I/O priority feature with the z/OS Work Load Manager will manage the contention
using the end user-specified goals so that the higher importance workloads are executed first.
For Metro Mirror traffic, best practices call for the IBM z Systems end user to use FIDR in the fabric for
predictable/repeatable performance, resilience against work load spikes and ISL failures, and for optimal
performance. It is felt that if a slow-drain device situation does occur in a FICON SAN fabric with Metro
Mirror traffic, it will impact the synchronous write performance of the FICON traffic because the write
operations will not complete until the data is synchronously copied to the secondary control unit. Since
FICON traffic is already subject to the slow-drain device scenarios today, exploiting FIDR does not introduce
a new challenge to the end user and their FICON workloads.
FICON Multihop has several hardware requirements that need to be met prior to implementation. There are
IBM Z, storage, SAN, and network/DWDM device requirements.
DASD/Storage requirements
FICON Multihop is supported with IBM DS8870 running firmware release 7.5 or later. Multihop will work with
both 16 Gbps and 8 Gbps host adapter cards. Other storage OEMs may support Multihop as well. Clients
should check with their DASD vendor to verify support.
• Potential configuration consolidation/switch consolidation: The limit of only having two switches/one
hop supported in the past, led to some unnatural FICON switch configurations for clients running multisite
configurations. With the new supported Multihop configurations, the management of these configurations
is much simpler.
• Better availability: Using the square and triangle configurations enables the access to higher SAN
availability in the event of a total connectivity break between two switches. These configurations now
allow traffic to be routed around to ensure you still maintain access to all the devices and hosts in your
SAN fabric. Prior to Multihop support, clients with three- and/or four-site architectures experiencing these
types of connectivity failures could see themselves being in an unsupported configuration for failover.
Multihop FICON eliminates this potential issue.
Design Considerations
There are some design implications that need to be considered with the use of FICON Multihop
environments. The first is bandwidth planning and allocation for failure scenarios, and the second
performance and latency impacts of traffic being re-routed to longer paths.
When planning, and designing your FICON Multihop fabric, you need to keep in mind the additional
bandwidth that may be required when there is a total loss of connectivity between two switches. If this event
were to occur, then all traffic that was flowing between these two adjacent switches will be rerouted over the
non-failed ISL paths. To ensure there is minimal impact to the other traffic in the fabric additional bandwidth
should be considered between all routes.
Secondly, if all the traffic is rerouted due to a total connectivity loss between switches, applications using
those paths may see a negative impact to performance due to the increased latency of traversing the longer
path. Unfortunately, this is the consequence of a high-availability design, but there is little that can be done
to mitigate this.
• What type of traffic is going across the cascaded FICON links (ISLs)? DASD, tape, CTC, all of the above?
• Do I have replication traffic going across the ISLs? If so, what type (synchronous, asynchronous, both)?
• Do I wish to isolate a particular traffic type to its own set of ISLs (by OS, storage type, replication type)?
• Am I using trunking/port channels in conjunction with Multihop?
• Am I using Virtual Fabrics and how does that effect my ISL allocation?
• Do I have an SLA that must be met on a specific replication traffic?
• Does my environment include a GDPS or similar architecture? Will I be performing a HyperSwap?
There are a variety of tools that can be used for this study. Some of the better tools include the Intellimagic
family of software. If you are employing a virtual fabrics architecture, it is recommended that you consult
with your switch vendor for specific configuration guidelines on ISL usage with Multihop and Virtual Fabrics.
Finally, if you are employing DWDM devices in a cascaded FICON architecture, check with both your switch
vendor and DWDM vendor for specifics on Multihop support, as well as support for other cascaded FICON
technologies (for example, not every DWDM vendor supports VC_RDY and requires use of R_RDY on ISLs).
In the Information Age, data is quite valuable. It is the livelihood of businesses across the globe, whether
in the form of financial transactions, online purchases, customer demographics, correspondence or
spreadsheets, or any number of business applications. When it comes to the transactions of customers it
is absolutely imperative that none of them is ever lost due to an IT system failure. In today’s competitive
environment, end users demand near-100 percent system and infrastructure availability. This level of
availability is no longer a luxury, but a necessity. Mission-critical applications and operations require truly
mission-critical services and support, especially for their important I/O traffic.
HA is quite important to all businesses; however, to some, such as financial institutions, it is more important
than to others. Deploying HA must be a conscious objective, as it requires time, resources, and money. This
is because not only is HA used to ensure constant connections of servers to storage networks to storage
devices—along with a reliable data flow—but there is also a premium to be paid when dealing with the Total
Cost of Acquisition (TCA) of highly available equipment.
However, the Internet has emphasized that HA equals viability. If companies do not have highly reliable and
available solutions for the continuing operation of their equipment, they lose money.
If one company’s server goes down due to a failure in availability, customers are apt to click over to a
competitor. If mission-critical computers involved in manufacturing are damaged through machine failure,
inventory runs behind and schedules are missed. If a database application cannot reach its data due to I/O
fabric failures, seats might not get filled on flights or hotel room reservations might go to a competitor—or
credit card transactions might be delayed. Such events cost many thousands, sometimes even millions, of
dollars in damage to a company’s bottom line.
Storage networking is an important component of this infrastructure. Depending upon electronic devices,
storage networks can at times fail. The failure may be due to software or hardware problems, but failures
do occur from time to time. That is why rather than taking big risks, businesses running mission-critical
processes in their computer environments integrate high-availability solutions into their operations.
Real-time systems are now a central part of everyone’s lives. Reliable functioning of these systems is of
paramount concern to the millions of users that depend on these systems every day. Unfortunately, some
systems still fall short of user expectations of reliability. What is really meant by the term “five- nines” HA in
storage networks?
In discussing this topic, it is important to clarify a couple of terms. Quite a few professionals use the terms
“reliability” and “availability” as if they mean the same thing. But they are not the same. Reliability is about
how long something lasts before it breaks. Availability is about how to keep running even if something does
break. Five-nines of availability is very difficult to achieve. And it is not just a matter of building hardware,
but about how that hardware is deployed within the data center environment, which determines if “real” five-
nines availability can be or has been achieved.
For example, the industry makes a general claim that dual director-based fabrics can provide five-nines of
availability. That might be so—but, maybe it is not. Can an end user (such as a financial institution, a broker,
or an airline reservation system, for example) survive for very long on half the bandwidth if one of its two
fabrics fails? Usually not. Figure 18 is a depiction of that configuration.
Figure 18: What happens if you lose 50 percent of your bandwidth capacity?
To achieve five-nines of availability, an end user must architect the hardware, the fabrics, and the bandwidth
across those fabrics to achieve this high expectation (Guendert, 2013). The configuration in Figure 18
potentially provides an environment of five-nines of availability from a redundant fabric point of view. But
each of the two deployed fabrics actually need to run at no more than approximately 45 percent busy, so
that if a failure occurs, and one fabric must fail over to the remaining fabric, then that remaining fabric has
the capacity to accept and handle the full workload for both fabrics. End users often create these fabrics so
that the data flow across each fabric can be shifted to the other fabric without causing any issues.
Yet, over time, entropy happens. Storage devices and CHPIDs and HBA ports are added (or removed) in an
asymmetrical fashion, leaving more data traffic on one fabric than another. In a similar scenario, application
and data growth occur, which drive both fabrics to run at above 45 percent utilization. If one fabric had
to fail over to the other fabric, then—in this case—the remaining fabric would not have enough excess
bandwidth to accept and successfully run the failed fabric’s workload.
So, the question that is not asked often enough when attempting to deploy highly available fabrics is:
“Beyond redundant fabrics, how much bandwidth can I lose and continue to meet my end-user System Level
Agreements (SLAs)?”
Worldwide, the most common deployment of mainframe FICON fabrics is as shown in Figure 19. Since a
Path Group (PG) has eight links, two links (redundancy) are routed to each of four fabrics per PG. Since
mainframe channel cards and open system HBAs contain four or fewer ports, at least redundant channel
cards or HBAs are required to provision the eight paths shown in Figure 19, which allows users to deploy
redundancy for HA fabrics at yet another level of their infrastructure (Guendert, 2013).
Risk of “loss of bandwidth” is the motivator for deploying fabrics like those shown in Figure 18. In this case,
the four fabrics limit bandwidth loss to no more than 25 percent, if a fabric were to fail and could not be
fully recovered by the other fabrics. In this storage networking architecture, each fabric can run at no more
than about 85 percent busy, so that if a failover occurs, the remaining fabrics have the capacity to accept
and handle the full workload without overutilization on any of the remaining fabrics. Even with entropy, it
would take a tremendous amount of growth to overwhelm the total availability and bandwidth capabilities of
this deployment.
There are end users, however, where even the deployment of fabrics as shown in Figure 18 is just not
adequate to meet their requirements for five-nines of availability. These users need to minimize their risk of
losing bandwidth to the maximum extent possible. So, doubling the size of what is shown in Figure 18 would
limit the maximum risk of bandwidth loss to just 12.5 percent. That configuration is shown in Figure 20. This
is a configuration that is deployed especially in the financial sector, where millions of dollars might be at
stake for each minute that work completion is delayed.
By now you understand why so many end users in mainframe environments deploy three, four, or eight
fabrics. When a single fabric fails they then have the capacity to handle the additional workload from the
failed fabric and continue to operate at full capability. This creates a truly five-nines environment. Note that
availability is not just a factor of hardware. It is the ability of the customer to complete all work accurately,
even when facing a failure of some part of the system.
Then the vendor is marketing counting for availability. What about the failed devices that were never
returned? What about the power supplies and fans, and so on, that failed in the field and were replaced but
not shipped back to the vendor? None of those items are likely to be counted into the availability numbers.
Table 1 describes various levels of availability. Occasionally, you might hear a reference to achieving six-
nines of availability. When you think about that statement, it is either referring to an engineering number
(hardware rather than fabric deployment) or is just marketing talk. A vendor really should seek to create
six-nines capable hardware. However, six-nines of data center availability is simply not possible to deploy
in customer environments. Customers could not even detect that there was a problem occurring in only 32
seconds, let alone be capable of fixing the problem to achieve six-nines of availability. Five-nines director-
class Fibre Channel switching devices have component redundancy built in everywhere.
Some areas to look for when looking for a lack of redundancy (Guendert, Proactively Improve Performance
With Resilient SAN Fabrics, 2013):
• A single fan
• A single fan connector (attached to the backplane, so that if it is damaged, the chassis might need
replacement)
• A single I/O timing clock (see below); all I/O has to be timed, and timers can get out of sync and go bad
• Control and connectivity blades, which have many components that pose significant potential for failures
• Poor control over I/O frame flow
• An active backplane (hundreds of electronic components attached to their backplanes)
Each switching vendor and each storage vendor selling infrastructure wants to paint the best possible
picture of the devices they are selling to end users. Unfortunately, there is no single authority to determine
which devices offer two-nines, three-nines, four-nines, or five-nines of availability. Thus, vendor marketing
departments always emphasize the best possible characteristics of a device, including availability—whether
or not testing or proof points are available to back up these claims.
The testing labs that are available to do switch testing also charge the switch vendors money to have their
devices tested. The results of such testing should always be viewed with suspicion. Often when competitors
use the same testing agency, in both cases that agency provides a glowing report about that switch
vendor’s equipment. The vendor can even tell the test agency which tests to run and which tests to avoid.
After all, the switch vendor is paying for the testing. For the consumer, the test results cannot be relied
upon, because the vendor versus vendor results are like comparing apples to oranges instead of providing
accurate and complete answers. An end user must look deeper into the engineering and manufacturing of a
device in order to get a sense of that device’s capability to deliver five-nines of availability to a data center.
Reliability can be measured as the amount of time a device or component runs before something breaks.
Availability is about how to keep running even if something does break. Those concepts seem simple
enough. You might assume that switch vendors perform all of the correct engineering and use all of the
proper manufacturing processes so that they create a climate of reliability and availability within their
product set. But that has not always been true. The storage network, which is the “piece in the middle,” is
crucial to HA environments. It is important for end users to understand first, how to design their storage
network for “five nines” HA if required and next, what to look for in the switching device’s engineering and
design to ensure that HA requirements are being met.
Chapter Summary
This chapter discussed topology considerations for FICON SANs. It introduced the three primary topologies
for FICON: DAS, single-switch, and cascaded switched FICON. Cascaded FICON was explained in greater
detail, because it is the most commonly implemented topology. The second section in the chapter featured
a detailed discussion on the advantages to using a switched FICON architecture for your topology. The
chapter concluded with a detailed look at RAS and creating FICON fabrics at five-nines of availability.
References
Artis, H., & Guendert, S. (2006). Designing and Managing FICON Interswitch Link Infrastructures.
Proceedings of the 2006 Computer Measurement Group. Turnersville, NJ: The Computer Measurement
Group, Inc.
Brocade Communications Systems. (2008). Technical Brief: Cascaded FICON in a Brocade Environment.
San Jose, CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Fabric OS FICON Configuration Guide, 8.1.0. San Jose,
CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Fabric OS Administration Guide, 8.1.x. San Jose, CA,
USA: Brocade Communications Systems.
Catalano, P., & Guendert, S. (2016). FICON Dynamic Routing Technology and Performance Considerations.
Armonk, NY, USA: IBM Corporation.
Catalano, P. & Guendert, S. (2017). FICON Multihop Requirements and Configurations. Armonk, NY, USA:
IBM Corporation
Guendert, S. (2007). A Comprehensive Justification For Migrating from ESCON to FICON. Gahanna, OH:
Warren National University.
Guendert, S. (2007). Understanding the Performance Implications of Buffer to Buffer Credit Starvation
in a FICON Environment: Frame Pacing Delay. Proceedings of the 2007 Computer Measurement Group.
Turnersville, NJ: The Computer Measurement Group, Inc.
Guendert, S. (2011, November). The Merits of Switched FICON vs. Direct Attached: A Business Perspective.
Mainframe Executive.
Guendert, S. (2011, October). To Switch or Not to Switch? That’s the Question! zJournal.
Guendert, S. (2013, August). Proactively Improve Performance With Resilient SAN Fabrics. Enterprise Tech
Journal.
Guendert, S., & Lytle, D. (2007). MVS and z/OS Servers Were Pioneers in the Fibre Channel World.
Journal of Computer Resource Management, 3-17.
IBM Systems (2015). Why switched High Performance FICON for z Systems (zHPF)? Somers, NY, USA: IBM
Corporation.
Singh, K., Grau, M., Hoyle, P., & Kim, J. (2012). FICON Planning and Implementation Guide. Poughkeepsie,
NY: IBM Redbooks.
76
PART 2 Chapter 3: Protocol intermix mode
Introduction
Protocol Intermix Mode (PIM) is a feature supported by IBM on IBM Z processors, which allows IBM
Fibre Connection (FICON) and open systems Fibre Channel Protocol (FCP) traffic to coexist on the same
physical storage network. Over the past decade, Fibre Channel (FC) has become the dominant protocol
for connecting servers to storage. It was designed to be a robust, highly reliable, and high- performance
foundation layer that can support a framework for both channel- and network-oriented upper layers. FC has
resulted in an ideal architecture for delivering mass connectivity. End users are able to combine mainframe
FICON storage networks and open systems FC Storage Area Networks (SANs) onto an enterprise-wide
storage network or systems area network since early 2003, when IBM initially announced zSeries support
for what is known as FICON/FCP Intermix or PIM. (These two terms are used interchangeably in the industry,
and they refer to the same feature. That’s why both terms are used in this book without differentiating
between them.) From early 2000, continuing through today, IBM has been one of the biggest supporters of
Linux in the IT industry, specifically Linux on the mainframe. Linux on IBM Z has been the biggest driver of
FICON/FCP Intermix during this same time.
Although FICON/FCP Intermix has been approved by IBM and FICON director vendors, many end users,
particularly large enterprises, have been hesitant to try Intermix, for several reasons. These reasons include
security and management concerns, internal politics, and perhaps a lack of understanding of how FICON/
FCP Intermix really works. However, considering the cost reductions achieved by FICON/FCP Intermix and
access to the latest advances in technology—such as the IBM z14, IBM LinuxONE servers, and N_Port ID
Virtualization (NPIV)—FICON/FCP Intermix offers a clear advantage for SAN and server network design. This
chapter discusses the mixing of FICON and FCP devices in the same SAN. It focuses on end-user issues and
the fabric elements to consider when evaluating a FICON/FCP Intermix solution.
Note: In this chapter, the term “switch” is used in phrases such as “Fibre Channel switch” to refer to a
switch or director platform.
Storage Device (DASD) array. This is a relatively new subject area that is beyond the scope of this chapter.
In open systems environments, FCP is the upper-layer protocol for transporting Small Computer Systems
Interface version 3 (SCSI-3) over Fibre Channel transports. IBM introduced FICON in 1998 to replace the
aging Enterprise Systems Connection (ESCON) architecture. Like FCP, FICON is an upper- layer protocol
that uses the lower layers of Fibre Channel (FC-0 to FC-3). This common FC transport is what enables
mainframes and open systems to share a common network and I/O infrastructure, hence the term
“Intermix.”
Mixing FC-4 protocols in the same fabric has been discussed since the advent of the Fibre Channel
architecture. There are several key concepts to understand when considering FICON/FCP Intermix.
• The FC-4 type of FC frame is an element of the payload; it is not part of the routing algorithms. Therefore,
Fibre Channel can function as protocol-agnostic, which allows it to be an ideal common transport
technology.
• FICON was the first major FC-4 type other than FCP (which is the SCSI transport protocol in Fibre Channel)
to be released in the storage market.
• Other standards-defined FC-4 types include IP, IPI (Intelligent Peripheral Interface), and HIPPI (High-
Performance Parallel Interface). Of these defined types, FCP is the most widely used at present, with
FICON being a close second.
As shown in Figure 1, the open systems SAN Fibre Channel Protocol (FC-SCSI-3) and FICON (FC- SB2/3/4)
are merely different upper-level protocols in the overall FC structure; the difference between the two is at
the FC-4 layer (data payload packet). Essentially, an end user’s open systems SAN platforms and FICON
platforms are identical hardware. However, FICON platforms are typically purchased with a software feature
known as Control Unit Port (CUP). The CUP code provides administrators with the capability to manage the
FICON directors in-band, using the same host management tools that you use with ESCON.
First, it makes good sense to use PIM in specialized non-production environments that require flexibility and
resource sharing. Examples are quality assurance, testing, and development, and even dedicated disaster
recovery data center environments. Many FICON users with newer DASD arrays who use FCP for replication
and mirroring are already running in PIM.
Second, the most important reason users move to a PIM environment is to reduce Total Cost of Ownership
(TCO). The consolidation of both switching infrastructure and cabling plants when running a PIM
environment can significantly reduce TCO. The latest generation of Brocade directors, the Gen 6 Brocade
X6, are designed with non-blocking architectures and full throughput to all ports, even when the port count
is fully populated. Very large IT organizations with separate open systems and FICON storage networks
have installations that are large enough to fully utilize their switching hardware, even when running distinct,
segregated storage networks. However, many smaller organizations also run distinct, segregated storage
networks to keep the open systems SAN separate from the FICON/mainframe environment. Sometimes
these organizations do not fully utilize their switching hardware; they often have directors or switches
with many available ports. Sometimes, up to 50 percent of the director or switch is reserved for “planned
growth.” In this type of environment, enterprises can realize significant savings by consolidating the
underutilized directors onto a common SAN operating in PIM. This results in fewer “serial numbers” on the
floor, which in turn leads to lower costs for maintenance, electrical, cooling, and other operations. PIM also
allows an organization to have one common cable plant/fiber infrastructure, allowing for further reductions
in management and operational costs.
Third, IBM has been one of the biggest proponents of Linux for the past decade—specifically, of
consolidation of open systems servers, particularly Wintel servers, onto zLinux partitions. This trend did not
really start until the 2005 release of the System z9, when NPIV was announced. The trend continued with
the System z10, the zEnterprise 196 (z196), and continues today with IBM’s latest mainframe, the IBM
z14. The significant enhancements in performance and operational cost savings offered by IBM Z are part
of a very compelling zLinux consolidation story. IBM published the results of an internal project in which
approximately 3900 open systems servers were consolidated onto mainframes, and the projected TCO
savings was 80 percent over five years. The latest analyst surveys and publicly released information from
IBM estimates that 40 percent of IBM Z customers are running Linux on their mainframes.
IBM introduced the LinuxONE servers in Q1 2016, and in Q4 2017 announced the next set of
enhancements to the LinuxONE line, including the Emperor II LinuxONE server. IBM has also been a leading
advocate of Blockchain technologies on Linux platforms via the Linux Foundation’s Hyperledger project.
The IBM z13, IBM z14, and LinuxONE platforms are an ideal fit for blockchain implementations. These
implementations will require a mix of Linux servers with FCP connectivity to storage, working together
with traditional mainframe based OLTP workloads. A PIM storage network is the logical choice for these
environments.
• Small FICON and open systems environments with a common storage network
• z/OS servers accessing remote FICON storage via FICON cascading
• Linux on an IBM Z mainframe running z/OS, using FICON to access local storage
• Hardware-based remote DASD mirroring between two locations using FCP as a transport, for example, the
IBM Metro Mirror Peer-to-Peer Remote Copy (PPRC)
• An IBM Z processor that accesses storage via FICON and storage devices via FCP to perform remote
mirroring between devices (instead of ESCON)
• Open systems and IBM LinuxONE servers accessing storage on the same storage network using FCP
• Linux on the IBM Z platform located at either of two sites, using FCP to access storage
• One mainframe running all of these: z/OS workloads (such as DB2) with FICON-attached storage, Linux
under System z/VM accessing SCSI storage via FCP channels, and IBM BladeCenter blade servers in the
zBX running AIX, Linux, and Windows connected to the storage network
The primary management difference between the two protocols is the mechanism for controlling device
communication. Because FCP and FICON are both FC-4 protocols, they do not affect the actual switching
of frames. Therefore, the management differences between the two protocols are not relevant, unless you
want to control the scope of the switching through zoning or connectivity control. For example, FCP devices
use name server zoning as a way to provide fabric-wide connection control. FICON devices use the Prohibit
Dynamic Connectivity Mask (PDCM, explained in the next section) to provide switch-wide connection control.
FICON Connectivity
FICON communications are address-centric and definition-oriented, and they use host assignment to
determine device communication. The server-to-storage configuration is defined using a host-based program
and is stored in the channel drivers. Additional connectivity control is managed at the switch level by
configuring (via the PDCM) which port addresses are allowed to form connections with each other.
Initially, the FICON architecture was limited to single-switch (domain) environments. Due to the single-
byte addressing scheme limitations it inherited from ESCON, the original FICON architecture limited
configurations to single domains. At that time, FICON-compliant directors did not form Expansion Ports
(E_Ports, that is, Inter-Switch Links, or ISLs) with other switches. If two directors were connected, the FICON
Management Server in the director reported this condition as an invalid attachment and prevented the
port from coming online. The z/900, z/OS, and accompanying 64-bit architecture announced in mid-2000
allowed for 2-byte addressing. Subsequently, FICON cascading was qualified by IBM and Brocade to provide
one-hop configurations between the channel and control units, via ISLs between two switches or directors.
Although FICON cascading relieves the single domain restriction, the end user must decide whether or not
to exploit FICON cascading functionality. As discussed in Part 2 Chapter 2, multihop FICON cascading was
qualified by IBM and Brocade in April, 2017 with support for up to three hops.
The FICON architecture communicates port-to-port connectivity via the PDCM that is associated with the
port address. The PDCM is a vector-defined addressing system that establishes which addresses are
allowed to communicate with each other. The FICON switch is required to support hardware enforcement of
the connectivity control; therefore, the control must be programmed into the port hardware and must apply
to all connections. In PIM environments, this creates the potential for the connectivity information to be
more restrictive than the zoning information used by open systems devices.
It is also important to understand the implications of FICON port addressing versus port numbering in FCP.
Like ESCON, FICON abstracts the concept of the port by creating an object known as the port address.
All the configuration attributes associated with a port are attached to its port address. A subsequent
association is then made between the port address and the port number. The port addressing concept
facilitates the ability to perform FICON port swaps for maintenance operations, without the need to
regenerate the host configuration. The port address abstraction is not a concept in the FC architecture,
therefore it is foreign to FCP.
FCP Connectivity
In contrast, FCP communications are name-centric, discovery-oriented, and fabric-assigned, and they use
the Fibre Channel Name Server to determine device communication. FCP connectivity is device- centric
and defined in the fabric using the World Wide Port Names (WWPNs) of the devices that are allowed to
communicate. When an FCP device attaches to the fabric, it queries the Fibre Channel Name Server for the
list of devices with which it is allowed to form a connection (that is, the zoning information). FICON devices
do not query the Fibre Channel Name Server for accessible devices, because the allowable port/device
relationships have been previously defined using the host-based program: IOCP (Input/Output Configuration
Program) or HCD (Hardware Configuration Definition).
Hence, the zoning and name server information does not need to be retrieved for FICON. In addition, FCP
configurations support multiple switch environments (fabrics) and allow as many as seven hops between
source (server) and target (storage). FCP port-to-port connectivity has traditionally been enforced via zoning,
although other techniques that are complementary to zoning—such as port, switch, and fabric binding—have
been introduced. Binding helps alleviate security concerns experienced in PIM installations, because with
FCP, any device in the fabric can access the ANSI standard FC management server by logging into the fabric.
The IBM z14, like its predecessors, inherits sophisticated virtualization capabilities. Virtualization brings
finer control to powerful systems that can be logically divided into multiple simultaneously running
processes. System z provides a hypervisor function, which enables the hosting of multiple Logical Partitions
(LPARs), in which operating system (OS) images can run independently. In addition, z/VM can run as
a second-level hypervisor in one or more of these LPARs, thereby creating Virtual Machines (VMs) in
which additional OS images can be run within an LPAR. These significant virtualization capabilities allow
concurrent execution of a large number of OS images on a single CEC (Central Electronics Complex). This
virtualization achieves its most significant economic benefit by sharing and making optimum use of I/O
resources.
An IBM z14 running z/VM and up to 85 LPARs needs a storage network that, for one thing, can effectively
handle this computing power. In addition, the network should share the I/O resources in a controlled, secure
manner. Predecessors of the IBM z14 pioneered server virtualization, including the sharing of data storage
subsystems among the virtual servers of a host computer, using the channel-sharing capabilities of FICON
channels in FC fabrics. This sharing of industry-standard SCSI devices among virtual servers using FCP in
SANs, however, has historically been problematic, due to the traditional single-user design and the need to
reserve resources for use by particular OS images. Enterprise-class systems, which often host multiple OS
images, require each OS image that uses SCSI-FCP protocols to have the presence of an unshared Node
Port (N_Port) in a Host Bus Adapter (HBA), in order to be uniquely identified by an N_Port ID.
To apply the power of server virtualization to such an environment, IBM mainframes, and LinuxONE servers
implement an FC standard called Node Port ID Virtualization (NPIV), which enables the sharing of host
adapters in these IBM mainframes and FC fabrics. With NPIV, a host FC adapter is shared in such a way that
each virtual adapter is assigned to a virtual server and is separately identifiable in the fabric. Connectivity
and access privileges in the fabric are controlled via the identification of each virtual adapter. The IBM Z
and LinuxONE servers use NPIV and FC virtual switches in virtual fabrics to support Linux. Storage networks
are now expanding into virtual realms, which have been widely accepted in mainframe environments for
decades. Mainframes have housed many virtual processes to increase performance, manageability, and
utilization. The same transformation is occurring in storage networks on two levels. First, NPIV lets Linux
images on IBM Z and LinuxONE servers access open systems storage. This storage usually costs much less
than the high-end DASD associated with mainframes. Secondly, virtual fabric technology lets the storage
network create multiple FC virtual switches using the same physical hardware.
Many open systems servers have appeared in the data center to fulfill different business application
requirements. Every time a department finds a new application, servers are acquired that require
new switching, storage, cabling, and management. These open systems servers introduce additional
administrative overhead such as maintenance, procurement, and inventory costs. The acquisition of
a few servers at a time over several years can result in racks and racks of servers, many of which are
underutilized. However, instead of acquiring new hardware for new or overloaded applications, you can
replace these open systems servers with virtual Linux on IBM Z and/or LinuxONE servers. These virtual
servers can be up and running quickly, instead of it taking days or weeks to acquire and install physical
open systems servers. Implementing Linux on IBM Z and LinuxONE servers means that world-class
computing resources can now take advantage of low-cost open systems storage, further reducing costs. In
addition, these servers running Linux scale predictably and quickly and offer benefits such as consistent,
homogeneous resource management.
These channel cards provide either FICON or FCP channel functionality, depending on the firmware and
microcode with which they are loaded. The channel card itself features local memory, a high-speed Direct
Memory Access (DMA) engine used for accessing system memory, an FC HBA chipset, and a pair of cross-
checked PowerPC processors.
• A device driver that interfaces directly with the FC HBA chipset to drive the FC link
• A real-time OS kernel that provides the basic system infrastructure
• A layer that is unique to the SCSI protocol for providing the programming interface to the host device
drivers for SCSI devices
• A channel and control unit component that provides the interface to the z/Architecture I/O subsystem,
which in turn provides I/O and other operations used in controlling the configuration
• A Queued Direct I/O (QDIO) component for providing the primary transport mechanism for all FC traffic
FCP support on today’s IBM Z mainframes allows Linux running on the host to access industry-standard
SCSI storage devices. For disk applications, these FCP storage devices utilize Fixed Block Architecture (FBA)
512-byte sectors, rather than the Extended Count Key Data (ECKD) format. The SCSI and FCP controllers
and devices can be accessed by Linux on the IBM Z and LinuxONE servers as long as the appropriate I/O
driver support is present. There are currently two supported methods of running Linux on IBM Z: natively in
a logical partition or as a guest operating system under z/VM (version 4 release 3 and later releases).
The FCP I/O architecture used by IBM Z and LinuxONE servers fully complies with the Fibre Channel
Standards specified by the International Committee of Information Technology Standards (INCITS). To review:
FCP is an upper-layer FC mapping of SCSI on a common stack of FC physical and logical communications
layers. FC-FCP and SCSI are supported by a wide range of controllers and devices, which complement the
storage attachment capability through FICON and ESCON channels.
The QDIO architecture is used by FICON channels in FCP mode to communicate with the operating
system. This architecture is derived from the same QDIO architecture that is defined for IBM HiperSockets
communications and for OSA Express channels. Rather than using control devices, FCP channels use data
devices that represent QDIO queue pairs, which consist of a request queue and a response queue. Each of
these queue pairs represents a communications path between the FCP channel and the operating system.
An operating system can send FCP requests to the FCP channel via the request queue, and the response
queue can be used by the FCP channel for passing completion indications and unsolicited status indications
back to the operating system.
The FCP channel type still needs to be defined using HCD/IOCP, as do the QDIO data devices. However,
there is no requirement for defining the FC storage devices or switches. These devices are configured on
an operating system level using the parameters outlined in the industry-standard FC architecture. They are
addressed using World Wide Names (WWNs), FC Identifiers (FCIDs), and Logical Unit Numbers (LUNs). After
the addresses are configured on the operating system, they are passed to the FCP channel, along with the
corresponding I/O request, via a queue.
In other words, access is granted on a first-come, first-served basis. This prevents problems with concurrent
access from Linux images that are sharing the same FCP channel (also sharing the same WWPN). This
means that one Linux image can block other Linux images from accessing the data on one or more logical
control units. If the FCP channel is used by multiple independent operating systems (under z/VM or in
multiple LPARs), the SAN is not aware of this fact, and it cannot distinguish among the multiple independent
users. You can partially avoid this by requiring separate FCP channels for each LPAR. However, this does not
solve the problem with z/VM.
Another and better method to control access to devices is the implementation of NPIV. In the mainframe
space, NPIV is unique to the IBM System z9 and later IBM Z machines. NPIV allows each operating system
sharing an FCP channel to be assigned a unique virtual WWPN. The virtual WWPN can be used for both
device-level access control in a storage controller (LUN masking) and for switch-level access control on an
FC switch (via zoning).
In summary, NPIV allows a single physical FCP channel to be assigned multiple WWPNs and to appear as
multiple channels to the external SAN environment. These virtualized FC N_Port IDs allow a physical FC port
to appear as multiple distinct ports, providing separate port identification and security within the fabric for
each OS image. The I/O transactions of each OS image are separately identified, managed, transmitted, and
processed as if each OS image had its own unique physical N_Port.
NPIV is based on extensions made to the Fibre Channel standards that allow an HBA to perform multiple
logins to the SAN fabric via a single physical port. The switch to which the FCP channel is directly
connected (that is, the “entry switch”) must support multiple N_Port logins. No changes are required for
the downstream switches, devices, or control units. The switch—and not the FCP channel— provides the
multiple WWPNs that are used for the virtualization. Therefore, NPIV is not supported with FCP point-to-point
attachments and requires switched FCP attachment. NPIV is available with Linux on IBM Z in an LPAR or
as a guest of z/VM versions 4.4, 5.1, and later for SCSI disks accessed via dedicated subchannels and for
guest IPLs. For guest use of NPIV, z/VM versions 4.4, 5.1, 5.2, and later all provide support transparently (in
other words, no Program Temporary Fix [PTF] is required). System z/VM 5.1 and later provide NPIV support
for VM system use of SCSI disks (including emulated FBA minidisks for guests). System z/VM 5.1 requires
a PTF to properly handle the case of an FC switch being unable to assign a new N_Port ID when one is
requested (due to the capacity of the switch being exceeded).
In summary, while LPARs consume one or more FC ports, NPIV allows multiple Linux “servers” to share
a single FC port. The sharing allows you to maximize asset utilization, because most open systems
applications require little I/O. While the actual data rate varies considerably for open systems servers, a
general rule is that they consume about 10 MB/sec (megabytes per second). Fibre Channel has reduced
the I/O latency by scaling FC speeds from 1 Gigabit per second (Gbps) to 2 Gbps, to 4 Gbps, to 8 Gbps, and
now to 16 Gbps. With each Gbps of bandwidth, one FC port supplies 100 MB/ sec of throughput. A 4-Gbps
link should be able to support about 40 Linux severs, from a bandwidth perspective.
To aggregate many Linux servers on a single FC port, the Fibre Channel industry has standardized NPIV,
which lets each Linux server (on a System z machine) have its own 3-byte FC address, or N_Port ID. After
N_Ports (servers or storage ports) have acquired an N_Port ID from logging into the switch, NPIV lets the
port request additional N_Port IDs—one for each Linux on Systems z servers running z/VM version 5.1 or
later. With a simple request, the switch grants a new N_Port ID and associates it to the Linux on System z
image. The WWN uniquely identifies the Linux on System z image and allows for the zLinux servers to be
zoned to particular storage ports. IBM Z currently supports 64 NPIV addresses per port, and up to 9000
Linux images in a single server platform.
FC virtual switches relieve another difficulty (as shown in Figure 2), that is, managing several distributed
fabrics. The storage network has grown organically with different applications and the physical limitations
of the switches. Table 1 shows the port counts for each fabric on each site. Fabrics 1 and 2 each have
64-port directors with 28 to 48 of these ports in use. The backup fabric has only 24-port switches, and only
a single port is available on each of these switches. While Fabrics 1 and 2 have a considerable number of
unused ports, the switches do not have the capability to offer ports to the backup fabric. The usage rates of
the fabrics are also relatively low, and the combinations of products, firmware releases, and management
applications makes the distributed fabric rather complex.
The benefits of the consolidated approach include efficiency, reduced cost, and manageability. Figure
3 illustrates a better solution that meets all the needs of the data center. The physical configuration
of the data center can be consolidated on two 256-port directors, which can be divided into multiple
virtual storage networks. These networks can be large or small, and they offer independent, intelligible
management.
When large corporations require more than a thousand servers in their data center, the manageability and
accountability for controlling storage networks can become unruly. Breaking the fabrics into small virtual
fabrics increases the manageability of the storage network. Instead of coordinating a team of administrators
to manage one large storage network, individuals each handle a manageable piece of the solution. The
virtual nature of the new data center creates a management hierarchy for the storage network and enables
administration to be accomplished in parallel.
A further explanation of the differences between the distributed storage network and the consolidated
storage network illuminates the benefits of the new approach. In Figure 2, the distributed storage network
features 58 open systems servers, consuming 58 FC ports and the corresponding cabling and rack space.
Each open systems server is using only a fraction of the link’s bandwidth as the speed of the link increases
to 4 Gbps. The reliability of the variety of servers that have accumulated over time is significantly less than
the tested reliability of the IBM z14. The administrative cost of adding and repairing servers that change
every quarter leads to complexity and inefficiency. In a nutshell, the organic growth of the open systems
servers has led to a “population” problem.
The consolidated data center in Figure 3 features a unified architecture that is adaptable and efficient. The
IBM z14 supports mixed applications and is managed from either a single screen or multiple screens. The
director has been divided up into FC virtual switches that suit the needs of each application and, in the
future, will be able to quickly scale to almost any level. The links in Figure 3 are black because they can be
attached to any of the virtual fabrics. The combination of more powerful hardware and virtual processors
and switches creates a simpler architecture than the distributed storage network.
Figure 4 shows details of the virtualization techniques between the IBM z14 and the director. The
mainframe applications have traditional direct links to the FC virtual switches, which connect the mainframe
to the DASD. These links typically run at 800 MB/sec and can support high I/O per second (IOPS). The open
systems applications are using NPIV to allow them to access the open systems storage. Figure 4 shows
how one physical link to Virtual Fabric 1 supports eight open systems applications. Each open systems
application has a zLinux server, WWN, and N_Port ID, all of which are used for management purposes.
With multiple open systems applications using a single link, the usage rates have increased to the levels
shown in Table 2. Table 2 also shows how fewer ports were used more efficiently in the consolidated storage
network than in the distributed storage network. Even after several applications were added to the data
center, the number of ports in the fabric decreased from 199 to 127.
With the link speed increasing to 8 Gbps, the utilization rate has increased by more than three times, to
an average of 61 percent from 20 percent. With virtual fabrics, the number of ports used in each fabric is
flexible, and switch ports can be added to the fabric without the addition of another switch. Other virtual
switches also can be present in the director, which is a configuration with a higher reliability than multiple
small switches.
In 2015, Brocade enhanced Fabric Vision with the introduction of the Analytics Monitoring Platform (AMP).
Brocade AMP is designed to provide the user with real time performance analytics of their storage network.
Brocade also further enhanced Fabric Vision with the launch of the Gen 6 family of hardware and FOS 8.0.
These enhancements included a new tool called I/O Insight.
Fabric Vision was limited to traditional open systems environments until mid-2017 when Brocade and IBM
tested and qualified I/O Insight and AMP for IBM Z and LinuxONE servers using FCP channel connectivity.
At the time of writing, full analytics support was limited to FCP channels, although FICON channels could be
connected to the SAN using I/O Insight and/or AMP.
One of the well-known challenges of using Linux on any platform is the lack of storage monitoring tools.
Using Brocade AMP and/or I/O Insight in a IBM Z/LinuxONE PIM environment provides an outstanding new
toolset to allow for much more proactive monitoring of the Linux Storage environment.
Fabric Vision, including AMP and I/O Insight will be covered in detail in its own chapter (Part 4 Chapter 4).
Conclusion
The enhancements made to the IBM Z family have made FICON/FCP Intermix, or PIM, a more attractive,
viable, and realistic option for connectivity. NPIV technology finally makes it realistic to run open systems
and z/OS on a mainframe that connects everything onto a common storage network, using FICON Express
channel cards. Virtual fabric technologies, coupled with the IBM Z, allow for the end user to bring UNIX and
Windows servers into this storage network.
Brocade has worked with IBM and other companies to help develop the T11 standards to support these
storage networking virtualization techniques on System z9 and newer IBM mainframes, in order to replace
multiple open systems servers with Linux on System z. NPIV lets these servers use open systems storage
for a low-cost solution with enhanced technologies. The FC virtual switches that create virtual fabrics, such
as those on the Brocade X6 Director, offer an additional capability to increase manageability and enable
physical switches to be more adaptable.
89
PART 2 Chapter 4: BROCADE MAINFRAME SAN SOLUTIONS
Introduction
Brocade offers a full range of SAN infrastructure equipment ranging from entry-level platforms up to
enterprise-class fully modular directors. The networking capabilities of the platforms allow solutions with up
to 10,000 ports in a single network. Brocade is the leader in networked storage—from the development of
Fibre Channel (FC) to the current Brocade family of high-performance, energy-efficient Gen 6 SAN switches,
and directors. In addition, advanced fabric capabilities such as distance extension and FICON Management
are available from Brocade.
In the past, the speed of a switching device was used to indicate a generation of products. So, for instance,
there were 1 Gbps, 2 Gbps, 4 Gbps, and then 8 Gbps switching devices, all of which are discussed in this
chapter. Currently, the 16 Gbps switching devices are referred to as “Gen 5” devices, and the latest 32 Gbps
devices are referred to as “Gen 6”.
The sections in this chapter introduce and briefly describe the following Gen 6 (32 Gbps) products from
Brocade, as well as earlier Gen 5 16 (Gbps) products from Brocade, all of which can be utilized with FICON.
For the most up-to-date information on Brocade products, visit http://www.brocade.com/products/ index.
page. Choose a product from the Switches–Fibre Channel or SAN Backbones–Fibre Channel category. Then
scroll down to view data sheets, FAQs, technical briefs, and white papers associated with that product.
This chapter provides an overview of the supported hardware products as well as an overview of the
management software that is available through Brocade to proactively manage evolving customer storage
networks.
Designed to meet relentless growth and mission-critical application demands, Brocade X6 Directors
are the right platform for large IBM Z enterprise environments that require increased capacity, greater
throughput, and higher levels of resiliency. The Brocade X6 Director is available in two modular form
factors. This modular chassis design increases business agility with seamless storage connectivity and
flexible deployment offerings. Built for large enterprise networks, the 14U Brocade X6-8 has eight vertical
blade slots to provide up to 384 32 Gbps Fibre Channel device ports and 32 additional 128 Gbps Brocade
UltraScale Inter-Chassis Link (ICL) ports. Built for midsize networks, the 8U Brocade X6-4 has four horizontal
blade slots to provide up to 192 32 Gbps Fibre Channel device ports and 16 additional 128 Gbps UltraScale
ICL ports. Each blade slot can be populated with two optional blades. For device connectivity, the Brocade
FC32-48 Fibre Channel device port blade provides 48 32 Gbps Fibre Channel ports. To support disaster
recovery and data protection storage solutions over long distances, the Brocade SX6 Extension Blade
provides 16 32 Gbps Fibre Channel ports, 16 1/10 Gigabit Ethernet (GbE) ports, and two 40 GbE ports
for Fibre Channel and IP replication traffic. Brocade directors build upon years of innovation and leverage
the core technology of Brocade systems to consistently deliver five-nines availability in the world’s most
demanding data centers. And with non-disruptive, hot-pluggable components and a no-single-point-of-failure
design, the Brocade X6 is truly the enterprise-class director for today’s storage infrastructure.
Evolving critical workloads and higher density virtualization are continuing to demand greater, more
predictable performance. The Brocade X6 features industry-leading Gen 6 Fibre Channel that increases
performance for demanding workloads across 32 Gbps line-speed links and up to 16.2 Tbps of chassis
bandwidth to address next-generation I/O- and bandwidth-intensive applications. Gen 6 Fibre Channel
technology provides up to 566 million frames switched per second per ASIC, unlocking the full capability of
flash storage. This breakthrough performance speeds up data-intensive application response times, allows
more transactions in less time, and enables improved SLAs. In addition, the Brocade X6 Director increases
scalability with double the throughput for high-density VM deployments and larger fabrics. This allows
organizations to support more storage devices and meet bandwidth requirements using the same number
of Fibre Channel links. Brocade X6 Directors provide unmatched chassis, slot-to-slot, and port performance
and bandwidth. In addition, local switching capabilities ensure that data traffic within the same port
group does not consume slot bandwidth, maximizing the number of line-rate ports while reducing latency.
Performance capabilities include:
The Brocade G620 is built for maximum flexibility, scalability, and ease of use. Organizations can scale
from 24 to 64 ports with 48 SFP+ and 4 Q-Flex ports, all in an efficient 1U package. In addition, a simplified
deployment process and a point-and-click user interface make the Brocade G620 easy to use. With the
Brocade G620, organizations gain the best of both worlds: high-performance access to industry-leading
storage technology and “pay-as-you-grow” scalability to support an evolving storage environment.
Faced with unpredictable virtualized workloads and growing flash storage environments, organizations
need to ensure that the network does not become the bottleneck. The Brocade G620 delivers increased
performance for growing and dynamic workloads through a combination of market-leading throughput and
low latency across 32 Gbps and 128 Gbps links. With Gen 6 ASIC technology providing up to 566 million
frames switched per second, the Brocade G620 Switch shatters application performance barriers with
up to 100 million IOPS to meet the demands of flash-based storage workloads. At the same time, port-to-
port latency is minimized to ≤ 900 ns (including FEC) through the use of cut-through switching at 32 Gbps.
With 48 SFP+ ports and 4 Q-Flex ports, each providing four 32 Gbps connections, the Brocade G620 can
scale up to 64 device ports for an aggregate throughput of 2 Tbps. Moreover, each Q-Flex port is capable of
128 Gbps parallel Fibre Channel for device or ISL connectivity, enabling administrators to consolidate and
simplify cabling infrastructure. Administrators can achieve optimal bandwidth utilization, high availability,
and load balancing by combining up to eight ISLs in a 256 Gbps framed-based trunk. This can be achieved
through eight individual 32 Gbps SFP+ ports or two 4×32 Gbps QSFP ports. Moreover, exchange-based
Dynamic Path Selection (DPS) optimizes fabric-wide performance and load balancing by automatically
routing data to the most efficient, available path in the fabric. This augments Brocade ISL Trunking to
provide more effective load balancing in certain configurations.
The Brocade G620 features up to 64 Fibre Channel ports in an efficiently designed 1U form factor,
delivering industry-leading port density and space utilization for simplified scalability and data center
consolidation. With this high-density design, organizations can pack more into a single data center with a
smaller footprint, reducing costs and management complexity. Designed for maximum flexibility and value,
this enterprise-class switch offers pay-as-you-grow scalability with Ports on Demand (PoD). Organizations
can quickly, easily, and cost-effectively scale from 24 to 64 ports to support higher growth. Brocade G620
ports are available for device port or ISL port connectivity. As such, Q-Flex ports can be used as ISLs to
provide simple build-out of fabrics, with more switching bandwidth. In addition, flexible, high-speed 32
Gbps and 16 Gbps optics allow organizations to deploy bandwidth on demand to meet evolving data
center needs. Brocade G620 Q-Flex ports currently support both 4×32 Gbps and 4×16 Gbps QSFPs for ISL
connectivity. The 4×16 Gbps QSFP also features breakout cable support for additional flexibility.
The Brocade G620 provides midrange IBM Z and LinuxONE environments with a high performance, high
availability storage network switching device ideally suited for situations not requiring all of the ports a
Brocade X6 FICON director provides.
Networks need to evolve in order to support the growing demands of highly virtualized environments and
private cloud architectures. Fibre Channel, the de facto standard for storage networking, is evolving with the
data center. The Brocade DCX 8510 with Gen 5 Fibre Channel delivers a level of scalability and advanced
capabilities to this robust, reliable, and high-performance technology. This enables organizations to continue
leveraging their existing IT investments as they grow their businesses. In addition, organizations can
consolidate their SAN infrastructures to simplify management and reduce operating costs.
Figure 3: The Gen 5 Brocade DCX 8510-8 and DCX 8510-4 Directors.
Brocade DCX 8510 directors are available in two modular form factors. Built for large enterprise IBM Z
storage networks, the 14U Brocade DCX 8510-8 has eight vertical blade slots to provide up to 384 16
Gbps Fibre Channel ports. Built for midsize IBM Z storage networks, the 8U Brocade DCX 8510-4 has four
horizontal blade slots to provide up to 192 16 Gbps Fibre Channel ports. The Brocade DCX 8510 family
supports 2, 4, 8, 10, and 16 Gbps Fibre Channel, FICON, and 1/10 Gbps Fibre Channel over IP (FCIP).
Both models feature ultra-high-speed UltraScale Inter-Chassis Link (ICL) ports to connect up to three FICON
directors, providing extensive scalability and flexibility at the network core without using any Fibre Channel
ports.
A simplified deployment process and a point-and-click user interface make the Brocade 6510 both powerful
and easy to use. The Brocade 6510 offers low-cost access to industry-leading SAN technology while
providing “pay-as-you-grow” scalability to meet the needs of an evolving storage environment.
To accelerate fabric deployment, ClearLink Diagnostic Ports (D_Ports) are available. This is a new port
type that enables administrators to quickly identify and isolate optics and cable problems, reducing fabric
deployment and diagnostic times. Organizations also can use ClearLink® Diagnostics to run a variety of tests
through Brocade Network Advisor or the CLI to test ports, small form-factor pluggables (SFPs), and cables
for faults, latency, and distance.
The Brocade 6510 enables secure metro extension with 10 Gbps DWDM link support, as well as in-flight
encryption and data compression over ISLs. Organizations can have up to four ports at 8 Gbps and up to
two at 16 Gbps of in-flight encryption and data compression per Brocade 6510. The switch also features on-
board data security and acceleration, minimizing the need for separate acceleration appliances to support
distance extension. Internal fault-tolerant and enterprise-class Registration, Admission, and Status (RAS)
features help minimize downtime to support mission-critical cloud environments.
The Gen 5 Brocade 6510 with Fabric Vision technology, an extension of Brocade Gen 5 Fibre Channel,
introduces a breakthrough hardware and software technology that maximizes uptime, simplifies SAN
management, and provides unprecedented visibility and insight across the storage network. Offering
innovative diagnostic, monitoring, and management capabilities, this technology helps administrators avoid
problems, maximize application performance, and reduce operational costs.
The Brocade 7840 is an ideal platform for building a high-performance data center extension infrastructure
for replication and backup solutions. It leverages any type of inter-data center WAN transport to extend
open systems and mainframe storage applications over any distance. Without the use of extension,
those distances are often impossible or impractical. In addition, the Brocade 7840 addresses the most
demanding disaster recovery requirements. Twenty-four 16 Gbps Fibre Channel/FICON ports, sixteen 1/10
Gigabit Ethernet (GbE) ports, and two 40 GbE ports provide the bandwidth, port density, and throughput
required for maximum application performance over WAN connections. Designed for maximum flexibility,
this enterprise-class extension switch offers “pay as you grow” scalability with capacity-on-demand
upgrades. To meet current and future requirements, organizations can quickly and cost-effectively scale
their WAN rate from 5 Gbps to 40 Gbps per platform via software licenses. With compression enabled,
organizations can scale up to 80 Gbps application throughput, depending on the type of data and the
characteristics of the WAN connection. The Brocade 7840 base configuration is a comprehensive bundle
that includes a set of advanced services: FCIP, IP Extension, Brocade Fabric Vision technology, Extension
Trunking, Adaptive Rate Limiting, IPsec, compression, Open Systems Tape Pipelining (OSTP), Fast Write,
Adaptive Networking, and Extended Fabrics. Optional value-add licenses for Integrated Routing (FCR), FICON
Management Server (CUP), and Advanced FICON Accelerator are available to address challenging extension
and storage networking requirements in open system and mainframe environments.
The Brocade 7840 is a robust platform for large-scale, multi-site data center environments implementing
block, file, and tape data protection solutions. It is ideal for:
Such performance gains enable use cases that at one time were deemed unfeasible. IP extension also
offers other, more far-reaching benefits. The Brocade 7840 supports and manages Fibre Channel/FICON
and IP-based data flows, enabling storage administrators to consolidate I/O flows from heterogeneous
devices and multiple protocols. The consolidation of these applications into a single, managed tunnel
between data centers across the WAN has real operational, availability, security, and performance value.
Consolidating IP storage flows, or both IP storage and Fibre Channel/FICON flows, into a single tunnel
contributes significantly to operational excellence. Operational advantages are gained with Fabric Vision,
MAPS (Monitoring Alerting Policy Suite), WAN Test Tool (Wtool), and Brocade Network Advisor. Using
custom, browser-accessible dashboards for IP storage or combined Fibre Channel and IP storage, storage
administrators have a centralized management tool to monitor the health and performance of their
networks.
The advanced performance—up to 80 Gbps per blade—and network optimization features of the Brocade
SX6 Extension Blade enable replication and backup applications to move more data over metro and WAN
links in less time, and optimize available WAN bandwidth. In addition, the Brocade SX6 secures data
flows over distance with no performance penalty to minimize exposure to data breaches between data
centers. Supporting up to 250 ms Round-Trip Time (RTT) latency, the Brocade SX6 enables extension over
distances up to 37,500 kilometers (23,400 miles). The Brocade SX6 Extension Blade maximizes utilization
of WAN links through protocol optimization technology and expands WAN bandwidth with hardware-based
compression, disk and tape protocol acceleration, WAN-optimized TCP, and other extension networking
technologies.
The Brocade X6 Director and Brocade SX6 Extension Blade combine data center-proven reliability with
an innovative business continuity solution for non-stop operations. They leverage the core technology
of Brocade Gen 6 Fibre Channel, consistently delivering 99.9999 percent uptime in the world’s most
demanding data centers. The Brocade X6 Director delivers enterprise-class availability features, such as
hot-pluggable chassis components with non-disruptive software upgrades to maximize application uptime
and minimize outages. This combines director-class availability with Brocade SX6 innovative extension
features and the industry’s only WAN-side, non-disruptive firmware upgrades to achieve always-on business
operations and maximize application uptime. These unique capabilities enable a high-performance and
highly reliable network infrastructure for disaster recovery and data protection.
The Brocade SX6 provides a suite of features—from pre-deployment validation to advanced network failure
recovery technologies—to ensure a continuously available storage extension infrastructure. With built-
in Flow Generator and WAN Test Tool (Wtool), administrators can validate conditions of the WAN links,
network paths, and proper setup of configurations prior to deployment. In addition, administrators can
troubleshoot the physical infrastructure to ease deployment and avoid potential issues. To protect against
WAN link failure and avoid restart or resync events, the Brocade SX6 Extension Blade leverages Brocade
WAN-optimized TCP to ensure in-order lossless transmission of data. In addition, the Brocade SX6 with
Extension Trunking protects against WAN link failures with tunnel redundancy for lossless path failover
and guaranteed in-order data delivery using LLL. The advanced Extension Trunking feature allows multiple
network paths to be used simultaneously. When there is a failure for a network path, Extension Trunking will
automatically retransmit the lost packets over a non-affected WAN link to maintain overall data integrity. The
replication application will be protected and continue with no disruption. To maintain replication application
performance during failure situations, the Brocade SX6 features Adaptive Rate Limiting that uses dynamic
bandwidth sharing between minimum (floor) and maximum (ceiling) rate limits. With Adaptive Rate Limiting,
organizations can optimize bandwidth utilization and maintain full WAN performance of the link during
periods when a path is offline due to an extension platform, IP network device, or array controller outage.
During this failure scenario, the other extension switch automatically detects that the first device went idle,
and dynamically adjusts to utilize 100 percent of the available WAN bandwidth, providing full throughput.
In addition, with unprecedented amounts of storage data crossing extension connections and consuming
larger, faster links, Brocade has enhanced Adaptive Rate Limiting to react 10 times faster to varying traffic
patterns that compete for WAN bandwidth or use shared interfaces. Brocade extends proactive monitoring
between data centers through MAPS and Flow Vision to automatically detect WAN anomalies. With real-time
information, administrators can pinpoint and isolate the issue to either its storage or network device to
accelerate troubleshooting and avoid unplanned downtime.
The Brocade SX6 Extension Blade provides flexible Fibre Channel and IP storage replication deployment
options within the Brocade X6 Director, integrating seamlessly with Fibre Channel blades or providing
standalone extension services for large-scale, multi-site data center environments implementing block,
file, and tape data protection solutions. In addition, a broad range of advanced extension, FICON, and
storage fabric services are available to address the most challenging extension and storage networking
requirements. The Brocade SX6 is ideal for:
With the Brocade SX6, organizations gain “scale as you grow” flexibility with up to four blades within a
Brocade X6 Director chassis. Each extension blade provides 16 32 Gbps Fibre Channel/FICON ports, 16
1/10 Gigabit Ethernet (GbE) ports, and 2 40 GbE ports to deliver the high bandwidth, port density, and
throughput required for maximum application performance over WAN connections. With industry-leading
port density, administrators can connect more Fibre Channel and IP storage devices to scale on demand.
To meet current and future requirements, with compression enabled, organizations can scale up to 80
Gbps application throughput, depending on the type of data and the characteristics of the WAN connection.
The Brocade SX6 Extension Blade provides a comprehensive bundle that includes an unlimited WAN
rate up to 40 Gbps and a set of advanced services: FCIP, IP Extension, Brocade Fabric Vision technology,
Extension Trunking, Adaptive Rate Limiting, IPsec, compression, Open Systems Tape Pipelining (OSTP),
Fast Write, Adaptive Networking, Extended Fabrics, FICON Management Server (CUP), and Advanced FICON
Acceleration. An optional value-add license for Integrated Routing (FCR) is available to address challenging
extension and storage networking requirements in open systems environments. Organizations can deploy
the Brocade SX6 and the Brocade 7840 Extension Switch in a data center-to-edge architecture as a cost-
effective option for connecting primary data centers with remote data centers and offices.
The Brocade SX6 also fully supports the same IP extension features and functions described earlier in the
section on the Brocade 7840 extension switch.
To address this challenge, the Brocade FX8-24 Extension Blade, designed specifically for the Brocade DCX
8510 Backbone with Gen 5 Fibre Channel and for Brocade DCX Backbones, helps provide the fastest, most
reliable, and most cost-effective network infrastructure for remote data replication, backup, and migration.
Leveraging 12 advanced 8 Gbps Fibre Channel ports, 10 Gigabit Ethernet (GbE) ports, up to two optional
10 GbE ports, and FCIP technology, the Brocade FX8-24 provides a flexible and extensible platform to move
more data faster and farther than ever before.
Organizations can install up to four Brocade FX8-24 blades in a Gen 5 Brocade 8510-8/8510-4 or an 8
Gbps DCX/DCX-4S, providing scalable Fibre Channel and FCIP bandwidth for larger enterprise data centers
and multisite environments. Activating the optional 10 GbE ports doubles the aggregate bandwidth to 20
Gbps and enables additional FCIP port configurations (ten 1 GbE ports and one 10 GbE port, or two 10 GbE
ports).
The following table shows the different names used by these OEMs: EMC, HDS, and IBM.
Table 1: Brocade hardware products with the naming conventions of different OEMs.
FICON/FMS/CUP
FICON Management Server (FMS) is a comprehensive and easy-to-use software product that allows
automated control of FICON storage networks from a single interface. FMS allows the CUP port to become
available for FICON users. FMS is a software license that is applied to FICON switching and extension
devices.
Host-based management programs manage switches using Control Unit Protocol (CUP) by sending
commands to an emulated control device in Brocade FOS. A Brocade switch that supports CUP can be
controlled by one or more host-based management programs, as well as by Brocade tools. A mode register
controls the behavior of the switch with respect to CUP itself, and with respect to the behavior of other
management interfaces.
The CUP provides an interface for host in-band management and collection of FICON director performance
metrics using the z/OS RMF 74-7 record, more commonly known as the FICON Director Activity Report.
Through the use of this record and report you can gain a good understanding about frame pacing delay.
Frame pacing delay in the FICON Director Activity Report must be carefully analyzed, as numbers in this
column of the report do not automatically mean there are inadequate buffer credits assigned.
Host-based management programs manage the FICON directors and switches by sending commands to
the switch Control Unit defined in the IOCDS and Hardware Configuration Definition (HCD). A FICON director
or switch that supports CUP can be controlled by one or more host-based management programs or
director consoles. Control of the FICON directors can be shared between these options. There are 39 CUP
commands, or Channel Command Words (CCWs), for monitoring and control of the FICON director functions.
CUP commands are oriented towards management of a single switch, even though the use of CUP in a
cascaded FICON environment is fully supported.
The FICON CUP license must be installed, FICON Management Server mode (fmsmode) must be enabled,
and the configure CUP attributes (FMS parameters) for the FICON director on the switch must be set
to enable CUP management features. When this mode is enabled, Brocade FOS prevents local switch
commands from interfering with host-based management commands by initiating serialized access to
switch parameters.
For Gen 5 (and earlier), FICON CUP was an optional, licensed feature (FICON Management Server (FMS)
license. Starting with Gen 6 Brocade products, FICON CUP is included in the enterprise bundle of software.
Fabric OS CLI
The Fabric OS CLI, which is accessed via Telnet, SSH, or serial console, provides full management capability
on a Brocade switch. The Fabric OS CLI enables an administrator to monitor and manage individual
switches, ports, and entire fabrics from a standard workstation. Selected commands must be issued from a
secure Telnet or SSH session.
Access is controlled by a switch-level password for each access level. The commands available through the
CLI are based on the user’s login role and the license keys that are used to unlock certain features.
By use of scripting mechanisms, task automation can be achieved through the switch serial port or Telnet
interfaces.
Web Tools
Brocade Web Tools, an intuitive and easy-to-use interface, can be used to monitor and manage single
Brocade FOS based switches. Administrators can perform tasks by using a Java-capable Web browser from
their laptops, desktop PCs, or workstations at any location that has IP connectivity. In addition, Web Tools
access is available through Web browsers, a secure channel via HTTPS, or from Brocade Network Advisor.
This product enables the management of any switch in the fabric from a single access point. In addition, any
single switch can automatically discover the entire fabric. When administrators enter the network address
of any switch in the fabric, the built-in Web server automatically provides a full view of the fabric. From that
point, administrators can configure, monitor the status of, and manage switch and fabric parameters on
any switch in the fabric. Organizations may use this product to configure and administer individual ports or
switches. To protect against unauthorized actions, username and password login procedures are used to
limit access to configuration features.
A limited set of features is accessible using Web Tools without a license, which is available free of charge.
Additional switch management features are accessible using Web Tools with the Enhanced Group
Management (EGM) license.
Brocade Network Advisor saves network administrators time and enables greater agility through several key
capabilities, including:
• Intelligent dashboards: Organizations can proactively identify problem areas and prevent network
downtime with a customizable, at-a-glance summary of all discovered Brocade devices and third-party IP
devices.
• Easy-to-use interface: Administrators can rapidly identify problem areas and troubleshoot use cases with
performance dashboards and drill-down options.
• Performance and historical data reporting: Organizations gain visibility into in-depth performance
measures and historical data for advanced troubleshooting and performance management.
With Brocade Network Advisor, storage and server administrators can proactively manage their SAN
environments to support non-stop networking, address issues before they impact operations, and minimize
manual tasks. Key capabilities for this include:
• Configuration management: Organizations can automate manual operations with configuration wizards
for Fibre Channel SANs, FICON and cascaded FICON environments, FCIP tunnels, Host Bus Adapters
(HBAs), and Converged Network Adapters (CNAs).
• Policy Monitor management: Brocade Network Advisor can proactively validate SAN configurations with
best-practices templates.
• Top Talkers: Organizations can identify devices or ports with the most demanding traffic flows for
proactive capacity planning.
Brocade Network Advisor integrates with Brocade Fabric Vision technology to provide unprecedented
visibility and insight across the storage network. Brocade Network Advisor supports the following Brocade
Fabric Vision technology features: Monitoring and Alerting Policy Suite (MAPS), Flow Vision, Single-dialog
bulk configuration, Brocade ClearLink Diagnostics, and Bottleneck Detection.
Brocade Network Advisor supports the entire Brocade portfolio, unifying network management under a
single tool. For a complete list of supported Brocade products, refer to Brocade Network Advisor Installation
and Migration Guide.
Brocade Network Advisor is available in three versions. The free Professional version supports only one
fabric and up to 1000 fabric ports. Both the Professional Plus and Enterprise SAN licenses now support up
to 36 fabrics. The maximum number of access gateways that each license can support has doubled to 200.
Per instance, port count limitations are at 2560 and 9000 for the Professional Plus and Enterprise versions,
respectively.
Brocade Network Advisor is available for Windows and Linux platforms. For a complete list of supported
platforms, refer to the Brocade Network Advisor Installation and Migration Guide.
BNA is available through the OEM IBM and HDS under the name BNA. HP sells it as HP Network Advisor,
while EMC calls the product Connectrix Manager Converged Network Edition (CMCNE). Be aware that not all
OEMs offer all three available versions—Professional, Professional Plus and Enterprise— and all releases.
Refer to each OEM’s website to get the information on what it supports.
• ClearLink diagnostics: Leverages the unique Brocade Diagnostic Port (D_Port) mode to ensure optical
and signal integrity for Gen 5 Fibre Channel optics and cables, simplifying deployment and support of
high-performance fabrics
• Bottleneck Detection: Identifies and alerts administrators to device or Inter-Switch Link (ISL) congestion,
as well as device latency in the fabric, providing visualization of bottlenecks and identifying exactly which
devices and hosts are impacted
• At-a-Glance Dashboard: Includes customizable health and performance dashboard views, providing all
critical information in one screen
• Forward Error Correction (FEC): Automatically detects and recovers from bit errors, enhancing
transmission reliability and performance
• Virtual Channel-Level Buffer Credit Recovery: Automatically detects and recovers buffer credit loss
at the virtual channel level, providing protection against performance degradation and enhancing
application availability
• Monitoring and Alerting Policy Suite (MAPS): A policy-based monitoring tool that simplifies fabric-wide
threshold configuration and monitoring
• Flow Vision: A comprehensive tool that allows administrators to identify, monitor, and analyze specific
application data flows
The Brocade Flow Vision tool suite allows administrators to identify, monitor, and analyze specific application
data flows in order to maximize performance, avoid congestion, and optimize resources. Flow Vision
includes:
The Brocade SAN Health® family provides powerful tools that help focus on optimizing fabrics rather than
manually tracking components. A wide variety of useful features make it easier to collect data, identify
potential issues, and check results over time.
The Brocade SAN Health diagnostic capture tool can be used to do the following:
• Take inventory of devices and fabrics (SAN and FICON), allowing you to load the FICON IOCP in the utility
• Capture and display historical performance data, as well as asset performance statistics and error
conditions
• Produce detailed graphical reports and fabric diagrams that even can be published to a customer’s
intranet
• Assist with zoning migrations. From the Brocade SAN Health output, zone scripts, consisting of CLI
commands, are generated to create new zones.
At the core of SAN Health is the SAN Health Diagnostics Capture tool, which automatically scans and
captures information about SAN fabrics and backbones, directors, switches, and connected devices in the
storage environment. SAN Health Diagnostics Capture runs from any Windows PC with TCP/
IP connections to the SAN. The captured information is stored in an encrypted audit file, which the user can
then upload or e-mail to Brocade. Brocade, in turn, then generates target reports, which are made available
to the user usually within a few hours. The user reports are delivered in Excel and Visio formats and provide
information such as snapshot views of inventory, configurations, historical performance, and graphical
topology diagrams.
For FICON mainframe environments, the Input Output Configuration Program (IOCP) data in plain-text format
can be uploaded when Brocade SAN Health is executed. Brocade SAN Health then matches the IOCP
against the Request Node Identification Data (RNID) data in the switched-fabric environment. The use of
Brocade SAN Health in a FICON environment is not limited to the local environment, but includes cascaded
assets as well.
SAN Health Professional is a program that you can use to load the original report data, the SH file, which
is generated by SAN Health Diagnostics Capture. The program supports extended functionality beyond
the capabilities of an Excel report and Visio topology diagram. Capabilities such as searching, comparing,
custom report generation, and change analysis are all available in an easy-to-use GUI.
SAN Health Professional Change Analysis is an add-on module for SAN Health Professional that enables
organizations to compare two SAN Health reports that were run at different times to visually identify
which items have changed from one audit to the next. With SAN Health Professional Change Analysis,
organizations can compare two SAN Health reports with all the detailed changes highlighted in an easy-to-
understand format. The changes are easily searchable, and organizations can quickly produce a change
report.
Best Practices
106
PART 2 Chapter 5: BEST PRACTICES
Introduction
As a professional, you might hear about “best practices” all the time, but have you taken the time to think
about what a best practice really is? One of the most helpful definitions comes from the website whatis.
com: “A best practice is a technique or methodology that, through experience and research, has proven to
reliably lead to a desired result. A commitment to using the best practices in any field is a commitment to
using all the knowledge and technology at one’s disposal to ensure success. A best practice tends to spread
throughout a field or industry after success has been demonstrated.” For those who work in the technology
field, this definition illuminates exactly what a best practice is all about.
Depending on what you are trying to accomplish, certain best practices that prove useful for the industry as
a whole may not work for you in your situation. Remember the old adage, “Your mileage may vary!”
The FICON best practices described in this chapter are designed to guide you to the correct method—and
typically the easiest method—to architect, deploy, and manage a mainframe I/O infrastructure when utilizing
Brocade switching devices in a FICON or Protocol Intermix environment.
The following overview of best practices and techniques is derived from the much more detailed “FICON
Advice and Best Practices,” available at www.brocade.com. Although FICON switching equipment exists
beginning at 1 Gigabit-per-second (Gbps), this document focuses on Gen 5 (16 Gbps) and Gen 6 (32 Gbps)
switching device best practices. Gen 5 can incorporate 8 Gbps, 10 Gbps, and 16 Gbps connectivity. Brocade
Fabric OS (FOS) 8.x and 7.x is also featured, although occasionally Brocade FOS 6.4.2a is referenced.
In this chapter, only a few tips are provided per major area. Refer to the document mentioned above for all
of the advice and best practices created for FICON fabrics.
Addressing
When the FICON Management Server (FMS) license is installed, and the Control Unit Port (CUP) is enabled
and being utilized, the CUP makes use of an internal embedded port (0xFE) for its in-band communications:
• If physical ports 0xFE and 0xFF exist, they become unavailable to be used for FICON connectivity.
• You can still use the physical port 0xFF when users need a vacant port to do a port swap activity.
Consider using the Brocade Virtual Fabrics feature, if the switch chassis contains 256 physical ports or
more and there is a need to utilize all of those physical ports for connectivity.
Brocade Virtual Fabrics allow users to create a Logical Switch (LS) that has only ports “x00-up to-FD”
(0–254) and one or more other LSs that contain the remainder of the ports.
• Since the LSs do not have any physical port “FE” or “FF” defined, all ports can now be used even when
CUP is deployed.
Autonegotiation
At 32 Gbps, optics cannot autonegotiate (and cannot be set to manually transfer at) 4 Gbps.
• 4 Gbps storage requires the use of 16 Gbps optics (Small Form-Factor Pluggable, or SFP). 16 Gbps SFPs
can be inserted into 32 Gbps ports on 32 Gbps switching devices.
At 16 Gbps, optics cannot autonegotiate (and cannot be set to manually transfer at) 2 Gbps.
• 2 Gbps storage requires the use of 8 Gbps optics (SFP Plus, or SFP+), and 8 Gbps SFPs can be inserted
into 16 Gbps ports on Gen 5 switching devices.
• It is not possible to connect to 1 Gbps storage if it becomes attached to Gen 5 Brocade 16 Gbps switching
devices.
Buffer Credits
Buffer Credits (BC) define the maximum amount of data (in other words, frames in flight) that can be sent
down a link, prior to any acknowledgment being received of frame delivery.
On Brocade Gen 5 and Gen 6 devices, the default of 8 and 16 BCs respectively per port is more than
adequate for local FICON I/O. On the Brocade 8 Gbps, Gen 5, and Gen 6 switching devices, BCs are pooled
by ASIC and are not isolated on a port-by-port basis. Users can take BCs from the ASIC pool and deploy
them to F_Ports and E_Ports, as required.
Gen 5 Fibre Channel-based switches are able to utilize a BC pool of 8,192 buffers, which quadruples the
Brocade 8 Gbps Condor2 BC pool of 2,048 buffers. Gen 6 products have a pool of 15,360 BCs per Condor 4
ASIC.
Use the Resource Management Facility (RMF) FICON Director Activity Report to find any indication of Buffer
Credit problems. A non-zero figure in the “Average Frame Pacing Delay” column means that a frame was
ready to be sent, but there were no Buffer Credits available for 2.5 microseconds (μsec) or longer. Users
want to see this value at zero most of the time.
Why not just use the maximum number of BCs on a long-distance port?
• If the switching device gets too busy internally handling small frames, so that it is not servicing ports in a
timely manner, and the short-fused mainframe Input/Output (I/O) timer pops, then it places a burden on
the switching device to discard all of those frame buffers on each port and then redrive the I/O for those
ports.
• Since BCs also serve a flow control purpose, adding more credits than needed for distance causes the
port to artificially withhold backpressure, which can affect error recovery.
• It is a best practice to allocate just enough BCs on a long-distance link to optimize its link utilization
without wasting BC resources.
While the Brocade 8 Gbps ASIC and Brocade FOS provide the ability to detect Buffer Credit loss and recover
Buffer Credits at the port level, the Gen 5 and Gen 6 Brocade Fibre Channel ASICs diagnostic and error
recovery feature set includes not only the port-level BC recovery capability, but also the following features:
• If the ASICs detect that only a single Buffer Credit is lost, they can restore the lost Buffer Credit without
interrupting the ISL data flow.
• If the ASICs detect that more than one Buffer Credit is lost, or if they detect a “stuck” VC, the ASICs can
recover from those conditions by resetting the link, which requires retransmission of frames that were in
transit across the link at the time of the link reset.
Virtual Channel-level BC recovery is a feature that is implemented on Gen 5 Fibre Channel ASICs:
• Both sides of a single link must contain Gen 5/Gen 6 Fibre Channel ASICs.
• BC recovery is supported on local and long-distance ISL links.
Loss of a single Buffer Credit on a VC is recovered automatically by the Gen 5/Gen 6 Fibre Channel ASICs
through the use of a VC reset:
• Detection of a stuck VC occurs by the ASIC after a zero-BC condition is timed for 600 milliseconds—more
than half a second.
Loss of multiple BCs on a VC is recovered automatically, and often non-disruptively, by the ASICs through the
use of Link Reset commands.
• Buffer Credit recovery allows links to recover after Buffer Credits are lost, when the Buffer Credit recovery
logic is enabled.
The Buffer Credit recovery feature also maintains performance. If a credit is lost, a recovery attempt is
initiated. During link reset, the frame and credit loss counters are reset without performance degradation.
Cabling
Fibre Channel speeds have increased from 1 Gbps to 2 Gbps to 4 Gbps to 8 Gbps to 16 Gbps to 32 Gbps.
This is exciting, but with every increase in speed, you suffer a decrease in maximum distance. What needs
to change is the quality of the cables or, more specifically, the modal bandwidth (the signaling rate per
distance unit).
With the evolution of 10 Gbps Ethernet, the industry produced a new standard of fiber cable that the Fibre
Channel world can use to its advantage, called laser optimized cable or, more correctly, OM3. Now OM3
has been joined by an even higher standard, known as OM4. OM4 is completely backwards compatible with
OM3 fiber and shares the same distinctive aqua- colored jacket.
• For user requirements, at 8 Gigabit Fibre Channel and beyond, use OM3/OM4 MM cables or OS1 SM
cables.
• OM1 and OM2 multimode cables do have very limited distance when used at 8 Gbps and beyond.
• Most mainframe shops have converted to using all long-wave SFPs combined with OS1 single-mode cable
connectivity:
–– Long-wave and short-wave FICON channel cards are the same purchase cost to the user.
–– Long-wave optics are more expensive to deploy as switch and storage ports.
–– Long-wave connectivity provides the most flexibility and the most investment protection.
• For Direct Access Storage Device (DASD) I/O operations, the DASD array is typically the I/O source, and
the Channel Path Identifier (CHPID) (and application) is the target, since 90 percent of all DASD I/O
operations are for reads.
• For a tape write operation, the CHPID (and application) is typically the I/O source, and the tape drive is the
target, since 90 percent of all tape I/O operations are for writes.
Always try to ensure that the target of an I/O has an equal or greater data rate than the source of the I/O. If
you do not take this into consideration, you might find that the I/O target is being overrun by the I/O source,
and a slow drain connection can occur.
With the FICON channel subsystem, there are situations where, for some reason, a switching port can
bounce. If the port does bounce, there is an elaborate recovery process that starts. Once the recovery
process starts, if the bounced port comes back up, this often causes additional problems, as the recovery
process with the mainframe host was already underway. So, for FICON, it is a best practice to leave a
bounced port disabled and let the host handle recovery. Then customers can resolve or re-enable the
port at a later time. The Port AutoDisable feature minimizes traffic disruption that is introduced in certain
instances when automatic port recovery is performed.
Redundant switches in redundant fabrics, with path groups utilizing both devices, potentially create an
environment with four-nines of availability. Users must also consider the impact of losing 50 percent of their
bandwidth and connectivity.
Single directors (such as the Brocade DCX 8510 Backbone and Brocade X6 Director) that host all of the
FICON connectivity create an environment with four-nines of availability.
• Dual Directors in dual fabrics, with path groups utilizing both devices, potentially create an environment
with five-nines of availability.
• Also consider the impact of losing a fabric and some percentage of your bandwidth and connectivity.
Many mainframe customers ensure a five-nines or better high-availability environment by deploying four or
eight redundant fabrics, which are shared across a path group, so that the bandwidth loss in the event of a
fabric failure is not disruptive to operations.
It is a best practice to split up the redundant fabrics across two or more cabinets. It is also optimal to locate
those cabinets some distance apart, so that a physical misfortune cannot disrupt all the FICON fabric
cabinets and their switching devices at the same time.
• Brocade FOS 7.0.0a and later support the execution of D_Port tests concurrently on up to eight ports on
the switch.
• Brocade FOS 7.0.0d and lower does not support D_Port for FICON fabrics. However, it is supported on
FCP-only ISL links.
• Brocade FOS 7.1 and higher does support D_Port for FICON fabrics, FCP fabrics, and Protocol Intermixed
fabrics.
It is a best practice to utilize this diagnostic feature to enable smooth operation of metro-connected FICON
and FCP fabrics.
Initially supported only for ISLs, the D_Port is able to perform electrical as well as optical loopback tests. It is
also able to perform link-distance measurement and link saturation testing:
Domain ID
The Domain ID defines the FICON Director address (or the switch address in an IBM Z environment).
The Domain ID is specified in the FICON Director as a decimal number, and it must be converted to a
hexadecimal value for use in the IBM Z server.
Always make Domain IDs unique and insistent. Insistent Domain IDs are always required for 2-byte
addressing (FICON cascading).
Set the switch ID in the I/O Configuration Program (IOCP) to be the same as the Domain ID of the FICON
switching device.
Make the Switch ID in the IOCP the same as the hex equivalent of the Domain ID. Although not required,
this is a best practice that makes it easier to correlate switches that are displayed through management
software to the IOCP, since the Domain ID displays as a hex equivalent number:
The Gen 5 and Gen 6 Brocade Fibre Channel ASICs includes integrated Forward Error Correction (FEC)
technology, which can be enabled only on E_Ports connecting ISLs between switches.
FEC is a system of error control for data transmissions, whereby the sender adds systematically generated
Error-Correcting Code (ECC) to its transmission. This allows the receiver to detect and correct errors without
the need to ask the sender for additional data:
ISL links using FEC must be directly connected together between Gen 5/Gen 6 switching devices.
Implementation of FEC enables a Gen 5 ASIC to recover bit errors in both 32 Gbps and 16 Gbps data
streams. The Gen 5 Fibre Channel FEC implementation can enable corrections of up to 11 error bits in every
2,112-bit transmission. This effectively enhances the reliability of data transmissions and is enabled by
default on Gen 5/Gen 6 Fibre Channel E_Ports.
• In other words, for approximately every 264 bytes of data in the payload of a frame, up to 11 bit errors
can be corrected.
• Based on the ratio between enc in and crc err—which basically shows how many bit errors on average
there are in a frame—FEC has the potential to solve over 90 percent of the physical problems users have
when connecting Fibre Channel fabrics.
• This technology enables only switch-to-switch compression, not device or data-at-rest compression.
• In-flight data compression optimizes network performance within the data center and over long- distance
links.
• Data is compressed at the source and uncompressed at the destination.
• Performance varies by data type, but Brocade uses an efficient algorithm to generally achieve 2:1
compression with minimal impact on performance.
• Compression can be used in conjunction with in-flight encryption.
This compression technology is described as “in-flight” because this ASIC feature is enabled only between
E_Ports. This allows ISL links to have the data compressed as it is sent from the Fibre Channel-based switch
on one side of an ISL and then decompressed as it is received by the Fibre Channel-based switch that is
connected to the other side of the ISL.
In-flight data encryption minimizes the risk of unauthorized access for traffic within the data center and over
long-distance links:
• This technology enables only switch-to-switch encryption, not device or data-at-rest encryption.
• Data is encrypted at the source and decrypted at the destination.
• Encryption and decryption are performed in hardware using the AES-GCM-256 algorithm, minimizing any
impact on performance.
• Encryption can be used in conjunction with in-flight compression.
• In-flight encryption is available only on 16 and 32 Gbps port blades.
As with Gen 5 Fibre Channel integrated compression, the integrated encryption is supported in-flight,
exclusively for ISLs, linking Gen 5 Fibre Channel-based switches.
• The result of enabling ISL encryption is that all data is encrypted as it is sent from the Fibre Channel-
based switch on one side of an ISL and then decrypted as it is received by the Fibre Channel-based switch
connected to the other side of the ISL.
At FOS 7.0 and earlier, the Port Decommissioning/Port Commissioning feature is not supported on ISL links
that are configured with encryption or compression. Beginning at FOS 7.1, the Port Decommissioning/Port
Commissioning feature is supported on ISL links that are configured with encryption or compression.
Now in its third generation, the Brocade optical ICLs—based on Quad SFP (QSFP) technology— replace
the original 8 Gbps ICL copper cables with MPO cables and provide connectivity between the core routing
blades of two Brocade DCX 8510 and Brocade X6 chassis:
• The ICL ports for the Gen 5/Gen 6 directors are quad-ports, so they use MPO cables and QSFP rather
than standard SFP+.
• Each QSFP-based ICL port combines four 16 or 32 Gbps links, providing up to 64 or 128 Gbps of
throughput within a single cable.
• Available with Brocade FOS 7.0 and later, Brocade offers up to 32 QSFP ICL ports on the Brocade DCX
8510-8/X6-8 and up to 16 QSFP ICL ports on the Brocade DCX 8510-4/X6-4.
An ISL is a link that joins two Fibre Channel switches through E_Ports.
For Brocade FOS 7.1 or higher environments, you can configure a Diagnostic Port (D_Port) within a FICON
LS and use it to validate an ISL in any LS (FICON or FCP). Testing an ISL link as a D_Port connection before
deploying that link as an ISL is a best practice.
For ISLs, an Extended Fabrics license is required on each chassis that participates in the long-distance
connection, if it is equal to or greater than 10 km.
For dark fiber connections, you should deploy long-wave optics and single-mode cabling.
A best practice is to use the Brocade Network Advisor Cascaded FICON Fabric Merge wizard to create
cascaded fabrics.
• The merging of two FICON fabrics into a cascaded FICON fabric may be disruptive to current I/O
operations in both fabrics, as the wizard needs to disable and enable the switches in both fabrics.
• Make sure that fabric binding is enabled.
• Make sure that a security policy (Switch Connection Control, or SCC, policy) is in effect with strict
tolerance.
• In-Order Delivery (IOD) must be enabled.
• Make sure that every switch has a different Domain ID.
• Make sure that all Domain IDs are insistent.
It is a best practice to not mix tape I/O and DASD I/O across the same ISL links, as the tape I/O often
uses the full capacity of a link. This leaves very little for the DASD I/O, which can become very sluggish in
performance.
From Brocade FOS 7.0 to Brocade FOS 7.0.0d, when using 10 Gbps or 16 Gbps ISL links, any FCP frames
traversing ISL links can be compressed and encrypted. FICON ISL compression and encryption (IP Security,
or IPsec) was not certified for these releases.
At Brocade FOS 7.1 or higher, when using 10 Gbps, 16, or 32 Gbps ISL links, any FICON or FCP frames
traversing ISL links can be compressed and encrypted.
At FOS 7.0 and earlier, the Port Decommissioning/Port Commissioning feature is not supported on ISL links
that are configured with encryption or compression. Beginning at FOS 7.1, the Port Decommissioning/Port
Commissioning feature is supported on ISL links that are configured with encryption or compression.
Local Switching
Brocade Local Switching is a performance enhancement that provides two capabilities:
• Local Switching reduces the traffic across the backplane of the chassis, which reduces any
oversubscription across that backplane for other I/O traffic flows.
• Local Switching provides the fastest possible performance through a switching device by significantly
reducing the latency time of a frame passing through the locally switched ports.
Brocade Local Switching occurs when two ports are switched within a single ASIC.
With Local Switching, no I/O traffic traverses the backplane (the core blades) of a director, so core blade
switching capacities are unaffected.
• The ASIC recognizes that it controls both the ingress and egress port; it simply moves the frame from the
ingress to the egress port without ever having the frame move over the backplane.
Local Switching always occurs at the full speed negotiated by the switching ports (gated by the SFP)
regardless of any backplane oversubscription ratio for the port card.
• Backplane switching on 8 Gbps and Gen 5 Brocade Directors is sustained at approximately 2.1 μsec per
I/O.
• Local Switching on Gen 5 Brocade Directors is sustained at approximately 700 ns (nanoseconds) per I/O,
which is three times faster than backplane switching.
• The Brocade FC32-48 port blade used in the Brocade X6 family of directors provides full line rate on all
48 ports with no oversubscription.
Local Switching always occurs at full speed in an ASIC, as long as the SFP supports the highest data rate
that the port can sustain.
NPIV must be set up on the FICON entry switching device before it is set up on the mainframe.
There is a maximum of only 2,048 NPIV nodes on the Brocade Gen 5 switching devices. This limits NPIV-
capable ports per chassis to <=64 (32 virtual node logins × 64 ports = 2048). These limitations are a
function of both memory resources and processing power on today’s switches. This limitation was increased
to 9K NPIV devices per Brocade X6 chassis for the Brocade Gen 6 directors.
A vital role of NPIV is to allow users to consolidate their I/O from many Linux guests onto just a few CHPIDs
that are running in FCP mode. Users must be careful that this consolidation does not create a bottleneck.
• Too many Linux guests might push the utilization of a CHPID link beyond what can be handled without
congestion. After deploying NPIV, check the channel performance statistics to be sure that congestion is
not occurring.
• Deployment of the FCP CHPIDs must be careful to utilize as many Virtual Channel paths as possible, to
avoid congestion at the VC level of any of the ISL links that might be utilized by the NPIV links. See the
section on Virtual Channels for more information.
Optics (SFPs)
Multimode cables, which are attached to short-wavelength SFP optics, are designed to carry light waves for
only short distances. Single-mode cables, which are attached to long-wave length SFP optics, are designed
to carry light waves for much longer distances. Three factors control a user’s ability to reliably send data
over distances longer than 984 feet (300 meters):
• The intensity (brightness) of the transmitter (for example, lasers are better at long distance compared to
LEDs)
• The wavelength of the laser
• The number of BCs available to the link, especially if it is a cross-site ISL
Consistent with the IBM recommendation, the best practice in a mainframe environment is to use single-
mode fiber and long-wave optics for FICON I/O connectivity.
• Only optics branded by Brocade (or SmartOptics for FCP I/O traffic) can be used in Brocade Gen 5 and
Gen 6 IBM Z qualified switching devices.
• 8 Gbps optics branded by Brocade can be used in Gen 5 16 Gbps ports; however, 16 Gbps optics cannot
be used in 8 Gbps ports.
• 16 Gbps optics branded by Brocade can be used in Gen 6 32 Gbps ports; however, 32 Gbps optics cannot
be used in 16 Gbps ports.
Brocade Gen 5 16 Gbps 8510 Directors do not support Brocade FC10-8, 10 Gbps blades.
Brocade Gen 5 Director FC16-32 and FC16-48 blade ports can be enabled to use 10 Gbps SFP+ optics.
10 Gbps speed for Brocade FC16 blades requires that the 10 Gbps license is enabled on each slot using 10
Gbps SFPs.
10 GbE ports on any Brocade FX8-24 blade housed in a Gen 5 chassis also requires that a 10 Gbps license
is enabled on the slot that houses the Brocade FX8-24 blade, in order to be used.
If an optic begins to fail or flap, customers can use the management Port Auto Disable capabilities to cause
intermittently failing ports to become disabled when they encounter an event that would cause them to
reinitialize. This negates the ability of the ports to cause performance and/or availability issues, since error
recovery does not have to try to recover the port.
If an optic begins to fail or flap, then it should be replaced. It is a best practice to disable (vary offline) that
port prior to switching out the SFP.
Port Decommissioning
At Brocade FOS 7.0 the Port Decommissioning feature became available for use for Brocade 8 Gbps and
Gen 5 switching devices. This function provides an ability to gently remove the traffic from an ISL link,
reroute that I/O traffic onto another link, and then bring that ISL link offline.
The Port Decommissioning feature can automatically coordinate the decommissioning of ports in a switch,
ensuring that ports are gracefully taken out of service without unnecessary interruption of service and
without triggering the automatic recovery functionality that may occur during manual decommissioning of
ports.
• Port Decommissioning provides an automated mechanism to remove an E_Port from use (decommission)
and to put it back in use (recommission).
Where ISL link aggregation (trunking) is provided for in the fabric, decommissioning of a port in a link
aggregation group may be performed, if there are other operational links in the group. Decommissioning
of a non-aggregated port (or the last operational port in a link aggregation group) may be performed if a
redundant path is available.
At Brocade FOS 6.4.2a and lower, PDCMs can be utilized by anyone who is using a B-Series chassis.
At Brocade FOS 7.0 and above, the FMS setting can be enabled only when an FMS license has been
installed on the chassis to allow in-band management through CUP.
Since the FMS control must be enabled in order to utilize the PDCMs of the Allow/Prohibit Addressing
Matrix, PDCMs are available only to a customer who implements CUP on that chassis.
It is best practice to use PDCMs to prohibit the use of ports that do not have any cables attached to an SFP.
This keeps accidental miscabling of FCP attachments into a FICON zone.
If the mode register bit for Active=Saved (ASM mode) is enabled, the Initial Program Load (IPL) file is kept
in sync with the active configuration on the switching device. However, if for any reason the ASM is not
enabled, then the port states in the active configuration may differ from the port states in the IPL file of
the switch. The switch’s IPL file governs how the port is configured (its state) following any Power-on-Reset
(POR).
Note: It is a best practice to make sure that ASM mode register bit is enabled in FICON environments.
Brocade switching device PDCMs are enforced by the zoning module. That is why Brocade FOS firmware
requires that a switch must have an active zoning configuration in order for PDCM settings to be useful.
If default zoning mode (all access) has been implemented in the fabric, then there is no active zoning
configuration (because no zones/zone-sets have been activated). PDCM changes can be set, but they are
not enforced.
PIM typically becomes an issue with Linux on the mainframe, data replication activities, and/or Storage
Area Network (SAN) sharing of a common I/O infrastructure.
It is a best practice to place FICON ports and FCP ports into different zones.
It is also a best practice to have at least one specific zone for FICON and then at least one specific zone for
each non-FICON protocol that is in use on the infrastructure (for instance, Windows, AIX, UNIX, and so forth).
Additional specific zones may be implemented to meet a customer’s configuration requirements.
Customers who want to provide isolation between their FICON and FCP environments should consider
creating a Virtual Fabric (an LS) for FICON and one or more additional Virtual Fabrics to host their FCP
environment(s). This prevents FICON and FCP I/O from communicating with each other on the fabric.
The FMS license must be purchased, installed, and enabled on the local switch in order for a remote CUP
to work. On Gen 6 IBM Z-qualified Brocade products, the FMS license is included as part of the Enterprise
software bundle.
Once FMS is enabled on a switching device, users should check the FICON CUP settings to be sure that
Active=Saved is enabled, so that any changes to the switching environment will persist across PORs.
It is a best practice to have only one Logical Partition (LPAR) polling the CUP for statistics. That LPAR should
distribute the information to other systems in the sysplex, as required. The first reference for this, created
many years ago, was IBM Authorized Program Analysis Report (APAR) 0A02187.
It is a best practice that in the Control Unit statement for the switch in the Input/Output Configuration Data
Set (IOCDS) and Hardware Configuration Definition (HCD), customers should configure two channels—and
only two channels—as paths to each CUP in the enterprise.
Option FCD in ERBRMFxx parmlib member and STATS=YES in IECIOSnn tell RMF to produce the 74-7 records
that RMF then uses to produce the FICON Director Activity Report on every RMF interval for every switch (or
every Domain ID if a physical chassis is virtualized), using CUP.
Prior to Brocade FOS 7.0.0c, a maximum of two CUPs could be enabled on a chassis utilizing Virtual Fabrics.
At Brocade FOS 7.0.0c and higher, a maximum of four CUPs can be enabled on an 8 Gbps or Gen 5 Director
chassis utilizing Virtual Fabrics. On any physical chassis (Director or modular switch) that is not utilizing
Virtual Fabrics, only a single CUP can be enabled to report on that chassis.
If SA is using its I/O Operations (IOOPs) application to poll remote CUPs across ISL links, it is using small
frames to do this polling. The IOOPs module polling can use up all of the BCs for an ISL link, so be on alert
for this possibility.
When used on a chassis that also houses FCP ports for replication, Linux on System z, or SAN, CUP can
report on the average frame size, bandwidth utilization, and Buffer Credit usage on the FCP ports, as well as
the FICON ports.
Switch ID
The switch ID must be assigned by the user, and it must be unique within the scope of the definitions (HCD,
Hardware Configuration Manager [HCM], or IOCP). The switch ID in the CHPID statement is basically used
as an identifier or label. Although the switch ID can be different than the switch address or Domain ID, it is
recommended that you use the same value as the switch address and the Domain ID, when referring to a
FICON Director.
Set the switch ID in the IOCP to be the same as the switch address of the FICON switching device. (On
Brocade DCX switches, the switch address is also the Domain ID.)
• The “Switch” keyword (Switch ID) in the CHPID macro of the IOCP is only for purposes of documentation.
The number can be from 0x00 to 0xFF. It can be specified however you desire. It is a number that is
never checked for accuracy by HCD and the IOCP process. The best practice is to make it the same as the
Domain ID, but there is no requirement that you do so.
It is a best practice for customers to set the switch ID to the hex equivalent of the switch Domain ID. There
are two reasons for this:
• When entering 2-byte link address, the operator is less likely to make mistakes, because the first byte is
always the switch ID.
• When in debugging mode, the first byte of the link address and Fibre Channel address is the switch ID, so
there are no translations to do.
• Some examples are: SID=01 to DID=01 [0x01]; SID=10 to DID=16 [0x10]); SID=15 to DID=21 [0x15].
In Interop Mode 0, which is the typical mode and is used for fabrics with B-type switches only, the switch
address and the Domain ID are the same thing.
The 4 Gbps, 8 Gbps, Gen 5, and Gen 6 Brocade directors and switches support the full range of standard
Domain IDs, 1–239, and one of these values is recommended to be set for the address of these products.
Brocade FOS 7.0 and higher supports a fabric naming feature that allows users to assign a user- friendly
name to identify and manage a logical fabric.
The switch address is a hexadecimal number that is used when defining the FICON switching device in the
IBM Z environment. The Domain ID and the switch address must be the same when referring to a FICON
switch or Director in a IBM Z environment.
Switch Software
It is a best practice for the Brocade FOS firmware on all of the switching devices in a fabric to be at the same
firmware level.
Director-class switching devices always provide for non-disruptive firmware upgrades and can host a FICON
infrastructure with five-nines of availability. But fixed-port switching devices can create fabric disruptions
during firmware upgrades, so at best they can host a FICON infrastructure with four-nines of availability:
• There is a “blind period” that occurs during a firmware upgrade. This happens when the CP is rebooting
the OS and reinitializing Brocade FOS firmware. On director-class products (those with dual CP systems),
this blind period lasts a short time—from 2 to 4 seconds. The blind period lasts only as long as it takes for
mastership to change (hence, a few seconds) and does not cause fabric disruption.
• Single CP systems (such as the Brocade 6510 fixed-port switches) have a blind period that can last 1 to
2 minutes. Obviously, this potentially can cause fabric disruptions. These switches allow data traffic to
continue flowing during the firmware upgrade CP blind period, unless a response from the CP is required.
In the mainframe world, this can occur more frequently, since FICON implementations utilize Extended
Link Services (ELSs) for path and device validation. In addition, FICON uses class-2 for delivery of certain
sequences, which increases the chances that the CP needs to become involved.
It is a best practice to vary the CUP port (0xFE) on a switching device offline, from every LPAR on every CEC,
before starting a switch firmware download and install procedure.
• This keeps the CUP from getting “boxed” and issuing an Interface Control Check (IFCC) when new
firmware is being implemented.
• After the firmware is successfully updated, vary the CUP back online to every LPAR on every Central
Electronics Complex (CEC).
It is a best practice to download and utilize the Brocade SAN Health free software, which provides users
with a host of reports and a storage network diagram, and which can point out inconsistencies between the
deployed FICON networks and the user’s IOCP.
• The user can configure from two ports up to eight ports within an ASIC-based trunk group, where each link
in the trunk supports transfer rates of up to 16 Gbps (Gen 5) or 32 Gbps (Gen 6) of bidirectional traffic.
• The ports in a trunk group make a single logical link (fat pipe). Therefore, all the ports in a trunk group
must be connected to the same device type at the other end.
• In addition to enabling load sharing of traffic, trunk groups provide redundant, alternate paths for traffic if
any of the segments fail.
Frame-based trunking (that is, Brocade Trunking) is supported in all cascaded FICON fabric
implementations. I/O traffic load sharing is accomplished very efficiently by the hardware ASIC pair to which
the ISL links are connected.
The Fibre Channel protocol has a function called Fabric Shortest Path First (FSPF) to assign ingress ports to
ISL links via a costing mechanism.
• Brocade preempts FSPF with its own capability, Dynamic Load Sharing (DLS), in order to evenly assign
ingress ports to egress ISL links at Port Login (PLOGI) time.
DLS is valid for FICON environments regardless of the routing policy setting. Thus, DLS with Lossless mode
enabled should be used for all FICON environments, whether port-based, device-based, or exchange-based
routing (ISL load balancing mechanisms) are deployed.
When using FICON device emulation (IBM Extended Remote Copy [XRC] and/or FICON Tape Pipelining), the
FCIP blade—which is the Brocade FX8-24—is allowed to use only ASIC-based Brocade Trunking, if trunking is
desired.
A FICON Accelerator License (FAL) enables you to use FICON emulation (FICON acceleration) on Brocade
FCIP links. The license is a slot license. Each slot that does emulation (acceleration) requires a license.
• A statement coded as LINK=(02,1A,20,2A) indicates that only a single switch is in the channel path.
• A statement coded as LINK=(3A02,3A1A,3A20,3A2A) indicates a cascaded FICON channel path. This
format is known as two-byte link addressing.
Unless users are certain that their FICON fabrics will never need to be cascaded, it is a best practice to
always utilize two-byte link addressing in the IOCP.
Two-byte link addressing on Brocade Gen 5 and Gen 6 devices requires, before activation, that users:
However, if that channel logs out for any reason, it is not able to log back in.
Virtual Fabrics
It is a best practice to always enable Brocade Virtual Fabrics even if users do not deploy them (at Brocade
FOS 6.0 and higher), because it is disruptive to change this mode at a later time.
Once users have enabled Brocade Virtual Fabrics, a Default Switch is automatically created. However, this
should not be used for FICON connectivity purposes.
• When the Virtual Fabrics feature is enabled on a switch or Director chassis, all physical ports are
automatically assigned to the Default Switch (which is an LS that is automatically created when Brocade
Virtual Fabrics are enabled).
The Default Switch should not be utilized to provide host/storage connectivity:
• Although it might work for a time, it is a bad practice to place any FICON or FCP connectivity ports in the
Default Switch.
• Move connectivity ports that are in the Default Switch to some other LS that was created. After enabling
Virtual Fabrics on a chassis, users should then create their own LSs.
An LS is an implementation of a Fibre Channel switch in which physical ports can be dynamically added or
removed. After creation, an LS is managed just like a physical switch. Each LS is associated with a Logical
Fabric (Virtual Fabric).
CUP should never be allocated on the Default Switch. CUP should be run only on LSs that users have
created.
Brocade Virtual Fabrics create a more complex environment and additional management duties, so it is a
best practice not to use more of them on a chassis than necessary. In many FICON environments, a single
user-created LS and its Virtual Fabric are adequate.
In particular, consider using Brocade Virtual Fabrics under the following circumstances:
Conclusion
This chapter has presented a great deal of technical information in a highly compact format. The chapter
reviewed best practices for architecture, design, implementation, and management in a Brocade FICON
SAN. Best practices are typically based on years of experience from a wide variety of sources and customer
implementations. However, if you are already implementing a method that is contrary to the best practices
described in this chapter, and your method works for you and your installation, we recommend that you
follow the reliable sage advice: “If it’s not broken, don’t fix it.” Do not change something that works for you,
simply in order to comply with a best practice that is suggested here.
126
PART 3 Chapter 1: BC/DR/CA Basics and Distance Extension Methodologies
Introduction
This chapter introduces some basic concepts associated with BC/DR/CA (Business Continuity, Disaster
Recovery, and Continuous Availability) to provide a level-set background for the more detailed discussions
throughout the rest of part four. It then discusses the various methodologies for network extension that you
can use with the replication networks associated with BC/DR/CA architectures.
Information availability (IA) refers to the ability of an IT infrastructure to function according to business
expectations during its specified period of operation. IA involves ensuring that 1) information is accessible at
the right place to the right user (accessibility), 2) information is reliable and correct in all aspects (reliability),
and 3) the information defines the exact moment during which information must be accessible (timeliness).
Various planned and unplanned incidents result in information unavailability. Planned outages may include
installations, maintenance of hardware, software upgrades or patches, restores, and facility upgrade
operations. Unplanned outages include human error-induced failures, database corruption, and failure of
components. Other incidents that may cause information unavailability are natural and manmade disasters:
floods, hurricanes, fires, earthquakes, and terrorist incidents. The majority of outages are planned;
historically, statistics show the cause of information unavailability due to unforeseen disasters is less than
one percent.
Information unavailability (downtime) can result in loss of productivity, loss of revenue, poor financial
performance, and damage to the reputation of your business. The Business Impact (BI) of downtime is the
sum of all losses sustained as a result of a given disruption. One common metric used to measure BI is the
average cost of downtime per hour. This is often used as a key estimate in determining the appropriate BC
solution for an enterprise. Table 1 shows the average cost of downtime per hour for several key industries
(Guendert, 2013).
• Mean Time Between Failure (MTBF) is the average time available for a system or component to perform
its normal operations between failures. It is a measure of how reliable a hardware product, system, or
component is. For most components, the measure is typically in thousands or even tens of thousands of
hours between failures.
• Mean Time To Repair (MTTR) is a basic measure of the maintainability of repairable items. It is the
average time required to repair a failed component. Calculations of MTTR assume that the fault
responsible for the failure is correctly identified and that the required replacement parts and personnel to
replace the failed component are available.
You can formally define IA as the time period during which a system is in a condition to perform its intended
function upon demand. IA can be expressed in terms of system uptime and system downtime and can be
measured as the amount or percentage of system uptime:
In this equation, system uptime is the period of time during which the system is in an accessible state.
When the system is not accessible, it is called system downtime. In terms of MTBF and MTTR, IA can be
expressed as (Guendert, 2013):
Uptime per year is based on requirements of the service for exact timeliness. This calculation leads to the
use of the common “number of 9s” representation for availability metrics. Table 2 below lists the approximate
amount of downtime allowed for a service to achieve the specified levels of 9s of availability.
Some Definitions
1. Disaster Recovery (DR): The coordinated process of restoring systems, data, and the infrastructure to
support ongoing business operations after a disaster occurs. Disaster Recovery concentrates solely on
recovering from an unplanned event.
2. IT resilience: The ability to rapidly adapt and respond to any internal or external disruption, demand,
or threat and continue business operations without significant impact. This concept of IT resilience is
related to DR, but it is broader in scope in that Disaster Recovery concentrates on recovering only from
an unplanned event.
3. Recovery Point Objective (RPO): The point in time to which systems and data must be recovered after
an outage. RPO represents the amount of data your company is willing to recreate following a disaster.
In other words, what is the acceptable time difference between the data in your production system
and the data at the recovery site? How much data can be lost? What is the acceptable time difference
between the data in your production system and the data at the recovery site; in other words, what is
the actual point-in-time recovery point at which all data is current? If you have an RPO of less than 24
hours, expect to be able to do some form of onsite real-time mirroring. If your DR plan is dependent
upon daily full-volume dumps, you probably have an RPO of 24 hours or more. An organization can plan
for an appropriate BC technology solution on the basis of the RPO it sets.
4. Recovery Time Objective (RTO): The time within which systems and applications must be recovered
after an outage. RTO represents how long your business can afford to wait for IT services to be resumed
following a disaster. In other words, RTO defines the amount of downtime that a business can endure
and survive. RTO traditionally refers to the questions, “How long can you afford to be without your
systems? How much time is available to recover the applications and have all critical operations up and
running again?”
5. Network Recovery Objective (NRO): The time needed to recover or fail over network operations. NRO
includes such jobs as establishing alternate communications links, reconfiguring Internet servers,
setting alternate TCP/IP addresses, and everything else needed to make the recovery transparent to
customers, remote users, and others. NRO effectively represents the amount of time it takes before
your business appears recovered to your customers.
6. Data vault: A repository at a remote site where data can be periodically or continuously copied, so that
there is always a copy at another site.
7. Hot site: A site to which an enterprise’s operations can be moved in the event of a disaster. A hot site is
a site that has the required hardware, operating system, network, and application support to perform
business operations in a location where the equipment is available and running at all times.
8. Cold site: A site to which an enterprise’s operations can be moved in the event of disaster, with
minimum IT infrastructure and environmental facilities that are in place, but not activated.
9. Continuous Availability (CA): A new paradigm that encompasses not only recovering from disasters, but
keeping your applications up and running throughout the far more common planned and unplanned
outages that do not constitute an actual disaster.
Table 3 lists some typical RPO and RTO values for some common DR options (Clitherow, et al., 2017).
Generally, a form of real-time software or hardware replication is required to achieve an RPO of minutes or
less, but the only technologies that can provide an RPO of zero (0) are synchronous replication technologies
coupled with automation to ensure that no data is written to one location and not the other. These topics
will be covered in later chapters.
An Evolving Paradigm
The importance of Business Continuity and Disaster Recovery (BC/DR) for information technology
professionals and the corporations they work for has undergone considerable change in the past twenty
years. This change has further increased exponentially in the years since Sept. 11, 2001, Hurricane Katrina
in 2005, the Japanese Tsunami of 2011, and Superstorm Sandy in 2012. The events of Sept. 11, 2001
(9/11) in the United States served as a wakeup call to those who viewed BC/DR as a mere afterthought
or an item to just check off from a list. That day’s events underscored how critical it is for businesses to be
ready for disasters on a larger scale than previously considered. The watershed events of 9/11 and the
resulting experiences served to diametrically change IT professionals’ expectations, and these events now
act as the benchmark for assessing the requirements of a corporation having a thorough, tested BC/DR
plan.
Following September 11, 2001, industry participants met with multiple government agencies, including the
United States Securities and Exchange Commission (SEC), the Federal Reserve, the New York State Banking
Department, and the Office of the Comptroller of the Currency. These meetings were held specifically
to formulate and analyze the lessons learned from the events of September 11, 2001. These agencies
released an interagency white paper, and the SEC released their own paper on best practices to strengthen
the IT resilience of the U.S. financial system.
The General Data Protection Regulation (GDPR) (Regulation (EU) 2016/679) is a regulation by which
the European Parliament, the Council of the European Union and the European Commission intend to
strengthen and unify data protection for all individuals within the European Union (EU). It also addresses
the export of personal data outside the EU. The primary objectives of the GDPR are to give control back to
citizens and residents over their personal data and to simplify the regulatory environment for international
business by unifying the regulation within the EU. When the GDPR takes effect, it will replace the data
protection directive (officially Directive 95/46/EC) from 1995. The regulation was adopted on 27 April 2016.
It applies from 25 May 2018 after a two-year transition period and, unlike a directive, it does not require
any enabling legislation to be passed by national governments. GDPR is likely to impact BC/DR planning for
European IBM Z clients.
Regional Disasters
One of the things that was not well appreciated prior to 9/11 was the concept of a regional disaster. What
is a regional disaster? A good working definition of a regional disaster is: a disaster scenario, manmade or
natural, that impacts multiple organizations or multiple users in a defined geographic area. In other words,
it is not a disaster impacting only one organization or one data center (Guendert, 2007). Figure 1 lists some
well-known regional disasters that have occurred since 2001.
Organizations whose BC/DR plans focused on recovering from a local disaster such as fire or power failure
within a data center faced the realization on 9/11 that their plans were inadequate for coping with and
recovering from a regional disaster. A regional disaster was precipitated by the attacks on the World Trade
Center in New York City on 9/11. Hundreds of companies and an entire section of a large city, including the
financial capital of the world, were affected. Power was cut off, travel restrictions were imposed, and the
major telephone switching centers (land line and wireless) were destroyed. Access to buildings and data
centers was at a minimum restricted and at a maximum completely impossible for several days following
9/11. The paradigm for planning mainframe-centric BC/DR changed overnight. Organizations quickly
realized that they now needed to plan for the possibility of a building being eliminated or rendered useless.
Even if remote, what about the threat of terrorists using biological, chemical, or nuclear weapons? What
about another regional or super-regional natural disaster such as Hurricane Katrina or the 2011 Japanese
earthquake and tsunami, or manmade disasters such as the 2003 North American power blackout? These
are some examples of issues beyond the data center that organizations need to consider in their planning
for a regional disaster. Two of these issues deserve the most attention.
The first primary issue beyond the data center is erosion in confidence in a business based upon the
company’s reactions to the regional disaster. This typically is based on how the press and internal
communications to employees report on disaster-related events. In other words, what is the perception
of the company’s ability to react to the crisis, minimize the ensuing chaos, provide timely (and accurate)
updates on the status of company operations, and (internally) discuss how Human Resources issues are
being addressed? In effect, organizations need to include a communications plan in their BC/DR plans, as
well as appoint a BC/DR Communications Director/Coordinator. Perception is reality, and the more control
an organization is perceived to have over its destiny, and the more it communicates information upfront, the
less likely that there will be an erosion of confidence in the business.
The second primary issue is that an organization needs to be able to restart or execute restoration of
business operations in a timely manner. Failure to do so can result in essential supplies being unable
to reach the business. Also, if customers have no access to a business’s products and services, those
customers may look to competitors for what they need. This loss of revenue and customer loyalty has a
direct relationship to the effects of a loss or erosion in business confidence.
• Many plans only included plans for replacing IT equipment inside the data center and did not include
plans for replacing IT equipment outside of the data center. Key elements of the IT infrastructure are
essential to restoring and sustaining business operations, not just equipment in the data center.
• The BC/DR plan was not comprehensive, in that it addressed only recovering the IT installation. The
plan did not address everything required to accomplish a complete end-to-end business recovery.
• Documentation describing how to recover mission-critical applications and business processes was
inadequate, or sometimes completely missing.
• The education and training of personnel was not adequate, and sufficient disaster drills had not been
conducted to practice executing the plan. Why not? Was this a lack of control over these procedures
and processes due to outsourcing BC/DR? For those of you who have served in the U.S. Navy,
remember how frequently casualty drills were practiced? And these casualty drills were practiced
as if the ship’s survival was at stake each time. When your organization’s future survival is at stake,
shouldn’t you practice or test your BC/DR plans in the same committed fashion that the U.S. Navy
practices casualty drills? Practice like you play the real game.
• Another planning inadequacy that was frequently cited was a lack of addressing organizational issues
directly related to mission-critical business functions. Examples include but are not limited to the
following:
–– Documentation of a clear chain of command with an associated decision control matrix
–– An effective internal communications plan for employees and an external communications plan for
suppliers, customers, and the media
–– A documented succession plan to follow in case key personnel are incapacitated or unavailable (on
9/11 organizations were woefully unprepared for the loss of key personnel and their skills)
–– Definition of a priority sequence for recovering business processes and associated IT applications
1. A regional disaster could very well cause multiple companies and organizations to declare disasters
and initiate recovery actions simultaneously, or nearly simultaneously. This is highly likely to severely
stress the capacity of business recovery services (outsourcers) in the vicinity of the regional disaster.
Business continuity service companies typically work on a “first come, first served” basis. So when a
regional disaster similar to 9/11 occurs, these outsourcing facilities can fill up quickly and become
overwhelmed. Also, a company’s contract with the BC/DR outsourcer may stipulate that the customer
has the use of the facility only for a limited time (for example, 45 days). This may spur companies
with BC/DR outsourcing contracts to: a) consider changing outsourcing firms, b) renegotiate their
existing contract, or c) study the requirements and feasibility for insourcing their BC/DR and creating
their own DR site. Depending on the organization’s Recovery Time Objective (RTO) and Recovery Point
Objective (RPO), option “c” may provide the best alternative.
2. The recovery site must have adequate hardware, and the hardware at the facility must be compatible
with the hardware at the primary site. Organizations must plan for their recovery site to have: a)
sufficient server processing capacity, b) sufficient storage capacity, and c) sufficient networking and
storage networking capacity to enable all business-critical applications to run from the recovery site.
The installed server capacity at the recovery site may be used to meet normal (day-to-day) needs
(assuming BC/DR is insourced). Fallback capacity may be provided via several means, including
workload prioritization (test, development, production, and data warehouse). Fallback capacity may
also be provided via implementing a capacity upgrade scheme that is based on changing a license
agreement, instead of installing additional capacity. IBM z Systems mainframes have a Capacity
Backup Option (CBU). Unfortunately, such a feature is not prevalent in the open systems world. Many
organizations take a calculated risk with open systems and fail to purchase two duplicate servers (one
for production at the primary data center, and a second server for the DR data center). Therefore, open
systems DR planning must take this possibility into account and consider the question, “What can my
company lose?”
When looking at the issue of compatible hardware, encryption is an important consideration; however, It
is common for an enterprise to overlook encryption in their BC/DR planning. If an organization does not
plan for the recovery site to have the necessary encryption hardware, the recovery site may not be able to
process the data that was encrypted at the primary site. So, when you buy new tape drives with onboard
encryption, before reallocating those previous generation tape drives (which did not have encryption
capabilities) to the recovery site to save money, maybe you need to re-evaluate the decision. In this case,
the cost savings in the short run puts the business at risk of being unable to recover to meet RTO.
3. A robust BC/DR solution must be based on as much automation as possible. The types of catastrophes
inherent in 9/11-style regional disasters make it too risky to assume that key personnel and critical
skills will be available to restore IT services. Regional disasters impact personal lives as well. Personal
crises and the need to take care of families, friends, and loved ones take priority for IT workers.
Also, key personnel may not be able to travel and thus cannot get to the recovery site. Mainframe
installations are increasingly looking to automate switching resources from one site to another. One way
to do this in a mainframe environment is with Geographically Dispersed Parallel Sysplex (GDPS). That
topic is discussed in a later chapter.
What about a company who uses a third-party site for their business recovery but has a contractual
agreement calling for them to vacate the facility within a specified time period? What happens when
their primary site is destroyed, and the company cannot return there? The threat of terrorism and the
prospect for further regional disasters poses the question, “What is our secondary Disaster Recovery
plan?
Many companies already have, or are seriously considering implementing a three-site BC/DR strategy.
In a nutshell, this strategy entails having two sites within the same geographic vicinity to facilitate
high availability and then having a third, remote site for Disaster Recovery. The major inhibitor to
implementing a three-site strategy is telecommunications costs. As with any major decision, a proper
risk versus cost analysis should be performed (Kern & Peltz, 2010).
In some nations, government legislation and regulations set out very specific rules for how organizations
must handle their business processes and data. Some examples are the Basel II rules for the European
banking sector and the U.S. Sarbanes-Oxley Act. These laws both stipulate that banks must have a resilient
back office infrastructure. (Guendert,2007) Another example is the Health Insurance Portability and
Accountability Act (HIPAA) in the United States. This legislation determines how the U.S. health care industry
must account for and handle patient-related data.
This ever increasing need for “365 × 24 × 7 × forever” availability really means that many businesses are
now looking for a greater level of availability than before, covering a wider range of events and scenarios
beyond the ability to recover from a disaster. This broader requirement is called IT resilience. As stated
earlier, IBM has developed a definition for IT resilience: the ability to rapidly adapt and respond to any
internal or external opportunity, threat, disruption, or demand and continue business operations without
significant impact (Guendert, 2013).
There are several factors involved in determining your RTO and RPO requirements. Organizations need to
consider the cost of some data loss while still maintaining cross-subsystem/cross-volume data consistency.
Maintaining data consistency enables you to perform a database restart, which typically requires a duration
of seconds to minutes. This cost needs to be weighed against the cost of no data loss, which either impacts
production on all operational errors—in addition to Disaster Recovery failure situations—or yields a database
recovery disaster (typically hours to days in duration), as cross- subsystem/cross-volume data consistency is
not maintained during the failing period. The real solution that is chosen is based on a particular cost curve
slope: If I spend a little more, how much faster is Disaster Recovery? If I spend a little less, how much slower
is Disaster Recovery?
In other words, the cost of your Business Continuity solution is realized by balancing the equation of how
quickly you need to recover your organization’s data against how much it will cost the company in terms
of lost revenue, due to the inability to continue business operations. The shorter the time period chosen
to recover the data to continue business operations, the higher the costs. It should be obvious that the
longer a company is down and unable to process transactions, the more expensive the outage will be for the
company. If the outage lasts long enough, survival of the company is doubtful. Figure 2 provides a reminder
of some basic economic cost curves. Much like deciding on the price of widgets and the optimal quantity of
widgets to produce, the optimal solution resides at the intersection point of the two cost curves (Guendert,
2007).
Cost of Solution
and Cost of Outage
Time to Recover over Time
Cost
Cost/Time
fig02_P4Ch1
Window
Time
Tier 0: No Disaster Recovery plan and no offsite data: Tier 0 means that a company has no contingency
plan, no backup hardware, no documentation, no saved information and—in the event of a disaster—no
more business.
Tier 1: Data backup, with no hot site: Tier 1 allows a business to back up its data (typically with tape).
The system, application data, and subsystem are dumped to tape and transported to a secure facility.
Depending on how often backups are made, Tier 1 businesses are prepared to accept several days to
weeks of data loss. Their backups will be secure off-site (assuming that the tapes made it safely to the
site!). In the event of a disaster, all backup data such as archived logs and image copies that are still onsite,
will be lost. This data loss typically consists of 24 to 48 hours’ worth of data. Tier 1 is often referred to as
the Pickup Truck Access Method (PTAM). Tier 1 businesses recover from a disaster by securing a DR site,
installing equipment, transporting backup tapes from the secure site to the DR site. The next steps involve
restoring the system, the application infrastructure and related data, and the subsystem, then restarting the
systems and workload. This process typically takes several days. The cost factors involved include creating
backup tapes, transporting the tapes, and storing the tapes.
Tier 2: Data backup/PTAM with a hot site: The Tier 2 plan is essentially the same as the Tier 1 plan, except
that the organization has secured a DR facility in advance. This means that data recovery is somewhat faster
and has a more consistent response time as compared to a Tier 1 recovery. You still have a data loss of 24 to
48 hours, but recovery now takes only 24 to 48 hours, as opposed to several days. The cost factors for a Tier 2
solution include ownership of a second IT facility or the cost of a DR facility subscription fee. These costs must be
considered in addition to the previously mentioned Tier 1 costs.
Tier 3: Electronic vaulting: Tier 3 provides the same solutions as Tier 2, except that with a Tier 3 solution,
the organization dumps the backup data electronically to a remotely attached tape library subsystem.
Depending on when the last backup was created, data loss with a Tier 3 solution consists of up to 24 hours’
worth of data or less. Also, electronically vaulted data is typically more current than data that is shipped
via PTAM. Also, since Tier 3 solutions add higher levels of automation, there is less data recreation or loss.
In addition to the Tier 2 cost factors, Tier 3 cost factors include the telecommunication lines required to
transmit the backup data, as well as a dedicated Automated Tape Library (ATL) subsystem at the remote
site.
Tier 4: Active secondary site/electronic remote journaling/point in time copies: Tier 4 solutions are used
by organizations that require both greater data currency and faster recovery than the four lower tiers (Tiers
0 through 3). Rather than relying exclusively on tape, Tier 4 solutions incorporate disk-based solutions, such
as point-in-time copy. This means that the data loss is worth only minutes to hours, and the recovery time is
likely 24 hours or less. Database Management System (DBMS) and transaction manager updates are remotely
journaled to the DR site. Cost factors (in addition to the Tier 3 costs) include a staffed, running system at the
DR site to receive the updates and disk space to store the updates. Examples include peer-to-peer virtual tape,
flash copy functionality, and IBM Metro/Global copy.
Tier 5: Two-site, two-phase commit/transaction integrity: Tier 5 offers the same protection as Tier 4, with
the addition of applications that perform two-phase commit processing between two sites. Tier 5 plans are
used by organizations that require consistency of data between production and recovery data centers. Data
loss is only seconds’ worth, and recovery time is under two hours. Cost factors inherent in Tier 5 solutions
include modifying and maintaining the application to add the two-phase commit logic (in addition to the Tier
4 cost factors). An example of a Tier 5 solution is Oracle Data Guard.
Tier 6: Zero data loss/remote copy: Tier 6 solutions maintain the highest levels of data currency. The
system, subsystem, and application infrastructure and application data is mirrored or copied from the
production site to a DR site. Tier 6 solutions are used by businesses that have little to no tolerance for data
loss and that need to rapidly restore data to applications. To ensure this capability, Tier 6 solutions do not
depend on the applications themselves to provide data consistency. Instead, Tier 6 solutions use real-time
server and storage mirroring. There is small to zero data loss if using synchronous remote copy. There
are seconds to minutes of data loss if using asynchronous remote copy. The recovery window is the time
required to restart the environment using the secondary DASD, if the solution is truly data consistent. Cost
factors in addition to the Tier 5 factors include the cost of telecommunications lines. Examples of Tier 6
solutions include Metro Mirror, Global Mirror, Symmetrix Remote Data Facility (SRDF), and HDS’ True Copy
and Universal Replicator.
Tier 7: Zero or little data loss (GDPS): GDPS moves beyond the original SHARE-defined DR tiers. GDPS
provides total IT business recovery through the management of systems, processors, and storage across
multiple sites. A Tier 7 solution includes all of the major components used in a Tier 6 solution, with
the addition of integration of complete automation for storage, IBM Z servers, software, networks, and
applications. This automation enables Tier 7 solutions to have a data consistency level above and beyond
Tier 6 levels. GDPS manages more than the physical resources. GDPS also manages the application
environment and the consistency of the data. GDPS provides full data integrity across volumes, subsystems,
OS platforms, and sites—while still providing the ability to perform a normal restart if a site switch occurs.
This ensures that the duration of the recovery window is minimized. Since application recovery is also
automated, restoration of systems and applications is also much faster and more reliable (Guendert, 2007).
• Recovery Point Objective (RPO): The period between backup points; RPO describes the acceptable age of
the data that must be restored after a failure has occurred. For example, if a remote backup occurs every
day at midnight, and a site failure occurs at 11 p.m., changes to data made within the previous 23 hours
is not recoverable.
• Recovery Time Objective (RTO): The time to recover from the disaster; RTO determines the acceptable
length of time that a break in continuity can occur with minimal or no impact to business services.
Options for replication generally fall into one of several categories. A Business Continuity solution with strict
RTO and RPO may require high-speed synchronous or near-synchronous replication between sites, as well
as application clustering for immediate service recovery. A medium-level Disaster Recovery solution may
require high-speed replication that could be synchronous or asynchronous with an RTO from several minutes
to a few hours. Backup of non-critical application data that does not require immediate access after a
failure can be accomplished via tape vaulting. Recovery from tape has the greatest RTO. You can use other
technologies, such as Continuous Data Protection (CDP), to find the appropriate RPO and RTO.
From a technology perspective, there are several choices for optical transport network and configuration
options for a Fibre Channel Storage Area Network (FC SAN) when it is extended over distance Applications
with strict RTO and RPO require high-speed synchronous or near-synchronous replication between sites
with application clustering over distance for immediate service recovery. Less critical applications may only
require high-speed replication that could be asynchronous, in order to meet RPO and RTO metrics. Lower
priority applications, which do not require immediate recovery after a failure, can be restored from backup
tapes located in remote vaults.
As more applications are driving business value, and the associated data becomes key to maintaining
a competitive advantage, cost-effective protection of the applications and data from site disasters and
extended outages becomes the norm. Modern storage arrays provide synchronous as well as asynchronous
array-to-array replication over extended distances.
When the array provides block-level storage for applications, Fibre Channel is the primary network
technology used to connect the storage arrays to servers, both physical and virtual. For this reason, cost-
effective Disaster Recovery designs leverage Fibre Channel to transport replicated data between arrays in
different data centers over distances spanning a few to more than 100 kilometers. Therefore, SAN distance
extension using Fibre Channel is an important part of a comprehensive, cost-effective, and effective
Disaster Recovery design.
It is important to understand Fibre Channel protocol and optical transport technology and how they interact.
In a discussion of long-distance configuration, it is also useful to note the formal structure of the FC protocol
and the specific standards that define the operation of the protocol, as follows:
Fiber Optics
Fiber Cabling
There are two basic types of optical fiber: Multimode Fiber (MMF) and Single-Mode Fiber (SMF). Multimode
fiber has a larger core diameter of 50 micrometers (μm) or 62.5 μm. The latter was common for Fiber
Distributed Data Interface (FDDI) and carries numerous modes of light through the waveguide. MMF is
generally less expensive than SMF, but its characteristics make it unsuitable for distances greater than
several hundred meters. Because of this, MMF is generally used for short distance spans and is common
for interconnecting SAN equipment within the data center.
SMF has a smaller core diameter of 9 μm and carries only a single mode of light through the waveguide. It is
better at retaining the fidelity of each light pulse over long distances and results in lower attenuation. SMF is
always used for long-distance extension over optical networks and is often used even within the data center
for FICON installations. Table 4 describes various types of optical fiber and operating distances at different
speeds (Brocade Communications Systems, 2008).
Data Rate (MB/Sec) 100 200 400 800 100 200 400 800
Operating Distance (m) 0.5–860 0.5–500 0.5–380 0.5–150 0.5–500 0.5–300 0.5–150 0.5–50
Data Rate (MB/sec) 100 200 400 800 100 200 400 800
Operating Distance (m) 0.5–300 0.5–150 0.5–70 0.5–21 2.0+ 2.0+ 2.0+ 2.0+
There are several types of Single-Mode Fiber, each with different characteristics that should be taken into
consideration when a SAN extension solution is deployed. Non-Dispersion Shifted Fiber (NDSF) is the oldest
type of fiber. It was optimized for wavelengths operating at 1310 nanometers (nm), but performed poorly in
the 1550 nm range, limiting maximum transmission rate and distance. To address this problem, Dispersion
Shifted Fiber (DSF) was introduced. DSF was optimized for 1550 nm, but it introduced additional problems
when deployed in Dense Wavelength Division Multiplexing (DWDM) environments. The most recent type
of SMF, Non-Zero Dispersion Shifted Fiber (NZ-DSF), addresses the problems associated with the previous
types and is the fiber of choice in new deployments.
As light travels through fiber, the intensity of the signal degrades, called attenuation. The three main
transmission windows in which loss is minimal are in the 850, 1310, and 1550 nm ranges. Table 5 lists
common fiber types and the average optical loss incurred by distance.
The link power budget identifies how much attenuation can occur across a fiber span while sufficient output
power for the receiver is maintained. A power budget is determined by finding the difference between
“worst-case” launch power and receiver sensitivity. Transceiver and other optical equipment vendors
typically provide these specifications for their equipment. A loss value of 0.5 dB can be used to approximate
the attenuation caused by a connector/patch panel. An additional 2 dB is subtracted as a safety margin.
Power Budget = (Worst Case Launch Power) – (Worst Case Receiver Sensitivity) + (Connector Attenuation)
Signal loss is the total sum of all losses due to attenuation across the fiber span. This value should be within
the power budget, in order to maintain a valid connection between devices. To calculate the maximum
signal loss across an existing fiber segment, use the following equation (where km means kilometers)
(Guendert,2013):
Signal Loss = (Fiber Attenuation/km * Distance in km) + (Connector Attenuation) + (Safety Margin)
Table 5 provides average optical loss characteristics of various fiber types that can be used in this equation,
although loss may vary depending on fiber type and quality. It is always better to measure the actual optical
loss of the fiber with an optical power meter.
Some receivers may have a maximum receiver sensitivity. If the optical signal is greater than the maximum
receiver sensitivity, the receiver may become oversaturated and be unable to decode the signal, causing
link errors or even total failure of the connection. Fiber attenuators can be used to resolve the problem. This
is often necessary when connecting FC switches to DWDM equipment using single-mode FC transceivers
(Brocade Communications Systems, 2008).
Optical transceivers often provide monitoring capabilities that can be viewed through FC switch
management tools, allowing some level of diagnostics of the actual optical transceiver itself.
Note that terms are often misused or used in a very generic way. And products can be configured and
used in any of the different ways discussed in the following sections. Be sure that there is no confusion or
uncertainty about the type of equipment being used. If connectivity is being provided by a service provider,
in addition to equipment deployed at your site, it is important to understand all devices in the network.
The end user generally incurs expensive construction, service, and maintenance costs when adding a bulk
of fiber cables intended to satisfy current E_Port connectivity requirements, while allowing future growth
potential and redundancy against accidental fiber breaks. Existing fibers that were used for Ethernet
implementations cannot be shared and require separate dedicated channels per protocol. The challenges
involved with this process might involve a range of costs from mandatory expenditures to extraneous costs
associated with fiber cable maintenance.
In addition to costs, there are physical hardware limitations to achieving connectivity between (at least) two
geographically separated sites. Fibre Channel optics that are installed on the Fibre Channel switch are at
the mercy of the limited optical output transmission power. Even with repeater technology, distortion of the
optical wavelength transmitted by the optics can occur over several hops.
Using DWDM, several separate wavelengths (or channels) of data can be multiplexed into a multicolored
light stream transmitted on a single optical fiber. Each channel is allocated its own specific wavelength
(lambda) band assignment. Each wavelength band is generally separated by 10 nm spacings. This
technique to transmit several independent data streams over a single fiber link is an approach to opening
up the conventional optical fiber bandwidth by breaking it up into many channels, each at a different optical
wavelength. Each wavelength can carry a signal at any bit rate less than an upper limit that is defined by the
electronics, typically up to several gigabits per second (Gbps).
DWDM is suitable for large enterprises and service providers who lease wavelengths to customers. Most
equipment vendors can support 32, 64, or more channels over a fiber pair, each running at speeds up to 10
Gbps or more. Different data formats being transmitted at different data rates can be transmitted together.
Specifically, IP data, ESCON, Fibre Channel, FICON, SONET data, and even ATM data can all be traveling
at the same time within the optical fiber. DWDM systems are independent of protocol or format, and no
performance impacts are introduced by the system itself. The costs associated with DWDM are higher,
due to greater channel consolidation, flexibility, utilization of higher quality hardware precision-cooling
components (to prevent low-frequency signal drift), as well as the capabilities of regenerating, reamplifying,
and reshaping (3R) wavelengths assigned to channels to ensure optical connectivity over vast distances
(Guendert, 2013).
Fiber distances between nodes can generally extend up to 100 km or farther. DWDM equipment can be
configured to provide a path protection scheme in case of link failure or in ring topologies that also provide
protection. Switching from the active path to the protected path typically occurs in less than 50 milliseconds
(ms). Figure 2 illustrates the concept of DWDM technology.
Optical Channels
ESCON
Optical
FICON
Optical Electrical Lambda λ
Fibre Channel
Gigabit Ethernet
Coarse Wavelength Division Multiplexing (CWDM), like DWDM, uses similar processes of multiplexing and
demultiplexing different channels by assigning different wavelengths to each channel. CWDM is intended
to consolidate environments containing a low number of channels at a reduced cost. CWDM contains 20
nm of separation between each assigned channel wavelength. CWDM technology also generally uses cost-
effective hardware components that require a reduced amount of precision cooling components, usually
dominant in DWDM solutions due to the wider separations. With CWDM technology the number of channel
wavelengths that are packed onto a single fiber is greatly reduced.
CWDM is generally designed for shorter distances (typically 50 to 80 km) and thus does not require
specialized amplifiers and high-precision lasers (which means a lower cost). Most CWDM devices support up
to 8 or 16 channels. CWDM generally operates at a lower bit rate than higher-end DWDM systems—typically
up to 4 Gbps.
There are two basic types of Wavelength Division Multiplexing (WDM) solutions; both are available for
CWDM and DWDM implementations, depending on customer requirements (Guendert, Storage Networking
Business Continuity Solutions, 2013):
• Transponder-based solutions: These allow connectivity to switches with standard 850 or 1310 nm optical
SFP transceivers. A transponder is used to convert these signals using Optical-to-Electrical-to-Optical
(O-E-O) conversion to WDM frequencies for transport across a single fiber. By converting each input to a
different frequency, multiple signals can be carried over the same fiber.
• SFP-based solutions: These eliminate the need for transponders by requiring switch equipment to utilize
special WDM transceivers (also known as “colored optics”), reducing the overall cost. Coarse or Dense
WDM SFPs are similar to any standard transceiver used in Fibre Channel switches, except that they
transmit on a particular frequency within a WDM band. Each wavelength is then placed onto a single fiber
through the use of a passive multiplexer.
Traditionally, SFP-based solutions were used as low-cost solutions and thus were mostly CWDM based. Due
to compliance requirements, some customers are using these solutions to minimize the number of active—
or powered—components in the infrastructure. Along with the need for increased bandwidth, and the use of
such solutions to support Ethernet as well as FC connectivity, some customers are now using DWDM SFP-
based implementations. These implementations require DWDM colored optics rather than CWDM colored
optics, to allow sufficient connections through a single fiber.
Additionally, note that TDM-based systems can result in a level of jitter or variable latency. As such, it is not
possible to make broad statements about the ability to use frame-based trunking. In general, best practice is
to avoid frame-based trunking on a TDM-based configuration. The need to use Idle primitives may impact the
availability of other vendor-specific features. The specific details often depend on switch firmware levels and on
which configuration mode is used for compatibility.
FC-SONET/SDH
Synchronous Optical Network (SONET) is a standard for optical telecommunications transport, developed
by the Exchange Carriers Standards Association for ANSI. SONET defines a technology for carrying
different capacity signals through a synchronous optical network. The standard defines a byte- interleaved
multiplexed transport occupying the physical layer of the OSI model.
SONET and Synchronous Digital Hierarchy (SDH) are standards for transmission of digital information over
optical networks and are often the underlying transport protocols that carry enterprise voice, video, data,
and storage traffic across Metropolitan Area Networks and Wide Area Networks (MANs and WANs). SONET/
SDH is particularly well suited to carrying enterprise mission-critical storage traffic, because it is connection-
oriented, and latency is deterministic and consistent. FC-SONET/SDH is the protocol that provides the
means for transporting FC frames over SONET/SDH networks. FC frames are commonly mapped onto a
SONET or SDH payload using an International Telecommunications Union (ITU) standard called Generic
Framing Procedure (GFP) (Guendert, 2013). SONET is useful in a SAN for consolidating multiple low-
frequency channels (such as ESCON and 1, 2 Gb Fibre Channel) into a single higher-speed connection. This
can reduce DWDM wavelength requirements in an existing SAN infrastructure. It can also allow a distance
solution to be provided from any SONET service carrier, saving the expense of running private optical cable
over long distances.
The basic SONET building block is an STS-1 (Synchronous Transport Signal) composed of the transport
overhead plus a Synchronous Payload Envelope (SPE), totaling 810 bytes. The 27-byte transport overhead is
used for operations, administration, maintenance, and provisioning. The remaining bytes make up the SPE,
of which an additional nine bytes are path overhead.
An STS-1 operates at 51.84 megabits per second (Mbps), so multiple STS-1s are required to provide the
necessary bandwidth for ESCON, Fibre Channel, and Ethernet, as shown in Table 6. Multiply the rate by 95
percent to obtain the usable bandwidth in an STS-1 (reduction is due to overhead bytes). One
OC-48 can carry approximately 2.5 channels of 1 Gbps traffic, as shown in Table 6 (Guendert, 2013).
To achieve higher data rates for client connections, multiple STS-1s are byte-interleaved to create an STS-N.
SONET defines this as byte-interleaving three STS-1s into an STS-3 and subsequently interleaving STS-3s.
By definition, each STS is still visible and available for ADD/DROP multiplexing in SONET, although most SAN
requirements can be met with less complex point-to-point connections. The addition of DWDM can even further
consolidate multiple SONET connections (OC-48), while also providing distance extension.
Like TDM, FC-SONET devices typically require a special switch configuration to ensure the use of Idle
primitives rather than Arbitrate (ARB) primitives for compatibility.
Note again that using FC-SONET/SDH-based systems can result in a level of jitter, or variable latency. As a
result, it is not possible to make broad statements about the ability to use frame-based trunking. In general,
it is best practice to avoid frame-based trunking on an FC-SONET/SDH-based configuration.
The Transmission Control Protocol (TCP) is a connection-oriented transport protocol that guarantees reliable in-
order delivery of a stream of bytes between the endpoints of a connection. TCP achieves this by assigning a unique
sequence number to each byte of data, maintaining timers and acknowledging received data through the use of
acknowledgements (ACKs) and retransmission of data if necessary.Once a connection is established between the
endpoints, data can be transferred. The data stream that passes across the connection is considered to be a single
sequence of eight-bit bytes, each of which is given a sequence number.
Brocade Extended Fabrics is a licensed feature that extends SANs across longer distances for Disaster
Recovery and Business Continuance operations by enabling a modified buffering scheme in order to
support long-distance Fibre Channel extensions, such as MAN/WAN optical transport devices. The Brocade
Extended Fabrics feature is discussed in detail in later chapters.
Some FC distance extension equipment (WDM, FC-SONET, FC-IP, and so on) can participate in FC buffer-
to-buffer flow control to increase the distance so that it is greater than what is possible with Extended
Fabrics. Such devices typically participate in E_Port link initialization or snoop the receive-buffer field
in the Exchange Link Parameters (ELP) payload. Since these devices are not aware of Brocade VC_RDY
(Virtual Circuit Ready) flow control, R_RDY flow control must be enabled on Brocade B-Series switches.
These devices return R_RDY credit to the switch in order to maintain performance over hundreds or even
thousands of kilometers. Flow control and error correction between distance extension nodes are performed
independently of the switch and are usually dependent on long-haul network protocols.
References
Brocade Communications Systems. (2008). SAN Distance Extension Reference. San Jose, CA:
Brocade Communications Systems.
Clitherow, D., Schindel, S., Thompson, J., Narbey, M., (2017). IBM GDPS Family: An Introduction to Concepts
and Capabilities. Poughkeepsie, NY: IBM Redbooks.
Guendert, S. (2007). A Comprehensive Justification For Migrating from ESCON to FICON. Gahanna, OH:
Warren National University.
Guendert, S. (2007). Revisiting Business Continuity and Disaster Recovery Planning and Performance for
21st Century Regional Disasters: The Case For GDPS. Journal of Computer Resource Management, 33-41.
Guendert, S. (2013, October). Storage Networking Business Continuity Solutions. Enterprise Tech Journal.
Guendert, S. (2013, November). The New Paradigm of Business Continuity, Disaster Recovery and
Continuous Availability. Enterprise Executive.
Kern, R., & Peltz, V. (2010). IBM Storage Infrastructure for Business Continuity. Poughkeepsie, NY:
IBM Redbooks.
United States Federal Reserve System. (2003). Interagency Paper on Sound Practices to Strengthen
the Resilience of the US Financial System. United States Federal Reserve.
147
PART 3 Chapter 3: Remote Data Replication for DASD
Introduction
This chapter and chapter three will examine remote data replication (RDR) for storage. RDR is defined as
the process to create replicas of information assets at remote sites/locations. Remote replication helps
organizations mitigate the risks associated with regionally driven outages resulting from natural or human
made disasters. During such a disaster, RDR enables the workload to be moved to a remote site to ensure
continuous business operations.
This chapter will focus on RDR for Direct Access Storage Devices (DASD), aka disk, while chapter three will
focus on RDR for tape, both native tape and virtual tape. Our discussion on RDR for DASD in this chapter will
focus on the basic methods of RDR for DASD in mainframe environments (synchronous and asynchronous)
as well as the basic technologies used (host-based and DASD array-based). We will briefly discuss some
basic principles such as data consistency, data integrity, and latency. Following this will be a section of the
chapter which will discuss DASD RDR specifics of IBM, EMC, and HDS products. Finally, the chapter will
conclude with an introduction to Extended Distance FICON, and a comparison of Extended Distance FICON
with Brocade’s Advanced FICON Accelerator’s XRC emulation.
It should be noted that this chapter will limit itself to basic DASD RDR between two sites. Complex, multi-site
DASD RDR will be covered in Part 3 Chapter 5. Finally, this chapter is not meant to be a detailed treatise
on the subject at hand. Rather, it is intended to be a good solid introduction to the subject from a storage
network engineering point of view. For additional details, the DASD vendors have countless white papers,
books, and other documentation, much of which has been used as reference material in the writing of this
chapter.
Since writes must be committed on both the source and target before sending the “write complete”
acknowledgement back to the host, application response time is increased with synchronous RDR. The
exact degree of adverse impact on response time depends primarily on the distance between sites, the
bandwidth available for RDR, and the quality of service (QoS) of the network connectivity infrastructure.
System z System z
Extra I/O Latency Time
1 3
Network Links
2
Open Open
Systems Source Target Systems
fig01_P4CH3
1. Write is received by the source array from the host/server.
2. Write is transmitted by source array to the target array.
A write must be committed to the source and remote replica
before it is acknowledged back to the host (System z).
3. The target array sends acknowledgement to the source array.
Ensures source and remote replica have identical data at all times
and that it is ordered the way it was sent.
4. The source array signals write complete to the host/server.
Synchronous replication provides the lowest RPO and RTO, usually 0.
fig02_P4CH3
Time
Referring to figure 2, if the bandwidth provided for synchronous RDR is less than the maximum write
workload, there will be times when the application response time may be excessively elongated, perhaps
even to the point of causing applications to time out. The distances over which synchronous RDR should
be deployed depend on the applications capability to tolerate response time elongation. Therefore,
synchronous RDR is typically deployed for distances less than 200 km (125 miles). Rarely have the authors
seen real world implementations of synchronous RDR in excess of 50 miles.
System z System z
Extra I/O Latency Time
1 3
Network Links
2
Open Open
Systems Source Target Systems
fig03_P4CH3
1. Write is received by the source array from the host/server.
2. Write is transmitted by source array to the target array.
A write must be committed to the source and remote replica
before it is acknowledged back to the host (System z).
3. The target array sends acknowledgement to the source array.
Ensures source and remote replica have identical data at all times
and that it is ordered the way it was sent.
4. The source array signals write complete to the host/server.
Synchronous replication provides the lowest RPO and RTO, usually 0.
fig04_P4CH3
Time
With asynchronous RDR, the required network bandwidth can be provisioned equal to or greater than the
average write workload. Data can be buffered during times when the bandwidth is not enough, and moved
later to the remote site. Therefore, sufficient buffer capacity should be provisioned. With asynchronous RDR,
data at the remote site will be behind the source by at least the size of the buffer, therefore, asynchronous
RDR provides a finite, but non-zero RPO solution. The exact RPO depends on the size of the buffer, available
network bandwidth, and the write workload at the source. Asynchronous RDR implementations can take
advantage of what is termed “locality of reference.” Simply stated, locality of reference refers to repeated
writes to the same location. If the same location is written to multiple times in the buffer prior to data
transmission to the remote site, only the final version of the data is transmitted, thus conserving network
link bandwidth.
LVM-based RDR
LVM-based RDR is performed and managed at the volume group level. Writes to the source volumes are
transmitted to the remote host by the LVM. The LVM on the remote host receives the writes and commits
them to the remote volume group. LVM RDR supports both synchronous and asynchronous modes of
replication. LVM RDR is also independent of the storage device, therefore, LVM RDR supports replication
between heterogeneous DASD arrays. With LVM, if a failure occurs at the source site, applications can be
restarted on the remote host using the data on the remote replicas. The LVM RDR process adds overhead
on the host CPUs: CPU resources on the source host are shared between replication tasks and applications.
This may cause performance degradation to the applications running on the host. Finally, because the
remote host is actively involved in the replication process, it must be continuously up and available (EMC
Education Services, 2012).
Linux on
System z
fig05_P4CH3
IP
Linux on
System z
Prior to starting production work and the replication of log files, all relevant components of the source
database are replicated to the remote site. This is done while the source database in shutdown. Next,
production work commences on the source database, and the remote database is started in a standby
mode of operation. In standby mode, the database typically is not available for transactions.
All database management systems (DBMSs) switch log files either at preconfigured time intervals or when a
log is full. For example:
1. The current log file is closed at the time of log switching and a new log file is opened.
2. When a log switch occurs, the closed log file is transmitted by the source host to the remote host.
3. The remote host receives the log and updates the standby database.
This process ensures that the standby database is consistent up to the last committed log. RPO at the
remote site depends on the size of the log and the frequency of log switching. When determining the
optimal size of the log file there are four things that need to be considered:
An existing IP network can be used for replicating log files. Host-based log shipping requires low network
bandwidth because it transmits only the log file at regular intervals.
Replication network performance and RAS is critical with this type of replication. If network links fail,
replication will be suspended; however, production work can continue uninterrupted on the source DASD
array. The source array will keep track of the writes that are not transmitted to the remote storage array.
When the network links are restarted, the accumulated data is transmitted to the remote array. During the
time period of network link outage, if there is a failure at the source site, some data will be lost and the RPO
at the target will be greater than the ideal of zero.
There are some implementations of asynchronous remote replication that maintain write ordering (check
with your DASD vendor for your specifics). Typically:
1. A timestamp and sequence number are attached to each write when it is received by the
source array.
2. Writes are transmitted to the remote array and they are committed to the remote replica in the
exact order in which they were buffered at the source.
This process offers an implicit guarantee of the consistency of the data on remote replicas.
Other implementations ensure consistency by leveraging the dependent write principles inherent in most
DBMSs:
Asynchronous DASD-based RDR provides network bandwidth cost savings due to the fact that the required
bandwidth is lower than the peak write workload. During periods of time when the write workload exceeds
the average bandwidth, you must have sufficient buffer capacity configured on the source array to hold
these writes.
RTO is a measure of how long a business can tolerate an IT application being inoperable. Values for RTO
typically range from a few minutes to several hours, with the most mission-critical applications having RTO
goals of zero to near zero.
As it relates to DASD replication, RPO measures the amount of time the secondary storage subsystem lags
behind the primary storage subsystem (Kern & Peltz, 2010). When the primary storage subsystem receives
a write command for a volume that is part of a remote copy pair, the primary system initiates the process of
transmitting the data to the secondary storage system at the remote site. This process includes the actual
data transmission time. At any given moment, if you are using asynchronous replication, the currency of
the data at the secondary storage subsystem may lag the primary. This time lag typically ranges from a few
to several seconds depending on the workload at the primary and secondary storage subsystems and the
remote data link bandwidth.
Since RPO measures the amount of time that the secondary storage subsystem lags the primary, the
corresponding amount of data that must be recreated at the secondary in the event of an unplanned
system outage is directly proportional to the RPO. Therefore, RPO is a very important measure of how fast
an installation can recover from an outage or disaster. RPO has a direct influence on each application’s RTO
and on the RTO of the business as a whole (Kern & Peltz, 2010).
The technique for ensuring that the write order sequence is preserved at the secondary site is known
as creating consistency groups. Essential elements for cross volume data integrity and data consistency
include (Kern & Peltz, 2010):
1. The ability to create point in time (PIT) copies of the data as necessary.
2. The ability to provide for a secondary site “data freeze” in the case where the secondary site will be used
for DR/BC.
3. The ability to have a common restart point for data residing across multiple operating system platforms,
thereby reducing end to end recovery time.
1. The primary DASD subsystem issues a write I/O to the secondary DASD subsystem.
2. The primary DASD subsystem waits until it has received notification from the secondary subsystem that
the data was successfully received by the secondary.
3. Upon receiving notification from the secondary subsystem, the primary subsystem notifies the host
server that the write I/O was successful.
There in an inherent delay in synchronous DASD replication. This delay is due to the time required to send
an electrical signal or optical pulse from the primary DASD subsystem to the secondary DASD subsystem,
and the corresponding reply from the secondary back to the primary. This delay is known as latency. Latency
is different from the time required to transmit data due to the limitations imposed by replication network
link bandwidth. The speed of light in optical fiber is approximately 5 microseconds/kilometer, which equates
to a latency of .5 milliseconds/100 km or 1 msec/100 km round trip.
The DASD hardware vendors attempt to mitigate overhead and latency inherent in synchronous replication
data transmission by using a variety of techniques including pipelining I/Os, and exploiting different link
protocols. As a rule of thumb, the more complicated the transmission algorithm (such as with pipelining
techniques), the more time is spent in the synchronous link overhead to maintain I/O data integrity and
cross volume data consistency.
One example of a vendor technique is an IBM feature known as pre-deposit writes. Pre-deposit write
improves the normal fibre channel protocol (FCP) from the standard two protocol exchanges to an optimized
single protocol exchange (Kern & Peltz, 2010). Functionally, several I/O requests are sent as one “I/O
package”. This reduces the number of handshakes required to execute the I/O chain. The use of HyperPAV/
Parallel Access Volume (PAV), Multiple Allegiance, and I/O priority queuing functions in conjunction with pre-
deposit write can significantly help mitigate performance overhead. With pre-deposit write enabled, IBM can
extend IBM Metro Mirror synchronous replication distances to 300 km with good performance.
A second example of a vendor technique is IBM’s zHyperwrite technology provided by the IBM DS8880.
zHyperwrite is used by z/OS DFSMS and DB2 to help accelerate the DB2 log writes in IBM Metro Mirror
synchronous RDR environments when using either IBM Geographically Dispersed Parallel Sysplex (GDPS) or
IBM Copy Services Manager and HyperSwap technology.
When an application sends an I/O request to a volume that is in synchronous data replication, the response
time is affected by the latency caused by the replication management itself, in addition to the latency
because of the distance between source and target disk control units (10 microseconds per 1 km). Although
the latest one cannot be avoided (limited by speed of light), there is an opportunity to reduce the latency
added by managing the synchronous relationship when application usage allows this technique to be
employed. IBM zHyperWrite combines concurrent DS8000 Metro Mirror (PPRC) synchronous replication and
software mirroring through media manager (DFSMS) to provide substantial improvements in DB2 log write
latency. With zHyperWrite, DB2 I/O writes to the DB2 logs are replicated synchronously to the secondary
volumes by DFSMS. For each write to the primary DB2 log volume, DFSMS updates its secondary volume at
the same time. Both primary and secondary volumes must be in the Metro Mirror (PPRC) relationship, but
z/OS instructs the DS8000 to avoid using Metro Mirror replication for these specific I/O writes to DB2 logs.
The I/O write to the DB2 log with zHyperWrite is completed only when both primary and secondary volumes
are updated by DFSMS. All I/O writes directed to other DB2 volumes (tablespaces, indexes, and others) are
replicated to the secondary volumes by DS8000 Metro Mirror, as shown in Figure 6 (Dufrasne, et al. 2017).
For Metro Mirror, the DS8000 series supports only Fibre Channel Protocol (FCP) links. FICON links may not
be used. For Metro Mirror paths, you should strongly consider defining all Metro Mirror paths that are used
by one application environment on the same set of physical links if you intend to keep the data consistent. A
DS8000 series FCP port can simultaneously be (Cronauer, Dufrasne, Altmannsberger, & Burger, 2013):
Each Metro Mirror port provides connectivity for all Logical Control Units (LCUs) within the DS8000 series
array and can carry multiple logical Metro Mirror paths. Although one FCP Link has sufficient bandwidth
capacity for most Metro Mirror environments, it is preferable (a best practice) to:
1. Configure at least two FCP Links between each source and remote DASD subsystem to:
a. Provide multiple logical paths between LCUs.
a. Provide redundancy for continuous availability in the event of a physical path failure.
2. Dedicate FCP ports for Metro Mirror usage, ensuring no interference from host I/O activity. Metro Mirror
is time critical and should not be impacted by host I/O activity.
1. IBM Global Mirror-this function runs on IBM DS8000 series DASD arrays.
2. z/OS Global Mirror (XRC)-this function runs on IBM Z mainframes in conjunction with the DS8000 series
array.
1. Dependent writes: For Global Mirror, if the start of one write operation depends upon the completion
of a previous write, the writes are said to be dependent. For example: databases with their associated
logging files.
2. Consistency: For Global Mirror, the consistency of data is ensured if the order of dependent writes to
disks or disk groups is maintained. Data consistency of the secondary site is important for the ability to
restart a database.
3. Data currency: For Global Mirror, this refers to the difference of time since the last data was written
at the primary site, versus the time the same data was written to the secondary site. It determines the
amount of data you must recover at the remote site.
The design objectives for IBM DS8000 Global Mirror are threefold (Kern & Peltz, 2010):
1. Keep currency of the data at the secondary site consistent to within 3-5 seconds of the primary site,
assuming sufficient bandwidth and storage system resources.
2. Do not impact the DASD subsystems’ host response time when insufficient bandwidth and system
resources are available.
3. Provide data consistency and predictable host response time across multiple primary and secondary
DASD subsystems to enhance scalabililty.
The logical paths are unidirectional, i.e. they are defined to operate in either one direction or the other.
However, any particular pair of LSSs can have paths that are defined among them that have opposite
directions (each LSS holds both primary and secondary volumes from the other particular LSS0. These
opposite direction paths can be defined on the same FCP physical link. For bandwidth and redundancy,
multiple paths can be created between the primary and secondary LSSs. The physical links are bidirectional
and can be shared by other remote replication and copy functions such as Metro Mirror. However, it is a best
practice not to do so. Although Global Mirror is asynchronous, it should not share paths with host I/O as this
could affect the number of out of sync tracks and the creation of consistency groups at the secondary site.
fibre connection can provide high bandwidth, but within a limited distance range. It also is capable of
holding different protocols using the same fibre connection. It also has a high cost of implementation. SAN
switching provides better management control and workload balancing on the connections. FCIP overcomes
the distance limitations of native FC and DWDM and enables geographically separated SANs to be linked
(Guendert, 2013).
Local Remote
IBM DS8800/DS8700 IBM DS8800/DS8700
Global Mirror Session
Target
Global Copy Devices
GC GM FC Practice
Devices
SAN/Network Connectivity
fig06_P4CH3
(optional)
Source Journal
Devices Devices
zGM Operation
Below is the typical sequence for zGM operations (Kern & Peltz, 2010):
1. A host application issues a write request to a disk on the primary subsystem that is part of a zGM
configuration.
2. zGM intercepts the write request and captures all the information required by the System Data Mover
(SDM) to recreate the write operation at the secondary site. This information includes the type of write
command, sequence number, and host time stamp. All this information is collected in about .04 ms after
which the write command continues and is completed.
3. Independent of the application I/O running on the host at the primary site, the SDM communicates with
the DASD subsystem at the primary site.
4. The SDM reads all record updates across all volumes in an LSS.
5. The updates are collected through the SDM across all primary storage subsystems, journalized in
consistency groups, and subsequently written to the secondary target disks.
It should be noted that zGM implementation specific details can vary by vendor.
Remote Array
SDM SDM
Works with
EMC, HDS, and DASD
IBM Storage
Long Push
Distance
fig07_P4CH3
SAN/WAN
Pull
DASD
Local Array
One common thread throughout the operational modes has been control of the SRDF family products by
a mainframe-based application called the SRDF Host Component. SRDF Host Component is the central
mechanism through which all SRDF functionality is made available. Dell EMC Consistency Group for z/OS
is another highly useful tool for managing dependent write consistency across links between Symmetrix
subsystems attached to one or more IBM Z mainframes.
1. Support for either ESCON or FICON host channels regardless of the SRDF link protocol employed:
SRDF is a protocol deployed between Symmetrix subsystems that mirrors data on both sides of a
communications link. IBM Z host connectivity to the Symmetrix storage subsystem (ESCON, FICON) is
completely independent of the protocols used in moving data over the SRDF links.
2. The ability to deploy SRDF solutions across the entire enterprise: SRDF Host Component can manage
remote replication for both Count Key Data (CKD) and Fixed Block Architecture (FBA) format disks. In
these deployments, both a mainframe and one or more open systems hosts are attached to the primary
side of the SRDF relationship. Enterprise SRDF deployments can be controlled either by mainframe
hosts or by open systems hosts.
3. Software support for taking an SRDF link offline: SRDF Host Component has a software command that
can take an SRDF link offline independent of whether it is taking the target volume offline. This feature
is useful if there are multiple links in the configuration and only one is experiencing issues, for example
too many bounces (sporadic link loss) or error conditions. It is unnecessary to take all links offline, when
taking the one in question offline is sufficient.
SRDF/Synchronous (SRDF/S)
SRDF/S is DASD remote replication that creates a synchronous replica at one or more Symmetrix targets
located within campus, metropolitan, or regional distances. SRDF/S provides a no data loss (near zero RPO)
solution.
SRDF/Asynchronous (SRDF/A)
SRDF/A is a DASD remote replication solution that enables the source to asynchronously replicate
data. SRDF/A incorporates delta set technology which enables write ordering by employing a buffering
mechanism. Dell EMC Symmetrix storage subsystems started to support SRDF/A beginning with EMC
Enginuity version 5670.
SRDF/A is a mode of remote DASD replication that allows the end user to asynchronously replicate data
while maintaining a dependent-write consistency copy of the data on the secondary (R2) devices at all
times. This dependent-write consistent point-in-time copy of the data at the secondary side is typically only
seconds behind the primary (R1) side. SRDF/A session data is transferred to the secondary Symmetrix
subsystem in cycles or delta sets. This eliminates the redundancy of multiple same track changes being
transferred over the SRDF links, potentially reducing the replication network bandwidth requirements.
SRDF/A provides a long-distance replication solution for Dell EMC customers who require minimal host
application impact while maintaining a dependent-write, consistent restartable image of their data at
the secondary site. In the event of a disaster at the primary (R1) site or if SRDF links are lost during data
transfer, a partial delta set of data can be discarded, preserving dependent- write consistency on the
secondary site within no more than two SRDF/A cycles.
Compatible Peer’s volume pairs must be composed of Dell EMC volumes; however, Symmetrix subsystems
with Compatible Peer can be configured side by side with other vendor DASD subsystems employing PPRC.
Even though Compatible Peer employs SRDF/S microcode operations and physical SRDF components, it is
entirely operated by PPRC directives.
For IBM z/OS, TrueCopy is compatible with and optionally managed via PPRC commands. This enables
compatibility in IBM Hyperswap environments to provide high availability of IBM Z mainframes. TrueCopy
supports synchronous replication up to 300 km (190 miles) via fibre channel connectivity. Up to 4,094
replication pairs and up to 16 consistency groups/DASD susbsystem are supported by HDS.
A TrueCopy for mainframe system creates and maintains a mirror image of a production volume at a remote
location. Data in a TrueCopy backup stays synchronized with the data in the local storage subsystem. This
happens when data is written from the host to the local storage subsystem, then to the remote storage
subsystem via the fibre channel data path. The host holds subsequent output until an acknowledgement is
received from the remote storage subsystem for the prior output.
1. The local VSP storage subsystem is connected to a host. The remote storage subsystem is connected
to the local system via fibre channel data paths. The remote system may be a VSP, or a different HDS
model.
2. A host at the local site is connected to the local storage subsystem. HDS also typically recommends
having a host at the secondary site connected to the remote storage subsystem for use in a disaster
recovery scenario. If this is not feasible, the local host must be in communication with the remote
storage subsystem.
3. A main volume (M-VOL) on the local storage subsystem that is copied to the remote volume (R-VOL) on
the remote system.
4. Fibre channel data paths for data transfer between the main and remote storage subsystems.
5. Target, initiator, and RCU target ports for the fibre channel interface.
HUR offloads the majority of processing for asynchronous replication to the remote storage subsystem using
unique HDS pull technology. HUR also uses journaling to keep track of all updates at the primary site until
the updates are applied on the remote storage subsystem. The journaling approach provides resiliency for
a shorter RPO while protecting production performance during network anomalies. The HDS pull technology
frees up processing power at the primary site for critical production activities and limits the need to upgrade
your primary site storage infrastructure.
With HUR, the remote volumes are an asynchronous block for block copy of the local storage volumes. HUR is
designed to support a secondary site hundreds or thousands of miles from the local site to facilitate recovery
from regional disasters. HUR is designed to limit impact on the local (primary) storage subsystem. Updates sent
from a host mainframe to the primary production volume on the local storage subsystem are copied to a local
journal volume. The remote storage subsystem pulls data from this journal volume across the communication
link(s) to the secondary volume. With HUR, the local storage subsystem is therefore free to perform its role as a
transaction processing resource rather than as a replication engine.
Hitachi Replication Manager is a server based GUI software tool than can be used to configure, monitor, and
manage Hitachi storage based replication products for both open systems and mainframe environments
as an enhancement to BCM. In essence it provides a single pane of glass to monitor and operate both
mainframe and open systems replication.
As part of the Feb. 26, 2008, IBM z10 and DS8000 announcements, IBM announced Information Unit (IU)
pacing enhancements that allow customers to deploy z/OS Global Mirror over long distances without a
significant impact on performance. This is more commonly known as Extended Distance FICON. The more
technically accurate term, as defined in the ANSI T11 FC-SB3/4 standards, is persistent IU pacing. The ANSI
T11 FC-SB3 standard was amended (FC-SB3/AM1) in January 2007 to incorporate the changes made with
persistent IU pacing (Flam & Guendert, 2011).
Extended Distance FICON really is a marketing term created to add some “pizzazz” to the standards
terminology. IBM coined the term in February 2008, but it really has nothing to do with increasing distance;
rather, it’s a performance enhancement for FICON data traffic traveling over a long distance. Enhanced
Distance FICON would be a more accurate marketing term.
By the late 1990s the shortcomings of ESCON were driving a new technical solution. FICON evolved in the late
1990s to address the technical limitations of ESCON in bandwidth, channel/device addressing, and distance.
Unlike parallel and ESCON channels, FICON channels rely on packet switching rather than circuit switching.
FICON channels also rely on logical connectivity based on the notion of assumed completion rather than the
permission-seeking schema. These two changes allow a much higher percentage of the available bandwidth to
be employed for data transmission and for significantly better performance over extended distances (Brocade
Communications Systems, 2011).
Numerous tests comparing FICON to ESCON have been done by IBM, other SAN equipment manufacturers
and independent consultants. The common theme among these tests is that they support the proposition
that FICON is a much better performing protocol over a wide range of topologies when compared with
traditional ESCON configurations. This performance advantage is very notable as the distance between the
channel and control unit increases.
Figure 9 (Brocade Communications Systems, 2011) is a representation of what was referred to earlier as
ESCON’s permission-seeking/challenge-response schema. ESCON channels only transmit a single CCW at
a time to a storage subsystem. Once the subsystem (such as a DASD array) has completed its work with the
CCW, it notifies the channel of this with a channel-end/device-end (CE/DE). The CE/DE presents the status
of the CCW. If the status of the CCW was normal, the channel then transmits the next CCW of the channel
program. Assuming that we are discussing DASD, this leads to approximately 75 percent of the reported
connect time being representative of time periods in which the channel is occupied, but data is not being
transferred. This schema presents two distance-related performance issues. First, each CCW experiences
distance-related delays as it is transmitted. Second, the small size of the packet that ESCON employs to
transfer data, known as a Device Information Block (DIB), coupled with the very small data buffer employed
by ESCON results in the phenomenon that is now known as ESCON droop. ESCON droop is a substantial
effective data rate drop experienced as the distance between the host channel and the storage device
increases.
CE/DE END
CCW 2 CMD
fig08_P4CH3
CE/DE END
CCW 3 CMD
CE/DE END
Figure 10 (Brocade Communications Systems, 2011) represents the FICON CCW pattern, which is often
referred to as assumed completion. The entire channel program is transmitted to the storage subsystem at
the start of an I/O operation. If there are more than 16 CCWs in the channel program, the remainder are
transmitted in groups of 8 until the channel program is completed. The overwhelming majority of channel
programs are 16 or fewer CCWs in length. After the storage subsystem completes the entire channel
program, it notifies the channel with a channel end/device end (CE/DE). The numerous turnarounds present
with the ESCON protocol are eliminated. The FICON architecture has been tested as supporting distances of
up to 120 km without the problem of droop.
FICON FICON FICON
Channel CCW 1 CU Device
CMD
CCW 2
END
CCW 3
CMD
fig09_P4CH3
END
CMD
CE/DE END
Droop begins when the link distance reaches the point at which the time required for light to make one round
trip on the link is equal to the time it takes to transmit a number of bytes that will fill the receiver’s buffer. The
main factors are the speed of light, the physical link, the protocol itself, the physical link distance between the
channel and control unit, the link data rate and the buffer capacity. With ESCON, droop begins at about 9 km;
whereas FICON does not experience droop until at least 100 km.
In a configuration where a channel is extended beyond traditional distances, the additional delay of the
extension solution can add significantly to the droop effect of an ESCON or FICON channel. Generally, at 20
km ESCON delivers less than half the maximum effective bandwidth that would be delivered at zero distance.
FICON is not affected until after 120 km.
How do you overcome the effects of channel droop, maintain native (or near native) performance, and still
guarantee the integrity of the channel program and corresponding data and information? Until the introduction
of persistent IU pacing/Extended Distance FICON, the solution was to implement a combination of frame-
shuttle, channel, control unit, and device emulation technologies on channel extension hardware, either stand-
alone specialized hardware, or a blade in a FICON director.
Buffer credits are a crucial and often overlooked component in configurations in which frames will travel long
geographic distances between optically adjacent ports (ports connected to each other). Z Systems FICON
channel cards and storage host adapter ports have a fixed number of buffer credits per port. Switching devices
(including channel extension) typically can have their buffer credit settings customized. This is important, as the
long distances are typically between the switching devices. Having the correct/optimal number of buffer credits
ensures frames will “stream” over the distance and the link will be fully utilized. An insufficient number will result
in dormant periods on the link-stopped traffic. The optimal number of buffer credits depends on three variables:
link speed, frame size, and distance (Guendert, 2007).
When receiving frames from the link, the Fibre Channel Physical and Signaling Interface (FC-PH) standard
provides a mechanism that assembles sub-blocks of data contained in the payloads of one or more frames
into a single data block. This single data block is known as a sequence. Each FC-4 ULP defines the contents
of the sequences used for ULP functions. When this content (and use of the sequence) is defined in this
manner, it’s called an IU, which is a logical resource comprised of one to four credits. FICON and FICON
Express/Express 2/Express 4/Express 8 channels currently support a maximum of 16 concurrent IUs for
each OE. SB-3 IUs contain commands, device-level controls, link controls, and other related functions.
Common examples of SB-3 IUs are command IUs, solicited data IUs, unsolicited data IUs, solicited control
IUs, and unsolicited control IUs.
An Open Exchange (OE) is a logical resource representing an I/O being processed by a channel. FICON and
FICON Express/Express 2 channels on z Series servers can support 32 concurrent OEs. FICON Express
2/4/8/8S/16S channels on IBM Z9 and later servers can support 64 concurrent OEs. IUs sent by a FICON
channel during the execution of an SB-3 link-control function or device-level function are restricted to one
exchange, known as an outbound exchange. The IUs a FICON channel receives during an I/O operation are
restricted to a different exchange, known as an inbound exchange (Kembel, 2002). Figure 11 shows the
relationship between frames, sequences, and exchanges.
Exchange
An inbound exchange and outbound exchange transfer IUs in a single direction. An exchange pair occurs when
both an inbound and outbound exchange simultaneously exist between a channel and control unit for execution
of the same device-level or link-level function. The control unit is then said to be connected to the channel. A
channel program, executed in a single connection, uses only one existing pair. If the connection is removed by
closing the exchanges during the channel program, a new exchange pair is required to complete the channel
program. To avoid errors that can occur from lost IUs or due to the inability to send an IU, sufficient resources
need to be held in reserve. This reserve is necessary to support new exchange pairs for unexpected events. If an
IU initiating a new exchange pair is received, sufficient resources must be available to properly receive the IU and
to initiate a new exchange in response (Flam & Guendert, 2011).
The IU pacing protocol has a limitation that the first burst of IUs from the channel to the control unit can
be no larger than a default value of 16. This causes a delay in the execution of channel programs with
more than 16 commands at long distances because a round trip to the control unit is required before
the remainder of the IUs can be sent by the channel, upon the receipt of the first command response, as
allowed by the increased pacing count. Since flow control is adequately addressed by the FC-PH level buffer-
to-buffer crediting function, IU pacing isn’t a flow control mechanism. Instead, IU pacing is a mechanism
intended to prevent I/O operations that might introduce large data transfers from monopolizing access to
Fibre Channel facilities by other concurrent I/O operations.
In essence, IU pacing provides a load-sharing or fair-access mechanism for multiple, competing channel
programs. While this facility yields desirable results, ensuring more predicable I/O response times on
heavily loaded channels, it produces less optimal results for long-distance deployments. In these cases,
increased link latencies can introduce dormant periods on the channel and its Wide Area Network (WAN)
link (Flam & Guendert, 2011). Dormant periods occur when delays waiting for anticipated command
responses increase to the point where the pacing window prohibits the timely execution of CCWs that might
otherwise be executed to ensure optimal performance.
The nominal IU pacing window for 1, 2, 4, 8 and 16 Gbit/sec FICON implementations permits no more than
16 IUs to remain uncredited. Pacing credits can be adjusted dynamically from these values by control unit
requests for specific protocol sequences; however, the channel isn’t bound to honor control unit requests for
larger IU pacing windows.
Persistent IU pacing is a method for allowing FICON channels to retain a pacing count that may be used
at the start of execution of a channel program. This may improve the performance of long I/O programs at
higher link speeds and long distances by allowing the channel to send more IUs to the control unit, thereby
eliminating the delay of waiting for the first command response. The channel retains the pacing count value,
presented by the control unit in accordance with the standard, and uses that pacing count value as its new
default pacing count for any new channel programs issued on the same logical path (Flam & Guendert,
2011).
Persistent IU pacing is described by modifications to the following functions specified in FC-SB-3. The
changes relevant to persistent IU pacing are summarized below:
• Establish Logical Path, Logical Path Established, and Status-Parameter Field: The Establish Logical Path
(ELP) function was modified with definitions for optional features for bits in the link-control information
field. In particular, bit one (1) was modified for persistent IU pacing. When bit one of the link-control
information field is set to zero in the ELP IU, persistent IU pacing is not supported. When bit one of the link
control information field is set to one in the ELP IU, persistent IU pacing is supported. This change is also
reflected in the Logical Path Established (LPE) function.
Bytes 2 and 3 of word 0 of the status header (the status-parameter field) was also modified for persistent IU
pacing. This is a 16-bit field that may contain a residual count or IU pacing parameter. If the conditions are
such that neither the residual count nor an IU pacing parameter is to be presented, the control unit shall set
the status-parameter field to zero, and the channel receiving the status DIB shall ignore the status-parameter
field. The modifications made deal with the residual count, and the IU pacing parameter.
• Residual Count: The residual count is a 16-bit unsigned binary number that represents the difference
between the CCW count for a command and the quantity of data transferred either from or to the device
for that command. For write commands, the residual count shall be equal to the difference between the
CCW count of the current CCW for the command and the number of bytes actually written to the device.
For read commands, the residual count shall be the difference between the CCW count for the current
CCW for the command and the number of bytes actually read from the device and transferred to the
channel. The residual count is meaningful only when the residual-count- valid (RV) bit is one.
• Persistent IU Pacing Parameter Valid: When persistent IU pacing is enabled and the residual-count- valid
(RV) bit is set to zero, the persistent IU pacing parameter valid bit is provided in the status parameter
field bit 0. When set to one, this bit indicates that the IU pacing parameter is valid. When set to zero,
this bit indicates that the IU pacing parameter is not valid, is ignored by the channel and shall not
change the persistent pacing credit at the channel for the logical path. This bit is only meaningful when
persistent pacing is enabled.
• IU Pacing Parameter: The IU pacing parameter is an eight-bit value that is carried in the least- significant
byte of the status-parameter field. The use of the IU pacing parameter in the status parameter field
depends on whether or not persistent IU pacing is enabled. When persistent IU pacing is not enabled,
the IU pacing parameter is provided in the status-parameter field when status is presented for the first
command of a channel program that is executed as an immediate operation or when presenting device
end in order to reconnect when the chaining condition is set (Flam & Guendert, 2011). When persistent
IU pacing is enabled, the residual-count-valid (RV) bit is set to zero and the persistent IU pacing parameter
valid bit is set to one, the IU pacing parameter shall be provided in the status parameter field. The IU
pacing parameter shall be sent by the control unit to indicate to the channel the maximum number of IUs
a channel may send on a given outbound exchange until it receives a command response IU that was
sent because the CRR bit was set to one, on the existing inbound exchange. An IU pacing parameter of
zero indicates that the control unit wishes to leave the default value of the IU pacing credit unchanged.
Each FICON channel provides an IU pacing credit initialized at either the start of each channel
program or at reconnection to continue the execution of a channel program. The IU pacing credit is the
maximum number of IUs a channel may send on a given outbound exchange before it receives a command-
response IU.
A FICON channel may operate in the default IU pacing mode or in the persistent IU pacing mode. A FICON
channel operates in persistent IU pacing mode on each logical path for which both the channel and
control unit indicate support for the persistent IU pacing optional feature. On any logical path for which the
persistent IU pacing optional feature isn’t supported, a FICON channel operates in the default IU pacing
mode. When a FICON channel is operating in default IU pacing mode, the IU pacing credit in effect at the
start of a channel program is set to a model-dependent (DASD array-specific) value no greater than the
default value of 16. When a FICON channel is operating in persistent IU pacing mode, the IU pacing credit is
initialized with the current pacing credit in effect for the logical path [FC-SB-4].
When a control unit is operating in default pacing mode, the control unit may request that the IU pacing
credit be increased by the FICON channel at the start of a channel program or each time the control unit
reconnects with device-end status. At the start of a channel program, the control unit may request that the
IU pacing credit be increased by providing an IU pacing parameter in either the command response or in
the status sent in response to the first command of a channel program. When reconnecting with device-
end status, the control unit may request that the IU pacing credit be increased by providing an IU pacing
parameter in the status DL/I Interface Block (DIB). A modified pacing credit then remains in effect for the
duration of the outbound exchange.
When a control unit is operating in persistent IU pacing mode, the control unit may request that the IU
pacing credit for the logical path be modified at the start of a channel program. Any modification of the
IU pacing credit will take effect at the start of the next channel program on the logical path and persist
until changed again by the control unit, or until a system reset is performed for the logical path (Brocade
Communications Systems, 2011). The control unit may request that the IU pacing credit be modified by
providing an IU pacing parameter in either the command response or in the status sent in response to the
first command of a channel program. A control unit may increase, decrease, or reset the pacing credit to the
default value. A pacing parameter value of zero has the effect of resetting the credit to the default value of
16.
To avoid resetting the pacing count to the default value, the control unit must retain its desired setting and
include this value in the pacing parameter for all command response IUs and status IUs in which the pacing
parameter is valid. This must be done for all device operations on the same logical path.
If the control unit sets the IU pacing parameter to a value less than or equal to the default value, the FICON
channel won’t increase the IU pacing credit above the default value. If the control unit sets the IU pacing
parameter to a value greater than the default value, then the FICON channel may increase the IU pacing
credit by any amount up to the value indicated by the IU pacing parameter [FC-SB-4].
At the start of a channel program or at each reconnection, the FICON channel shall send a model
dependent number of IUs to the control unit. The number of IUs sent shall not exceed the IU pacing credit
value. Prior to or at the last command IU sent, the FICON channel shall request a command response to
be returned by setting the CRR bit in a command or command-data DIB. The selection of the command or
command-data DIB for the setting of the CRR bit shall be such that the remaining IU pacing credit (that is,
the number of additional IUs the FICON channel is allowed to send before it receives a command-response
IU) does not prevent the transmission of all of the IUs for a CCW. For example, if the FICON channel has
not set the CRR bit since the command or command-data DIB for which the last command response was
received and the remaining IU pacing credit is less than the number of IUs required to transfer all of the
data indicated by the CCW count field in a command-data DIB, then the CRR bit shall be set to a one in the
command-data DIB; otherwise, the FICON channel shall be unable to proceed with the channel program.
When a command response is received, it shall indicate which CCW is currently being executed, and
therefore, the number of IUs that have been processed since the start of the channel program or since the
IU for which the previous command response was received. Upon receipt of the command response, the
FICON channel is then permitted to send an additional number of IUs beyond the current remaining credit
equal to the number of IUs indicated as having been processed (Flam & Guendert, 2011).
When a control unit sends a data IU containing a status DIB, the control unit shall discard all command IUs
with the SYR bit set to zero and data IUs that are received subsequent to the IU for which the status was
sent. When a data IU containing a status DIB is received, the FICON channel sets its remaining IU pacing
credit to a value equal to the IU pacing credit for the exchange; the number of IUs a FICON channel is then
permitted to send, including and subsequent to the IU sent in response to the status, is equal to the IU
pacing credit. When an IU that closes the inbound exchange is received, a FICON channel is allowed to
respond to the IU without regard to IU pacing credit.
An IU pacing credit greater than 16 is recommended for link speeds above 1 Gbit/sec. For distances of up
to 100 km, the IU pacing credit should double as the link speed doubles above 1 Gbit/sec.
The IBM z/OS Global Mirror for IBM Z (zGM) operates on two basic types of commands (Flam & Guendert,
2011):
• Read Track Sets (RTSs) are used to read entire tracks during a synchronization phase.
• Read Record Sets (RRSs) are used to read records that have been modified on the primary DASD.
In zGM applications, high performance depends on the efficient execution of lengthy chains of the RRS
channel programs. Since these channel programs can ultimately chain hundreds of RRS commands
together, IU pacing may impose unnecessary performance constraints on long-distance and high-latency
WANs. In these applications, the outbound flow of commands or command-data IUs to the device may be
interrupted, due to delays in the timely receipt of required command response IUs from the device.
The Brocade z/OS Global Mirror emulation process serves only this uniquely formatted channel program.
Within this single channel program type, it seeks only to alleviate the dormancy that may be introduced
by the effect of IU pacing over long distances and increased latency. Latency is introduced by several
mechanisms, including intermediate buffering delays, signal propagation delays, and, finally, bandwidth
restrictions imposed by WAN links with lower bandwidth than the Fibre Channel media (one or two gigabits).
Additionally, since the IU pacing mechanism has no means of predicting the actual number of data bytes
solicited by input (RRS) operations, the resulting small inbound IUs raise the opportunity to increase periods
of link dormancy.
Brocade zGM emulation yields performance increases over non-emulating protocols based on the
combination of all of the factors already mentioned. It is important to appreciate the impact of these factors,
in order to adequately assess the impact of zGM emulation. Methods of developing such information prior
to deployment include data reduction of traces of the system data mover’s active channel programs with
primary volumes. Additionally, information about the average propagation delay and total average bandwidth
must be obtained from the WAN link provider.
In order to alleviate the excessive latency introduced by IU pacing across a WAN link, a zGM RRS emulation
algorithm is implemented. RRS channel programs are differentiated from other channel programs via
a unique prefixed Define Subsystem Operation (DSO) command called Command-Data IU. Information
contained in the data buffer associated with this command carries both the number and the buffer size
of all subsequent RRS Command IUs. Only channel programs associated with this unique DSO command
are subject to the emulation process. All other channel programs and their associated information units
are shuttled across the WAN link without any further processing. Thus, the emulation issue addresses only
the performance issues associated with the Channel Command Words (CCWs) that are involved in RRS
operations. All other supported functions are subject to the effects of WAN link latency to some degree
(Brocade Communications Systems, 2011).
The device emulation employed in Brocade FICON extension products enables the solution to meet these
objectives. The actual emulation will result in entire channel programs being executed on both the local
and remote ends of the network and to eliminate IU pacing from the emulated channel sequences. The
resulting operations localize many exchanges that would otherwise have gone end-to- end between the host
and control unit. This is not indiscriminate emulation, but rather a well-planned set of emulation rules which
protect the integrity of the channel program.
z/OS Global Mirror (formerly XRC) configurations such as that shown in figure 12 (Brocade Communications
Systems, 2011), typically use channel extension technology between the remote host and the primary
controller. The channel extension technology typically uses advanced emulation/acceleration techniques
to analyze the CCW chains and modify them to perform local acknowledgements. This avoids throughput
performance degradation (commonly known as data droop), which is caused by extra round trips in the
long distance environment. This technology is typically an additional software license added to the channel
extension hardware and may cost extra. Persistent IU pacing produces performance results similar to XRC
emulation at long distances. It doesn’t really extend the distance supported by FICON, but can provide the
same benefits as XRC emulation. In other words, there may be no need to have XRC emulation running on
the channel extension hardware.
Figure 12 shows persistent IU pacing performance comparisons for a sequential write workload. The
workload consists of 64 jobs performing 4KB sequential writes to 64 data sets with 1,113 cylinders each,
which all reside on one large disk volume. There are four enhanced System Data Movers (SDM) configured.
When turning the XRC emulation off, the performance drops significantly, especially at longer distances.
However, after the persistent IU pacing functionality is installed on the DASD array, performance is nearly
returned to where it was with XRC emulation on.
fig12_P4CH3
3200KM, Brocade Emulation OFF
3200KM, Brocade Emulation OFF + Pacing
0 5 10 15 20 25 30 35
This has been achieved for the z/OS Global Mirror SDM “Read Record Set” (RRS) data transfer from a DASD
array that supports persistent IU pacing to the SDM host address space. The control units that support
persistent IU pacing can increase the pacing count. The pacing count is the number of IUs “in flight” from
a channel to a control unit. FICON supports IU pacing of 16 IUs in flight. Persistent IU pacing increases this
limitation for the RRS CCW chain to permit 255 IUs in flight without waiting for an acknowledgement from
the control unit, eliminating handshakes between channel and control unit. Improved IU pacing with 255 IU,
instead of 16 will enhance utilization of the link.
High performance for z/OS Global Mirror can be achieved by ensuring that the storage subsystems have
adequate resources, including cache, NVS, and path connections to handle the peak workload that is
associated with remote copy sessions (and coexisting concurrent copy operations). The System Data Mover
(SDM) software requires adequate processor MIPs, multiple processors, and real storage to accomplish its
function in an efficient and timely fashion. The SDM, without adequate resources, may not be able to drain
the storage control cache rapidly enough under peak workload conditions. Based on the workload, if this
continues for an extended period of time, the cache could become overcommitted, and ultimately this could
affect the performance of your primary systems.
When a z/OS Global Mirror solution is implemented over and extended distance, the entire solution must
be architected to accommodate the worst case I/O scenarios. The solution needs to be able to handle
end-of-month, end-of-quarter processing loads. The solution must be able to handle processing large I/O’s
(half track of full track updates) and also online transaction processing characterized by much smaller sized
I/O’s. Traditionally, it is the periods high volumes of small updates that will cause Global Mirror to have
performance issues. Figure 12 is intended to represent the high volume of OLTP type transactions and is
important because it reflects a worst-case type scenario (Flam & Guendert, 2011).
When looking at the data in figure 13 shown above, each test scenario depicted shows three data points;
Brocade emulation on, Brocade emulation off, and Brocade emulation off with pacing. The test cases of
Brocade emulation off reflect what performance of z/OS Global Mirror are without “Extended Distance
FICON” and also with not Brocade emulation enabled. One can see that the network latency has a
significant impact on I/O rates. The other two observations that should be made are (Flam & Guendert,
2011):
1. Prior to the release of the “Extended Distance FICON” feature, z/OS Global Mirror required emulation in
the network to function acceptably when there was moderate to high network latency. With the increased
I/O rate that “Extended Distance FICON” has with Global Mirror, a customer may not need the Brocade
emulation feature. The I/O rate increase that “Extended Distance FICON” provides may be good enough
for the customer production environment
2. With the “Extended Distance FICON” feature and the Brocade emulation feature combined, the peak I/O
rate that z/OS can sustain can be increased over what can be achieved with just the “Extended Distance
FICON” feature and no Brocade emulation. The Brocade emulation feature can help a customer address
the worst-case I/O rate scenarios.
Persistent IU pacing, AKA “Extended Distance FICON” does not increase the distance support for FICON.
Rather, it enhances the performance of FICON traffic going over long distance by increasing the number of
information units (IUs) that can be in flight between a channel and control unit, and allowing that increased
number of IUs to persist for longer than one exchange. This performance enhancement is intended primarily
for use with z/OS Global Mirror (formerly known as XRC). This technology does not alter anything with
respect to the amount of buffer credits needed to support long distance FICON data transmission. Hence,
it does not change the need for FICON directors and/or channel extension hardware used for multi-site
disaster recovery/business continuity architectures. However, persistent IU pacing can eliminate the need
for XRC emulation technology on the channel extension hardware and this may reduce the cost of this
hardware.
Conclusion
This chapter gave an overview of remote data replication (RDR) for DASD in mainframe environments. We
discussed the basic modes of DASD RDR (synchronous and asynchronous) as well as the methods of DASD
RDR (host-based and DASD array-based). We then discussed DASD vendor specifics, with a more detailed
focus on IBM technology since that is the basis for most of the mainframe DASD RDR used by all DASD
vendors. Finally, we discussed the new Extended Distance FICON functionality, which is also known as
persistent IU pacing. We discussed the Brocade XRC emulation technology, and finished the chapter with a
comparison of persistent IU pacing and Brocade XRC emulation.
References
Artis, H., & Guendert, S. (2006). Designing and Manageing FICON Interswitch Link Infrastructures.
Proceedings of the 2006 Computer Measurement Group. Turnersville, NJ: The Computer Measurement
Group, Inc.
Brocade Communications Systems. (2011). Technical Brief: FICON Over Extended Distances and IU Pacing.
San Jose, CA, USA: Brocade Communications Systems.
Cronauer, P., Dufrasne, B., Altmannsberger, T., & Burger, C. (2013). IBM System Storage DS8000 Copy
Service for IBM Z Systems. Poughkeepsie, NY: IBM Redbooks.
Dufrasne, B,. Bauer, W., Brunson, S., Coelho, A., Kimmel, P., Tondini, R. (2017). IBM DS8880 and IBM z
Systems Synergy. Poughkeepsie, NY: IBM Redbooks
EMC Education Services. (2012). Information Storage and Management: Storing, Managing, and Protecting
Digital Informatiin in Classic, Virtualized, and Cloud Environments 2nd Edition. Indianapolis,
IN: Wiley.
Flam, G., & Guendert, S. (2011). Demystifying Extended Distance FICON AKA Persistent IU Pacing.
Proceedings of the 2011 International CMG. Turnersville, NJ: The Computer Measurement Group, Inc.
Flam, G., & Guendert, S. (2011, Dec). z/OS Global Mirror Emulation vs. Persistent IU Pacing. zJournal.
Guendert, S. (2007). Understanding the Performance Implications of Buffer to Buffer Credit Starvation
in a FICON Environment: Frame Pacing Delay. Proceedings of the 2007 Computer Measurement Group.
Turnersville, NJ: The Computer Measurement Group, Inc.
Guendert, S. (2013, Winter). Fibre Channel over Internet Protocol Basics for Mainframers. Enterprise Tech
Journal, pp. 40-46.
Hitachi Data Systems. (2015). Hitachi Business Continuity Manager User Guide. Hitachi Data Systems.
Hitachi Data Systems. (2016). Hitachi Virtual Storage Platform Hitachi TrueCopy for Mainframe User Guide.
Hitachi Data Systems.
Hitachi Data Systems. (2016). Hitachi Virtual Storage Platform Hitachi Universal Replicator for Mainframe
User Guide. Hitachi Data Systems.
Kern, R., & Peltz, V. (2010). IBM Storage Infrastructure for Business Continuity. Poughkeepsie, NY:
IBM Redbooks.
Negro, T., & Pendle, P. (2013). EMC Mainframe Technology Overview. Hopkinton, MA: EMC Corporation.
Remote Data
Replication for Tape
This chapter continues our discussion of remote data
replication; however, it focuses exclusively on remote
replication for mainframe tape technologies. The
discussion includes both native tape and virtual tape.
The chapter starts with some general concepts and
a discussion of the Brocade Advanced Accelerator
for FICON’s mainframe tape read and write pipelining
technology. The second part of the chapter briefly
looks at the different storage vendor native tape and
virtual tape mainframe offerings and examines remote
replication specifics for those offerings.
175
PART 3 Chapter 3: Remote Data Replication for Tape
Introduction
For years, enterprise IT has been faced with a variety of challenges that are almost mutually exclusive:
support and enable business growth while reducing costs, increase data protection while improving global
data access, and meet shorter recovery objectives while managing growing amounts of data, to name a few.
To meet these and other business objectives, IT personnel require new approaches, new technologies, and
new infrastructure architectures. Many are exploring the use of or are currently using electronic remote tape
backup and remote virtual tape as a cost-effective approach for improved disaster protection and long-term
data retention. However, given the sequential nature of tape operations, can tape be effectively extended
over distance (Larsen, 2008)? This chapter focuses on the mainframe enterprise and leveraging of FICON-
attached tape media over long distance to meet data retention and Disaster Recovery (DR) needs. It
explores the business requirements, technological challenges, techniques to overcome distance constraints,
and considerations for solution design.
This chapter continues our discussion of remote data replication; however, it focuses exclusively on remote
replication for mainframe tape technologies. The discussion includes both native tape and virtual tape.
The chapter starts with some general concepts and a discussion of the Brocade Advanced Accelerator for
FICON’s mainframe tape read and write pipelining technology. The second part of the chapter briefly looks
at the different storage vendor native tape and virtual tape mainframe offerings and examines remote
replication specifics for those offerings.
Every company in the world has information that keeps its business afloat, whether is it billing information,
order information, assembly part numbers, or financial records. Regardless of its size, every company
has critical information that keeps its business in operations. The data that is deemed critical varies from
company to company, but most agree that information on an employee’s workstation is less critical than the
data stored in a DB2 application. It is that ranking of criticality to application and data that determines the
most cost-effective means to protect it.
There are many different methods that can be applied to protect information in offsite locations. These
range from physically moving tape cartridges to a secure location to synchronized disk replication over
distance for immediate failover of operations. Depending on what Recovery Point Objective (RPO) and
Recovery Time Objective (RTO) is applied to specific information, different solutions can be implemented to
meet those objectives and contain costs for items such as additional equipment and Wide Area Network
(WAN) charges.
RTO and RPO assignments can help in classifying applications or data to a level of protection and solution
type to use (Larsen, 2008). Figure 1 shows the guidelines used by a large financial institution to help it
decide on how to best protect its various applications. The institution determined that five classes of data
existed and that each classification would use a certain type of recovery or Business Continuity method.
Once an application was classified, the institution did not have to revisit which protection method to use.
Asychronous Sychronous
Tape Back-up High Availibility Hot Site
Replication Replication
Weeks Days Hours Minutes Seconds Seconds Minutes Hours Days Weeks
fig01_P4Ch3
Mirroring
2 (Cold Site)
Remote
Class Tape
3 Less than 24 hours 24 hours - 72 hours Vaulting
Local
Class Tape
4 24 hours - 48 hours 3 days - 7 days Vaulting
Non-
Class Mission
More than 48 hours More than 7 days Critical
5 Data
-5 -4 -3 Day-2 Day-1 Day+1 Day+2 Day+3 +4 +5 +6 +7
In addition to Disaster Recovery needs, some businesses have specific requirements that must be met for
retaining information. For instance, consider a company located in the Central U.S. that does credit card
processing and billing for several large financial institutions and is mandated to retain records for no less
than seven years. This company determined that tape was the most cost-effective media to use. It had three
processing centers located as far as 1,800 miles from one another and wanted to electronically move tape
traffic to its main data center, where all the media management could take place. Its application processing
was done with a mainframe host in each location and utilized a FICON-based virtual tape subsystem that
extended the physical tape attachment over a WAN to its main processing center. Even though this was
not deployed for Disaster Recovery, it did provide DR benefits, such as reduced risk of lost tapes due to
handling. It also allowed centralized tape management and provided immediate access to tape data that
was as current as the most recent backup.
• Tier 1: Tier 1 users send the tapes to a warehouse or “cold” site for storage.
• Tier 2: Tier 2 users send the tapes to a “hot” site where the tapes can be quickly restored in the event of
a disaster.
Various schemes have been developed to improve the process of offloading data nightly from production
sites and production volumes to tape. Some of these solutions provide full integration with various
databases (Oracle, DB2, UDB, and so forth), as well as applications, for example, SAP. Numerous names
have been created to describe these off-line backup solutions, including Serverless backup, LAN-free
backup, Split Mirroring, and SnapShot.
• The first method is to write directly from a primary site to a tape storage subsystem located at a remote
secondary site.
• A second method is introduced with the TS7700 Grid Tape replication solution, or similar offerings from
Oracle, EMC, and Luminex. With IBM tape replication, a host writes data to a tape locally in the primary
site. The TS7700 Grid hardware then transmits the data to a second TS7700, located at the secondary
site. The TS7700 actually supports more than one target location in a grid. Various options are provided
as to when the actual tape copy takes place, for example, during tape rewind/ unload or asynchronously
after rewind/unload (Kern & Peltz, 2010).
FICON-Capable Tape
Mainframe tape options vary from vendor to vendor, but for the most part there are a variety of standalone
drives, automated libraries, or virtual tape subsystems that leverage internal disk to house tape data before
actually exporting to real physical tape media. Selecting the right product for operational needs is a matter
of fitting the requirements of the enterprise.
Since all the drives use the FICON standard, techniques needed for long-distance deployment have been
tested with the major vendors, so interoperability issues should be minimal. The FICON standard has
evolved along with the Fibre Channel standard, because the two use the same lower-level protocols at
Layer 1 and Layer 2 of the protocol stack. In addition to the advancements of FICON/FC port speeds, there
has been a move toward using IP as the transport of choice for long-distance connectivity. FICON over IP
is based on the Fibre Channel over IP (FCIP) standard but is uniquely modified to support the mainframe
channel protocol.
Write operations can be expected to comprise the larger percentage of I/O operations for tape devices
(for archival purposes), but given today’s requirement for data retrieval and shorter recovery times,
read operations are becoming more and more critical. Both write and read operations are necessary
for standalone tape extension or remote virtual tape solutions. Virtual tape configurations have a peer-
to- peer relationship between the virtual tape controllers and replicate data between them. Peer-to-peer
deployments can also be clustered, which requires host access to both the primary and secondary virtual
controllers. In a clustered virtual tape configuration, write and read operations work together to enhance
data access and improve flexibility of system administration. A clustered configuration can take advantage
of a primary and secondary subsystem failover, allowing for more effective management of maintenance
windows. Without both write and read performance enhancements, users cannot leverage the failover
capability and maintain operational performance (Larsen, 2008).
While these end-to-end exchanges may be considered trivial in native FICON-attached tape
implementations, they can become a significant impediment to acceptable I/O access times and bandwidth
utilization for WAN supported FICON configurations. In addition to the command-data-status exchange
required to support tape operations, the effect of IU pacing may also introduce additional undesirable
delays in FICON-attached tape devices accessed through WAN facilities, particularly for tape write
operations, where outbound data frames figure significantly in the IU pacing algorithm. The Brocade suite
of emulation and pipelining functions reduces the undesirable effects of latency on these exchanges and
improves overall performance for WAN-extended FICON-attached tape and virtual tape devices (Brocade
Communications Systems, 2017).
Tape pipelining refers to the concept of maintaining a series of I/O operations across a host-WAN- device
environment, and should not be confused with the normal FICON streaming of CCWs and data in a single
command chain. Normally, tape access methods can be expected to read data sequentially until they reach
the end of-file delimiters (tape marks) or to write data sequentially until either the data set is closed or an
end-of-tape condition occurs (multivolume file). The tape pipelining/emulation design strategy attempts to
optimize performance for sequential reads and writes while accommodating any non-conforming conditions
in a lower performance non-emulating frame shuttle.
Tape emulation features exist as an extension of the previously developed zGM (XRC) emulation functions,
and much of the existing zGM emulation infrastructure has been exploited. In fact, it is possible for the code
to concurrently support both the zGM emulation functions and the tape emulation functions, provided their
separate control units are dynamically switched. As with zGM emulation, the tape emulation techniques
apply to the Fibre Channel frame level and not to the buffer or command level. As with zGM emulation,
the only parts of the tape access to be emulated are those associated with standard writes and reads
and possible accompanying mode-sets or no-ops, which constitute the bulk of the performance path for
tape operations. All other control functions are not emulated, and the emulation process provides no
surrogate control unit or device image to the host and no surrogate host image to the device (Brocade
Communications Systems, 2012).
Pipelining does remove the latency of the network; however, application performance is still a matter of having
the bandwidth necessary to move the required data. When going through the design phase of a remote tape
solution, proper planning and research needs to be done to design the right solution for today and tomorrow. It
is highly recommended to work with the storage and network infrastructure vendors to help balance business
objectives, performance, scalability, and cost (Larsen, 2008). The following items should be discussed
throughout the planning cycle:
First, tests were run against a single extended IBM Virtual Tape Controller (VTC) with one 32 KB block size
tape job, using Brocade Extension technology. Read and write pipelining performance versus shuttle mode
performance was captured for various link latencies. The results show that read emulation performance
is lower than that of write operations and is limited by the VTC. However, the performance improvement
that FICON emulation provides relative to shuttle mode is pronounced for both reads and writes (Brocade
Communications Systems, 2017).
In the second test, read pipelining performance was measured using eight concurrent tape jobs.
The results show that read pipelining improves throughput from 25 percent to a massive 10x improvement
over long distances. Even without emulation, better utilization of the WAN link is achieved when scaling the
number of VTCs and concurrent tape jobs. When there is only one (or there are a few) tape jobs, FICON read
emulation is critical, in order to obtain better utilization of the WAN link and higher throughput on the tape
job (Brocade Communications Systems, 2017).
The Brocade FICON extension feature set includes emulation support for the IBM Virtual Tape Server (VTS),
TS7700 Virtualization Engine, the Oracle Virtual Storage Manager (VSM), EMC DLm, and read and write
pipelining for extended FICON tape operations (Brocade Communications Systems, 2012).
A wide range of configurations are supported for flexible Disaster Recovery implementations, which can be
deployed in a variety of switch environments, including FICON Cascade. These configurations include:
Virtual Tape
Tape Controller Library
Host
IP IP
IP
fig06_P1C4
Optional
Hosts IP IP
The System Storage Tape Controller for IBM Z offers IBM FICON attachments of TS11xx tape drives in
a TS4500 Tape Library or rack and reduces hardware and infrastructure requirements by sharing tape
drives with FICON hosts. It can also add tape drives non-disruptively, which helps enhance configuration
flexibility and availability. The TS1155 tape drive supports the System Storage TS4500 Tape Library and
IBM racks that enable standalone installation. TS1155 supports IBM TS4500 tape libraries as well as IBM
racks that enable stand-alone installation. TS1155 features three options for type D media. The IBM 3592
Advanced data tape cartridge, JD, provides up to 15 TB native capacity, as does the IBM 3592 Advanced
WORM cartridge, JZ. A type D 3592 Economy cartridge, JL, can also be up-formatted from its nominal 2 GB
capacity, storing 3 TB and providing fast access to data. TS1155 can also read and write previousgeneration
media—type C (JC, JY, and JK)—depending on media formatting. The drive supports a native data transfer
rate of up to 360 MBps to deliver information at the speed demanded by cloud, mobile and social media
users. TS1155 also helps simplify big-data management by transferring data at up to 700 MBps with data
compression.
The built-in data encryption capabilities of TS1155 enable organizations to reduce the need for host-
based encryption— and the concurrent drain on performance—or reduce the use of specialized encryption
appliances. This capability is intended to protect information if tape cartridges are lost or stolen by
supporting the storage of the data in an encrypted form (IBM, 2013).
To meet the high-performance needs of your enterprise data center, each StorageTek SL8500 library is
equipped with four or eight robots working in parallel to provide a multithreaded solution. This reduces
queuing, especially during peak work periods. As the system scales, each additional StorageTek SL8500
library added to the aggregate system comes equipped with more robotics, so the performance can
scale to stay ahead of your requirements as they grow. Additionally, with the StorageTek SL8500 modular
library system’s unique centerline architecture, drives are kept in the center of the library, alleviating robot
contention (Oracle, 2015).
The StorageTek SL8500 modular library system can be shared across heterogeneous environments, such
as Oracle’s Solaris, Linux, AS/400, mainframe, UNIX, and Windows NT environments, so you can easily
match the library configuration to your backup requirements. Also, with Oracle’s StorageTek Virtual Storage
Manager (VSM), you can use your library in a virtual enterprise mainframe environment.
The StorageTek T10000D tape drive delivers a potent combination: Native capacity of up to 8.5 TB and
a maximum compressed data rate of up to 800 megabytes per second (MB/s) (up to 252 MB/s with 1:1
compression). Because the StorageTek T10000D tape drive offers both 16 gigabytes per second (GB/ps)
Fibre Channel connectivity, you can easily transition as network architectures change.
Below is a list of benefits/advantages of VTLs that have led to their adoption in most large mainframe
sites:
• Requirement for fewer physical tape drives
• Smaller tape libraries (fewer physical drives = fewer slots in the library)
• More efficient usage of tape media
• Lower energy requirements and costs
• Lower storage management costs
• Improved performance (data transfer to disk)
• Ease of migration to new tape drive technology as it becomes available
• Applications that can support legacy tape formats on state of the art new tape drive technology
• No waiting for tape drives due to the large pool of virtual tape drives
Physical tape is an ever-increasing concern in mainframe data centers, where IT resources and budgets
are limited, and DR is a vital concern. The challenges with physical tape include the delays and costs
associated with physical tape media, storage, offsite shipping, end-of-life support, and drive and library
maintenance. All of these issues lead many businesses to search for a virtual tape solution that provides
better performance and reliability at a lower Total Cost of Ownership (TCO). There are four primary vendors
for mainframe virtual tape: EMC, Luminex, IBM, and Oracle (StorageTek).
Dell EMC Mainframe Virtual Tape: Disk Library for Mainframe (DLm)
The Dell EMC Disk Library for mainframe family of products offers IBM Z mainframe customers the ability
to replace their physical tape systems. DLm8100 is the Dell EMC enterprise virtual tape solution for
replacing physical tape in the mainframe environment, and it is unique with its support of true synchronous
replication to a secondary site. The DLm8100 comes with RAID 6 protected disk storage, hot-standby disks,
tape emulation, hardware compression, and Symmetrix Remote Data Facility (SRDF) replication capabilities.
The DLm8100 is EMC’s fourth-generation Disk Library for mainframe systems (EMC Corporation, 2017).
EMC Disk Library for mainframe systems include one or more Virtual Tape Engines (VTEs) to perform the
tape emulation operations, mainframe channel connectivity, and a variety of internal switches, as well as
back-end disk storage. Depending on the DLm system model and configuration, the back-end storage can
consist of either primary storage, deduplication storage, or both, to store the actual tape volumes. The base
components of the DLm all reside in a single cabinet, and additional cabinets are configured depending on
storage capacity requirements. The DLm8100 may be configured with two to eight VTEs, depending on the
required number of drives, scale, throughput, and overall system performance requirements.
The DLm8100 is available with the following product features (EMC Corporation, 2017):
The DLm8100 incorporates the latest EMC virtual tape emulation software (EMC Virtuent 7). Virtuent is a tape-
on-disk software application that runs on a base hardware controller, which provides two FICON connections
to the mainframe. The Virtuent software provides the controller emulation supporting model 3480, 3490, or
3590 tape drives.
The DLm8100 VTE appears to the mainframe operating system as a set of standard IBM tape drives. The
mainframe manages this set of tape drives as one or more virtual tape libraries. Existing mainframe software
applications use the virtual drives of the VTE (IBM 3480, 3490, and 3590 drive types) just as they would any
mainframe-supported physical tape drives. No application modifications are required to integrate them into
the existing mainframe environment. The VTEs are connected to the mainframe host via FICON channels.
Each VTE includes two FICON channels for connection to the mainframe host. The VTEC cabinet can support
from two to eight VTEs for a maximum of 2.7 GB/second throughput, 2048 virtual tape drives, and 16 FICON
channel attachments. Each VTE can support up to 256 total virtual tape drives across a total of 64 active
Logical Partitions (LPARs) (EMC Corporation, 2017).
The DLm8100 uses the EMC VMAX enterprise-class storage as its storage system, and it inherits features
and functions of the VMAX platform, including:
The DLm8100 VTEs process the arriving mainframe tape volume and write it as a single file on the DLm
storage. Each mainframe tape is stored as a single file whose filename matches the tape VOLSER. The file
consumes only as much space as is required to store it (no wasted storage space). This allows the virtual tape
to be easily located and mounted in response to read or write requests. All disk drives within the DLm8100 are
protected with a RAID 6 configuration and hot spare drives for each RAID group.
The DLm can be managed using the IBM Z DFSMS functionality, and it supports all CCWs for tape. Therefore,
DFHSM, backups, and other client applications continue to work without any changes being necessary. These
operations are also no longer dependent on a specific tape drive range, and tape processing is done at disk
speed. This reduces the time it takes for recycle/recall operations to complete.
The DLm8100 utilizes SRDF for replication of virtual tape. The DLm8100 supports Disaster Recovery solutions
using SRDF/Synchronous (SRDF/S) and SRDF/Asynchronous (SRDF/A). The SRDF/S option maintains a real-
time mirror image of data between the VMAX arrays. Data must be successfully stored in VMAX cache at both
the primary and secondary sites before an acknowledgement is sent to the production host at the primary site.
The SRDF/A option mirrors primary site data by maintaining a dependent-write consistent copy of the data on
the secondary site at all times. SRDF/A session data is transferred from the primary site to the secondary site
in cycles. The point-in-time copy of the data at the secondary site is only slightly behind that on the primary site.
SRDF/A has little or no impact on performance at the primary site, as long as the SRDF links contain sufficient
bandwidth and the secondary system is capable of accepting the data as quickly as it is sent across the SRDF
links (Negro & Pendle, 2013).
Finally, starting with release 4.5, DLm8100 has been enhanced with Dell EMC GDDR (Geographically
Dispersed Disaster Restart) technology to automate the failover of a DLm system to an alternate site in the
event of a site disaster. GDDR will be covered in more detail in the next chapter (Chapter 4).
CGX presents terabytes to petabytes of DASD storage as a range of standard mainframe tape devices. CGX
is application-transparent, requiring no changes to existing tape applications. Using CGX, all mainframe tape
data is stored and secured using the disk system’s RAID data protection technology. There are also additional
optional encryption features available on CGX. No tape data is actually stored on CGX; the data simply passes
through it and is stored as disk files, while retaining the original tape VOLSER numbers. Tape data is sent to
and from the IBM Z mainframe through the use of virtual tape drives or devices. Each virtual tape drive can
logically mount a virtual tape. Once mounted, the mainframe is in control of the operations performed on the
device. Data written by the mainframe is received by the CGX’s virtual tape drives and written to the attached
DASD subsystem. When data is requested by the mainframe, the data is retrieved from the storage subsystem
and presented to the mainframe. Luminex MVTe includes multiple CGX virtual tape control units, which
essentially function as a virtual tape library and replication control mechanism.
As data centers make the transition to tapeless environments, the traditional method of cataloging, packing,
shipping, and warehousing physical tapes is being replaced with more efficient and cost- effective IP-based
replication to a remote DR or bunker site. Luminex Replication operates over both IP and Fibre Channel
connections to one or more remote locations and only sends compressed, block-level changes to minimize
network bandwidth requirements. Luminex Replication also provides a comprehensive set of features
and options to further enhance data protection and Disaster Recovery processes for tapeless mainframe
data centers. In contrast to shipping physical tapes off site on a daily or weekly basis, Luminex Replication
continuously transmits tape data to the remote DR site, ensuring that even the most recently written data
is protected and available there. Enterprises can now meet and improve backup windows and Service Level
Agreements (SLAs), as well as RPO and RTO requirements—with minimal operator intervention and without
the costs and risks of handing off valuable data to a third party for transit or vaulting. With tape data always
available at DR, testing times are reduced, allowing for more frequent and/or more extensive testing events.
Additionally, recovery and testing success is improved, since there is no risk of missing, mislabeled, or
damaged physical tapes. Flexible policy configurations allow selective replication at the tape range and/or
device range (logical library) level, so only the required data is replicated. For enterprises that require a third
copy of tape data, multisite replication provides the ability to cascade replication from DR to a third mainframe
data center or bunker site automatically without operator intervention (Luminex Software, Inc, 2013).
Luminex also offers synchronous replication through their SyncCopy option. Luminex SyncCopy enforces tape
data consistency across replicated storage systems. Local write operations complete at the primary mainframe
only when the replicated copy is validated to be available at the secondary location. This ensures that data
centers keep their tape catalog synchronized at a remote location, the tape catalog is current, and the tape
data is consistent and usable if needed.
Figure 8 (Hitachi Data Systems, 2013) shows a virtualized storage environment with Hitachi Unified Storage
(HUS) behind VSP. Replication is provided by Hitachi Universal Replicator (HUR) for both primary disk and
virtual tape data.
ESCON ESCON
WAN
Mainframe Hitachi Virtual Hitachi Unified Hitachi Virtual Hitachi Unified Mainframe fig08_P4CH4
Hitachi Data Systems mainframe virtual tape solutions immediately replicate virtual tape data to remote
Disaster Recovery sites, eliminating delays related to shipping and lost or damaged tapes. Both Luminex
Replication and Hitachi Universal Replicator provide vastly improved Disaster Recovery plans and
preparedness, faster, more convenient Disaster Recovery testing, and shorter backup windows, which
enable more frequent full-volume backups.
Hitachi Virtual Storage Platform for DASD adds the ability to replicate mainframe disk and tape data with
HUR, providing mainframe data consistency for Disaster Recovery. And Luminex RepMon provides real-time
status monitoring and logging for mainframe virtual tape replication at the VOLSER level, for both Luminex
Replication and HUR.
1. VTSS hardware and software: The VSM 7 VTSS supports emulated tape connectivity over 16 Gbps FICON
interfaces to IBM z Systems hosts, and TCP/IP attachment to other VTSSs and VLEs. The 16 Gbps FICON
ports are provided by two dual-port Host Bus Adapters (HBAs) per node. There are four ports per node,
providing a total of (8) 16 Gbps FICON ports for the VSM 7 VTSS. Each FICON port supports IBM Control
Unit (CU) and IBM Channel Mode (CH) images concurrently. These ports are attached to the end user’s
FICON SAN. The TCP/IP attachment ports include 4 10 Gbps Ethernet ports per VSM 7 for replication.
2. Enterprise Library Software (ELS) and Virtual Tape Control Software (VTCS): ELS is the consolidated suite
of StorageTek mainframe software that enables and manages StorageTek’s Automated Cartridge System
(ACS) and VSM hardware. VTCS is a component of ELS that controls the virtual tape creation, deletion,
replication, migration, and recall of virtual tape images from the VTSS subsystem.
The VSM 7 supports multiple sites for Disaster Recovery/Business Continuity (DR/BC). Physical tape
extension can be used to extend the VSM 7 storage to span onsite and offsite locations that are
geographically dispersed. Up to 256 VSMs can be clustered together into what Oracle calls a Tapeplex,
managed under one single point of management as a single large data repository using Oracle’s StorageTek
Enterprise Library Software suite (ELS). The VSM7 utilizes user-defined policy-based management and task
automation of tasks that typically were done manually (Oracle, 2016).
The VSM 7 can also be optionally attached to the Oracle StorageTek Virtual Library Extension (VLE). The
VLE is a second-tier SAS disk array storage feature, which allows the VSM 6 to support multiple tiers of
storage (disk and tape) under a single point of management (ELS). The VLE adds a new disk-based higher-
performance dimension to the VSM 7 architecture, suitable for data that may need to be stored on disk
longer (45 to 90 days) for speed of accessibility purposes, before the probability of reuse diminishes.
Without the VLE, there would be more reliance on tape storage for this data, which might lead to situations
where data is constantly being migrated and recalled back and forth from physical tape resources. With
the VSM 7 and VLE combination, physical tape drives can be used primarily for long-term archiving. As
data on the VSM 7 and/or VLE ages, it can be migrated from the disk to physical tape libraries such as an
attached Oracle StorageTek SL8500. The VLE also can serve as a replication engine that migrates data
between VLEs, which takes the backup and replication workload off the VSM 7 itself. This frees up the VSM
7 resources to focus on host-related activities.
The VSM 7 offers several replication methods and provides the capability to make additional copies
of primary data, keep data in multiple sites in synch, and copy data to offsite and multiple geographic
locations. VLE to VLE disk replication, synchronous and/or asynchronous node clustering across FICON or
Gigabit Ethernet (GbE), Cross-Tapeplex replication (CTR), and real tape drive channel extension for remote
site physical tape support are all options with the VSM 7.
Finally, VSM 7 is architected to seamlessly integrate with Oracle Storage Cloud Service and Oracle Storage
Cloud Service—Archive Storage. You can set policies to automatically copy or migrate files from StorageTek
Virtual Storage Manager to low-cost, off-site cloud storage. Offloading capacity to cloud storage optimizes
utilization of on premise VSM resources and reduces overall cost while adding capabilities, such as a
simple, secondary data protection option without the need for expensive remote site operations and
resources.
IBM TS7760
The IBM TS7760 family of products is the latest IBM mainframe virtual tape technology solution and the
successor to the previously mentioned IBM TS7700 and IBM 3494 VTS. The IBM TS7760 offering has
advanced management features that are built to optimize tape processing. By using this technology,
IBM Z customers can implement a fully integrated, tiered storage hierarchy of disk and tape, and thus can
take advantage of the benefits of virtualization.
This powerful storage solution is complete with automated tools and an easy-to-use web-based Graphical
User Interface (GUI) for management simplification. By using the tools and GUI, your organization can store
data according to its value and how quickly it needs to be accessed. This technology can lead to significant
operational cost savings compared to traditional tape operations and can improve overall tape processing
performance.
The prior-generation IBM Virtual Tape Server (VTS) had a feature called peer-to-peer (PtP) VTS capabilities.
PtP VTS was a multisite capable DR/BC solution. PtP VTS was to tape what Peer-to-Peer Remote Copy
(PPRC) was to DASD. PtP VTS to VTS data transmission was originally done by Enterprise Systems
Connection (IBM ESCON), then FICON, and finally Transmission Control Protocol/Internet Protocol (TCP/IP).
As a Business Continuity solution for High Availability (HA) and Disaster Recovery, multiple TS7760 clusters
can be interconnected by using standard Ethernet connections. Local and geographically separated
connections are supported, to provide a great amount of flexibility to address customer needs. This IP
network for data replication between TS7760 clusters is more commonly known as a TS7760 Grid. The
IBM TS7760 can be deployed in several grid configurations. A TS7760 Grid refers to two to eight physically
separate TS7760 clusters that are connected to one another by using a customer-supplied IP network.
The TS7760 Virtualization Engine cluster combines the Virtualization Engine with a disk subsystem cache
(Coyne, et. al, 2017).
The virtual tape controllers and remote channel extension hardware for the PtP VTS of the prior generation
were eliminated. This change provided the potential for significant simplification in the infrastructure that is
needed for a business continuation solution and simplified management. Instead of FICON or ESCON, the
replication connections between the TS7760 clusters use standard TCP/ IP fabric and protocols. Similar
to the PtP VTS of the previous generation, with the new TS7760 Grid configuration data can be replicated
between the clusters based on customer established policies. Any data can be accessed through any of the
TS7760 clusters, regardless of which system the data is on,if the grid contains at least one available copy.
A TS7760 Grid refers to two to eight physically separate TS7760 clusters that are connected to each other
with a customer-supplied IP network. The TCP/IP infrastructure that connects a TS7760 Grid is known as
the grid network. The grid configuration is used to form a high-availability, Disaster Recovery solution and
to provide metro and remote logical volume replication. The clusters in a TS7760 Grid can be, but do not
need to be, geographically dispersed. In a multiple-cluster grid configuration, two TS7760 clusters are often
located within 100 kilometers (km) of each other. The remaining clusters can be more than 1,000 km away.
This solution provides a highly available and redundant regional solution. It also provides a remote Disaster
Recovery solution outside of the region (IBM Corporation, 2017).
The TS7760 Grid is a robust Business Continuity and IT resilience solution. By using the TS7760 Grid,
organizations can move beyond the inadequacies of onsite backup (disk-to-disk or disk-to-tape) that cannot
protect against regional (nonlocal) natural or man-induced disasters, and data can be created and accessed
remotely through the grid network. Many TS7760 Grid configurations rely on this remote access to further
increase the importance of the TCP/IP fabric.
With the TS7760 Grid, data is replicated and stored in a remote location to support truly continuous uptime.
The IBM TS7760 includes multiple modes of synchronous and asynchronous replication.Replication modes
can be assigned to data volumes by using the IBM Data Facility Storage Management Subsystem (DFSMS)
policy. This policy provides flexibility in implementing Business Continuity solutions so that you can simplify
your storage environment and optimize storage utilization. This functionality is similar to IBM Metro Mirror
and Global Mirror with advanced copy services support for IBM Z customers. IBM z/OS, z/Transaction
Processing Facility Enterprise Edition (z/TPF), zVirtual Storage Extended (zVSE), and z virtual machine (zVM)
operating systems all support the TS7760 Grid. They can use the policies through outboard management
within the GUI. These IBM Z operating systems do not support DFSMS policy management but require the
benefits of replication and the IP network (Coyne, et. al, 2017).
With increased storage flexibility, your organization can adapt quickly and dynamically to changing business
environments. Switching production to a peer TS7760 can be accomplished in a few seconds with minimal
operator skills. With a TS7760 Grid solution, IBM Z customers can eliminate planned and unplanned
downtime. This approach can potentially save thousands of dollars in lost time and business and addresses
today’s stringent government and institutional data protection regulations.
The TS7760 Grid configuration introduces new flexibility for designing Business Continuity solutions. Peer-to-
peer communication capability is integrated into the base architecture and design. No special hardware is
required to interconnect the TS7760s. The VTCs of the previous generations of PtP VTS are eliminated, and
the interconnection interface is changed to standard IP networking. If configured for HA, host connectivity
to the virtual device addresses in two or more TS7760s is required to maintain access to data if one of the
TS7760s fails. If they are at different sites, FICON channel extension equipment such as the Brocade 7840
Extension Switch or Brocade SX6 Extension Blade is required to extend the host connections.
The IBM TS7760 Grid offers a robust, highly resilient IT architecture solution. This solution, when used in
combination with DASD or disk mirroring solutions, provides zEnterprise customers with a Tier 6 level of
multisite High Availability and Business Continuity. The only higher level, Tier 7, is the IBM Geographically
Dispersed Parallel Sysplex (IBM GDPS) solution.
A TS7760 Grid network requires nonstop predictable performance with components that have “five-9s”
availability. A TS7760 Grid network must be designed with highly efficient components that minimize
operating costs. These components must also be highly scalable to support business and data growth and
application needs and to help accelerate the deployment of new technologies. Storage networking is an
important part of this infrastructure. Depending upon electronic devices, storage networks can at times fail.
The failure might be due to software or hardware problems. Rather than taking big risks, businesses that
run important processes integrate HA solutions into their computer environments. The storage network is
crucial to HA environments.
Conclusion
This chapter covered tape remote data replication, both for native (physical drives) tape and virtual tape
subsystems. It discussed the native tape offerings from IBM and Oracle. It also discussed the virtual
tape offerings available from EMC, IBM, HDS, Luminex, and Oracle. A particular focus was placed on the
connectivity and remote replication network options available for all of these offerings. The chapter also
discussed how the Brocade Advanced Accelerator for FICON software’s tape read and write pipelining work
in a System z environment.
Part 3, Chapter 4 looks at some complex remote data replication architectures, such as three-site
configurations and automated solutions such as IBM’s GDPS and EMC’s GDDR offerings.
References
Brocade Communications Systems. (2017). Brocade Advanced Accelerator for FICON-Setting a New
Standard for FICON Extension. San Jose, CA: Brocade Communication Systems.
Brocade Communications Systems. (2012). Technical Brief: FICON Over Extended Distances and IU Pacing.
San Jose, CA: Brocade Communications Systems.
Coyne, L., Denefleh, K., Erdman, D.,Hew, J., Matsui, S., Pacini, A., Scott, M. & Zhu, C. (2013). IBM TS7700
Release 4.0 Guide. Poughkeepsie, NY: IBM Redbooks.
Dell EMC Corporation. (2017). DLm8100 Product Overview. Retrieved 2017, from Dell EMC Disk Library for
Mainframe: https://www.emc.com/collateral/hardware/data-sheet/h4207-disk-library-mainframe-ds.pdf.
Guendert, S., Hankins, G., & Lytle, D. (2012). Enhanced Data Availability and Business Continuity with
the IBM Virtual Engine TS7700 and Brocade Integrated Network Grid Solution. Poughkeepsie, NY: IBM
Redbooks.
Hitachi Data Systems . (2013). Hitachi Data Systems Mainframe Virtual Tape Solutions. Hitachi Data
Systems.
IBM. (2017). IBM System Storage TS1155 Tape Drive. Retrieved 2017, from IBM System Storage TS1155
Tape Drive: https://www.ibm.com/us-en/marketplace/ts1155/details#product-header-top
IBM. (2017). IBM System Storage TS4500 Tape Library. Retrieved 2017, from IBM System Storage TS4500
Tape Library: https://www.ibm.com/us-en/marketplace/ts4500
IBM Corporation. (2017). TS7700 Version 4 Release 1 Introduction and Planning Guide. Boulder, CO: IBM
Corporation.
Kern, R., & Peltz, V. (2010). IBM Storage Infrastructure for Business Continuity. Poughkeepsie, NY: IBM
Redbooks.
Larsen, B. (2008, Summer). Moving Mainframe Tape Outside the Data Center. Disaster Recovery Journal,
pp. 42-48.
Luminex Software, Inc. (2013). Channel Gateway X Mainframe Virtual Tape Control Unit product description.
Moore, F. (2012). The Oracle StorageTek VSM6: Providing Unprecedented Storage Versatility. Horison, Inc.
Negro, T., & Pendle, P. (2013). EMC Mainframe Technology Overview. Hopkinton, MA: EMC Corporation.
Oracle. (2015). Storagetek SL8500 Modular Library System. Retrieved 2017, from http://www.oracle.com/
us/products/servers-storage/storage/tape-storage/034341.pdf
Oracle. (2016). StorageTek Virtual Storage Manager System 7 Data Sheet. Oracle.
194
PART 3 Chapter 4: Multi-site and Automated Remote Data Replication
Introduction
This chapter will discuss more advanced topics with remote data replication (RDR). The two primary topics
it will address are 1) three-site replication and 2) automated business continuity and disaster recovery
solutions. The primary emphasis on these topics will be on DASD replication.
Three-Site Replication
This section will discuss how three-site replication works, it will look at the two most common types of three-
site replication architectures in use today, and will conclude by examining some vendor specifics for three-
site replication from Dell EMC, IBM, and HDS.
When using asynchronous replication, the source and target sites are typically much further apart, often
on the order of hundreds or thousands of kilometers. Therefore, with asynchronous replication, a regional
disaster will typically not affect the target site. If the source site fails, production can be shifted to the target
site; however, there is no further remote protection of data until the failure at the primary/source site is
resolved.
Three-site replication architectures mitigate the risks identified in two-site replication. In these architectures,
data from the source site is replicated to two remote sites. Replication can be synchronous to one of the two
sites, providing a near-zero RPO solution, and it can also be replicated asynchronously to the other remote
site, providing a greater, but finite RPO. Three-site DASD RDR is typically implemented in 1 of two ways (EMC
Educational Services, 2012): either in what is commonly called a cascade/multihop configuration, or in a
triangle/multitarget configuration.
The RPO at the remote site is usually on the order of minutes for this implementation method. This method
requires a minimum of three DASD arrays (including the source). The other two DASD arrays are the array
containing the synchronous replica in the bunker, and the array containing the asynchronous replica at the
remote site.
If a disaster occurs at the source/primary site, production operations are failed over to the bunker site
with zero or near-zero data loss. The difference with a two-site configuration is that you still have remote
protection at the third site. The RPO between the bunker and third site will typically be on the order of a few
minutes. If a disaster occurs at the bunker site, or if there is a RDR network failure between the source/
primary and bunker sites, the source site continues normal operations, but without any remote replication.
This is very similar to a remote site failure in a two-site replication architecture.
The updates to the remote site cannot occur due to the failure at the bunker site. Therefore, the data at the
remote site keeps falling behind, however; if the source fails during this time, operations may be resumed
at the remote site (EMC Educational Services, 2012). The RPO at the remote site depends upon the time
difference between the bunker and source site failures.
In a three-site cascade/multihop configuration, a regional disaster is similar to a source site failure in two
site asynchronous RDR configurations. Operations are failed over to the remote site with an RPO on the
order of minutes. There will be no remote protection until the regional disaster is resolved.
If a disaster/outage occurs at the remote site, or if the RDR network links between the bunker and remote
sites fail, the source/primary site continues to function as normal with protection provided at the bunker
site.
Synchronous Asynchronous
fig01_P4Ch5
Source Site Bunker Site Remote Site
The real key benefit of the three-site triangle/multitarget architecture for replication is the ability to failover
to either of the two remote sites in the case of a source site failure with protection between the bunker and
remote sites (EMC Educational Services, 2012). Resynchronization between the two surviving target sites is
incremental and disaster recovery protection is always available of any one-site failure occurs.
During normal production operations, all three sites are available and the production workload is at the source
site. At any given instant, the data at the bunker and source is identical. The data the remote site is behind
the data at the source and the bunker. There are replication network links in place between the bunker and
the remote site, but they will not normally be in use. Therefore, during normal operations, there is no data
movement between the bunker and remote DASD arrays. The difference between these arrays is tracked
so if a disaster occurs at the source site, operations can be resumed at the bunker or the remote site with
incremental resynchronization between the two arrays at these sites.
Finally, it should be noted that a failure of the bunker or the remote site is not actually considered a disaster
because the operation can continue uninterrupted at the source site while remote disaster recovery
protection is still available. A RDR network link failure to either the source-bunker or source- remote site
does not impact production at the remote site while remote disaster recovery protection is still available with
the site that can be reached.
Remote Replica
Asynchronous
with
Source Device Bunker Site
Differential
Resynchronization
Source Site
Remote Replica
Remote Site
This section will look at specifics from EMC, HDS, and IBM as they pertain to three site DASD RDR.
SRDF/Cascaded
SRDF/Cascaded is a three-way data mirroring/recovery solution from EMC that synchronously replicates
data from a primary site to a secondary site. The data is then asynchronously replicated from the secondary
site to a tertiary site. Cascaded SRDF introduced the concept of a dual-role R2/R1 device, which EMC
calls an R21 device. Prior to EMC Enginuity 5773 (microcode level), an SRDF device could be either a
source device (R1) or a target device (R2); however, the SRDF device could not function in both roles
simultaneously. The R21 device type is unique to SRDF/Cascaded implementations (Negro & Pendle, 2013).
Concurrent SRDF/Asynchronous
Concurrent SRDF/A is a feature that was new with EMC Enginuity 5875. Concurrent SRDF/A offers the
ability to asynchronously replicate from a single production EMC Symmetrix subsystem to two unique,
remote Symmetric subsystems. It does this by enabling both legs of an R11 (source/primary Symmetrix
controller) to participate in two SRDF/A relationships. This approach has helped Dell EMC customers
minimize some of the performance impact to production applications in multi-site SRDF environments. It
also provides the users of SRDF/S and SRDF/A implementations with the ability to change the synchronous
mode operation to asynchronous mode during peak workload times to minimize the impact of SRDF/S
round trip response time to the applications (Smith, Callewaert, Fallon, & Egan, 2012). This ensures that the
host applications are not gated by the synchronous performance of the subsystems and network alike.
The concurrent configuration option of SRDF/A provides the ability to restart an environment at long
distances with minimal data loss while simultaneously providing a zero-data-loss restart capability at a local
site. This type of configuration provides data protection for both a site disaster and a regional disaster, while
minimizing performance (response time) impact and loss of data.
Dell EMC customers can also run in a Concurrent SRDF/A configuration without SRDF/Star functionality. In
this type of configuration, the loss of the primary site (site A) would normally mean that the long distance
RDR would stop and data would no longer be propagated to site C. Therefore, the data at site C would
continue to age as production was resumed at site B. Resuming SRDF/A between sites B and C would
require a full resynchronization to re-enable disaster recovery protection. However, this consumes both time
and physical resources and is a lengthy period of time to go without having normal BC/DR protection.
SRDF/Star changes this. SRDF/Star provides for a rapid reestablishment of cross-site protection in the
event of a primary site (site A) failure. Rather than providing a full resynchronization between sites B and
C, SRDF/Star provides a differential site B to site C synchronization. This dramatically reduces the time to
remotely protect the data at what is now the new production site in this scenario. SRDF/Star also provides a
mechanism for the user to determine which site (B or C) has the most current data in the event of a rolling
disaster affecting site A (Smith, et. al, 2012). However, the customer retains the choice of which site to use
in the event of a failure.
SRDF/Synchronous
R1 R2
SR
DF/
Asy
nch
ron Remote site (C)
ous
BCV
fig03_P4Ch5
Active
R2
Inactive
A 3DC HDS configuration consists of a primary site and two other sites, and is used to perform remote
copies from the primary site. By allocating copies to three different sites, the 3DC configuration minimizes
the chances of losing data in the event of a disaster. TrueCopy is used to perform operations for nearby
sites, and Universal Replicator is used to perform operations for distant sites (Hitachi Data Systems, 2016).
BC Manager BC Manager
Remote site 2
UR
Remote host 2
BC Manager
R-JNL
M-JNL
fig04_P4Ch5
UR
R-JNL
4x4 configuration
This configuration performs DASD RDR from multiple storage subsystems on the primary site to multiple
storage subsystems on the secondary site using Hitachi Universal Replicator to preserve consistency.
Super DKC
M-JNL R-JNL
Sub DKC
fig05_P4Ch5
M-JNL R-JNL
TC UR
CG container EXCTG
BC Manager BC Manager
M-JNL M-JNL
M-JNL M-JNL
Remote site
R-JNL
fig06_P4Ch5
R-JNL
TC UR
z Mainframes
or Servers
Normal
application
I/Os
fig07_P4Ch5
Metro Mirror
Global Mirror
FlashCopy
• incremental
A B C • NOCOPY
Metro Mirror network Global Mirror network
• synchronous • asynchronous
• short distance • long distance
D
Global Mirror
Metro/Global Mirror is a DASD RDR solution that is designed for installations that take advantage of a
short distance Remote Copy solution (Metro Mirror): When there is a disaster or a problem at the local site,
the applications and systems can be recovered immediately at the intermediate site. The solution also
takes advantage of a long-distance Remote Copy solution (Global Mirror): When there is a disaster at the
local and intermediate sites, the applications can be restarted at a long-distance site. Metro/Global Mirror
design is based on the features and characteristics of Metro Mirror and Global Mirror, except for cascading.
Cascading is a capability of DS8000 storage systems where the secondary (target) volume of a Global
Copy or Metro Mirror relationship is at the same time the primary (source) volume of another Global Copy
or Metro Mirror relationship. You can use cascading to do Remote Copy from a local volume through an
intermediate volume to a remote volume.
1. Local site disaster: Production can be switched to the intermediate site. This switch can occur in
seconds to minutes with GDPS type automation. Mirroring to the remote site is suspended. If operation
with the intermediate site disks is expected to be long term, the intermediate site can be mirrored to the
remote site with the appropriate communications connectivity.
2. Intermediate site disaster: Local and remote sites continue to operate normally. When the intermediate
site returns to normal operation, the intermediate site can be resynchronized quickly to the local site.
3. Local site and intermediate site disasters: The remote site can be brought up to the most recent
consistent image with some loss of in-flight updates. A FlashCopy copy of the consistent image disks (X’)
should be taken to preserve the original recovery point. Production can be started at the remote site by
using either set of local disks (X’ or X”), depending on your configuration.
4. Remote site disaster: Local and intermediate sites continue to operate normally.
Metropolitan Unlimited
distance distance
z/OS
Global
Mirror fig08_P4Ch5
Metro
mirror
FlashCopy
when required
P’ P X’
X
DS8000 DS8000 DS8000 X”
Metro Mirror Metro Mirror/ z/OS Global Mirror
secondary z/OS Global Mirror secondary
primary
Metropolitan Unlimited
distance distance
z/OS
Global
Mirror
fig08_P4Ch5
Metro
mirror
FlashCopy
when required
P’ P X’
X
DS8000 DS8000 DS8000 X”
Metro Mirror Metro Mirror/ z/OS Global Mirror
secondary z/OS Global Mirror secondary
primary
Figure 9: A four-site replication solution combining Metro/Global Mirror and Global Copy.
With this 4-site solution, you avoid the situation where you do not have a synchronous copy while you run
production at the remote site. If there is a disaster at the production site, you convert the Global Copy
relationships at the remote site to Metro Mirror. Thus, you can restart the production environment with the
same high availability and disaster recovery features you had available at the local site.
GDDR does not provide replication and recovery services itself, but rather monitors and automates the
services provided by other Dell EMC products, as well as third-party products, required for continuous
operations or business restart. GDDR facilitates business continuity by generating scripts that can be run
on demand; for example, restart business applications following a major data center incident, or resume
replication to provide ongoing data protection following unplanned link outages. Scripts are customized at
the time of invocation by an expert system that tailors the steps based on the configuration and the event
that GDDR is managing (Smith, et.al, 2012). Through automatic event detection and end-to-end automation
of managed technologies, GDDR removes human error from the recovery process and allows it to complete
in the shortest time possible. The GDDR expert system is also invoked to automatically generate planned
procedures, such as moving compute operations from one data center to another.
GDDR overview
Dell EMC GDDR can be implemented in a variety of configurations involving two or three-sites, SRDF/S,
SRDF/A, ConGroup, AutoSwap, SRDF/EDP, and SRDF/Star. In the mainframe environment, GDDR is a
requirement for a SRDF/Star configuration (Negro & Pendle, 2013).
GDDR can manage environments that are comprised of the below elements In each configuration, GDDR
provides specific capabilities tailored to that configuration. However, the major features of GDDR are
common across all topologies.
To achieve the task of business restart, GDDR automation extends well beyond the disk level and into the
host operating system level where sufficient controls and access to third party software and hardware
products exist to enable GDDR to provide automated recovery capabilities. GDDR can distinguish normal
operational disruptions from disasters and respond accordingly. For example, GDDR is able to distinguish
between network outages (SRDF link drop) and real disasters (Smith, et.al, 2012). This awareness is
achieved by periodic exchange of dual-direction heartbeats between the GDDR C-Systems. GDDR constantly
checks for disaster situations and ensures that other GDDR systems are healthy. This checking allows
GDDR to recognize, and act on, potential disaster situations, even if only one GDDR C-system survives.
The benefits of GDPS are a highly reliable and highly automated IT infrastructure solution with a z Systems
based control point, which can provide robust application availability. GDPS is enabled through key IBM
technologies and architectures. GDPS supports both the synchronous (Metro Mirror) and the asynchronous (z/
OS Global Mirror or Global Mirror) forms of Remote Copy. GDPS also supports Peer-to-Peer Virtual Tape Server
(PtP VTS) and TS7760 Grid Networks for Remote Copying tape data. The GDPS solution is a non-proprietary
solution, working with IBM and other vendors’ storage systems, if the vendor meets the specific functions
of the Metro Mirror, Global Mirror, and z/OS Global Mirror architectures that are required to support GDPS
functions (Clitherow, et. al, 2017).
GDPS automation manages and protects IT services by handling planned and unplanned exception
conditions. Depending on the GDPS configuration that is selected, availability can be storage resiliency only,
or can provide near-continuous application and data availability. Regardless of the specific configuration
variation, GDPS provides a z Systems–based Business Continuity automation and integration solution to
manage both planned and unplanned exception conditions. To attain high-levels of continuous availability
and near-transparent DR, the GDPS solution is based on geographical clusters and disk mirroring.
GDPS Components
The GDPS family of IBM Z Business Continuity solutions consists of three major offering categories, and
each category has several sub-offerings. Each GDPS solution is delivered through IBM Global Services, and
is tailored to fit a specific set of client recovery requirements, budgetary requirements, physical distance
and infrastructure, and other factors.
The three major categories of GDPS solutions that are built on DS8000 Copy Services are (Clitherow, et. al,
2017).
1. GDPS/PPRC solutions that are based on Metro Mirror. Metro Mirror was formerly known as Peer-to- Peer
Remote Copy (PPRC). GDPS PPRC comes in two flavors:
a. GDPS/PPRC
b. GDPS/PPRC HyperSwap Manager
2. GDPS/XRC solutions that are based on z/OS Global Mirror. z/OS Global Mirror was formerly known as
Extended Remote Copy (XRC).
These GDPS offerings can also be combined into 3-site GDPS solutions, providing Tier 7 recoverability in
multi-site environments:
IBM GDPS active/active continuous availability is a solution for an environment that consists of two sites,
which are separated by unlimited distances, running the same applications and having the same data
with cross-site workload monitoring, data replication, and balancing. Based on asynchronous software
replication, planned switches can be accomplished with no data loss (RPO 0), and when sufficient
replication bandwidth is provided, the RPO can be as low as a few seconds in the event of an unplanned
workload switch (Clitherow, et. al, 2017).
GDPS/PPRC overview
GDPS/PPRC is designed to manage and protect IT services by handling planned and unplanned exception
conditions, and maintain data integrity across multiple volumes and storage systems. By managing both
planned and unplanned exception conditions, GDPS/PPRC can maximize application availability and provide
business continuity. GDPS/PPRC provides the following attributes:
The GDPS/PPRC solution offering combines IBM Z Parallel Sysplex capability and ESS, IBM System Storage
DS6000, and DS8000 Metro Mirror disk mirroring technology to provide a Business Continuity solution for IT
infrastructures that have System z at the core. GDPS/PPRC offers efficient workload management, system
resource management, business continuity or disaster recovery for z/OS servers and Open System data,
and provides data consistency across all platforms that use the Metro Mirror Consistency Group function.
The GDPS solution uses automation technology to provide end-to-end management of IBM Z servers, disk
mirroring, tape mirroring, and workload shutdown and startup. GDPS manages the infrastructure to minimize
or eliminate the outage during a planned or unplanned site failure. Critical data is disk mirrored, and
processing is automatically restarted at an alternative site if a primary planned site shutdown or site failure
occurs. GDPS/PPRC automation provides scalability to ensure data integrity at many volumes across hundreds
or thousands of Metro Mirror pairs.
Since GDPS/PPRC V3.2, the HyperSwap function uses the Metro Mirror failover/failback (FO/FB) function.
For planned reconfigurations, FO/FB might reduce the overall elapsed time to switch the storage systems,
thus reducing the time that applications might be unavailable to users. For unplanned reconfigurations,
FO/FB allows the secondary disks to be configured in the suspended state after the switch and record
any updates that are made to the data. When the failure condition is repaired, resynchronizing back to
the original primary disks requires only the copying of the changed data, thus eliminating the necessity
of performing a full copy of the data. The window during which critical data is left without Metro Mirror
protection after an unplanned reconfiguration is therefore minimized.
GDPS/XRC overview
z/OS Global Mirror, formerly known as Extended Remote Copy (XRC), is a combined hardware and z/OS
software asynchronous Remote Copy solution for System z data. z/OS Global Mirror is designed to provide
premium levels of scalability, reaching into the tens of thousands of System z volumes. Consistency of the data
is maintained through the Consistency Group function within the System Data Mover.
GDPS/XRC includes automation to manage z/OS Global Mirror Remote Copy pairs and automates the process
of recovering the production environment with limited manual intervention, including invocation of Capacity
Backup (CBU), thus providing significant value in reducing the duration of the recovery window and requiring
less operator interaction. GDPS/XRC provides the following attributes (Petersen, 2013):
GDPS/GM resembles GDPS/XRC in that it supports unlimited distances between the application and
recovery sites. Also, GDPS/GM does not provide any automation or management of the production
systems; its focus is on managing the Global Mirror Remote Copy environment and automating and
managing recovery of data and systems if there is a disaster. Also, like GDPS/XRC, GDPS/GM supports
the ability to make Remote Copy copies of data from multiple sysplexes; each GDPS/PPRC license
supports Remote Copy for a single sysplex.
In addition to its disaster recovery capabilities, GDPS/GM also provides an interface for monitoring and
managing the Remote Copy configuration. This management includes the initialization and monitoring of
the GM volume pairs that are based upon policy, and performing routine operations on installed storage
systems.
1. A HyperSwap capability for near-continuous availability for a disk control unit failure
2. An option for no data loss
3. Maintenance of a disaster recovery capability after a HyperSwap
4. Data consistency to allow restart, not recovery, at either site 2 or site 3
5. A long-distance disaster recovery site for protection against a regional disaster
6. Minimal application impact
7. GDPS automation to manage Remote Copy pairs, manage a Parallel Sysplex configuration, and perform
planned and unplanned reconfigurations.
In addition, GDPS Metro/Global Mirror can be used for both IBM Z and open data, and provide consistency
between them.
Conclusion
This chapter discussed complex, three and four site DASD replication. It also discussed automated BC/DR
offerings from Dell EMC and IBM. It should be noted that HDS offerings are fully compatible with the IBM
GDPS offerings. These complex and automated configurations are typically used by businesses with very
strict RPO and RTO requirements such as financial institutions.
The RDR network considerations are the same for these solutions as they are for the earlier DASD RDR
discussions in Part 3 chapter 2. However, most end users who implement 3 or 4 site architectures and/or
use GDDR or GDPS will build more redundancy into their RDR networks.
The remaining two chapters of part 3 will discuss the specifics of FCIP and IP extension technologies.
References
Clitherow, D., Schindel, S., Thompson, J., Narbey, M., (2017). IBM GDPS Family: An Introduction to Concepts
and Capabilities. Poughkeepsie, NY: IBM Redbooks.
EMC Educational Services. (2012). Information Storage and Management Second Edition. Wiley.
Hitachi Data Systems. (2013). Hitachi TueCopy for Mainframe User Guide. Hitachi Data Systems. Hitachi
Data Systems. (2013). Hitachi Universal Replicator for Mainframe User Guide. Hitachi Data Systems.
Negro, T., & Pendle, P. (2013). EMC Mainframe Technology Overview V 2.0. Hopkinton, MA: EMC
Corporation.
Petersen, D., (2013). GDPS: The Enterprise Continuous Availability/Disaster Recovery Solution. IBM
Corporation.
Smith, D., Callewaert, P., Fallon, C., & Egan, J. (2012). EMC GDDR Solution Design and Implementation
Techniques V2.0. Hopkinton, MA: EMC Corporation.
Westphal, A., Dufrasne, B., Bertazi, A., Kulzer, C., Frankenberg, M., Gundy, L., Drozda, L., Wolf, R., and
Stanley, W. (2017). DS8000 Copy Services. Poughkeepsie, NY: IBM Redbooks.
212
PART 3 Chapter 5: FIBRE CHANNEL OVER IP (FCIP)
Introduction
FCIP, or Fiber Channel over Internet Protocol, is a tunneling protocol to link Fibre Channel over distance on
standard IP networks. Used primarily for remote replication, backup, and storage access, FCIP extension provides
Fibre Channel connectivity over IP networks between Fibre Channel devices or fabrics. The FCIP link is an Inter-
Switch Link (ISL) that transports Fibre Channel control and data frames between switches. FCIP extension is
supported on all Brocade extension platforms.
Over the last decade, extension networks for storage have become commonplace and continue to grow in size and
importance. Growth is not limited to new deployments but also involves the expansion of existing deployments.
Requirements for data protection never ease, as the economies of many countries depend on successful and
continued business operations and thus have passed laws mandating data protection. Modern-day dependence
on Remote Data Replication (RDR) means there is little tolerance for lapses that leave data vulnerable to loss
(Brocade Communications Systems, 2017). In mainframe environments, reliable and resilient networks—to the
point of no frame loss and in-order frame delivery—is necessary for error-free operation, high performance, and
operational ease. This improves availability, reduces risk, reduces operating expenses and—most of all—reduces
risk of data loss.
The previous chapters were an introduction to Business Continuity, Disaster Recovery, and Continuous Availability
(BC/DR/CA), and data replication. They also introduced the various network topologies and protocols used for the
networks associated with BC/DR/CA. This chapter focuses in depth on one of those protocols, FCIP, and how it is
used in a mainframe environment to provide long-distance extension networks between data centers. Because
of the higher costs of long-distance dark fiber connectivity compared with other communications services, use
of the more common and more affordable IP network services is an attractive option for Fibre Channel extension
between geographically separated data centers. FCIP is a technology for interconnecting Fibre Channel-based
storage networks over distance using IP.
Figure 1: FCIP is for passing FCP and FICON I/O traffic across long-distance IP links.
This chapter begins with a discussion of some FCIP basics. This discussion includes some terminology and
definitions, as well as an overview of the standards associated with FCIP. It then discusses the details of how
FCIP works, with an emphasis on mainframe and Fiber Connectivity (FICON) environments. This more detailed
discussion includes frame encapsulation, batching, the FCIP data engines, and the role of Transmission Control
Protocol (TCP) in FCIP. The third section of this chapter focuses on the unique technology feature of Brocade
Extension Trunking. The final section of this chapter introduces the Brocade Advanced Accelerator for FICON and
discusses the details of how that emulation technology works.
This chapter serves as a valuable accompaniment to the detailed discussion on remote Direct Access Storage
Device (DASD) replication, remote tape and virtual tape replication, and geographically dispersed automated
solutions, which were discussed in the preceding three chapters. Throughout this chapter, references to Fibre
Channel can be assumed to apply to both FICON and FCP data traffic unless stated otherwise.
FCIP Basics
IP storage initiatives have evolved on a foundation of previously established standards for Ethernet and IP. For
example, Ethernet is standardized in the Institute of Electrical and Electronics Engineers (IEEE) 802.3 standard,
Gigabit Ethernet (GbE) is standardized in IEEE 802.3z, 10 Gigabit Ethernet is standardized in IEEE 802.3a and
40/100 Gigabit Ethernet is standardized in IEEE 802.3ba. IP-related standards are established by the Internet
Engineering Task Force (IETF) through a diverse set of Requests for Comment (RFCs) that cover a wide range of
protocol management issues. Transmission Control Protocol (TCP) was standardized in 1981 as RFC 793, User
Datagram Protocol (UDP) was standardized in 1980 as RFC 768, and FCIP was standardized in 2004 as RFC
3821 (Guendert, 2017).
Ethernet and IP are just the basic components for deploying for IP storage networks. IP storage technology must
also accommodate previously established standards for Small Computer Systems Interface (SCSI), which is the
purview of the International Committee for Information Technology Standards (INCITS) T10 Committee. FCIP
storage solutions must also follow the previously established standards for FCP (INCITS T10) and Fibre Channel
Transport (INCITS T11).
These standards all provide guideposts for technology development, with the ideal goal of product interoperability.
Guideposts are not intended to be moved; in other words, new technology developments should not require
changes to existing standards.
RFC 3821
The RFC 3821 specification is a 75-page document that describes mechanisms that allow the interconnection
of islands of Fibre Channel Storage Area Networks (SANs) to form a unified SAN in a single fabric via connectivity
between islands over IP. The chief motivation behind defining these interconnection mechanisms is a desire to
connect physically remote Fibre Channel sites, allowing remote disk and DASD access, tape backup, and
live/real-time mirroring of storage between remote sites. The Fibre Channel standards have chosen nominal
distances between switch elements that are less than the distances available in an IP network. Since Fibre
Channel and IP networking technologies are compatible, it is logical to turn to IP networking for extending the
allowable distances between Fibre Channel switch elements.
The fundamental assumption made in RFC 3821, which must be remembered, is that the Fibre Channel traffic is
carried over the IP network in such a manner that the Fibre Channel fabric and all of the Fibre Channel devices
in the fabric are unaware of the presence of the IP network. This means that the Fibre Channel datagrams must
be delivered in time to comply with the existing Fibre Channel specifications. The Fibre Channel traffic may span
Local Area Networks (LANs), Metropolitan Area Networks (MANs), and Wide Area Networks (WANs), as long as this
fundamental assumption is adhered to.
Before discussing FCIP from the perspective of Brocade technologies, it is helpful to understand the formal
terminology used by RFC 3821, as well as understand the summary of the FCIP protocol as described by RFC
3821.
1. FC End Node: A FC (Fibre Channel) device that uses the connection services provided by the FC fabric.
2. FC Entity: The FC-specific functional component that combines with an FCIP entity to form an interface
between an FC fabric and an IP network.
3. dFC Fabric: An entity that interconnects various Nx_Ports attached to it, and that is capable of routing FC
frames using only the destination ID information in an FC frame header.
4. FC Fabric Entity: An FC-specific element containing one or more interconnect ports and one or more FC/FCIP
Entity pairs.
6. FC Frame Receiver Portal: The access point through which an FC frame and time stamp enter an FCIP data
engine from the FC entity.
7. FC Frame Transmitter Portal: The access point through which a reconstituted FC frame and time stamp leave
an FCIP data engine to the FC entity.
8. FC/FCIP Entity pair: The combination of one FC entity and one FCIP entity.
9. FCIP Data Engine (FCIP_DE): The component of an FCIP entity that handles FC frame encapsulation, de-
encapsulation, and transmission of FCIP frames through a single TCP connection.
10. FCIP Entity: An FCIP entity functions to encapsulate Fibre Channel frames and forward them over an IP
network. FCIP entities are peers that communicate using TCP/IP. FCIP entities encompass the FCIP_LEP(s)
and an FCIP Control and Services module.
11. FCIP Frame: An FC frame plus the FC frame encapsulation header, encoded Start of Frame (SOF) and
encoded End of Frame (EOF) that contains the FC frame.
12. FCIP Link: One or more TCP connections that connect one FCIP_LEP to another.
13. FCIP Link Endpoint (FCIP_LEP): The component of an FCIP entity that handles a single FCIP link and
contains one or more FCIP_DEs.
14. Encapsulated Frame Receiver Portal: The TCP access point through which an FCIP frame is received from
the IP network by an FCIP data engine.
15. Encapsulated Frame Transmitter Portal: The TCP access point through which an FCIP frame is transmitted
to the IP network by an FCIP data engine.
16. FCIP Special Frame (FSF): A specially formatted FC frame containing information used by the FCIP protocol.
1. The primary function of an FCIP entity is forwarding FC frames, employing FC frame encapsulation.
2. Viewed from the IP network perspective, FCIP entities are peers, and they communicate using TCP/IP. Each
FCIP entity contains one or more TCP endpoints in the IP-based network.
3. Viewed from the FC fabric perspective, pairs of FCIP entities in combination with their associated FC entities,
forward FC frames between FC fabric elements. The FC end nodes are unaware of the existence of the
FCIP link.
4. FC primitive signals, primitive sequences, and Class 1 FC frames are not transmitted across an FCIP link
because they cannot be encoded using FC frame encapsulation.
5. The path (route) taken by an encapsulated FC frame follows the normal routing procedures of the IP network.
6. An FCIP entity may contain multiple FCIP link endpoints, but each FCIP link endpoint (FCIP_LEP)
communicates with exactly one other FCIP_LEP.
7. When multiple FCIP_LEPs with multiple FCIP_DEs are in use, the selection of which FCIP_DE to use for
encapsulating and transmitting a given FC frame is covered in FC-BB-2. FCIP entities do not actively
participate in FC frame routing.
8. The FCIP Control and Services module may use TCP/IP Quality of Service (QoS) features.
9. It is necessary to statically or dynamically configure each FCIP entity with the IP addresses and TCP port
numbers corresponding to the FCIP entities with which it is expected to initiate communication. If dynamic
discovery of participating FCIP entities is supported, the function shall be performed using the Service
Location Protocol (SLPv2). It is outside the scope of this specification to describe any static configuration
method for participating FCIP entity discovery.
10. Before creating a TCP connection to a peer FCIP entity, the FCIP entity attempting to create the TCP
connection shall statically or dynamically determine the IP address, TCP port, expected FC fabric entity World
Wide Name (WWN), TCP connection parameters, and QoS Information.
11. FCIP entities do not actively participate in the discovery of FC source and destination identifiers. Discovery
of FC addresses (accessible via the FCIP entity) is provided by techniques and protocols within the FC
architecture, as described in FC-FS and FC-SW-2.
a. Implement cryptographically protected authentication and cryptographic data integrity keyed to the
authentication process.
13. On an individual TCP connection, this specification relies on TCP/IP to deliver a byte stream in the same order
that it was sent.
14. This specification assumes the presence of and requires the use of TCP and FC data loss and corruption
mechanisms. The error detection and recovery features described in this specification complement and
support these existing mechanisms.
• p1) FC Frame Receiver Portal: The interface through which an FC frame and time stamp enters an FCIP_DE
from the FC entity.
• p2) Encapsulated Frame Transmitter Portal: The TCP interface through which an FCIP frame is transmitted to
the IP network by an FCIP_DE.
• p3) Encapsulated Frame Receiver Portal: The TCP interface through which an FCIP frame is received from the
IP network by an FCIP_DE.
• p4) FC Frame Transmitter Portal: The interface through which a reconstituted FC frame and time stamp exit an
FCIP_DE to the FC entity.
The work of the FCIP_DE is done by the Encapsulation and De-Encapsulation Engines. The engines have two
functions (IETF, 2004):
2. Detecting some data transmission errors and performing minimal error correction
Data flows through a pair of IP network connected FCIP_DEs in the following seven steps:
1. An FC frame and time stamp arrive at the FC frame receiver portal and are passed to the encapsulation
engine. The FC frame is assumed to have been processed by the FC entity according to the applicable FC
rules and is not validated by the FCIP_DE. If the FC entity is in the unsynchronized state with respect to a time
base, as described in the FC frame encapsulation specification, the time stamp delivered with the FC frame
shall be zero.
2. In the encapsulation engine, the encapsulation format described in FC frame encapsulation shall be applied
to prepare the FC frame and associated time stamp for transmission over the IP network.
3. The entire encapsulated FC frame (in other words, the FCIP frame) shall be passed to the Encapsulated
Frame Transmitter Portal, where it shall be inserted in the TCP byte stream.
4. Transmission of the FCIP frame over the IP Network follows all the TCP rules of operation. This includes, but is
not limited to, the in-order delivery of bytes in the stream, as specified by TCP.
5. The FCIP frame arrives at the partner FCIP entity, where it enters the FCIP_DE through the Encapsulated
Frame Receiver Portal and is passed to the De-Encapsulation Engine for processing.
6. The De-Encapsulation Engine shall validate the incoming TCP byte stream and shall de-encapsulate
the FC frame and associated time stamp according to the encapsulation format described in FC frame
encapsulation.
7. In the absence of errors, the de-encapsulated FC frame and time stamp shall be passed to the FC Frame
Transmitter Portal for delivery to the FC entity.
Every FC frame that arrives at the FC Frame Receiver Portal shall be transmitted on the IP network as described in
steps 1 through 4 above. In the absence of errors, data bytes arriving at the Encapsulated Frame Receiver Portal
shall be de-encapsulated and forwarded to the FC Frame Transmitter Portal as described in Steps 5 through 7.
Each FC frame that exits the FCIP_DE through the FC Frame Transmitter Portal shall be accompanied by the time
stamp value taken from the FCIP frame that encapsulated the FC frame. The FC entity shall use suitable internal
clocks and either Fibre Channel services or a Simple Network Time Protocol (SNTP) version 4 server to establish
and maintain the required synchronized time value. The FC entity shall verify that the FC entity it is communicating
with on an FCIP link is using the same synchronized time source, either Fibre Channel services or SNTP server
(IETF, 2004).
Note that since the FC fabric is expected to have a single synchronized time value throughout, reliance on the
Fibre Channel services means that only one synchronized time value is needed for all FCIP_DEs regardless of their
connection characteristics.
FCIP supports applications such as Remote Data Replication (RDR), centralized SAN backup, and data migration
over very long distances that are impractical or very costly using native Fibre Channel connections. FCIP tunnels,
built on a physical connection between two extension switches or blades, allow Fibre Channel I/O to pass through
the IP WAN.
The TCP connections ensure in-order delivery of FC frames and lossless transmission. The Fibre Channel fabric
and all Fibre Channel targets and initiators are unaware of the presence of the IP WAN. Figure 4 shows the
relationship of FC and TCP/IP layers, and the general concept of FCIP tunneling (Guendert, 2017).
IP addresses and TCP connections are used only at the FCIP tunneling devices at each endpoint of the IP WAN
“cloud.” Each IP network connection on either side of the WAN cloud is identified by an IP address and a TCP/
IP connection between the two FCIP devices. An FCIP data engine encapsulates the entire Fibre Channel frame
in TCP/IP and sends it across the IP network (WAN). At the receiving end, the IP and TCP headers are removed,
and a native Fibre Channel frame is delivered to the destination Fibre Channel node. The existence of the FCIP
devices and IP WAN cloud is transparent to the Fibre Channel switches, and the Fibre Channel content buried in
the IP datagram is transparent to the IP network. TCP is required for transit across the IP tunnel to enforce in-order
delivery of data and congestion control. The two FCIP devices in Figure 4 use the TCP connection to form a virtual
Inter- Switch Link (ISL), called a VE_Port, between them. They can pass Fibre Channel Class F traffic as well as
data over this virtual ISL (Guendert, 2017).
Here is a simple real-world analogy. This FCIP process is similar to writing a letter in one language and putting it
into an envelope to be sent through the postal system to some other destination in the world. At the receiving end,
the envelope is opened, and the contents of the letter are read. Along the route, those handling the letter do not
have to understand the contents of the envelope, but only where its intended destination is and where it came
from. FCIP takes Fibre Channel frames regardless of what the frame is for (FICON, FCP, and so on) and places
these into IP frames (envelopes) for transmission to the receiving desti-nation. At the receiving destination, the
envelopes are opened and the contents are placed back on the Fibre Channel network to continue their trip. In
Figure 5, Fibre Channel frames carrying FICON, FCP, and other ULP traffic are simply sent from the SAN on the left
to the SAN on the right. The frames are placed into IP wrapper, transmitted over the IP WAN, which could be using
10/100 Gb, 10 GbE, SONET/SDH, or even ATM as underlying interfaces/protocols.
FC FC
X6 IP X6
Fabric MLX MLX
Fabric
7840 7840
SX6 SX6
Data Data
SCSI / CCW DCX DCX SCSI / CCW
FCP / FICON FICON/FCP FCIP FCIP FICON/FCP FCP / FICON
FC layer 0-2 FC 0-2 TCP TCP FC 0-2 FC layer 0-2
IP IP IP
L2 L2 L2
7840 7840
X6 PHY IP PHY IP
PHY X6
MLX
Tunnels
An extension tunnel is a conduit that contains one or more circuits. When a tunnel contains multiple circuits, it is
also called an extension trunk because multiple circuits are trunked together. An extension tunnel, or extension
trunk, is a single Inter-Switch Link (ISL). Circuits are individual extension connections within the trunk, each with
its own unique source and destination IP interface (IPIF) address.
To understand tunnels, you must understand the relationship between tunnels and Virtual E_Port (VE_Port).
Because an extension tunnel is an ISL, each end of the tunnel requires its own VE_Port. For example, the Brocade
SX6 extension blade supports a number of VE_Ports. The available VE_Ports on the Brocade SX6 are numbered
16 to 35. When you create a tunnel, that tunnel can be numbered any of 16 to 35. Tunnels are more complicated
than this, but the point is that each end of the tunnel is identified by number, and that number is directly
associated with a VE_Port on the extension platform. Tunnels are frequently created between different VE_Ports,
so the tunnel number can be different at each end, from a configuration point of view. Each extension platform,
such as Brocade SX6 Extension Blade, Brocade 7840 Extension Switch, and Brocade FX8-24 Extension Blade, has
a different numbering scheme for its VE_Ports.
An FCIP tunnel carries Fibre Channel traffic (frames) over IP networks such that the Fibre Channel fabric and all
Fibre Channel devices in the fabric are unaware of the IP network’s presence. Fibre Channel frames “tunnel”
through IP networks by dividing frames, encapsulating the result in IP packets upon entering the tunnel and then
reconstructing the packets as they leave the tunnel. FCIP tunnels are used to pass Fibre Channel I/O through
an IP network. FCIP tunnels are built on a physical connection between two peer switches or blades. An FCIP
tunnel forms a single logical tunnel from the circuits. A tunnel scales bandwidth with each added circuit, providing
lossless recovery during path failures and ensuring in-order frame delivery. You configure an FCIP tunnel by
specifying a VE_Port for a source and destination interface. When you configure the tunnel, you provide the IP
address for the source destination IP interface.
Once a TCP connection is established between FCIP entities, a tunnel is established. The two platforms (for
instance, a Brocade SX6 Extension Blade or Brocade 7840 Extension Switch) at the FCIP tunnel endpoints
establish a standard Fibre Channel ISL through this tunnel. Each end of the FCIP tunnel appears to the IP network
as a server, not as a switch. The FCIP tunnel itself appears to the switches to be just a cable. Each tunnel carries a
single Fibre Channel ISL. Load balancing across multiple tunnels is accomplished via Fibre Channel mechanisms,
just as you would do across multiple Fibre Channel ISLs in the absence of FCIP.
Circuits exist within tunnels. A circuit is a connection between a pair of IP addresses that are defined within
an extension tunnel. Circuits provide the links for traffic flow between source and destination interfaces that
are located on either end of the tunnel. Each tunnel can contain a single circuit or multiple circuits. You must
configure unique IPIFs as the local and remote endpoints of each circuit. On the local side, a circuit is configured
with a source IPIF address and a destination address. On the remote side of the circuit, its source IPIF address
is the same as the local-side destination address. Similarly, on the remote side of the circuit its IPIF destination
address points to the local-side source address. Multiple IPIFs can be configured on each physical Ethernet
interface.
Extension tunnels created on a Brocade 7840 or Brocade SX6 cannot connect to interfaces on a Brocade 7500
switch, 7800 switch or FX8-24 blade.
VE_Ports
Special types of ports such as Virtual Expansion ports, called VE_Ports—function somewhat like an Expansion
Port (E_Port). The link between a VE_Ports is considered an ISL. FCIP tunnels emulate FC ports on the extension
switch or blade at each end of the tunnel. Once the FCIP tunnels are configured, and the TCP connections are
established for a complete FCIP circuit, a logical ISL is activated between the switches. Once the tunnel and ISL
connection are established between the switches, these logical FC ports appear as “Virtual” E_Ports or VE_Ports.
VE_Ports operate like FC E_Ports for all fabric services and Brocade FOS operations. Rather than using FC as
the underlying transport, however, VE_Ports use TCP/IP over GbE, 10 GbE or 40 GbE (Brocade Communications
Systems, 2017). An FCIP tunnel is assigned to a VE_Port on the switch or blade at each end of the tunnel.
Since multiple VE_Ports can exist on an extension switch or blade, you can create multiple tunnels through the
IP network. Fibre Channel frames enter FCIP through VE_Ports and are encapsulated and passed to TCP layer
connections. An FCIP complex (FCIP Data Engine) on the switch or blade handles the FC frame encapsulation, de-
encapsulation, and transmission to the TCP link.
There are Brocade Application Specific Integrated Circuits (ASICs) internal to the Brocade 7840 and Brocade SX6
that Brocade calls the Condor 3 (C3) for the Brocade 7840 and the Condor 4 (C4) for the Brocade SX6. These
ASICS understand only the Fibre Channel protocol. The VE_Ports are logical representations of actual Fibre
Channel ports on those ASICs. Consider the VE_Ports as the transition point from the Fibre Channel world to the
TCP/IP world inside these devices. In actuality, multiple FC ports “feed” a VE_Port.
FCIP Interfaces
You must configure unique IP Interfaces (IPIF) on each switch or blade Gigabit, 10 or 40 Gigabit Ethernet port
used for FCIP traffic. An IPIF consists of an IP address, a netmask, and an MTU size. If the destination FCIP
interface is not on the same subnet as the GbE/10 GbE/40 GbE port IP address, you must configure an IP route
to that destination. You can define up to 32 routes for each GbE/10 GbE/40 GbE port. A port can contain multiple
IP interfaces. You must configure unique IPIFs as the local and remote endpoints of each circuit. On the local side,
a circuit is configured with a source IPIF address and a destination address. On the remote side of the circuit, its
source IPIF address is the same as the local-side destination address. Similarly, on the remote side of the circuit
its IPIF destination address points to the local-side source address. Multiple IPIFs can be configured on each
physical Ethernet interface.
On the Brocade FX8-24, the IP address (or IPIF) for each local and remote address must be individually unique.
However, on the Brocade 7840 and Brocade SX6, each address pair must be unique. The circuit configuration
parameters on the local and remote sides of the circuits and tunnel must match in addition to the source and
destination IPIF addresses pointing to each other. For example, if you configure IPsec on a tunnel, each end of the
tunnel must be configured to use the same IPsec parameters. Other parameters for each circuit must match, such
as MTU size, bandwidth allocations, QoS, VLAN ID, and keepalive values.
You must configure a gateway IP route for the circuit to the destination network when the remote IPIF is not on
the same subnet as the local IPIF. You can define a specific number of routes per IPIF based on the extension
platform.
Circuits
FCIP circuits are the building blocks for FCIP tunnels and Extension Trunking. Circuits provide the links for traffic
flow between source and destination FCIP interfaces that are located on either end of the tunnel. For each
tunnel, you can configure a single circuit or a trunk consisting of multiple circuits. Each circuit is a connection
between a pair of IP addresses that are associated with source and destination endpoints of an FCIP tunnel. In
other words, the circuit is an FCIP connection between two unique IP addresses. An Ethernet interface can have
one or more FCIP circuits, each requiring a unique IP address. Circuits in a tunnel can use the same or different
Ethernet interfaces. Each circuit automatically creates multiple TCP connections that can be used with QoS
prioritization (Guendert, 2017). The figure below shows an example of two extension tunnels. The first tunnel is
a trunk of four circuits, and the second tunnel is a trunk of two circuits. Each circuit is assigned a unique IPIF
address. Those IPIFs are, in turn, assigned to Ethernet interfaces. In the figure, each IPIF is assigned to a different
Ethernet interface. This is not required. Ethernet interface assignment is flexible depending on the environment’s
needs and assignments can be made as desired. For instance, multiple IP interfaces can be assigned to a single
Ethernet interface. The circuit flows from IP interface to IP interface through the assigned Ethernet interfaces.
Ethernet interfaces are assigned by associating them with IP Interfaces (IPIFs). IPIFs on the extension platforms
are virtual entities within the platform. The IPIFs, via their IP addresses, are assigned to circuits by designating
the source IP address. The circuit is grouped into an extension trunk by designating the VE_Port (or VEX_Port)
to which it belongs. Tunnels/trunks are identified by their VE_Port number. If an extension trunk already exists,
you can create an additional circuit and include it in the tunnel. Each Ethernet interface on the Brocade SX6,
Brocade 7840, or Brocade FX8-24 can be used by circuits from multiple VE/VEX_Ports. Each VE/VEX_Port can use
multiple Ethernet ports. Each VE/VEX_Port can have multiple IP interfaces, or circuits. Each IP interface or circuit
in a VE/VEX_Port is associated with an IP address. In turn, those IP addresses are assigned to specific Ethernet
interfaces.
For Brocade Extension Trunking, the Data Processor (DP) that owns the VE_Port controls all the member circuits.
There is no distributed processing, load sharing or LLL across DPs. Failover between DPs is done at the FC level by
the Brocade FC switching ASIC, provided the configuration permits it. The only components that can be shared are
the Ethernet interfaces. The IP network routes the circuits over separate network paths and WAN links based on
the destination IP addresses and possibly other Layer 2 and Layer 3 header attributes. Ethernet interfaces on the
Brocade SX6, Brocade 7840, and Brocade FX8-24 provide a single convenient connection to the data center LAN
for one to many circuits.
Trunks
An extension trunk is a tunnel consisting of multiple FCIP or IP Extension circuits. Brocade Extension Trunking
provides multiple source and destination addresses for routing traffic over a WAN, allowing load leveling, failover,
failback, in- order delivery, and bandwidth aggregation capabilities. Extension Trunking allows the management
of WAN bandwidth and provides redundant paths over the WAN that can protect against transmission loss due to
WAN failure. Extension Trunking is discussed in more detail in Part 3 Chapter 7.
Deploying FCIP in this method typically provides poor performance. Brocade technology encapsulates FC frames
within IP packets. However, Brocade encapsulates many FICON or FC frames per IP frame. This method, which
is exclusive to Brocade, is known as FCIP batching. The Brocade extension products use batching to improve
overall efficiency, maintain full utilization of links, and reduce protocol overhead. Simply stated, a batch of FC
frames is formed, after which the batch is processed as a single unit. Batching forms FCIP frames at one frame
per batch. A batch is comprised of up to 14 FC frames for open systems, or four FC frames for FICON. These
frames have already been compressed using the exclusive Brocade FCcomp technology. Compression adds two
to four microseconds (µs) additional latency. All the frames have to be from the same exchange’s data sequence
(Brocade Communications Systems, 2017). SCSI commands, transfer readies, and responses are not batched and
are expedited by immediate transmission and/or protocol optimization. If the last frame of the sequence arrives,
the end-of-sequence bit is set. The batch is then known to be complete and is processed and sent immediately.
Figures 10 and 11 illustrate the concepts of FCIP batching.
FC FC FC FC
FCIP
Up to 4 FC frames for Frame Frame Frame Frame
Header
FICON per Batch 4 3 2 1
FCIP Batch
FC FC FC FC FC FC FC FC FC FC FC FC FC FC
Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame FCIP
14 13 12 11 10 9 8 7 6 5 4 3 2 1 Header
FC
Up to 14 FC frames for Open Systems FC Frame 16
Frame FCIP
EoS bit set Header
15
FCIP Batch
Figure 7: FICON and Open Systems FCIP batch examples.
• Up to 20 VE_Ports
The table below shows the Ethernet interface properties on the Brocade 7840 and Brocade SX6.
3. USB port
6. FC ports 0-23
The Brocade 7840 Extension Switch provides 24 16 Gbps FC ports (FC0–FC23) numbered 0–23 on the switch,
two 40 GbE ports (ge0–ge1) numbered 0–1 on the switch, and 16 1/10 GbE ports (ge2–ge17) numbered 2–17
on the switch. Up to 20 VE_Ports are supported for tunnel configurations. Typically, only one VE_Port is needed
per remote site. The 40 GbE ports are enabled for configuring IP addresses with the 7840 WAN Rate Upgrade 2
license.
1. 40 GbE port 0
2. 40 GbE port 1
• Sixteen Fibre Channel (FC) SFP+ ports that support Fibre Channel Routing Services and connection to FC
devices for the Brocade Extension feature. These ports support 32 Gbps transceivers operating at 8, 16, or
32 Gbps or 16 Gbps transceivers operating at 4, 8, or 16 Gbps. The ports also support 10 Gbps transceivers.
Supported port speed depends on the installed transceiver. FC ports can auto-negotiate speeds with
connecting ports.
• Sixteen 10/1 GbE SFP+ and two 40 GbE QSFP ports. These ports allow connection of blades to IP WANs
and allow Fibre Channel and IP I/O to pass through the IP WAN using extension tunnels. The 10/1 GbE ports
operate at 10 Gbps or 1 Gbps fixed speeds with appropriate 10 Gbps or 1 Gbps transceivers installed. The
40 GbE QSFP ports operate at 40 Gbps fixed speed.
• There is a 40 Gbps full-duplex connection between the Fibre Channel (FC) switching ASIC and external FC
ports and each DP. On the Brocade 7840, the external FC ports are Gen 5 (16 Gbps) and on the Brocade
SX6, they are Gen 6 (32 Gbps).
• FC frames can be compressed with the fast deflate compression hardware. Other software-based
compression options can be configured.
• The maximum bandwidth size of any one tunnel across the internal connections can be no more than
10 Gbps.
• From the network processors, data can be encrypted by the IPsec hardware using high-speed low-latency
hardware-based encryptors. Each DP network processor can produce 20 Gbps of data flow going towards or
coming from the external Ethernet interfaces and the WAN.
• In 10VE mode, each DP complex supports a total of 5 VE_Ports. 10VE mode is required for hybrid mode
operation.
• In 20VE mode, each DP complex supports a total of 10 VE_Ports. 20VE mode is allowed in FCIP mode only.
Figure 12: Brocade 7840 Switch DP components and VE_Port distribution in 20 VE mode.
If a 4:1 compression ratio is achieved using fast deflate compression, then 80 Gbps is available to external FC
ports. The maximum compression ratio depends on a number of factors and is not guaranteed. Typical deflate
compression may achieve different compression ratios. Brocade makes no promises as to the achievable
compression ratios for customer-specific data. The Adaptive Rate Limiting (ARL) aggregate of all circuit maximum
values on a single DP complex cannot exceed 40 Gbps. The ARL aggregate of all circuit minimum values for a
single DP complex cannot exceed 20 Gbps. All circuits include all circuits from all tunnels, not just all circuits from
a single tunnel. ARL is used only when minimum and maximum bandwidth values are configured for a circuit.
Specific VE_Ports are associated with each DP complex. The VE_Port that you configure for a tunnel also selects
the DP complex that is used for processing.
6. Power LED
7. Status LED
• 1 Gbps mode: You can use all ten GbE ports (0 through 9). Both XGE ports are disabled.
• 10 Gbps mode: You can use the xge0 and xge1 ports. GbE ports (0 through 9) are disabled.
• Dual mode: You can use GbE ports 0 through 9, and port xge0. The xge1 port is disabled.
• VE_Ports 22 through 31
• VE_Ports 12 through 21
• GbE ports 0 through 9 while operating in 1 Gbps mode and dual mode
• You can define multiple addresses on Ethernet ports to configure multiple circuits. Multiple circuits can be
configured as a trunk, which provides multiple source and destination addresses to route traffic across an IP
network, provide load leveling, and provide failover capabilities.
• The committed rate for a circuit associated with a physical port cannot exceed the rate of the port.
• In a scenario where a tunnel has multiple circuits of different metrics (0 or 1), circuits with higher metrics (1)
are treated as standby circuits and are only used when all lower metric (0) circuits fail. Using Circuit Failover
Grouping, you can better control which metric 1 circuits will be activated if a metric 0 circuit fails.
• When the spillover is configured, circuit metrics determine which metric 1 circuits to use when capacity exceeds
the metric 0 circuits.
• If the circuit source and destination IP addresses are not on the same subnet, an IP static route (iproute) must
be defined on both sides of the tunnels that designates the gateway IP addresses.
• As a best practice, all tunnel and circuit settings should be identical on both sides of the tunnel. This includes
committed bandwidth, IPsec, compression, ARL minimum and maximum, Fastwrite, OSTP, FICON tunnel, and
Keepalive Timeout values (KATOV). You must configure the tunnel and circuit parameters correctly on both sides
of the network, otherwise the tunnel or circuit will fail to come online.
• VE_Ports or VEX_Ports cannot connect in parallel to the same domain at the same time as Fibre Channel
E_Ports or EX_Ports.
• When load-leveling across multiple circuits, the difference between the ARL minimum data rate set on the
slowest circuit in the trunk and the fastest circuit should be no greater than a factor of four. For example, a
100 Mbps circuit and a 400 Mbps circuit will work, but a 10 Mbps and a 400 Mbps circuit will not work. This
ensures that the entire bandwidth of the trunk can be utilized. If you configure circuits with the committed rates
that differ by more than a factor of four, the entire bandwidth of the trunk might not be fully utilized.
• If the settings are not the same on both sides of the tunnel, Op Status displays UpWarn.
• You can define up to 128 routes per GbE port; however, you can only define 120 IP routes per DP. For example,
you can configure 64 IP routes defined on ge2.dp0 and another 64 IP routes defined on ge2.dp1.
• On the Brocade 7840, there are two VE_Port groups in 10VE mode. Each port group can share 20 Gbps.
• In 20VE mode on the Brocade 7840, there are four VE_Port groups. Each port group can share 10 Gbps.
• On the Brocade SX6, there are two VE_Port groups in 10VE mode. Each port group can share 20 Gbps.
• In 20VE mode on the Brocade SX6, there are four VE_Port groups. Each port group can share 10 Gbps.
• As a best practice, VE_Ports should not be connected in parallel to the same domain at the same time as Fibre
Channel E_Ports or EX_Ports. Instead, configure tunnels with multiple circuits.
• The minimum committed rate for all VE_Ports in one DP complex cannot exceed 20 Gbps. The maximum rate
for all VE_Ports in one DP complex cannot exceed 40 Gbps.
• With compression, total bandwidth cannot exceed 80 Gbps (40 Gbps per DP) on the Fibre Channel side.
• The difference between the guaranteed (minimum) and maximum bandwidth for a tunnel cannot exceed the
5:1 ratio.
Circuits:
The number of circuits that you can configure on an Ethernet port is limited only by the number of IP address
pairs available and how the addresses are allocated. Each circuit requires a unique IP address pair. A unique pair
means that one of the two addresses (local IPIF and remote IPIF) be unique in the pair. For example, the following
address pairs use the same source address in each pair but the destination addresses are different, therefore
each pair is unique:
• There are two VE_Port groups. DP1 controls ports numbered 12 through 21 and DP0 controls ports numbered
and 22 through 31.
• The blade also supports VEX_Ports to avoid the need to merge fabrics.
• VE_Port versus Ethernet port usage depends on the blade operating mode as follows:
–– 1 Gbps mode: VE_Ports 12 through 21 are available to use GbE ports 0 through 9. VE_Ports 22-31, xge0,
and xge1 are not available.
–– In 10 Gbps mode, VE_Ports 12 through 21 are available to use xge1; VE_Ports 22 through 31 are available
to use xge0. GbE ports 0 through 9 are not available.
–– In 10 Gbps mode, you can also configure VE_Ports 12 through 21 to use port xge0 as a crossport and VE_
Ports 22 through 31 to use port xge1 as a crossport. In dual mode, VE_Ports 12 through 21 are available
to use GbE ports 0 through 9; VE_Ports 22 through 31 are available to use xge0. Port xge1 is not available.
Circuits:
• A limit of 20 circuits can be configured per VE_Port group (12 through 21 or 22 through 31) when using a
10 GbE port. For the 20 circuits, 10 are configured on local ports and 10 on crossports.
• The FX8-24 blade contains two 10 GbE ports. You can define up to 10 circuits per trunk spread across the
10 GbE ports.
• A limit of 10 circuits can be configured on a single 10 GbE port. Each circuit requires a unique IP address.
• The blade contains ten 1 GbE ports. You can define up to 10 circuits per trunk spread across the GbE ports.
• A limit of four circuits can be configured on a single 1 GbE port. Each circuit requires a unique IP address.
• On the FX8-24, a unique pair means that both of the addresses (local IPIF and remote IPIF) must be unique and
cannot be reused. For example, the following address pairs use the same source address in each pair, only the
destination addresses are different. Because the source addresses are the same, the pair is not unique:
– --local-ip 10.0.1.10 --remote-ip 10.1.1.10
• For ARL, configure minimum rates of all the tunnels so that the combined rate does not exceed 20 Gbps for all
VE_Ports on the blade.
• For ARL, you can configure maximum rate of 10 Gbps for all tunnels over a single 10 GbE port and 10 Gbps for
any single circuit.
• A circuit between 1 GbE ports cannot exceed the 1 Gbps capacity of the interfaces rate.
TCP connections ensure in-order delivery of FC frames, error recovery, and lossless transmission in FCIP extension
networks.
TCP achieves this by assigning each byte of data a unique sequence number, maintaining timers, acknowledging
received data through the use of Acknowledgements (ACKs), and retransmission of data if necessary. Once a
connection is established between the endpoints, data can be transferred. The data stream that passes across
the connection is considered a single sequence of eight-bit bytes, each of which is given a sequence number. TCP
does not assume reliability from the lower-level protocols (such as IP), so TCP must guarantee this itself.
TCP can be characterized by the following facilities it provides for applications that use it (Guendert, 2017):
• Stream data transfer: From an application’s viewpoint, TCP transfers a continuous stream of bytes through the
network. The application does not have to bother with chopping the data into basic blocks or datagrams. TCP
does this by grouping the bytes in TCP segments, which are passed to IP for transmission to the destination.
TCP itself decides how to segment the data, and it can forward the data at its own convenience.
• Flow control: The receiving TCP, when sending an ACK back to the sender, also indicates to the sender the
number of bytes it can receive beyond the last received TCP segment without causing overrun and overflow in
its internal buffers. This is sent in the ACK in the form of the highest sequence number it can receive without
problems. This is also known as the TCP window mechanism.
• Reliability: TCP assigns a sequence number to each byte transmitted and expects a positive ACK from the
receiving TCP. If the ACK is not received within a timeout interval, the data is retransmitted. Since the data is
transmitted in blocks (TCP segments), only the sequence number of the first data byte in the segment is sent
to the destination host. The receiving TCP uses the sequence numbers to rearrange the segments when they
arrive out of order, and to eliminate duplicate segments.
• Logical connections: The reliability and flow control mechanisms described above require that TCP initializes
and maintains certain status information for each data stream. The combination of this status—including
sockets, sequence numbers, and window sizes—is called a logical connection in TCP. Each such connection is
uniquely identified by the pair of sockets used by the sending and receiving processes.
• Multiplexing: This is achieved through the use of ports, just as with UDP.
It is helpful to review some of the terminology used with TCP to create an understanding of its basics and how it
functions to provide in-order delivery and lossless transmission in an FCIP network.
TCP Segment
TCP segments are units of transfer for TCP and are used to establish a connection, transfer data, send ACKs,
advertise window size, and close a connection. Each segment is divided into three parts (IETF, 2004):
• Data
Acknowledgements (ACKs)
The TCP Acknowledgement scheme is cumulative, as it acknowledges all the data received up until the time the
ACK was generated. As TCP segments are not of uniform size, and a TCP sender may retransmit more data than
what was in a missing segment, ACKs do not acknowledge the received segment. Rather, they mark the position
of the acknowledged data in the stream. The policy of cumulative Acknowledgement makes the generation of
ACKs easy; any loss of ACKs do not force the sender to retransmit data. The disadvantage is that the sender does
not receive any detailed information about the data received except the position in the stream of the last byte that
has been received.
Delayed ACKs
Delayed ACKs allow a TCP receiver to refrain from sending an ACK for each incoming segment. However, a receiver
should send an ACK for every second full-sized segment that arrives. Furthermore, the standard mandates that a
receiver must not withhold an ACK for more than 500 milliseconds (ms). The receivers should not delay ACKs that
acknowledge out-of-order segments.
may be sent over an established connection once permission has been given by SACK (Brocade Communications
Systems, 2017).
Retransmission
A TCP sender starts a timer when it sends a segment and expects an Acknowledgement for the data it sent. If
the sender does not receive an Acknowledgement for the data before the timer expires, it assumes that the data
was lost or corrupted and retransmits the segment. Since the time required for the data to reach the receiver and
the Acknowledgement to reach the sender is not constant (because of the varying network delays), an adaptive
retransmission algorithm is used to monitor performance of each connection and conclude a reasonable value for
timeout based on the RTT.
TCP Window
A TCP window is the amount of data that a sender can send without waiting for an ACK from the receiver. The
TCP window is a flow control mechanism and ensures that no congestion occurs in the network. For example,
if a pair of hosts are talking over a TCP connection that has a TCP window size of 64 KB, the sender can send
only 64 KB of data, and it must stop and wait for an ACK from the receiver that some or all of the data has been
received (Jonnakuti, 2013). If the receiver acknowledges that all the data has been received, the sender is free
to send another 64 KB. If the sender gets back an ACK from the receiver that it received the first 32 KB (which
is likely if the second 32 KB was still in transit or it is lost), then the sender can send only another 32 KB, since
it cannot have more than 64 KB of unacknowledged data outstanding (the second 32 KB of data plus the third).
The window size is determined by the receiver when the connection is established and is variable during the data
transfer. Each ACK message includes the window size that the receiver is ready to deal with at that particular
instant.
The primary reason for the window is congestion control. The whole network connection, which consists of the
hosts at both ends, the routers in between, and the actual connections themselves, might have a bottleneck
somewhere that can handle only so much data so fast. The TCP window throttles the transmission speed down to
a level where congestion and data loss do not occur.
Usable Window
This is the minimum of the receiver’s advertised window and the sender’s congestion window. It is the actual
amount of data the sender is able to transmit. The TCP header uses a 16-bit field to report the receive window
size to the sender. Therefore, the largest window that can be used is 2**16 = 65 K bytes.
Window Scaling
The ordinary TCP header allocates only 16 bits for window advertisement. This limits the maximum window that
can be advertised to 64 KB, limiting the throughput. RFC 1323 provides the window scaling option, to be able to
advertise windows greater than 64 KB. Both the endpoints must agree to use window scaling during connection
establishment.
The window scale extension expands the definition of the TCP window to 32 bits and then uses a scale factor to
carry this 32-bit value in the 16-bit Window field of the TCP header (SEG.WND in RFC-793). The scale factor is
carried in a new TCP option—Window Scale. This option is sent only in a SYN segment (a segment with the SYN bit
on), hence the window scale is fixed in each direction when a connection is opened.
Network Congestion
A network link is said to be congested if contention for it causes queues to build up and packets start getting
dropped. The TCP protocol detects these dropped packets and starts retransmitting them, but using aggressive
retransmissions to compensate for packet loss tends to keep systems in a state of network congestion even after
the initial load has been reduced to a level that would not normally induce network congestion. In this situation,
demand for link bandwidth (and eventually queue space), outstrips what is available. When congestion occurs,
all the flows that detect it must reduce their transmission rate. If they do not do so, the network remains in an
unstable state, with queues continuing to build up.
• Slow start: After a connection is established and after a loss is detected by a timeout or by duplicate ACKs.
• Congestion avoidance: In all other cases, congestion avoidance and slow start work hand-in-hand. The
congestion avoidance algorithm assumes that the chance of a packet being lost due to damage is very small.
Therefore, the loss of a packet means there is congestion somewhere in the network between the source and
destination. Occurrence of a timeout and the receipt of duplicate ACKs indicates packet loss.
When congestion is detected in the network, it is necessary to slow things down, so the slow start algorithm is
invoked (Jonnakuti, 2013). Two parameters, the congestion window (cwnd) and a slow start threshold (ssthresh),
are maintained for each connection. When a connection is established, both of these parameters are initialized.
The cwnd is initialized to one MSS. The ssthresh is used to determine whether the slow start or congestion
avoidance algorithm is to be used to control data transmission. The initial value of ssthresh may be arbitrarily high
(usually ssthresh is initialized to 65535 bytes), but it may be reduced in response to congestion.
The slow start algorithm is used when cwnd is less than ssthresh, whereas the congestion avoidance algorithm
is used when cwnd is greater than ssthresh. When cwnd and ssthresh are equal, the sender may use either slow
start or congestion avoidance.
TCP never transmits more than the minimum of cwnd and the receiver’s advertised window. When a connection
is established, or if congestion is detected in the network, TCP is in slow start and the congestion window is
initialized to one MSS. Each time an ACK is received, the congestion window is increased by one MSS. The sender
starts by transmitting one segment and waiting for its ACK. When that ACK is received, the congestion window is
incremented from one to two, and two segments can be sent. When each of those two segments is acknowledged,
the congestion window is increased to four, and so on. The window size increases exponentially during slow start.
When a time-out occurs or a duplicate ACK is received, ssthresh is reset to one half of the current window (that is,
the minimum of cwnd and the receiver’s advertised window). If the congestion was detected by an occurrence of a
timeout, the cwnd is set to one MSS.
When an ACK is received for data transmitted the cwnd is increased, but the way it is increased depends on
whether TCP is performing slow start or congestion avoidance. If the cwnd is less than or equal to the ssthresh,
TCP is in slow start. Slow start continues until TCP is halfway to where it was when congestion occurred, then
congestion avoidance takes over. Congestion avoidance increments the cwnd by MSS squared divided by cwnd
(in bytes) each time an ACK is received, increasing the cwnd linearly. This provides a close approximation to
increasing cwnd by, at most, one MSS per RTT (Jonnakuti, 2013).
A TCP receiver generates ACKs on receipt of data segments. The ACK contains the highest contiguous sequence
number the receiver expects to receive next. This informs the sender of the in-order data that was received by the
receiver. When the receiver receives a segment with a sequence number greater than the sequence number it
expected to receive, it detects the out-of-order segment and generates an immediate ACK with the last sequence
number it has received in-order (that is, a duplicate ACK). This duplicate ACK is not delayed. Since the sender does
not know if this duplicate ACK is a result of a lost packet or an out-of-order delivery, it waits for a small number
of duplicate ACKs, assuming that if the packets are only reordered, there are only one or two duplicate ACKs
before the reordered segment is received and processed and a new ACK is generated. If three or more duplicate
ACKs are received in a row, it implies there has been a packet loss. At that point, the TCP sender retransmits this
segment without waiting for the retransmission timer to expire. This is known as fast retransmit (IETF, 2004).
After fast retransmit has sent the supposedly missing segment, the congestion avoidance algorithm is invoked
instead of the slow start; this is called fast recovery. Receipt of a duplicate ACK implies that not only is a packet
lost, but there is data still flowing between the two ends of TCP, as the receiver generates only a duplicate ACK on
receipt of another segment. Hence, fast recovery allows high throughput under moderate congestion.
• Routers and firewalls that are in the data path must be configured to pass FCIP traffic (TCP port 3225) and
IPsec traffic, if IPsec is used (UDP port 500).
• To enable recovery from a WAN failure or outage, be sure that diverse, redundant network paths are available
across the WAN.
• Be sure the underlying WAN infrastructure is capable of supporting the redundancy and performance expected
in your implementation.
Conclusion
This chapter continued from where Part 3 Chapter 1 started, with an overview of the various distance extension
technology options, and then focused in depth on one of those options: FCIP. The chapter covered the basics of
FCIP, including a look at the relevant standards. It covered the building blocks of FCIP in a discussion of tunnels
and circuits. The chapter then discussed some Brocade-specific FCIP technologies such as FCIP batching, as well
as specifics to the Brocade 7840 Extension Switch and the Brocade FX8-24 and SX6 Extension Blades. Some
time was then spent on an in-depth discussion of Brocade FCIP Trunking. Finally, FICON protocol emulation was
discussed.
References
Brocade Communications Systems. (2017). Brocade Advanced Accelerator for FICON-Setting a New Standard for
FICON Extension. San Jose, CA: Brocade Communications Systems.
Brocade Communications Systems. (2011). Technical Brief: FICON Over Extended Distances and IU Pacing. San
Jose, CA: Brocade Communications Systems.
Brocade Communications Systems. (2015). Brocade Adaptive Rate Limiting. San Jose, CA: Brocade
Communciations Systems.
Brocade Communications Systems. (2015). Brocade Open Systems Tape Pipelining, FastWrite, and Storage-
Optimized TCP. San Jose, CA: Brocade Communications Systems.
Brocade Communications Systems. (2015). Brocade Extension Trunking. San Jose, CA: Brocade Communications
Systems.
Brocade Communications Systems. (2017). Brocade Fabric OS Extension Configuration Guide FOS 8.1.0. San
Jose, CA: Brocade Communications Systems.
Flam, G., & Guendert, S. (2011). Demystifying Extended Distance FICON AKA Persistent IU Pacing. Proceedings of
The International Computer Measurement Group. Turnersville, NJ: Computer Measurement Group, Inc.
Guendert, S. (2017). Fibre Channel over Internet Protocol Basics for Mainframers. Enterprise Tech Journal.
IETF. (2004, July). IETF RFC 3821. Retrieved from IETF RFC 3821: http://tools.ietf.org/pdf/rfc3821.pdf
IP Extension
240
PART 3 Chapter 6: ip EXTENSION
Introduction
IP data flows across Brocade tunnels are referred to as IP Extension. The Brocade 7840 Switch and
Brocade SX6 Extension Blade support IP-based storage data flows as well as FC/FICON based data flows. IP
extension provides enterprise-class support for IP storage applications, using existing IP Wide Area Network
(WAN) infrastructure. IP Extension features are offered only on the Brocade 7840 Extension Switch and the
Brocade SX6 Extension Blade platforms. IP Extension enables you to use the existing IP WAN infrastructure
to connect IP storage. Additionally, IP Extension gives you visibility and control of flows using various
diagnostic tools, IPsec, compression, QoS, Extension Trunking, and lossless tunnel resiliency. IP Extension
supports applications such as array native IP Remote Data Replication (RDR), IP-based centralized backup,
VM replication; host based and database replication over IP, NAS head replication between data centers,
data migration between data centers and others.
• Brocade SX6 Extension Blade installed in a supported platform such as the Brocade X6-4 Director or
Brocade X6-8 Director.
IP Extension provides Layer 3 extension for IP storage replication. The IP extension platform acts as the
gateway for the LAN, but there is no Layer 2 extension, which means each side of the network must be on a
different subnet. The extended IP traffic receives the same benefits as does traditional FCIP traffic:
• Compression
• High-speed encryption
Brocade WAN Optimized TCP (WO-TCP) ensures in-order lossless transmission of IP Extension data. IP
Extension establishes a proxy TCP endpoint for local devices. Local devices are unaware and unaffected by
the latency and quality of the IP WAN. This accelerates end device native TCP. IP Extension data across the
IP WAN uses WO-TCP, a highly efficient and aggressive TCP stack for moving data between data centers.
IP extension switching and processing is done within the Ethernet switch, the Field-Programmable Gate
Arrays (FPGAs), and the Data Processors (DPs). IP extension uses VE_Ports as well, even if no FC or FICON
traffic is being transported. VE_Ports are logical entities that are more accurately thought of as a tunnel
endpoint than as a type of FC port. IP extension uses Switch Virtual Interfaces (SVIs) on each DP as a
communication port with the IP storage end devices. The SVI becomes the gateway IP interface for data
flows headed toward the remote data center. At the SVI, end-device TCP sessions are terminated on the
local side and reformed on the remote side. This is referred to as TCP Proxying, and it has the advantage of
providing local acknowledgments and acceleration. WAN-Optimized TCP (WO-TCP) is used as the transport
between data centers. Which side is “local” or “remote” depends on where the TCP session initiates.
Up to eight of the 10 GbE interfaces can be used for IP extension Local Area Network (LAN) side (end-
device or data center LAN switch) connectivity. Eight 10 GbE interfaces are reserved for WAN side (tunnel)
connectivity.
Superior Performance
Simply stated, IP storage applications can gain significant increases in performance across the WAN
between two data centers when they use a Brocade IP Extension solution. The more latency and packet
loss between the data centers, the greater the gain. The Brocade 7840 Extension Switch and the Brocade
SX6 Extension Blade provide performance gains up to 50 times the performance of native Transmission
Control Protocol/Internet Protocol (TCP/IP) stacks. Such performance gains enable use cases that were not
previously feasible.
Brocade IP Extension technology helps IP storage devices achieve the fastest speeds possible, by leveraging
local TCP acknowledgements (TCP acceleration). The Brocade 7840 Extension Switch and the Brocade
SX6 Extension Blade do this by terminating the end-device TCP sessions locally. Data is then optimally and
rapidly transported across the WAN using WAN-optimized TCP technology from Brocade. At the remote site,
another local TCP session is established to communicate with the destination end device by using local
acknowledgements.
Adaptive Rate Limiting (ARL) is used to maximize WAN utilization while sharing a link with other non-storage
applications. Native TCP/IP stacks on IP storage applications do not have the ability to adaptively alter rate
limiting based on conditions in the IP network, but the Brocade 7840 Extension Switch and the Brocade SX6
Extension Blade do have this ability. Brocade Adaptive Rate Limiting dynamically adjusts shared bandwidth
between minimum and maximum rate limits and drives maximum I/O even with a downed redundant
system. For more details on ARL, please see chapter 7.
The Brocade 7840 Extension Switch and the Brocade SX6 Extension Blade use aggressive deflate
compression, making the most out of available bandwidth. Typically, aggressive deflate mode achieves 4:1
compression. Brocade IP Extension compression offloads from hosts and appliances the need to perform
compression or adds compression to flows that cannot be compressed by the application. On a host running
Virtual Machines (VMs), IP Extension-based compression permits the processor capacity of the host server
to be utilized more efficiently for additional VMs.
Increased Security
All data leaving the confines of a data center should be secured. The Brocade 7840 and the Brocade
SX6 have a robust hardware-based IPsec implementation to provide assurance that security compliance
requirements are met. The use of a hardware-based IPsec implementation that encrypts data flows over
distance without a performance penalty allows ultra-low added latency (5 μs) and line rate operation for four
10 gigabit per second (Gbps) connections. When using IPsec, there is no need for intermediate firewalls,
which tend not to provide equivalent performance. Intermediate firewalls also affect throughput, increase
complexity, add points of failure, and increase Total Cost of Ownership (TCO)—in other words, they reduce
Return on Investment (ROI). The Brocade implementation of IPsec protects data from end to end and
minimizes exposure to data breaches to avoid unwanted publicity. The Brocade 7840 Extension Switch
and the Brocade SX6 Extension Blade include IPsec in the base unit, and there are no additional licenses
or fees. IPsec on the Brocade 7840 and the Brocade SX6 implements Advanced Encryption Standard
(AES) 256 encryption, Internet Key Exchange version 2 (IKEv2), Secure Hash Algorithm (SHA)-512 Hashed
Message Authentication Code (HMAC).
Continuous Availability
Extension Trunking is a Brocade technology originally developed for mainframes and now broadly used in
open systems environments. Extension Trunking shields end devices from IP network disruptions, making
network path failures transparent to replication traffic. Multiple circuits (two or more) from the Brocade
7840 and the Brocade SX6 are applied to various paths across the IP network. The most common example
of this is redundant Data Center Local-Area Network (DC-LAN) switches. For instance, one circuit goes over
DC-LAN Switch A, and the other circuit goes to DC-LAN Switch B. This is a simple and effective architecture
for redundancy and increased availability. Of course, as needs dictate, the application of circuits over
various paths and service providers (up to eight circuits per IP extension tunnel) can establish a highly
available infrastructure for IP storage.
Extension Trunking supports aggregation of multiple WAN connections with different latency or throughput
characteristics (up to a 4:1 ratio), allowing you to procure WAN circuits from multiple service providers with
different physical routes, to ensure maximum availability. If all WAN circuits are from the same service
provider, then chances are high that a single failure event (for example, equipment failure, power loss,
or cable cut) can take down all WAN circuits at one time. With Extension Trunking, you can protect your
replication traffic from these kinds of outage events.
Extension Trunking offers more than the ability to load balance and failover/failback data across circuits.
Extension Trunking is always a lossless function, providing in-order delivery within an extension trunk
(defined by a Virtual Expansion_Port, or VE_Port). Even when data in-flight is lost due to a path failure, data
is retransmitted over remaining circuits via TCP and placed back in-order before it is sent to Upper Layer
Protocol (ULP). IP storage applications are never subjected to lost data or out-of-order data across the WAN.
Normally, when Brocade IP Extension is not used, packet loss in the IP network results in extremely low
performance on native IP Storage TCP/IP stacks. These stacks have little to no tolerance for packet loss
across the IP network. On popular Network-Attached Storage (NAS) platforms during periods of 0.1 percent
packet loss and 5 milliseconds (ms) Round Trip Time (RTT) latency, the reduction in throughput is 95
percent or more. Extension Trunking will be discussed in more detail in Part 3 Chapter 7.
ARL is used with Extension Trunking to maintain available bandwidth to storage applications. For example,
if DC-LAN Switch A goes down, then as long as DC-LAN Switch B remains online and has ample bandwidth
connectivity, it should be able to maintain all the original bandwidth to the application. In this case, it is
necessary for rate limiting to readjust upward and compensate for the lost pathway. Rate limiting is used to
prevent oversubscribing the WAN and any associated contention or congestion. Congestion events force TCP
to perform flow control, which is extremely inefficient, slow to react, and results in poor performance. ARL
adjusts from a normal condition that is not oversubscribed to an outage condition that maintains the same
bandwidth. Clearly, this is essential to continuous availability.
Operational Excellence
Storage and mainframe administrators need to have visibility, monitoring, and reporting and diagnostic
tools to help them gain operational excellence. Brocade provides features and functionality that help
administrators more efficiently perform their jobs and maintain SLAs within their organizations. In most
cases, handing off IP storage connections to the IP network creates operational boundaries that are difficult
to manage. Refer to Figure 1. If the IP network goes completely down, the situation is not difficult to prove.
However, if the network is only degraded or suboptimal, what do you do? Is it obvious when there is a
degraded condition? How do you back up your suspicions? How do you document the situation? Where is
the problem located: in the end devices, local IP network, remote IP network, service provider, or somewhere
else? How should you respond to a suspected problem? Can you rely on the network team to do their due
diligence without providing documentation and proof to back up the storage/mainframe operational issues?
A large percentage of storage and mainframe administrators encounter these and similar major problems
with storage applications across the IP network.
Consolidating only IP storage flows or both IP storage and Fibre Channel/Fiber Connectivity (FC/FICON) flows
into a single tunnel significantly increases operational excellence. Operationally, the managed tunnel offers
the advantages of Brocade Fabric Vision technology, the Monitoring Alerting Policy Suite (MAPS), the WAN
testing tool (Wtool), and Brocade Network Advisor.
IP Extension enables consolidated flows from heterogeneous devices and multiple protocols. The increased
insight gained with a holistic view and correlation, instrumentation, and granularity simplifies management
across data centers. Custom browser-accessible dashboards for IP storage or combined Fibre Channel
and IP storage provide storage administrators with a centralized management tool to monitor the health
and performance of their network. Brocade Fabric Vision technology extends proactive monitoring between
data centers, to automatically detect WAN anomalies and avoid unplanned downtime. This simplifies
troubleshooting: administrators can quickly identify issues and ownership, resulting in quicker resolutions.
The Brocade 7840 and the Brocade SX6 provide a built-in traffic generator and WAN test tool to prevalidate
and troubleshoot the physical infrastructure, streamlining deployments, and preventing issues from arising.
The Situation
Have no fear. We are not about to start talking about characters from the horrible TV show Jersey Shore.
This situation refers to the internal politics present in the majority of information technology organizations
who use IBM mainframes.
A wide variety of storage applications use native IP as the transport between data centers. Storage
administrators hand off these IP connections to LAN network administrators for transport from the local
data center to remote data centers. Ultimately, storage and mainframe administrators are held responsible
for their applications, regardless of the IP network’s behavior (over which they have no control and often
no visibility or diagnostic capability). This creates an awkward situation in which storage and mainframe
administrators are responsible for something that is not within their control. Brocade and our OEMs
understand this situation and are taking steps to help administrators “take control.”
How can administrators take control of IP storage flows? Clearly, the IP network must continue to
transport data between data centers across the WAN infrastructure. Fortunately, this does not change.
The fundamental difference lies in the handoff of IP storage flows to the network. Packaging all flows—FC,
FICON, and IP—into a single managed tunnel provides some key benefits: visibility, acceleration, security,
and prioritization of IP Extension flows as well as bandwidth management and pooling.
Visibility
Making flows visible across the network requires a hardware system (the Brocade 7840 platform with Gen
5 Fibre Channel technology or the Brocade SX6 Extension Blade with Gen 6 Fibre Channel technology),
an operating system (Brocade Fabric OS), and a management platform (Brocade Network Advisor). These
Brocade technologies include Brocade Fabric Vision technology, MAPS, and Wtool. Knowing information
about specific flows empowers storage and mainframe administrators by ensuring that relevant
conversations with network administrators concerning non-optimal or degraded application conditions are
documented. Additionally, leveraging visibility and network tools resolves support tickets significantly
more quickly.
Brocade makes many aspects of monitoring the extension tunnel over the IP network available. These
statistics are captured over time and displayed on graphs. Brocade Network Advisor allows you to play back
the behavior of your environment, since all monitored statistics are timestamped and held in a database.
The database can play back timed events so that storage and mainframe administrators can see and
correlate events as they happened. For example, you might see that an unusual number of dropped packets
and duplicate ACKs (Acknowledgements) occurred before a circuit bounced. This indicates that the IP
network was unable to deliver datagrams to the remote side during this period of time, which correlates with
the circuit going down. The support team can interpret such information furnished by Brocade tools to help
pinpoint and resolve conditions within a troubled environment. Monitored statistics include:
• RTT
• Dropped packets
• Out-of-order segments
• Duplicate ACKs
• Compression ratio
• Fast retransmits
• Slow starts
• Circuit bounces
• Interface states
• Number of flows
In addition, monitored characteristics of the IP network are part of the Brocade Monitoring and Alerting
Policy Suite (MAPS), therefore, actions alert users to degraded conditions across the IP network. Armed
with this service-level information, storage and mainframe administrators now can open trouble tickets with
their internal network administrator counterparts to restore optimal operations across the IP network. For
additional information, refer to the Brocade white paper The Benefits of Brocade Fabric Vision Technology
for Disaster Recovery.
IP Extension on the Brocade 7840 and the Brocade SX6 terminates IP storage TCP flows locally and
transport the data across the WAN using a specialized TCP transport, called WAN-Optimized TCP (WO-TCP).
The primary benefit here is the local ACK. By limiting ACKs to the local data center, TCP that originates from
an end IP storage device must be capable of merely high-speed transport within the data center. Most
native IP storage TCP stacks are capable only of high speeds over short distances. Beyond the limits of the
data center, “droop” becomes a significant factor. Droop refers to the inability of TCP to maintain line rate
across distance. Droop worsens progressively as distance increases.
WO-TCP is an aggressive TCP stack designed for Big Data movement, operating on the purpose-built
hardware of the Brocade 7840 Extension Switch and the Brocade SX6 Extension Blade. The Brocade
7840 and the Brocade SX6 offer 64 processors and 128 gigabytes (GB) of RAM to support WO-TCP. In
comparison, WO-TCP has no droop across two 10 Gbps connections, up to 160 ms RTT per data processor.
This is equivalent to two fully utilized 10 Gbps WAN connections (OC-192) between Los Angeles and
Hong Kong.
Another benefit is the absence of Head of Line Blocking (HoLB) or Slow Drain Device (SDD) problems
with WO-TCP streams. WO-TCP on the Brocade 7840 and the Brocade SX6 implements a feature called
“streams.” Streams are used to mitigate HoLB and SDD problems across the IP network and WAN. If all IP
storage flows are flow controlled using a single TCP Receiver Window (rwnd), then all those flows slow or halt
in the event that the TCP Receiver Window is shut or closed. This situation is detrimental to all applications
except for the one that flow control is meant for. It is not practical to create a separate TCP connection for
each flow, as this consumes excessive resources. Instead, autonomous streams are created using virtual
TCP windows for each stream. The Brocade 7840 and the Brocade SX6 can accommodate hundreds of
streams per data processor. Two data processors are present for each Brocade 7840 switch and each
Brocade SX6 blade. Because a virtual TCP window is used for each stream, if a flow needs to slow down or
stop, no other flows are affected; they continue to run at their full rate.
It is not the responsibility of IP storage end devices to solve IP network problems, even if native IP ports
are available on those products. IP is supplied as a protocol of convenience, not for line rate performance
across a WAN. Also, the cost of adding hardware to various end-device products solely to solve network
performance and security issues is prohibitive, as is the cost of the software component that requires a
particular skill set. The process of placing specialized engineering into every IP storage device that sends
data to another data center across the WAN proves to be impractical. Alternatively, it is much easier to run
IP storage flows from many IP storage devices through a pair of Brocade 7840 switches or a pair of Brocade
SX6 blades at each data center.
The Brocade 7840 and the Brocade SX6 with IPsec encourage the use of encrypted data protection and
do so with no performance penalty. The Brocade IPsec capability is hardware-implemented and operates at
line rate (20 Gbps) per data processor with 5 microseconds (μs) of added latency. IPsec is included in the
Brocade 7840 and the Brocade SX6 base units, with no additional licenses or fees.
The second method is to mark data flows as they exit the Brocade 7840 or the Brocade SX6 and enter the
IP network. Data flows can be marked with IEEE 802.1P, part of 802.1Q VLAN tagging or Differentiated
Services Code Point (DSCP), or end-to-end IP-based QoS. The difficulty in using this method is that it
requires the IP network to be configured to perform the proper actions based on the marking. If the IP
network is not configured to do so, it does not prioritize the data flows. This usually involves a complex and
sizable project on the IP network side to categorize a diverse number of flows and assign priorities to the
flows. A problem with this approach is that QoS in the IP network does not remain stable. Applications and
priorities change over time.
On the Brocade 7840 and the Brocade SX6, features such as QoS, 802.1P, and DSCP marking are fully
supported. Typically, Brocade 7840 and Brocade SX6 users prioritize their flows within the FCIP+IP
extension tunnel. It is then possible to create an SLA for the tunnel itself, so that users can deploy QoS in
the manner best suited for the environment.
With each added circuit, even more bandwidth is added to the pool. Extension Trunking performs a Deficit
Weighted Round Robin (DWRR) schedule when placing batches into the egress. Batches are an efficiency
technique used by Brocade to assemble frames into compressed byte streams for transport across the
WAN. A feature of Extension Trunking called Lossless Link Loss (LLL) ensures lossless data transmission
across the trunk in the event that data is lost in-flight due to an offline circuit, and a WO-TCP is no longer
operational across that circuit. WO-TCP itself recovers lost or corrupted data across a link, if that circuit is
still operational. All data is delivered to the ULP in-order. Extension Trunking performs failover and failback,
and no data is lost or delivered out-of-order during such events. Circuits can be designated as backup
circuits, which are passive until all the active circuits within the failover group have gone offline. This
protects users against a WAN link failure and avoids a restart or resync event.
The Brocade 7840 switch and the Brocade SX6 blade also support jumbo frames. Even if the IP network or
WAN does not support jumbo, replication devices can still use LAN-side jumbo frames, which should offload
CPU on the device and accelerate replication.
• The Brocade 7840 switch and Brocade SX6 blade act as a next-hop gateway for IP extension traffic.
This is managed by creating the LAN interfaces and creating routes on the end-devices to point to the
gateway interface.
• The LAN interfaces are on the local side of the network, for example, a single portion of the network
that is within a data center. By comparison, WAN interfaces are on the side of the network that send
traffic outside of the local network and potentially communicate with multiple routers before arriving at
the final destination. IP Extension supports static Link Aggregation Groups (LAGs) on the LAN interface.
A LAG is a group of physical Ethernet interfaces that are treated as a single logical interface.
• IP Extension requires different broadcast domains on each end of the extension tunnel. This means
that a single subnet cannot span the extension tunnel.
• Beginning with Fabric OS 8.0.1, IP Extension supports Policy-Based Routing (PBR), which allows a router
to be connected to the LAN-side of an IP Extension platform.
• IP Extension uses Traffic Control Lists (TCLs) to filter traffic to the correct destination. The default action,
if no TCL is configured, is to deny all traffic flow. Configuring TCLs on each extension switch or blade that
allow traffic to pass from endpoint to endpoint through the extension tunnel is critical.
NOTE: IP Extension does not allow FIPS mode, and when a platform is operating in FIPS mode, IP Extension
is not allowed.
A tunnel on a VE_Port must be configured to enable IP Extension. A tunnel that supports IP traffic provides
additional QoS priorities for IP Extension. The tunnel can carry both Fibre Channel and IP traffic. When
a tunnel is running in Fibre Channel-only mode, it is compatible with a Brocade 7840 switch or Brocade
SX6 blade running in FCIP mode (that is, not in hybrid mode). Fibre Channel traffic will be limited to 10
Gbps. Compression is supported on a tunnel in hybrid mode. With Fibre Channel traffic, all compression
modes are supported: fast deflate, aggressive deflate, and deflate compression. IP traffic is limited to two
compression modes: aggressive deflate and deflate compression.
Head of Line Blocking (HoLB) is mitigated for the extended IP flows through the tunnel. Each flow in a TCP
connection receives an independent stream in the tunnel. The data must be delivered in-order for the
stream; however, data can be delivered out-of-order for the tunnel. This allows the WAN to pass up any data
that is received out of order because packet loss recovery is isolated on the WAN to the impacted stream, or
connection, to which the data belongs.
Note: In the Fabric OS 7.4.0 release and later, only static LAGs are supported.
Jumbo frames are supported for LAN traffic. A jumbo frame can be up to 9,216 bytes. VLAN tagging is
supported in IP Extension. Stacked tagging (IEE802.1ad) is not supported.
• The LAN interface functions as a virtual switch, or gateway. That is why it is referred to as a Switch
Virtual Interface.
• IP Extension forms an extension tunnel through the WAN to allow the local LAN to communicate with the
remote LAN.
• The SVI on the local extension platform and remote extension platform must be on different subnets.
Otherwise, the IP traffic will not be forwarded.
• When configuring the IP route on the extension platform, the SVI is the next-hop gateway.
• IP Extension optimizes the TCP traffic. However, if you are doing UDP forwarding or using IPsec in your
LAN, not all types of IPsec will be optimized.
• Double VLAN tagging (sometimes referred to as nested VLAN tagging) is not supported.
Figure 2: Extension tunnel concept and TCP/IP layers for FCIP and IP Extension.
Each TCL rule is identified by a name assigned to the rule when it is created. Thereafter, rules are modified
or deleted by name. A TCL name contains 31 or fewer alphanumeric characters. Each name must be
unique within the IP Extension platform (such as a Brocade 7840 switch or Brocade SX6 blade). Rules are
local to each platform and are not shared across platforms. When traffic is allowed, the TCL rule specifies
which tunnel and which QoS to use for that traffic type. When the rule denies traffic, it applies to both DP
complexes unless a specific DP complex is selected.
The TCL priority number provides an order of precedence to the TCL rule within the overall TCL list. The
priority value must be a unique integer. Smaller numbers are higher priority and larger numbers are lower
priority. You can modify the priority to reposition the TCL rule to a different location within the list. When
removing a rule, or even when creating a new rule, you can leave gaps in the priority numbers to allow
subsequent in-between entries.
The priority value must be unique across all active TCL rules within an IP extension platform. For example,
if a chassis has multiple Brocade SX6 blades installed, the priority must be unique across all blades. If a
TCL is defined as priority 10, that same priority cannot be used for another TCL rule, even if that rule would
be assigned to another DP. A check is performed when the rule is enabled to ensure the priority value
is unique. The TCL input filter inspects a set of parameters to help identify the input traffic. It is based
on several of the fields found in the IP, TCP and other protocol headers. The TCL input filter identifies a
particular host, device, or application by means of the Layer 4 protocol encapsulated within IP, Layer 4
destination ports, Layer 3 source or destination IP addresses and subnets, DSCP/802.1P QoS values, or
VLAN tagging.
Once traffic is matched to a TCL rule, it is sent to the tunnel, and further TCL processing stops for that traffic.
The TCL target parameter specifies which VE_Port tunnel the matched traffic will be sent to. Optionally,
you can specify a traffic priority. For example, when configuring the TCL rule specify --target 24 or --target
24-high. If no priority is specified, the input traffic is sent over the IP Extension medium-priority QoS on the
target tunnel. IP Extension traffic priorities are scheduled separately from the FCIP traffic priorities. Each
traffic type has its own set of QoS priorities in the egress scheduler. The TCL is evaluated for allow and
deny actions one time only when a TCP stream performs its handshake to form the traffic stream. If you
make any changes to the TCL, including changes to priority rules or disabled rules, the particular stream
those changes apply to will have no effect on the existing streams. There is no effect on an existing stream
because that stream is already formed and the TCL is no longer being evaluated. If you do not see any traffic
changes after changing a TCL, it is because the traffic stream is already formed. Newly-established streams
that do match changed priority rules will be affected.
Before TCL changes can take effect on established traffic streams, one of the following actions must occur:
• The end device must reset its TCP stream, forming a new stream. Disabling and enabling an end-device
interface resets its TCP streams.
• The tunnel VE_Port must be disabled and re-enabled. Disabling a VE_Port disrupts all traffic passing
through the tunnel (FCIP and IP Extension).
A recommended approach is to configure the fewest number of “allow” rules that will pass specific traffic.
Unmatched traffic will encounter the default “deny-all” rule. You can prevent a subset of the allowed traffic
by an “allow” rule with a “deny” rule. For example, you can deny a smaller IP subnet that is within a larger
allowed subnet. When IP routes on the end-device are configured correctly, no undesired traffic should
appear for the TCL to process. However, you can configure specific “deny” rules to ensure that certain
devices or addresses cannot communicate across the IP extension tunnel. Always limit how you use “deny”
rules. Configuring (and trouble-shooting) complex sets of “deny” rules results in more work and more chance
of error.
In some situations, all traffic that appears at the IP Extension LAN interface is intended to pass through the
tunnel:
• Directly connected end-device ports in which all the traffic from the end device is intended for the
remote data center.
• IP routes on the end devices are closely managed such that only desired traffic is allowed to use the IP
extension LAN gateway.
In these cases, the simplest solution is to configure an “allow” rule that allows all traffic to pass through.
A default TCL is always created when the IP extension platform is configured for hybrid mode. It has priority
65535, which is the lowest possible, and is set to deny all traffic. This rule cannot be deleted or modified.
If no other rule is created, all traffic will be dropped. Therefore, it is critical that you create at least one TCL
rule and set it to target an IP Extension enabled VE tunnel.
When using Virtual Fabric Logical Switches (VF LS) it is important to know that the IP Extension LAN
Ethernet interfaces must be in the default switch. IP Extension LAN Ethernet interfaces cannot be separated
and assigned into different logical switches. The lan.dp0 and lan.dp1 IPIFs behind the IP Extension LAN
Ethernet interfaces are not part of VF LS contexts and cannot be assigned to an LS. If a datagram arrives
at an IP Extension LAN IPIF, it will be processed and is subject to the TCL. Even though a VE_Port can reside
within a VF LS, there is no requirement for the IP Extension LAN Ethernet interfaces to also reside within
that VF LS. In fact, the IP Extension LAN Ethernet interfaces must remain in the default switch. The TCL
directs traffic to the specified VE_Port regardless of the VF LS in which it resides. The VE_Port must be
native to the DP hosting the lan.dp0 or lan.dp1. If the VE specified in the TCL is not native to DP hosting the
LAN ipif, the TCL will not find a match and the traffic will be dropped.
Non-terminated TCL
You can create a TCL rule for traffic that does not terminate at the Brocade 7840 or Brocade SX6. The non-
terminated TCL option allows TCP traffic to be sent as-is to the other endpoint over a tunnel. This traffic is
unlike terminated TCP traffic in that it does not count as one of the allowed 512 TCP connections at the DP.
The non-terminated TCL option is meant for use by low bandwidth control-type connections. It is
recommended that you use the highest TCL priority value possible for the non-terminated traffic based the
traffic handling requirements. This option can be enabled or disabled. However, you must stop and restart
the traffic from the application or host to ensure proper handling after the TCL is updated.
• None
• Fast deflate (fast deflate is not supported for IP traffic in hybrid mode)
• Deflate
• Aggressive deflate
Compression can be configured at the QoS group/protocol level when the switch or blade are operating in
hybrid mode. Different compressions can be applied to Fibre Channel extension traffic and IP extension
traffic on the same tunnel. With the protocol selection, you can use fast deflate for Fibre Channel traffic
while the IP traffic is using deflate or aggressive deflate compression. Compression is configured at the
tunnel level, for all traffic on FC and IP protocols, but you can override the tunnel settings and select
different compression at the QoS group/protocol level.
IP Extension Limitations
For TCP traffic, the following considerations apply:
• Flow control for the LAN TCP sessions is handled by means of internal management of the receive
window. The receive window allows up to 256 kB of data onto the LAN. However, the window will
degrade to 0 if the WAN becomes congested, or the target device experiences slow draining.
• TCP connections are limited to 512 terminated or accelerated connections per DP complex. Additional
connection requests are dropped. A RASLOG message is displayed when 500 connections are up and
active.
• Up to 512 TCP open connection requests per second are allowed. Additional open connection requests
are dropped. By limiting the number of connection requests per second, this helps prevent Denial of
Service (DoS) attacks.
Another aspect of the mainframe use case is the ability to manage applications that operationally use
a combination of FC+IP or FICON+IP. One such example is using FCIP and IP Extension with virtual tape
systems, as shown in Figure 4. In high-availability architectures, it is necessary for the local tape system to
communicate with the remote tape system, and possibly the remote tape system with the local tape system.
In the event that either the local or remote tape system is offline, tape processing can continue for mission-
critical applications, such as SAP. Both hosts have connectivity to the remote tape system control unit. The
FICON connection across the WAN uses FCIP extension over the same tunnel. Bandwidth and prioritization
are managed by the Brocade 7840 and the Brocade SX6 to ensure reliable operation.
As for the IP connection between local and remote tape systems, instead of sharing the WAN connection
with FCIP, which often is contentious and has to be continuously monitored and managed, the tape system
IP connectivity is managed by a single WAN scheduler and is joined into the extension tunnel on the Brocade
7840 and the Brocade SX6. The Brocade 7840 and the Brocade SX6 optimally manage both the FCIP and
IP Extension flows for optimal performance without contention or oversubscription on the WAN. Flows can
further be managed with QoS, compression, IPsec, and ARL without involving long complex projects on
the network side that might cause ongoing operational issues. The Brocade 7840 and the Brocade SX6
accelerate data transfers across the WAN by using Brocade WO-TCP.
PtP VTS-to-VTS data transmission was originally done by Enterprise Systems Connection (IBM ESCON),
then IBM FICON, and finally Transmission Control Protocol/Internet Protocol (TCP/IP). With the TS7700,
the virtual tape controllers and remote channel extension hardware for the PtP VTS of the prior generation
were eliminated. This change provided the potential for significant simplification in the infrastructure that is
needed for a business continuity solution and for simplified management.
Hosts attach directly to the TS7700/TS7760 via FICON channels. Instead of FICON or ESCON, the
connections between the TS7700/7760 clusters use standard TCP/IP. Similar to the PtP VTS of the previous
generation, with the new TS7700/7760 Grid configuration, data can be replicated between the clusters,
based on the customer’s established policies. Any data can be accessed through any of the TS7700/7760
clusters, re-gardless of which system the data is on, if the grid contains at least one available copy. As a
business con-tinuity solution for high availability and disaster recovery, multiple TS7700/7760 clusters are
interconnected by using standard Ethernet connections. Local and geographically separated connections
are supported to provide a great amount of flexibility to address customer needs. This IP network for data
replication between
TS7700/7760 clusters are more commonly known as a TS7700/7760 Grid. A TS7700/7760 Grid refers
to 2 to 8 physically separate TS7700/7760 clusters that are connected to each other with a customer-
supplied IP network. The TCP/IP infrastructure that connects a TS7700/7760 Grid is known as the grid
network. The grid configuration is used to form a high-availability disaster recovery solution and to provide
metro and remote logical volume replication. The clusters in a TS7700 Grid can be, but do not need to be,
geographically dispersed.
In a multiple-cluster grid configuration, two clusters are often located within 100 kilometers (km) of each
other. The remaining clusters can be more than 1,000 km away. This solution provides a highly available
and redundant regional solution. It also provides a remote disaster recovery solution outside of the region.
The TS7700/7760 Grid configuration introduces new flexibility for designing business continuity solutions.
Peer-to-peer communication capability is integrated into the base architecture and design. No special
hardware is required to interconnect the TS7700s. The Virtual Tape Controllers (VTCs) of the previous
generations of PtP VTS are eliminated, and the interconnection interface is changed to standard IP
networking. If configured for high availability, host connectivity to the virtual device addresses in two or
more TS7700/7760s are required to maintain access to data, if one of them fails. If the TS7700/7760s
are at different sites, channel extension equipment is required to extend the host connections. With the
TS7700/7760 Grid, data is replicated and stored in a remote location to support truly continuous uptime.
The IBM TS7700/7760 includes multiple modes of synchronous and asynchronous replication. Replication
modes can be assigned to data volumes by using the IBM Data Facility Storage Management Subsystem
(DFSMS) policy. This policy provides flexibility in implementing business continuity solutions, so that
organizations can simplify their storage environments and optimize storage utilization. This functionality is
similar to IBM Metro Mirror and Global Mirror with advanced copy services support for IBM Z clients.
The network infrastructure that supports a TS7700 Grid solution faces challenges and requirements
of its own. First, the network components need to individually provide reliability, high availability, and
resiliency. The overall solution is only as good as its individual parts. A TS7700 Grid network requires
nonstop, predictable performance with components that have “five-9s” availability, in other words, 99.999%
uptime. Second, a TS7700 Grid network must be designed with highly efficient components that minimize
operating costs. These components must also be highly scalable, to support business and data growth and
application needs and to help accelerate the deployment of new technologies. The Brocade 7840 and SX6
IP Extension capabilities with Brocade Fabric Vision technology provides exactly these benefits. Brocade
and IBM completed initial testing and qualification of IP Extension with the TS7700 in September 2015, and
that is extended today to the TS7760.
Below is a summary of the initial testing done by IBM and Brocade. The results show performance over a
10 GbE link. As can be seen, Brocade IP Extension offers a significant gain in TS7700/7760 Grid replication
performance in real world environments with packet loss occurring in the WAN. This performance is
attributed to the WO-TCP stack, as well as the lossless link loss technology inherent in Brocade extension
trunking technology. For more details, please see the white papers and IBM Redbook listed in the
references section.
Approximate
Latency ms
Distance Uni- No packet loss 1% packet loss
RTT
Direction
TS7700
Between Site A IP Storage IP Storage + IP TS7700 Native
1ms = ~100miles Native
and Site B + IP Extension Extension Extension
Extension
2 100 590 590 590 16
20 1000 590 590 590 12
100 5000 560 65 540 5
250 12500 540 25 540 3
Finally, the IPSec in-flight encryption features on the Brocade 7840 and SX6 are hardware based (FPGA)
and implementing this feature does not impact performance/replication latency. This is a significant benefit
to IBM TS7700/7760 clients with a data in flight encryption requirement. The TS7700/7760 has a well-
documented (by IBM) replication performance issue with its native in flight encryption functionality for
replication traffic. Another additional and notable bonus of the Brocade IPsec feature is it is part of the base
hardware so there is no additional license/cost required.
Oracle VSM 7
Oracle’s StorageTek Virtual Storage Manager System 7 (VSM 7) was announced in March 2016, and is the
latest generation of the mainframe virtual tape system. VSM 7 is the first and only mainframe virtual tape
storage system to provide a single point of management for the entire system that leverages the highest
existing levels of security provided by IBM Z environments. It provides a unique multitiered storage system
that includes both physical disk, optional tape storage, and auto-tiering to the Oracle Storage Cloud.
VSM 7 was designed for business continuity, disaster recovery, long-term archival and big data require-
ments for z/OS applications. Each VSM 7 emulates up to 512 virtual tape transports but actually moves
data to and from disk storage, back-end real tape transports attached to automated tape libraries, as well
as auto-tiering to the Oracle Storage Cloud. As data on VSM 7 disks ages, it can be migrated from disk to
physical tape libraries such as Oracle’s StorageTek SL8500 and SL3000 Modular Library Systems, as well
as being auto-tiered to the cloud in order to meet long-term data retention, archive, and compliance require-
ments.
The VSM 7 enables non-disruptive, on-demand capacity scaling and can grow to 825 TB of customer-usable
(without compression) disk storage in a single rack. VSM 7 provides 2× the performance and more than
2× the disk capacity than the previous generation VSM 6 system and scales disk capacity on demand to
211 petabytes of native capacity (scalability up to 256 VSM systems). A simple and seamless migration to
physical tape extends system capabilities to provide nearly unlimited capacity to meet an enterprise’s long-
term data retention and compliance needs.
• Increased performance and storage capacity with upgraded servers and storage disk enclosures
• 10 Gbps IP connectivity throughout the virtual tape storage subsystem (VTSS) and into the customer’s
network environment
• Two network switches that aggregate or fan out network connections from the servers to the customer’s
network environment.
The VSM 7 is packaged as a standard rack mount system built on existing Oracle server, storage, and
service platforms. The servers, storage disk enclosures, and standard rack mount enclosure are delivered
as a packaged system. The Oracle Solaris 11 operating system is the foundation of the VSM 7 software
environment, which also includes Oracle Solaris infrastructure components and VSM specific software. The
VSM 7 software environment is preinstalled and preconfigured for VSM functionality so that limited site-level
configuration is required to integrate the product into the customer’s managed tape environment.
• Virtual storage management hardware and software. VSM 7 supports emulated tape connectivity to
IBM MVS hosts over fiber connection (FICON) interfaces as well as Fibre Channel connection to select
backup applications. It allows attachment to RTDs, IP attachment to other virtual storage managers
(VSM 7, VSM 6, or VSM 5) and to VLE, and remote host connectivity using Extended Control And
Monitoring (ECAM) over IP and replication of one virtual storage manager to another.
• VLE hardware and software. VLE is IP-attached to VSM 7 and functions as a migration and recall
target for VTVs. VLE supports migration and recall to and from Oracle Cloud Storage. A VSM 7 that is
connected to a properly configured VLE can use the VLE to migrate and recall VTVs to and from Oracle
Cloud Storage instead of local disk.
• RTDs connected to physical tape libraries. RTDs serve as migration and recall targets for VTSS VTVs.
RTDs are FICON- and/or IP-attached to VSM 7.
NOTE: RLINK functionality cannot be used concurrently with synchronous cluster link (CLINK) replication.
RLINK functionality is available for use between two StorageTek VSM System 7 or System 6 VTSSs, where
each VTSS can be both a primary and a target for bidirectional synchronous VTV-level replication.
Oracle VSM 7, working in conjunction with the Brocade 7840 Extension Switch and/or Brocade SX6
Extension Blade, offers the flexibility to address multiple enterprise configuration requirements including
high-performance disk-only or massively scalable disk and physical tape, as well as single or multisite
support for disaster recovery. In addition, physical tape FICON/Fibre Channel over IP (FCIP) extension can be
used to extend the VSM 7 storage to span onsite and off-site repositories that are geographically dispersed.
Up to 256 VSM VTSSs can be clustered together into a tapeplex, which is then managed under one pane of
glass as a single large data repository using StorageTek ELS.
Figure 5 illustrates the performance gains realized with VSM 7 and Brocade 7840 IPEX and WO-TCP. This
testing was done using one GbE CLINK with 4 GB VTV size data being transferred from a VSM 6 to a VSM
7. For a series of distances/latency measurements, a “pristine” network with no packet loss was used for
a base comparison, followed by two more-likely network scenarios: .1 percent packet loss and .5 percent
packet loss. These results illustrate the enhanced replication performance that can be realized by VSM 7
customers when using the Brocade 7840 Extension Switch’s IPEX technology for connectivity between data
centers and VSM 7 devices.
Figure 5: Performance gains realized from StorageTek VSM System 7 and Brocade 7840 Extension Switch.
Luminex MVT/CGX
Luminex offers another leading mainframe virtual tape solution, the MVT/CGX, that also offers the option of
IP-based replication in addition to FCIP. Brocade and Luminex performed testing and qualification of a joint
solution. The results of the IP Extension testing are shown below in Figure 6.
Figure 6: Performance gains realized with Luminex MVT/CGX and Brocade IP Extension.
SecureAgent
SecureAgent offers yet another mainframe virtual tape solution that is used by many large companies with
z/TPF environments. The SecureAgent solution offers high performance and robust encryption capabilities.
At the time of writing, Brocade and SecureAgent were starting work to jointly test/qualify Brocade FCIP and
IP Extension with the SecureAgent product portfolio.
Summary
The Brocade 7840 Extension Switch and the Brocade SX6 Extension Blade are purpose-built extension
solutions that securely move more data over distance faster while minimizing the impact of disruptions.
With Gen 5 Fibre Channel, IP extension capability, and Brocade Fabric Vision technology, these platforms
deliver unprecedented performance, strong security, continuous availability, and simplified management to
handle the unrelenting growth of data traffic between data centers in Fibre Channel, FICON, and IP storage
environments. The Brocade 7840 switch and the Brocade SX6 blade are purposely engineered systems
of hardware, operating system, and management platforms that provide capabilities that native storage IP
TCP/IP stacks cannot provide. For example, many native storage TCP/IP stacks cannot adequately provide
the following features that the Brocade 7840 and the Brocade SX6 Extension Blade provide: consolidated
management platform for multiprotocol storage applications, operational excellence (especially across
the IP network), bandwidth pooling and management, security, protocol acceleration, lossless continuous
availability, and network troubleshooting tools. IP Extension benefits a wide variety of IP storage applications
that are found in most data centers. The consolidation of these applications into a single managed tunnel
between data centers across the WAN offers real operational, availability, security, and performance value.
• Data Center Interconnect (DCI): Unified support and management of both FC/FICON and IP
• Storage administrators: Provision once and connect many devices over time
• High performance for high-speed WAN links (one or more 10 Gbps and 40 Gbps links)
• Streams: Virtual windows on WAN Optimized TCP to eliminate Head of Line Blocking (HoLB)
• Separate QoS for both FCIP and IP Extension with DSCP and/or 802.1P marking and enforcement
• 9,216 byte Jumbo Frames for both LAN and WAN networks
Brocade IP Extension is an ideal complement to industry-leading mainframe virtual tape solutions that
perform long-distance replication via TCP/IP.
References
Brocade Communications Systems. (2017). Brocade Fabric OS Extension Configuration Guide FOS 8.1.0.
San Jose, CA: Brocade Communications Systems.
Brocade Communications Systems. (2015). Remove the Distance Barrier: Achieve Secure Local Replication
Performance Over Long Distance for IP Storage. San Jose, CA: Brocade Communications Systems.
Brocade Communications Systems (2016). Enhanced Data Availability and Business Continuity with the IBM
TS7700 and Brocade 7840 Extension Solution. San Jose, CA: Brocade Communications Systems.
Detrick, Mark. (2016). Brocade 7840 IP Extension Configuration Guide. San Jose, CA: Brocade
Communications Systems.
Guendert, Stephen. (2015). An Enhanced IBM TS7700 Grid Network: IP Extension + Fabric Vision.
Enterprise Executive Sept 2015. Dallas, TX: Enterprise Systems Media.
IBM Redbooks. (2017). IBM SAN42B-R Extension Switch and IBM b-type Gen 6 Extension Blade in Distance
Replication Configurations (Disk and Tape). Poughkeepsie, NY: IBM Corporation ITSO
IBM Systems Technical White Paper. (2016). Enhanced Resilient Solutions for Business Continuity. Somers,
NY: IBM Corporation.
Luminex Software, Inc. (2017). Brocade IPEX and FCIP Average Rate Over Distance Testing (Version 1.1.1).
Riverside, CA: Luminex Software Corporation.
Oracle StorageTek. (2016). StorageTek Virtual Storage Manager System 7 and Brocade 7840 Extension
Switch. Redwood Shores, CA: Oracle Corporation.
262
PART 3 Chapter 7: Brocade Extension Technologies Advanced Topics
Introduction
This chapter discusses the more advanced extension technology features that Brocade offers our
mainframe customers, such as Brocade Extension Trunking, adaptive rate limiting, hot code load, use of
IPsec, and use of virtual fabrics with extension.
Circuits are individual connections within the trunk, each with its own unique source and destination IP
Address. On older Brocade Extension platforms that had multiple connections, such as the Brocade 7500
Switch and the Brocade FR4-18i Blade, each connection was its own tunnel, which is no longer the case.
Now, with the Brocade 7800 and 7840 Extension Switches and the Brocade SX6 and FX8-24 Extension
Blades, a group of circuits associated with a single VE_Port forms a single ISL that is known as a trunk.
Because a tunnel/trunk is an ISL, each ISL endpoint requires its own VE_Port. Or, if it is an FC router
demarcation point for a “remote edge” fabric, then it is called a VEX_Port. Remote edge fabrics are edge
fabrics connected through a WAN connection to a VEX_Port. Extension Trunking operates the same, whether
the virtual port is a VE_Port or a VEX_Port. If each circuit is its own tunnel, a VE_Port is required for each
circuit. However, with Extension Trunking, each circuit is not its own tunnel, hence the term “circuit.” Since a
trunk is logically a single tunnel, only a single VE or VEX_Port is used, regardless of the fact that more than
one circuit may be contained within the tunnel.
Figure 1 shows an example of two tunnels that are trunking four circuits in Tunnel 1 and two circuits in
Tunnel 2. Each circuit is assigned a unique IP address by way of virtual IP interfaces (ipif) within the Brocade
7840, Brocade SX6, and Brocade FX8-24. Those IP interfaces are, in turn, assigned to Ethernet interfaces.
In this case, each IP interface is assigned to a different Ethernet interface. This is not required, however.
Ethernet interface assignment is flexible, depending on the environment’s needs, and assignments can
be made as desired. For instance, multiple IP interfaces can be assigned to a single Ethernet interface.
There are no subnet restrictions. The circuit flows from local IP interface to remote IP interface through the
assigned Ethernet interfaces.
Within the architecture of the Brocade 7840, 7800, SX6, and FX8-24, there are Brocade FC Application-
Specific Integrated Circuits (ASICs), which know only FC protocol. The VE/VEX_Ports are not part of the
ASICs and are logical representations of actual FC ports. In actuality, multiple FC ports feed a VE/VEX_Port,
permitting high data rates well above 8 Gigabits per second (Gbps) and 16 Gbps, which is necessary for
feeding the compression engine at high data rates and for high-speed trunks. On the WAN side, the Brocade
7840 and SX6 feature 10 Gigabit Ethernet (GbE) and 40 GbE interfaces, the Brocade FX8-24 features 10
GbE and 1 GbE interfaces, and the Brocade 7800 has 1 GbE interfaces. Think of VE and VEX_Ports as the
transition point from the FC world to the TCP/IP world inside the extension devices.
IP extension switching and processing is done within the Ethernet switch, the Field-Programmable Gate
Arrays (FPGAs), and the Data Processors (DPs). IP extension uses VE_Ports as well, even if no FC or FICON
traffic is being transported. VE_Ports are logical entities that are more accurately thought of as a tunnel
endpoint than as a type of FC port. IP extension uses Software Virtual Interfaces (SVIs) on each DP as a
communication port with the IP storage end devices. The SVI becomes the gateway IP interface for data
flows headed toward the remote data center. At the SVI, end-device TCP sessions are terminated on the
local side and reformed on the remote side. This is referred to as TCP Proxying, and it has the advantage
of local acknowledgments and great amounts of acceleration. WAN-Optimized TCP (WO-TCP) is used as
the transport between data centers. Which side is “local” or “remote” depends on where the TCP session
initiates. Up to eight of the 10 GbE interfaces can be used for IP extension Local Area Network (LAN) side
(end-device or data center LAN switch) connectivity.
Eight 10 GbE interfaces are reserved for WAN side (tunnel) connectivity. Since a single VE/VEX_Port
represents the ISL endpoint of multiple trunked circuits, this affords the Brocade 7840, 7800, and FX8-24
some benefits. First, fewer VE/VEX_Ports are needed, making the remaining virtual ports available for other
trunks to different locations. Typically, only one VE/VEX_Port is needed to any one remote location. Second,
bandwidth aggregation is achieved by merely adding circuits to the trunk. Each circuit does not have to be
configured identically to be added to a trunk; however, there are limitations to the maximum differences
between circuits. For example, the maximum bandwidth delta has to fall within certain limits. Circuits
can vary in terms of configured committed rate, specifically Adaptive Rate Limiting (ARL) floor and ceiling
bandwidth levels. Circuits do not have to take the same or similar paths. The Round Trip Time (RTT) does not
have to be the same; the delta between the longest and shortest RTT is not limitless and must fall within
supported guidelines.
Best practices accommodate nearly all practical deployments. The resulting RTT for all circuits in the
trunk is that of the longest RTT circuit in the group. Bandwidth scheduling is weighted per each circuit,
and clean links operate at their full available bandwidth. A common example is a tunnel deployed across
a ring architecture, in which one span is a relatively short distance and the opposite span is longer. It is
still practical to trunk two circuits, one across each span of the ring. Asynchronous RDR (Remote Data
Replication) applications are not affected if both paths take on the latency characteristic of the longest path.
Please refer to the Brocade Fabric OS Extension Guide for more details pertaining to a specific Brocade
Fabric OS (Brocade FOS) version.
Link failover is fully automated with Extension Trunking. By using metrics and failover groups, active and
passive circuits can coexist within the same trunk. Each circuit uses keepalives to reset the keepalive
timer. The time interval between keepalives is a configurable setting on the Brocade 7840, 7800, SX6, and
FX8-24. If the keepalive timer expires, the circuit is deemed to be down and is removed from the trunk. In
addition, if the Ethernet interface assigned to one or more circuits loses light, those circuits immediately are
considered down. When a circuit goes offline, the egress queue for that circuit is removed from the load-
balancing algorithm, and traffic continues across the remaining circuits, although at the reduced bandwidth
due to the removal of the offline link.
Keepalives are treated like normal data WO-TCP. They are not sent on any special path through the network
stack; therefore, they are intermixed with normal data that is passing through a TCP for transmission. This
means that the transmission of keepalives is guaranteed through the network by WO-TCP, so it cannot
be lost unless a circuit is down. The keepalive process determines true delay in getting a packet through
an IP network. This mechanism takes into account latency, reorder, retransmits, and any other network
conditions.
If packets take longer than the allotted amount of time to get through the IP network and exceed the KATOV,
the circuit is torn down, and that data is requeued to the remaining circuits. Configuring the KATOV based
on the timeout value of the storage application passing through the trunk is important and recommended.
Having said this, the default value for FICON is one second, which should not be changed. The default value
for RDR and tape is 10 seconds and can often be shortened. This change must be evaluated on a case-by-
case basis and depends greatly on the characteristics of the IP network. The KATOV should be less than the
application level timeout, when a trunk contains multiple circuits. This facilitates LLL without the application
timing out, ensuring that data traversing the IP network does not time out at the application level. TCP is the
preferred transport for the keepalive mechanism, because it allows tight control of the time that a segment
is allowed on the WAN.
This is problematic for some devices, particularly mainframes, resulting in an IFCC. For example, in
Figure 2, frames 1 and 2 are sent and received. Frame 3 is lost. Frame 4 is sent and received. The receiving
side detects 1-2-4, which is an OOOF condition, because it was expecting frame 3 but received frame 4.
Extension Trunking resolves this problem with LLL. When a link is lost, frames in flight are inevitably lost as
well, and the lost frames are retransmitted by LLL (See to Figure 2).
Normally, when a TCP segment is lost due to a dirty link, bit error, or congestion, TCP retransmits that
segment. In the case of a broken connection (circuit down), there is no way for TCP to retransmit the lost
segment, because TCP is no longer operational across the link.
An elegantly simple solution is to encapsulate all the TCP sessions, each session associated with an
individual circuit within the trunk. This outer TCP session is referred to as the “Supervisor” TCP session and
it feeds each circuit’s TCP session through a load balancer and a sophisticated egress queuing/scheduling
mechanism that compensates for different link bandwidths, latencies, and congestion events.
The Supervisor TCP is a Brocade proprietary technology that is purpose-designed with a special function
algorithm. It operates at the presentation level (Level 6) of the Open Systems Interconnection (OSI) model
and does not affect any LAN or WAN network device that might monitor or provide security at the TCP (Level
4) or lower levels, such as firewalls, ACL, or sFlow (RFC 3176). It works at a level above FC and IP such that it
can support both Fibre Channel over IP (FCIP) and IP extension across the same extension trunk.
If a connection goes offline and data continues to be sent over remaining connections, missing frames
indicated by noncontiguous sequence numbers in the header trigger an acknowledgment by the Supervisor
TCP session back to the Supervisor source to retransmit the missing segment, even though that segment
was originally sent by a different link TCP session that is no longer operational. This means that segments
that are held in memory for transmission by TCP have to be managed by the Supervisor, in the event that
the segment has to be retransmitted over a surviving link TCP session. Holding segments in memory on the
Brocade Extension platforms is not an issue, because of the large quantity of memory on these platforms.
For instance, the Brocade 7840 Extension switch has 128 GB of memory available to accommodate the
large BDPs (Bandwidth Delay Products) associated with multiple 10 Gbps connections over long distances.
In a three-site triangular architecture, in which normal production traffic takes the primary path that is
shortest in distance and latency with one hop, a metric of 0 is set. Sending traffic through the secondary
path, which has two hops and typically is longer in distance and latency, is prevented unless the primary
path is interrupted. This is done by setting a metric of 1 to those circuits. Nonetheless, the dormant circuits
are still members of the trunk. Convergence over to the secondary path is lossless, and there are no out-
of-order frames. No mainframe IFCC or SCSI errors are generated. Extension Trunking is required to assign
metrics to circuits.
Failover circuits and groups are a feature supported on the Brocade extension platforms. Failover groups
allow you to define a set of metric 0 and metric 1 circuits that are part of a failover group. When all metric 0
circuits in the group fail, metric 1 circuits take over operation, even if there are metric 0 circuits still active in
other failover groups. Typically, you would configure only one metric 0 circuit in a failover group. All metric 0
circuits in a group must fail before a metric 1 circuit is used.
Circuit failover provides an orderly means to recover from circuit failures. To manage failover, each circuit
is assigned a metric value, either 0 or 1, which is used in managing failover from one circuit to another.
Extension Trunking with metrics uses LLL, and no in-flight data is lost during the failover. If a circuit fails,
Extension Trunking first tries to retransmit any pending send traffic over another lowest metric circuit. In
Figure 4, circuit 1 and circuit 2 are both lowest metric circuits. Circuit 1 has failed, and transmission fails
over to circuit 2, which has the same metric. Traffic that was pending at the time of failure is retransmitted
over circuit 2. In-order delivery is ensured by the receiving extension switch or blade.
Figure 4: Link loss and retransmission over peer lowest metric circuit.
In Figure 5, circuit 1 is assigned a metric of 0, and circuit 2 is assigned a metric of 1. Both circuits are in
the same tunnel. In this case, circuit 2 is not used until no lowest metric circuits are available. If all lowest
metric circuits fail, then the pending send traffic is retransmitted over any available circuits with the higher
metric. Failover between like metric circuits or between different metric circuits is lossless.
Only when all metric 0 circuits fail do available metric 1 circuits cover data transfer. If the metric 1 circuits
are not identical in configuration to the metric 0 circuits, then the metric 1 circuits will exhibit a different
behavior. Additionally, if the metric 1 WAN path has different characteristics, these characteristics define the
behavior across the metric 1 circuits. Consider configuring circuit failover groups to avoid this problem.
With circuit failover groups, you can better control over which metric 1 circuits will be activated if a metric 0
circuit fails. To create circuit failover groups, you define a set of metric 0 and metric 1 circuits that are part
of the same failover group. When all metric 0 circuits in the group fail, metric 1 circuits will take over data
transfer, even if there are metric 0 circuits still active in other failover groups. Typically, you would only define
one metric 0 circuit in the group so that a specific metric 1 circuit will take over data transfer when the
metric 0 circuit fails. This configuration prevents the problem of the tunnel operating in a degraded mode,
with fewer than the defined circuits, before multiple metric 0 circuits fail.
Circuit Spillover
Circuit spillover is a load-balancing method introduced in Fabric OS 8.0.1 that allows you to define circuits
that are used only when there is congestion or a period of high bandwidth utilization. Circuit spillover is
configured on QoS tunnels on a VE_Port. When using spillover load balancing, you specify a set of primary
circuits to use all the time, and a set of secondary circuits (spillover circuits) that are used during high
utilization or congestion periods. Primary and secondary circuits are controlled using the metric field of the
circuit configuration. For example, when a tunnel is configured for spillover, traffic uses the metric 0 circuits
until bandwidth utilization is reached on those circuits. When bandwidth in metric 0 circuits is exceeded, the
metric 1 circuits are used. Conversely, when the utilization drops, the metric 1 spillover circuits are no longer
used.
The failover behavior of a tunnel remains intact with regard to the lower metric circuits. If a tunnel is
configured for spillover and the lower metric circuits become unavailable, the higher metric circuits function
as the backup circuits. Metric 1 circuits always behave as backup circuits whether configured for spillover
or failover. Failover load balancing, in comparison to spillover, requires that all lower metric circuits become
unavailable before the higher metric circuits are used. When you have multiple lower metric circuits, this
means that all of them must fail before any of the higher metric circuits are used. Circuit failover groups can
still be configured and will behave as before, however, only the metric value determines whether a circuit
is considered a spillover circuit or a primary circuit. For example, if a tunnel is configured with four circuits,
with two metric 0 and two metric 1 circuits in two separate failover groups, both metric 0 circuits are used,
and only when they become saturated are the metric 1 circuits used. The failover grouping is used for
failover scenarios only.
Because of how spillover is implemented, the observed behavior may not appear as expected. For example,
consider QoS traffic flows of high, medium, and low with a bandwidth percentage assigned to each priority.
QoS traffic is carried by the active circuits according to the percentage allocated to each priority. Say the
QoS low priority traffic reaches saturation before hitting its bandwidth allocation limit, it will spill over from
a metric 0 circuit to a metric 1 circuit. For example, consider QoS traffic flows designated as high, medium,
and low. The high QoS traffic flow can be assigned to metric 1 circuits, while the medium and low QoS
traffic flow can be assigned to metric 0 circuits. In this instance, the spillover circuit (metric 1) is used even
though the metric 0 circuits are not saturated. When the metric 0 circuit is saturated, additional traffic will
spill over to the metric 1 circuit. Using the portshow fciptunnel and portshow fcipcircuit with the --qos
and --perf options will display additional details. In the flag list for both tunnel and circuit output, a spillover
flag (s=Spillover) is listed. The spillover flag applies only to output for the portshow fciptunnel command. If
one end of the spillover circuit is using a version of Fabric OS software prior to FOS 8.0.1, the mismatch is
identified and the circuit behavior defaults to circuit failover.
Using protocol acceleration, it is necessary to confine traffic to a specific deterministic path bi-directionally.
This can be accomplished in three ways:
The reason that optimized traffic must be bidirectionally isolated to a specific path is because protocol
acceleration uses a state machine to perform the optimization. Protocol acceleration needs to know, in
the correct order, what happened during a particular exchange. This facilitates proper processing of the
various sequences that make up the exchange until the exchange is finished, after which the state machine
is discarded. These state machines are created and removed with every exchange that passes over the
tunnel/trunk. The Brocade 7840, the Brocade 7800, the Brocade SX6, and the Brocade FX8-24 have the
capacity for tens of thousands of simultaneous state machines or the equivalent number of flows. The
ingress FC frames are verified to be from a data flow that can indeed be optimized. If a data flow cannot be
optimized, it is merely passed across the trunk without optimization.
These state machines reside within each VE_Port endpoint. A state machine cannot exist across more than
one VE/VEX_Port, and there is no communication between different state machines. As shown in Figure 6,
if an exchange starts out traversing tunnel 1 and returns over tunnel 2, with each tunnel using a different
VE_Port, the exchange passes through two different state machines, each one knowing nothing about the
other. This situation causes FC protocol and SCSI I/O to break, producing error conditions.
An advantage to Extension Trunking is that logically there is only a single ISL tunnel and a single endpoint
at the VE or VEX_Port on each side of the trunk, regardless of how many circuits exist within the trunk. The
state machines exist at the endpoints of the Supervisor TCP and remain consistent, even if an exchange
uses circuit 0 outbound and circuit 1 inbound. Refer to Figure 7. Extension Trunking permits protocol
optimization to function across multiple links that are load balanced and can fail over with error-free
operation and without any disruption to the optimization process.
It is important to remember that if more than one VE/VEX_Port is used between two data centers,
the Brocade 7840, the Brocade 7800, the Brocade SX6, and the Brocade FX8-24 can route traffic
indeterminately to either of the virtual ports, which breaks protocol optimization. In this case, one of the
isolation techniques described above is required to prevent failure by confining traffic flows to the same
trunk or tunnel. There is one tunnel or trunk per VE_Port or VEX_Port.
A key advantage of Brocade Extension Trunking is that it offers a superior technology implementation for
load balancing, failover, in-order delivery, and protocol optimization. FastWrite disk I/O protocol optimization,
OSTP, and FICON Acceleration are all supported on Brocade Extension platforms. Brocade protocol
acceleration requires no additional hardware or hardware licenses. The same DP that forms tunnels/trunks
also performs protocol acceleration for a higher level of integration, greater efficiency, better utilization of
resources, lower costs, and a reduced number of assets.
Brocade ARL always maintains at least the minimum configured bandwidth, and it never tries to exceed
the maximum level. Everything in between is adaptive. Brocade ARL tries to increase the rate limit up to
the maximum until it detects that no more bandwidth is available. If it detects that no more bandwidth
is available, and ARL is not at the maximum, it continues to periodically test for more bandwidth. If ARL
determines that more bandwidth is available, it continues to climb towards the maximum. On the other
hand, if congestion events are encountered, Brocade ARL reduces the rate based on the selected back-off
algorithm. ARL never attempts to exceed the maximum configured value and reserves at least the minimum
configured value.
Common to all Remote Data Replication (RDR) systems is some type of WAN connection. WAN
characteristics include bandwidth, propagation delay (often referred to as Round Trip Time [RTT]), and the
connection type. Modern and ubiquitous connection types include MPLS (Multiprotocol Label Switching),
DWDM (Dense Wavelength Division Multiplexing), and Carrier Ethernet. There is yet another characteristic of
considerable importance to RDR network design, the attribute that causes the most operational difficulties:
Is the WAN dedicated to storage extension or is it shared? A dedicated or shared WAN connection is a very
important consideration both for cost and operations. A WAN connection may be shared among multiple
storage applications (such as Fibre Channel over IP [FCIP] disk and tape, native IP replication, or host-based
repli-cation) or shared with other non-storage traffic (for instance, e-mail, Web traffic, etc.). Enterprises are
increasingly opting to use a shared WAN of considerable bandwidth for extension rather than a dedicated
connection. There are many economic reasons for this. This section discusses how Brocade ARL is used to
facilitate the proper operation of extension across a shared WAN connection.
Shared Bandwidth
Shared bandwidth between data centers is common in IP networks and WANs used by RDR and Backup,
Recovery, and Archive; however, sharing bandwidth with these applications often presents difficult
problems. Here are examples of shared bandwidth:
• Bandwidth shared between multiple extension flows originating from various interfaces on the same or
different Brocade extension devices.
• Bandwidth shared between multiple extension flows and other storage flows from host-based replication,
native IP on arrays, Network Attached Storage (NAS) replication, Virtual Machine (VM) cluster data, etc.
• Bandwidth shared with non-storage-related applications, such as user traffic (for instance, Web pages,
e-mail, and enterprise-specific applications) and Voice over IP (VoIP).
Sharing bandwidth poses a problem for storage traffic. Extension traffic is, in essence, Inter-Switch Link
(ISL) traffic. ISL traffic characteristically makes stringent demands on a network. For instance, storage traffic
typically consumes a considerable amount of bandwidth. This depends on the number of volumes being
replicated and the amount of data being written to the volumes within a given period of time.
In general, shared bandwidth is neither best practice nor recommended for storage; however,
considerations of practicality and economics make it necessary for organizations to share their WAN links
with other applications. Experience has demonstrated that placing storage traffic over shared bandwidth
without proper planning and without implementing facilitating mechanisms for storage transmission either
fails in practice or is plagued with problems. Mechanisms to facilitate storage transmission include ARL,
massive overprovisioning of bandwidth, QoS, and Committed Access Rate (CAR). Each of these can be used
alone or in combination. This paper focuses on Brocade ARL. ARL is the easiest mechanism operationally
and the most reliable of the mechanisms that can be deployed.
Brocade ARL limits traffic to a circuit’s maximum configured bandwidth, referred to as the ceiling. The
maximum bandwidth of a circuit depends on the platform, 10 Gigabits per second (Gbps) on the Brocade
SX6, 7840 and FX8-24, and 1 Gbps on the Brocade 7800. The aggregate of the maximum bandwidth
values across multiple circuits assigned to an Ethernet interface can exceed the bandwidth of an interface
and is limited to two times the maximum speed of a VE_Port. A VE_Port on the Brocade SX6 blade and
7840 has a maximum speed of 20 Gbps. It has a maximum speed of 10 Gbps on the Brocade FX8-24 blade
and 6 Gbps on the Brocade 7800.
Multiple circuits can be assigned to the same Ethernet interface, or a circuit can be assigned to a dedicated
Ethernet interface. The design of the extension network and the number of available Ethernet interfaces
and their speed dictate how circuits are assigned to Ethernet interfaces. Brocade ARL permits circuits that
are active during one portion of the day to use the available bandwidth, while on the same interface during
a different time of day, a different circuit uses the available bandwidth. No one circuit exceeds the interface
rate, but the aggregate of multiple circuits can exceed the individual physical interfaces—up to two times the
maximum speed of the VE_Port.
Brocade ARL reserves a configured minimum amount of circuit bandwidth, referred to as the floor. The
minimum for the Brocade FX8-24 is as low as 10 Megabits per second (Mbps), and the minimum for the
Brocade 7840 and the Brocade SX6 is as low as 20 Mbps. The aggregate of the minimum (floor) bandwidth
values for multiple circuits assigned to an Ethernet interface cannot exceed the bandwidth of the interface.
The minimum values cannot oversubscribe the interface, which always permits at least that much
bandwidth to exist for that circuit. Proper extension network operation is always assured when operating at
floor values.
Tape and replication traffic are sensitive to latency. Even asynchronous traffic can be negatively affected,
primarily in terms of reduced throughput. There are Small Computer Systems Interface (SCSI) and Fibre
Connection (FICON) protocol optimization methods such as FastWrite and FICON emulation that mitigate
the effects of latency; however, optical propagation delay is not the only source of latency. Buffering delays
also cause latency in the IP network during periods of congestion. Retransmission due to packet loss also
causes added latency. Rate limiting in general must be used to counter these problems; otherwise, network
per-formance degrades. In a properly designed and deployed IP network that has ample bandwidth and is
not being oversubscribed, performance and behavior are the same as a native FC network across the same
distance. This should be the goal for high-performance extension.
Proper rate limiting of extension traffic into a WAN network prevents packet loss due to congestion and
buffer overflow. Feeding unlimited or excessive data into an IP network that is not capable of the capacity
causes packet loss. This results in retransmissions, which adds to WAN utilization and exacerbates the
problem. To make the situation worse, when congestion events occur, TCP limits transmission as a flow
control mechanism in an effort to dissipate congestion and prevent it from immediately happening again.
Using TCP as a flow control mechanism is extremely inefficient and results in poor performance. Simply
stated, you cannot feed more data into a network than it is capable of transmitting, otherwise, data is lost in
transmission and performance suffers. Therefore, it is essential to provide the network with a comfortable
data rate that avoids such problems. Since the inception of extension, rate limiting has been accomplished
through a simple static method. This approach is inadequate and is rarely used in modern times.
2. More than one extension interface feeding a single WAN link that has been dedicated solely to storage.
Sharing a link with both storage traffic and non-storage traffic is the most problematic. User data tends to
be unpredictable and very bursty, causing bouts of congestion. QoS can be used to give storage a higher
priority. CAR can be used to logically partition the bandwidth over the same WAN connection. Brocade has
developed WO-TCP, which is an aggressive TCP stack that facilitates and continues storage transmission—
even after experiencing congestion events—by maintaining throughput in the most adverse conditions. Other
TCP flows on a shared link will retreat much more relative to WOTCP. This frees more bandwidth for storage
during con-tentious times. WO-TCP cannot overcome User Datagram Protocol (UDP)-based traffic, because
UDP has no flow control mechanisms. Therefore, it is best to prevent collisions with UDP by using Brocade
ARL as the mediator.
Brocade ARL is easily configured by setting a ceiling (-B or –max-comm-rate) and a floor (-b or –min-comm-
rate) bandwidth level per circuit. Brocade ARL always maintains at least the floor level, and it never tries to
exceed the ceiling level. Everything in between is adaptive. Brocade ARL tries to increase the rate limit up
to the ceiling until it detects that no more bandwidth is available. If it detects that no more bandwidth is
available, and ARL
is not at the ceiling, it continues to periodically test for more bandwidth. If ARL determines that more band-
width is available, it continues to climb towards the ceiling. On the other hand, if congestion events are
en-countered, Brocade ARL drops to the floor based on the selected back-off algorithm. Subsequently, ARL
attempts to climb again towards the ceiling.
On the Brocade SX6 and the Brocade 7840, ARL back-off mechanisms have been optimized to increase
overall throughput. The Brocade 7800 and Brocade FX8-24 have only one back-off method, which is reset
to floor. On the Brocade SX6 and the Brocade 7840, ARL has been enhanced with new intelligence, which
permits rate limiting to retrench in precise steps and preserves as much throughput as possible. Experience
shows that a complete reset back to the floor value on a shared link is most often not required. Preserving
bandwidth and reevaluating whether additional back-off steps are required is prudent.
ARL now maintains RTT stateful information to better predict network conditions, allowing for more
intelligent and granular decisions. When ARL encounters a WAN network error (congestion event), it looks
back at pertinent and more meaningful stateful information, which will be different relative to the current
state. Rate limit decisions are fine-tuned using the new algorithms on the Brocade SX6 and the Brocade
7840. The Brocade ARL back-off algorithms are as follows:
• Static Reset (Brocade 7800, Brocade 7840, Brocade SX6, Brocade FX8-24)
Static Reset is most appropriate for latency-sensitive applications such as FICON. Static Reset quickly
addresses congestion issues due to excessive data rates. Static Reset is the fastest way to resolve the
negative effects of congestion. FICON has strict timing bounds and is very sensitive to latency. FICON
cannot wait for multiple iterations of back-off to finally get data through. It is most prudent to just reset
the rate limit, avoid congestion, and resend the data. The bandwidth is the trade-off while time is the
benefit.
• Modified Multiplicative Decrease (MMD) (Brocade SX6 and Brocade 7840 only)
For MMD, with each RTT across the IP network the rate limit is decreased by 50 percent of the difference
between the current rate limit and the floor value. For example: The floor is 6 Gbps, and the current
rate limit is at 10 Gbps. A congestion event occurs. The first back-off goes down to 8 Gbps. Another RTT
occurs, and another congestion event occurs also. The second back-off goes down to 7 Gbps (and so
forth). A quick RTT is critical for quick and effective back-off. The MMD method of back-off is best with
RDR (such as EMC SRDF, HDS HUR, and IBM GM) on links less than 200 milliseconds (ms) RTT. When the
link is less than 200 ms RTT, the contention on the WAN connection is resolved in a quick and efficient
manner while, if possible, attempting to maintain as much bandwidth as possible.
For example: On a 300 ms RTT link, the current rate is 10 Gbps and the floor rate is 6 Gbps. The first
back-off is 1.2 Gbps down to 8.8 Gbps. The second back-off is 0.84 Gbps down to about 8 Gbps, and so
forth. The rate at which WAN contention is resolved is directly related to the RTT. TBD is optimal for most
intercontinental applications.
Automatic (auto) Mode should be used and is considered best practice in most instances. Auto Mode
chooses the correct back-off algorithm based on the detected and configured characteristics of the circuit.
Auto Mode functions as follows:
• If FICON acceleration is enabled, or the Keep Alive Time Out Value (KATOV) is set to 1 second or less,
Static Reset is used automatically. If FICON is used without FICON acceleration, the KATOV should be set
to 1 second, which automatically implements Static Reset.
• If FICON is not used, and the KATOV is greater than 1 second, MMD is set.
• Within WO-TCP’s first few RTTs upon bringing the circuit up, if the measured RTT is greater than 200 ms,
the algorithm automatically changes to TBD.
After 10 ms of idle time, ARL begins to return to the floor rate, using a step-down algorithm. The Brocade
7800 and the Brocade FX8-24 take 10 seconds to return to the floor, while the Brocade 7840 and the
Brocade SX6 take one second. When data transmission starts, ARL climbs towards the ceiling. To climb
from floor to ceiling, the Brocade 7840 and the Brocade SX6 take one second, and the climb is smooth and
linear. The climb on the Brocade 7800 and the Brocade FX8-24 takes 10 seconds and is also smooth and
linear. Climbing at faster rates is prone to creating detrimental congestion events and therefore is not done.
ARL is not intended to be an extremely reactionary mechanism for the purpose of precisely following
network conditions. ARL’s purpose is to maximize overall average bandwidth on shared links, to not interfere
with other WAN applications, and to recover bandwidth from offline circuits. The process of detection,
feedback, and readjustment is not practical for instant-to-instant adjustments. ARL is reactive enough to
provide a positive experience on shared WAN connections and for reestablishing application bandwidth from
circuits that go offline.
Brocade ARL plays an important role in extension across shared links with other traffic. If congestion
events are detected, Brocade ARL backs off the rate limit and permits other traffic to pass. Typically,
within Enterprise IT there is an agreement on the maximum bandwidth that storage can consume across
a shared link. However, if other traffic is not using their portion of the bandwidth, storage traffic may use
free bandwidth. When storage gets to use extra bandwidth from time to time, asynchronous applications
such as RDR/A have a chance to catch up. RDR/A applications may fall behind during interims of unusually
high load. Brocade ARL facilitates these types of Service Level Agreements (SLAs) without the cost of
overprovisioned WAN connections.
In the case of multiple extension interfaces feeding a single link, the sources can be from a variety of
storage devices, such as tape and arrays. Rarely does a host communicate with storage across a WAN
due to poor response times and reliability. Alternatively, the source may be a single array with “A” and “B”
controllers that are connected to respective “A” and “B” Brocade 7840s, Brocade 7800s, or Brocade FX8-
24s in a High-Availability (HA) architecture, but that utilize only a single WAN connection. In any case, the
link is oversubscribed and congestion occurs.
• If the bandwidth is greater than or equal to 2 Gbps, the link cost is 500.
• If the bandwidth is less than 2 Gbps, but greater than or equal to 1 Gbps, the link cost is 1,000,000
divided by the bandwidth in Mbps.
• If the bandwidth is less than 1 Gbps, the link cost is 2,000 minus the bandwidth in Mbps.
Multiple parallel tunnels are supported, but their use is not recommended. If multiple parallel tunnels are
used, configure Lossless DLS. This avoids FC frame loss during routing updates that are made because of
bandwidth updates.
• As a best practice, the aggregate of the maximum rate bandwidth settings through a VE_Port tunnel
should not exceed the bandwidth of the WAN link. For example, if the WAN link is 2 Gbps, the aggregate of
the ARL maximum rates connected to that WAN link can be no more than 2 Gbps. For ingress rates, there
is no limit because the FC flow control (BBC) rate limits the incoming data.
• The aggregate of the minimum configured values cannot exceed the speed of the Ethernet interface,
which is 1 Gbps for GbE ports or 10 Gbps for 10 GbE ports, or 40 Gbps for 40 GbE ports.
• Configure minimum rates of all the tunnels so that the combined rate does not exceed the specifications
listed for the extension product in the tunnel and circuit requirements for Brocade Extension platforms.
• For 1 GbE, 10 GbE, and 40 GbE ports, the ratio between the minimum committed rate and the maximum
committed rate for a single circuit cannot exceed five times the minimum. For example, if the minimum is
set to 1 Gbps, the maximum for that circuit cannot exceed 5 Gbps. This limit is enforced in software.
• The ratio between any two circuits on the same tunnel should not exceed four times the lower circuit. For
example, if one circuit is configured to 1 Gbps, any other circuit in that same tunnel should not exceed 4
Gbps. This limit is not enforced in software, but is strongly recommended.
• For any circuit, the minimum bandwidth values must match on both the local side and remote side, and
so must the maximum bandwidth values.
• ARL is invoked only when the minimum and maximum bandwidth values on a circuit differ. For example, if
you configure both the minimum and maximum bandwidth values on a circuit to 1,000,000 (1 Gbps), ARL
is not used.
Now, consider the next scenario. In the event a Brocade Extension device, LAN switch/router, or array
controller is taken offline for maintenance, technology refresh, or due to failure, then the result is that a path
goes down. In the previous scenario, during an outage the oversubscription goes away, and there is no TCP
flow control problem. The issue is that normal operation does not work properly, due to the oversubscription.
Clearly, it is impractical to have a network that works properly only during device outages.
In a different case, such as being undersubscribed during maintenance or failure, normal operation
has no oversubscription permitting proper operation. The relevant issue is that during times of failure or
maintenance, undersubscription of the WAN occurs, which presents a problem of a different kind. This
undersubscription may be acceptable for some organizations, but it is unacceptable for many organizations.
It is imperative that during periods of maintenance and failures, no increase in the risk of data loss occurs.
This is exactly the situation that is faced when half the available bandwidth disappears, as shown in Figures
9 and 10.
Figure 10: Outage situation: Undersubscribed with fixed rate limit (no ARL).
If, suddenly, only half the bandwidth is available, and assuming more than half the bandwidth is used on
a regular basis, this creates a situation in which an RDR/A application has to start journaling data (that
is, Data Set Extension on SRDF) because of incomplete transmissions. This pending data is stored on the
limited space of disk journals or cache side files. Depending on the load during the outage, these journals
can quickly fill, as they are not meant to provide storage for longer than just minutes to possibly a couple
of hours. The journal has to be large enough to hold all the writes that are replicated during the outage.
Once the journal fills to maximum capacity, the RDR/A application has no choice but to sever the local
and remote volume pair relationship. When connectivity is restored, a resynchronization of all dirty tracks
on every volume pair is required. During the period of time that extends from the time the journals start
to fill, through the regaining of connectivity, to completion of resynchronization of the pairs, recent data is
not protected and RPO becomes long. Reestablishing volume pair status can take a lengthy period of time,
especially if the volumes are very large and many tracks have been marked as changed (dirty). During all
this time, data is not protected, which exposes the enterprise to a significant liability. Many companies are
interested in maintaining available bandwidth at all times and wish to avoid the risk associated with not
doing so.
The problem described in the previous paragraph can be solved by simply purchasing a link with twice the
bandwidth than is actually needed or will be used, and by setting static rate limiting to half that bandwidth.
During times of failure, upgrades, or maintenance—when one or more paths are offline—only half the
bandwidth can be used, due to static rate limiting. This does not pose a problem with an overprovisioned
link, because twice the needed bandwidth is available, thus half the bandwidth works well. Obviously, this
is not an optimal or cost-effective solution because during normal operation, half the bandwidth goes to
waste.
Brocade ARL permits more efficient use of that bandwidth, reducing Operational Expenses (OpEx) and
delivering a higher Return on Investment (ROI). With Brocade ARL, a properly sized WAN can be provisioned
with the ceiling set to the WAN bandwidth and (assuming two extension interfaces per WAN link) the floor
set to half of the WAN bandwidth. Refer to Figure 11 and Figure 12. If four extension interfaces are present,
the floor is set to one-fourth of the WAN bandwidth.
Figure 12: Offline device operation: ARL adjusted for an offline Brocade 7840.
Of course, manual intervention by resetting the rate limit settings can help here as well; however, the
downside is that this requires more management, more chance of error, more change control, inconvenient
maintenance windows, and increased operational costs.
Consider buying and using only the bandwidth that is actually needed. Some number of interfaces may
go offline from time to time for a variety of reasons. If the remaining online interfaces automatically adjust
their rate limit to compensate for other interfaces that are temporarily offline, then the bandwidth will be
efficiently consumed all the time. Brocade ARL addresses this problem by automatically adjusting the rate
limiting on extension interfaces. ARL maintains bandwidth during outages and maintenance windows, which
accomplishes several goals:
Brocade ARL performs another important task by allowing multiple storage applications (such as tape
and RDR) to access bandwidth fairly. As an application’s demand diminishes, the other applications
can utilize the unused bandwidth. If two storage applications assigned to different extension interfaces
and utilizing the same WAN are both demanding bandwidth, they adaptively rate limit themselves to an
equalization point. No application can preempt the bandwidth used by another application below its floor
rate. However, if the demand of one application tapers off, the other application can utilize unused portions
of the bandwidth. In Figure 13, you can see that the blue flow tapers off, the red flow starts to increase its
consumption of bandwidth. Conversely, if the blue application later needs more bandwidth, it preempts the
red application’s use of bandwidth back to the equilibrium point, but not beyond it. In this way, bandwidth is
fully utilized without jeopardizing the minimum allotted bandwidth of either of the storage applications.
It is important to note that when using hybrid mode in Fabric OS releases prior to Fabric OS 8.1.0, Extension
HCL is disruptive to any IP Extension traffic. Extension HCL is non-disruptive to FCIP traffic.
A Brocade extension platform has two Data Processor (DP) complexes, referred to as DP0 and DP1. An
Extension HCL firmware update occurs on one DP complex at a time. When a firmware update is initiated,
the process always starts on DP0. Before DP0 is updated to the new firmware, traffic fails over to DP1 to
maintain communication between the local and remote switch.
Extension HCL uses three tunnel groups contained within the extension tunnel, as shown in Figure 14, to
perform the non-disruptive firmware upload process. The Local Backup Tunnels (LBTs) and Remote Backup
Tunnels (RBTs) are automatically created when you configure an HCL-compliant tunnel. Not all tunnels are
HCL-capable. To create the tunnels, you configure IPIF addresses on the local DP0 and DP1, and on the
remote DP0 and DP1. As shown in Figure 14, you create circuits for the main tunnel (local DP0 to remote
DP0), the local backup tunnel (local DP1 to remote DP0), and the remote backup tunnel (local DP0 to
remote DP1). When you configure these circuits, you designate that they are used for high availability (HA).
This collection of circuits and tunnels form the HA tunnel group.
When HCL is properly configured, these events happen during the firmware upgrade. All traffic, FCIP and IP
Extension, is maintained without interruption during the upgrade.
• The Main Tunnel (MT) group provides connectivity during normal operations.
• An LBT group maintains connectivity from the remote switch when the local switch DP0 is being upgraded.
This tunnel, dormant during non-Extension HCL operations, is created automatically on the local DP1.
• A RBT maintains connectivity from the local switch when the remote switch DP0 is being upgraded. This
tunnel group, dormant during non-Extension HCL operations, is created automatically on the remote DP1
when the corresponding local HA IP address is defined.
The MT is what you normally configure to create an extension tunnel from a VE_Port using the portcfg
fciptunnel command and appropriate tunnel and circuit parameters. The MT carries traffic through the
extension tunnel to the remote switch. The LBT is created upon specifying the local HA IP address for the
circuit, and the RBT is created upon specifying the remote HA IP address for the circuit. All three tunnel
groups (MT, LBT, and RBT) are associated with the same VE_Port on the local and remote DP complexes.
When an extension tunnel is configured to be HCL capable, the LBT and RBT tunnel groups are always
present. These connections are established at switch boot up or when the tunnel and circuits are created.
The QoS traffic for high, medium, and low priorities is collapsed into a single QoS tunnel for the duration of
the HCL operation. QoS priorities are restored when the HCL operation completes. If the tunnel is configured
with Differentiated Services Code Point (DSCP), all traffic is tagged with medium DSCP when it enters the
WAN.
These tunnel groups are utilized in the following Extension HCL upgrade process:
2. The control processor reboots from the backup partition with the new firmware.
3. The local DP0 is updated with the new firmware using the following process:
a. Perform feature disable processing on the MT on DP0.
b. Traffic from the MT on DP0 is rerouted to DP1 through the LBT to the remote switch. In-
order data delivery is maintained.
c. DP0 reboots with the new firmware and the configuration is reloaded.
d. Traffic from the LBT is failed-back to DP0 through the MT.
4. The local DP1 is updated with new firmware using the following process:
a. Perform feature disable processing on the MT on DP1.
b. Traffic from the MT on DP1 is rerouted to DP0 through the LBT so that data traffic can
continue between the switches. In-order data delivery is maintained.
c. DP1 reboots with the new firmware and the configuration is reloaded.
d. Traffic from the LBT is failed-back to DP1 through the MT.
5. After firmware is updated on DP1 and all MTs, LBT, and RBT are online, the Extension HCL firmware
update is complete.
During the update process, tunnels and trunks change state (up or down). The MT provides connectivity
during normal operations. It is up during normal operation and down only during the Extension HCL process.
The RBT and LBT are normally up during normal operation, but do not handle traffic. They operate to handle
traffic during the Extension HCL process. RBT handles traffic when the remote switch DP0 undergoes the
Extension HCL process. The RBT is visible as a backup tunnel on local DP0.
• The primary circuit is referred to as the MT. The backup circuit from local switch to remote switch is the
LBT. The tunnel that points from the remote switch back to the local switch is the RBT.
• The IPIFs for the MT and LBT side of HCL must be on different DPs.
• When you create or modify a circuit, only those circuits that specify a local and remote HA IP address are
protected.
• If a circuit is not specifically configured for HCL, it will go down during the firmware upgrade process.
• When the switch is in 10VE mode, you can configure up to five circuits on each DP for HCL. This allows
a total of 15 addresses on the DP to use for HCL: five each to use as endpoints for main tunnel, local
backup tunnel, and remote backup tunnel.
• When the switch is in 20VE mode, you can configure up to 10 circuits on each DP for HCL, for a total of 30
addresses to use for HCL.
• Whether you use DP0 or DP1 as the main tunnel, the backup tunnel must be configured on the other DP.
• The tunnel and circuit parameters of the main tunnel are replicated on the LBT and RBT, including the
circuit properties such as QoS markings, FastWrite, and FICON Acceleration.
• Extension HCL is exclusive to the Brocade 7840 and the Brocade SX6. It is not compatible with the
Brocade 7500 switch, the Brocade 7800 switch, or the Brocade FX8-24 blade.
• No configuration changes are permitted during the Extension HCL process. This includes modifying
tunnel or circuit parameters. New device connections that require zone checking can possibly experience
timeouts during the CP reboot phase of the firmware download. The CP performs all zone checking and
therefore must be active to process new SID/DID connection requests, such as PLOGIs.
• Extension HCL supports Virtual Fabrics (VF) and FC Routing (FCR with the IR license) and all existing
features. VEX_Ports are not supported.
• Extension HCL was designed for all environments including mainframe FICON XRC and tape and open
systems disk replication (EMC SRDF, HDS Universal Replicator, IBM Global Mirror, HP Remote Copy, and
others). Extension HCL supports asynchronous and synchronous environments.
• The Brocade 7840 Switch and the Brocade SX6 Blade have two DP complexes: DP0 and DP1. During
the HCL process, each DP reloads one at a time, while the other DP remains operational. Consider the
following for planning and use of the switch during this process:
–– Because only one DP complex remains operational at a time, the total switch capacity is temporarily
diminished by 50 percent.
–– FCIP data is not lost and remains in order. Extension HCL does not cause FICON Interface Control
Check (IFCC).
–– The use of Extension HCL requires proper planning. There is large amount of bandwidth available as
the FC and FICON side of the switch provides 80 Gbps. In addition, there are typically A and B paths
for a total of 160 Gbps in a redundant replication network. This is more than enough bandwidth for
most replication networks, even during a path failure or firmware update. Apportioning bandwidth
to one DP complex or using only 50 percent of the capacity across both DP complexes reserves
adequate bandwidth for high-availability operations. This is considered best practice.
–– The aggregate of all FC and FICON application data passing through the Brocade extension platform
cannot exceed 20 Gbps per DP complex multiplied by the achievable compression ratio, or 40 Gbps,
whichever is smaller. For example, if 2:1 compression can be achieved, then the storage application
could maintain 40 Gbps of throughput across the extension connection. This is true for both 10VE
and 20VE operating modes. 20VE operating mode is not available when the Brocade extension
platform is in hybrid mode.
• Although most firmware updates for Fabric OS 7.4.0 and later will support Extension HCL, not every Fabric
OS release will guarantee firmware capable of using this feature. Refer to the Fabric OS release notes for
details.
• The firmware on the switch at each end the tunnel must be compatible. If not, this will prevent successful
tunnel formation when the main tunnel attempts to come back online or could introduce instability and
aberrant behavior.
• Extension HCL does not require any additional communication paths. Although there will be existing FC
and tunnel connections used for the normal operation of data replication and tape backup, this is the only
requirement.
• Extension HCL takes advantage of RASlog warnings and error messages (WARN/ERROR).
• If parallel tunnels are configured between local and remote sites, you must enable LLL and set the VE link
cost to static. Otherwise, FC traffic might be disrupted during HCL activity.
• When Teradata Emulation is enabled on an Extension tunnel, Extension HCL is not supported. You must
perform a disruptive firmware download.
Extension IP Security
Internet Protocol security (IPsec) uses cryptographic security to ensure private, secure communications
over Internet Protocol networks. IPsec supports network-level data integrity, data confidentiality, data origin
authentication, and replay protection. It helps secure your extension traffic against network-based attacks
from untrusted computers.
• IPsec and Internet Key Exchange (IKE) policies are created and assigned on peer switches or blades on
both ends of the tunnel. On the Brocade FX8-24 Blade, you enable IPsec and provide a pre-shared key
inline on the tunnel. There is no specific policy creation.
• IKE exchanges policy information on each end of the connection. If the policy information does not match,
the connection does not come online. Some of the exchanged Security Association (SA) parameters
include encryption and authentication algorithms, Diffie-Hellman key exchange, and SAs.
• Data is transferred between IPsec peers based on the IPsec parameters and keys stored in the SA
database.
• When authentication and IKE negotiation are complete, the IPsec SA is now ready for data traffic.
• SA lifetimes terminate through deletion or by timing out. An SA lifetime equates to approximately two
billion frames of traffic passed through the SA or after a set time interval has passed.
• As the SA is about to expire, an IKE re-key will occur to create a new set of SAs on both sides and data will
start using the new SA.
On the Brocade SX6 and the Brocade 7840, IPsec support in Fabric OS 8.1.0 supports Suite B
Cryptographic Suites for IPsec, which allows configuration of the Elliptical Curve Digital Signature Algorithms
(ECDSA) for key generation. As part of Suite-B support, you can use digital certificates and use third-party
Certificate Authority (CA) to verify the identity of the certificates. This requires each end of the secure
connection to have access to or a means to lookup the CA certificate for verification purposes. When you
define an IPsec policy, you can select between two profiles. The pre-shared key profile allows you to specify
the key when IPsec is configured. The Public Key Infrastructure (PKI) profile uses CA certificates.
Overview
The logical switch feature allows you to divide a physical chassis into multiple fabric elements. Each of these
fabric elements is referred to as a logical switch. Each logical switch functions as an independent self-
contained FC switch. Each chassis can have multiple logical switches. In Fabric OS 8.1.0, up to a total of 16
logical switches can be supported in a single Gen 6 chassis or on a single Brocade SX6 blade. The Brocade
7840 Switch can support up to four logical switches.
Virtual Fabrics allow Ethernet ports in the default switch to be shared among VE_Ports in any logical switch.
To use the Virtual Fabrics features, you must first enable Virtual Fabrics on the switch. Enabling Virtual
Fabrics creates a single logical switch in the physical chassis. This logical switch is called the default logical
switch, and it initially contains all of the ports in the physical chassis. After you enable Virtual Fabrics, you
can create additional logical switches. The number of logical switches that you can create depends on the
switch model.
After you create logical switches, the chassis appears as multiple independent logical switches. All of the
ports continue to belong to the default logical switch until you explicitly move them to other logical switches.
The default logical switch always exists. You can add and delete other logical switches, but you cannot
delete the default logical switch unless you disable Virtual Fabrics.
1. Enable Virtual Fabrics mode on the switch using instructions in the “Managing Virtual Fabrics” chapter of
the Brocade Fabric OS Administration Guide.
2. Configure logical switches to use basic configuration values using instructions in the “Managing Virtual
Fabrics” chapter of the Brocade Fabric OS Administration Guide.
3. Create logical switches using instructions for creating a logical switch or base switch in the “Managing
Virtual Fabrics” chapter of the Brocade Fabric OS Administration Guide.
Port assignment
Initially, all ports belong to the default logical switch. When you create additional logical switches, they
are empty and you can assign ports to those logical switches. As you assign ports to a logical switch, the
ports are moved from the default logical switch to the newly created logical switch. Following are some
requirements for assigning ports:
Create logical switches using the lsCfg command. For details, refer to the instructions for creating a logical
switch or base switch section in the Brocade Fabric OS Administration Guide and to the lsCfg command in
the Brocade Fabric OS Command Reference.
There are two methods for changing to the context of a specific logical switch so that you can perform
configuration or other tasks:
• Use the setcontext fabricID command. This changes the context to a specific logical switch and changes
the command line prompt to reflect the new FID. Any commands entered at this prompt are initiated on the
logical switch with that FID.
• Use the fosexec --fid FID -cmd “command” to initiate a specific command on a specific logical switch,
where command is the command string. The fosexec command can be entered for any logical switch to run
the specified FOS command on the specified logical switch, whereas the setcontext command runs only in
the current logical switch.
• Through Inter-Switch Links (ISLs). For extension traffic, the ISL connection is through a tunnel.
• Through base switches and extended ISLs (XISLs). This is supported by the Brocade SX6 Blade, the
Brocade 7840 Switch, and the Brocade FX8-24 Blade.
With Ethernet port sharing, you can have the following configuration, as an example:
• Only Ethernet ports in the default switch can be shared by VE_Ports in different logical switches. A
Ethernet port in a nondefault switch can only be used by VE_Ports in that same logical switch.
• The GbE ports in other logical switches or ports on the base switch cannot be shared by ports in different
logical switches.
• Tunnels created on Brocade FX8-24 Blades with a mix of dedicated ports (ports within the same logical
switch) and shared ports (ports in the default switch) are not supported.
• When using shared Ethernet interfaces between the default switch and other logical switches, if the
default switch is disabled, the Ethernet ports in the default switch will also be disabled. This will impact all
tunnels in the other logical switches using the Ethernet interfaces.
Conclusion
This chapter took a more detailed look at some of the key extension technologies that were briefly covered
in Part 3 Chapters 1-6. This concludes Part 3.
References
Brocade Communications Systems. (2015). Brocade Adaptive Rate Limiting. San Jose, CA: Brocade
Communciations Systems.
Brocade Communications Systems. (2017). Brocade Extension Trunking. San Jose, CA: Brocade
Communications Systems.
Brocade Communications Systems. (2017). Brocade FOS Extension Configuration Guide FOS 8.1. San Jose,
CA: Brocade Communications Systems.
Detrick, Mark. (2016). Brocade 7840 IP Extension Configuration Guide. San Jose, CA: Brocade
Communications Systems.
Guendert, S. (2017). Fibre Channel over Internet Protocol Basics for Mainframers. Enterprise Tech Journal.
IETF. (2004, July). IETF RFC 3821. Retrieved from IETF RFC 3821: http://tools.ietf.org/pdf/rfc3821.pdf
294
PART 4 Chapter 1: Brocade fabric vision technology
Introduction
The use of virtualization, flash storage, and automation tools has allowed applications and services to
be deployed faster while shattering performance barriers. The unprecedented number of application and
service interactions has also increased the complexity, risk, and instability of mission-critical operations.
As a result, IT organizations need flexible storage networks that can adapt to dynamic environments and
performance requirements for high-density virtualization, flash storage, and cloud infrastructures. To
achieve Service Level Agreement (SLA) objectives, IT administrators also need new tools that can help
ensure non-stop operations, quickly identify potential points of congestion, and maximize application
performance, while simplifying administration.
Brocade Fabric Vision technology is an advanced hardware and software solution that combines capabilities
from the Brocade Gen 5 and Gen 6 Fibre Channel ASIC, Brocade Fabric OS (FOS), and Brocade Network
Advisor to help administrators address problems before they impact operations, accelerate new application
deployments, and dramatically reduce operational costs. Fabric Vision technology provides unprecedented
visibility and insight across the storage network through innovative diagnostic, monitoring, and management
technology. Its powerful, integrated monitoring, management, and diagnostic tools enable organizations to:
1. Simplify monitoring
Fabric Vision really can be considered an expert system. Fabric Vision’s MAPS component allows you to
deploy more than 20 years of storage networking best practices in predefined, threshold-based rules,
actions, and policies with a single click. This capability eliminates much of the time and effort associated
with needing to manually configure these types of parameters and settings. Next, Fabric Vision allows you
to take advantage of non-intrusive, real-time monitoring and alerting while gaining visibility into storage IO
health and performance with key latency and performance metrics. Its integrated network sensors allow
you to gain visibility into VM and storage IO health and performance metrics to maintain SLA compliance.
Its integration with Brocade Network Advisor allows for you to gain comprehensive visibility into the fabric
through browser-accessible dashboards with drill-down capabilities to easily identify network health,
performance, latency, and congestion issues. Let’s take a look at some of the Fabric Vision components
that make this all possible.
By leveraging prebuilt, pre-validated rule-/policy-based templates, MAPS takes the guesswork out of defining
appropriate rules and actions, simplifying threshold configuration, monitoring, and alerting. With MAPS,
organizations can apply thresholds and alerts simply by enabling one of the predefined MAPS policies for
their environment and then selecting the alerts for out-of-range conditions.
Organizations can configure an entire fabric (or multiple fabrics) at one time using common rules and
policies, or customize rules for specific ports—all through a single dialog. The integrated dashboard displays
a switch health report, along with details on out-of-range conditions, to help administrators quickly pinpoint
potential issues and easily identify trends and other behaviors occurring on a switch or fabric.
MAPS provides an easy-to-use solution for policy-based threshold monitoring and alerting. MAPS proactively
monitors the health and performance of the storage infrastructure to ensure application uptime and
availability. By leveraging prebuilt rule-/policy-based templates, MAPS simplifies fabric-wide threshold
configuration, monitoring, and alerting. Administrators can configure the entire fabric (or multiple fabrics) at
one time using common rules and policies, or customize policies for specific ports or switch elements. With
Flow Vision and VM Insight, administrators set thresholds for VM flow metrics in MAPS policies in order to be
notified of VM performance degradation.
Note: At the current time, VM Insight capabilities are supported for open systems (non-IBM Z) environments
only.
MAPS is implicitly activated when you install Fabric OS 7.4.0 or a later release; however, doing so without
a license provides reduced functionality. When the Fabric Vision license is installed and activated, MAPS
provides you with a set of predefined monitoring policies that allow you to use it immediately. In addition,
MAPS provides you with the ability to create customized monitoring rules and policies, allowing you to
define specific groups of ports or other elements that will share a common threshold. MAPS lets you define
how often to check each switch and fabric measure and specify notification thresholds. This allows you to
configure MAPS to provide notifications before problems arise, for example, when network traffic through
a port is approaching a bandwidth limit. Whenever fabric measures exceed these thresholds, MAPS can
automatically notify you through e-mail messages, SNMP traps, log entries, or combinations. MAPS provides
a dashboard that allows you to quickly view what is happening on the switch (for example, the kinds of
errors, the error count, and so on).
MAPS Groups
A MAPS group is a collection of similar objects that you can then monitor using a common threshold.
MAPS provides predefined groups, or you can create a user-defined group and then use that group in rules,
thus simplifying rule configuration and management. For example, you can create a group of UNIX ports,
and then create specific rules for monitoring this group. As another example, to monitor your network, you
can define Flow Vision flows on a switch that have different feature sets, and then import them into MAPS
as groups. FOS 8.1 includes 40 predefined groups, none of which can be deleted.
End users are also provided the flexibility to create custom monitoring groups—such as switch ports
attached to high-priority applications and another group of switch ports attached to low-priority
applications—and to monitor each group according to its own unique rules. User-defined groups allow you
to specify groups defined by characteristics you select. In many cases, you may need groups of elements
that are more suited for your environment than the predefined groups. For example, small form-factor
pluggable (SFP) transceivers from a specific vendor can have different specifications than SFP transceivers
from another vendor. When monitoring the transceivers, you may want to create a separate group of SFP
transceivers for each vendor. In another scenario, some ports may be more critical than others, and so can
be monitored using different thresholds than other ports.
You can define membership in a group either statically or dynamically. For a group using a static definition,
the membership is explicit and only changes if you redefine the group. For a group using a dynamic
definition, membership is determined by meeting a filter value. When the value is met, the port or device
is added to the group, and is included in any monitoring. When the value is not met, the port or device is
removed from the group, and is not included in any monitoring data for that group.
The following items should be kept in mind when working with user-defined groups:
Thresholds are the values at which potential problems might occur. When configuring a rule using mapsrule
--config, or creating a rule using mapsrule --create, you can use the --value option to specify a threshold
value that, when exceeded, triggers specified actions. For example, you could create a rule that monitors
loss-of-signal errors on all host ports, and triggers actions when the counter is greater than 14. Then, when
the counter reaches 15, the rule triggers the actions. The following is an example of such a rule:
The -timebase value specifies the time interval between samples, and affects the comparison of sensor-
based data with user-defined threshold values. You can set the timebase to the following durations shown in
Figure 2.
A MAPS rule associates a condition with actions that need to be taken when the condition is evaluated to
be true and the specified rule is triggered. MAPS rules can exist outside of a MAPS policy, but they are only
evaluated by MAPS when the rule is part of an active policy. Each rule specifies the following items:
When you create, modify, or clone a rule using the mapsrule --create, --config, or --clone commands, you
associate an action for MAPS to take if the condition defined in the rule evaluates to “true.” Each rule can
have one or more actions associated with it. For example, you can configure a rule to log a RASLog message
and fence the port if the number of CRC errors on any E_Port is greater than 20 per minute.
• E-mail alerts
• FICON alerts
• MAPS SNMP traps
• Port fencing and port decommissioning
• RASLog messages
• SFP marginal
• Slow-Drain Device Quarantine
• Switch critical
• Switch marginal
• Port toggling
For each action, you can define a “quiet time” for most rules in order to reduce the number of alert
messages generated. For rules that are state-bound (such as those for temperature sensing monitoring
such as IN_RANGE or OUT_OF_RANGE), the rule is only triggered when the condition transitions from one
state to another. These types of rules will not be triggered if the condition remains in a single state, even if
that state is not a normal operating state, such as PS_Faulty. A couple of these are worth a more detailed
explanation.
FICON Alerts
FICON notification support is available as an action from MAPS. The FICON management service uses the
MAPS events to create a health summary report. Rules with a FICON notification action are part of all four
default policies. In the active policy, if FICON notification is specified for any triggered events, MAPS sends a
notification with the following event information to the FICON management service:
MAPS supports port fencing and port decommissioning actions for rules that monitor CRC, ITW, PE, LR,
STATE_CHG, or C3TXTO errors from physical ports, such as E_Ports, F_Ports, or U_Ports. Otherwise, for
circuits, MAPS supports only port fencing actions for rules that monitor changes of state (STATE_CHG).
Be aware that if multiple rules for the same counter are configured with different thresholds, then both port
fencing and port decommissioning should be configured for the rule with the highest threshold monitored.
For example, if you configure one rule with a CRC threshold value “greater than 10 per minute” and you
configure a second rule with a CRC threshold value “greater than 20 per minute,” you should configure port
fencing and port decommissioning as the action for the rule with the 20-per-minute threshold, because
configuring it for the 10-per-minute rule will block the other rule from being triggered.
For E_Ports, if port decommissioning fails, MAPS will fence the port. Switches themselves can
decommission E_Ports through MAPS. In this case, when port decommissioning is triggered on an E_Port,
the neighboring switches will perform a handshake so that traffic is re-routed before the port is disabled. Be
aware that there are multiple reasons that the port-decommissioning operation between two E_Ports could
fail; for example, if the link that fails is the last link between the two switches.
For F_Ports, port decommissioning will only work if Brocade Network Advisor is actively monitoring the
switch. Brocade Network Advisor can decommission F_Ports based on specified criteria. MAPS notifications
are integrated with Brocade Network Advisor, which in turn must coordinate with the switch and the end
device to orchestrate the port decommissioning. If Brocade Network Advisor is not configured on a switch,
MAPS will fence the F_Port.
Finally, administrators can tailor the frequency of alert messages to reduce duplicate notifications. Also,
administrators have the ability to apply different notifications and actions based on the frequency of a
violation with rule-on-rule monitoring. Administrators can define different actions as operational responses if
a monitoring rule has been repeatedly violated.
Predefined policies
MAPS provides four predefined policies that you can neither modify nor delete. The predefined policies are
as follows:
• dflt_conservative_policy: This policy contains rules with more lenient thresholds that allow a buffer and
do not immediately trigger actions. Use this policy in environments where the elements are resilient and
can accommodate errors. This policy can be used in Tape target setups.
• dflt_moderate_policy: This policy contains rules with thresholds values between the aggressive and
conservative policies.
• dflt_aggressive_policy: This policy contains rules with very strict thresholds. Use this policy if you need a
pristine fabric (for example, FICON fabrics).
For IBM Z and FICON environments, Brocade recommends that you start with the policy dflt_aggressive_
policy.
MAPS allows you to define your own policies. You can create a policy and add rules to it, or you can clone
one of the default policies and modify the cloned policy, saving it as a new policy name.
Dashboards
Fabric Vision provides at-a-glance views of switch status and various conditions that are contributing
to performance issues, enabling administrators to get instant visibility into any hot spots at a switch
level and take corrective actions. Dashboard views are available via both the CLI, and through Brocade
Network Advisor (BNA). Dashboard views include: Overall status of the switch health and the status
of each monitoring category, including any out-of-range conditions and the rules that were triggered.
Historical information on the switch status for up to the last seven days; automatically provides raw counter
information for a variety of error counters. This integrated dashboard view also provides a single collection
point for all dashboard data from a fabric for a specific application flow.
Brocade Network Advisor simplifies Fibre Channel management and helps organizations dramatically
reduce deployment and configuration times by allowing fabrics, switches, and ports to be managed
as groups. Customizable dashboards graphically display performance and health indicators out of the
box, including all data captured using Brocade Fabric Vision technology. To accelerate troubleshooting,
administrators can use dashboard playback to quickly review past events and identify problems in the
fabric. Dashboards and reports also can be configured to show only the most relevant data, enabling
administrators to more efficiently prioritize their actions and maintain network performance.
MAPS also provides a dashboard which provides a summary view of the switch health status that allows
you to easily determine whether everything is working according to policy or whether you need to investigate
further. The MAPS dashboard output is divided into the following main sections. A history section is
displayed if you enter mapsdb --show all.
When a category contains an “out-of-range” error, the dashboard displays a table showing the rules
triggered in that category since midnight. This allows you to see more precisely where the problem is
occurring. Each category in the table contains the following information:
information, you can get an idea of the errors seen on the switch even though none of the rules might have
been violated. If you see potential issues, you can reconfigure the appropriate rule thresholds to specifically
fit the switch based on the actual behavior of traffic on the switch.
Streamlining Administration
IT organizations with large, complex, or highly virtualized data center environments often require advanced
tools to help them more effectively manage their storage infrastructures. Developed specifically with these
IT organizations in mind, Fabric Vision also includes several breakthrough management capabilities that
dramatically simplify day-to-day SAN administration and provide unprecedented visibility across the storage
network. Two examples are COMPASS, and Fabric Performance Impact (FPI) monitoring.
COMPASS
Configuration and Operational Monitoring Policy Automation Services Suite (COMPASS) simplifies
deployment, safeguards consistency, and increases operational efficiencies of larger environments with
automated switch and fabric configuration services. COMPASS uses templates (user-defined) to compare
to the configuration settings on a switch to configuration settings in a template to determine whether a
configuration drift has occurred. Templates are made up of one or more configuration blocks. Configuration
blocks are made up of one or more configuration settings. This enables you to monitor subsets of a
switch configuration and receive notification when any of the defined configuration settings drifts occur.
Administrators can configure a template or adopt an existing configuration as a template and seamlessly
deploy the configuration across the fabric. In addition, they can ensure that settings do not drift over time
with COMPASS configuration and policy violation monitoring within Brocade Network Advisor dashboards.
FPI uses the tx_c3_timeout and tim_txcrd_z counters in Brocade ASIC to detect device-based latency. When
a frame has been unable to transmit due to the lack of buffer credits and has exceeded the timeout value,
the frame is discarded. Each frame discard increases the tx_c3_timeout counter. Each increment of the
tim_txcrd_z counter represents 2.5 microseconds of time with zero transmit buffer-to-buffer credit. Similar
to Bottleneck Detection, FPI monitors both counters to detect device-based latency. Its advanced algorithm
enables FPI to more accurately detect latency conditions. The detection logic uses thresholds that are
internally calibrated to remove the task of deciding the correct thresholds and configuring them. In addition,
FPI monitors the tim_txcrd_z counters only on F_Ports but not on E_Ports. As a result, FPI detects device-
induced latency problems at the point where hosts or storage devices attach to a fabric. With this feature,
administrators can pinpoint the devices causing slow draining problems and quickly isolate the problem.
Fabric OS 7.4.0 enhances FPI detection by monitoring the transmit queue (TXQ) latency counter tim_
latency_vc within the Condor3 ASIC. The TXQ latency counter measures how long a frame has been waiting
in the transmit queue due to the lack of buffer credit. This counter is updated every 250 nanoseconds for
each virtual channel (VC) of each port. Monitoring this counter improves FPI detection of transient spikes in
latency, which could have been missed with the tim_cred_z counter monitoring alone. While the tim_cred_z
counter is monitored for all F_Ports, the tim_latency_vc counter is monitored for all E_Ports and F_Ports.
FPI detects different severity levels of latency and reports two latency states. The IO_FRAME_LOSS state
indicates a severe level of latency. In this state, frame timeouts either have already occurred or are very
likely to occur. Administrators should take immediate action to prevent application interruption. The IO_
PERF_IMPACT state indicates a moderate level of latency. In this state, device-based latency can negatively
impact the overall network performance. Administrators should take action to mitigate the effect of latency
devices. The separate latency states enable administrators to apply different MAPS actions for different
severity levels. For example, administrators can configure the port fencing action for the IO_FRAME_LOSS
state and the email alert action for the IO_PERF_IMPACT state.
IO_PERF_IMPACT State
FPI evaluates the tim_txcred_z counter increment for all F_Ports at a fixed time interval. For each evaluation,
the cumulative increments of the tim_txcred_z counters over three time windows within the fixed interval
are calculated simultaneously. The time windows are short, medium, and long duration, respectively. The
cumulative increment for each time window is compared to an internal threshold predefined for that time
window. The short window uses the high threshold; the medium window uses the medium threshold;
and the long window uses the low threshold. If any of these thresholds is exceeded for the corresponding
time windows, the F_Port is put in the IO_PERF_IMPACT state. With internally predefined and calibrated
thresholds, FPI removes from administrators the complexity of determining appropriate latency thresholds.
Furthermore, with simultaneous evaluation against different thresholds for three different time windows,
FPI can capture both short spikes of severe latency and sustained medium latency. This is a significant
advantage compared to Bottleneck Detection, which can monitor against only one threshold at a time.
Administrators typically must choose between thresholds for short severe latency or for sustained medium
latency. FPI can capture a range of latency levels without users having to configure the thresholds.
FPI in Fabric OS 7.4 monitors the tim_latency_vc counters in addition to the time_txcred_z counter. The
ASIC updates this counter every time a frame is transmitted. Each unit of the counter represents 250
nanoseconds of time. FPI evaluates these counter increments for all F_Ports and E_Ports at the same fixed
intervals as the evaluation of tim_txcred_z counters. This evaluation compares the tim_latency_vc counter
increments to an internal threshold of 10 milliseconds. If the tim_latency_vc counter increment exceeds this
threshold, the corresponding F_Port or E_Port is put in the IO_PERF_IMPACT state.
IO_FRAME_LOSS State
FPI puts an F_Port in the IO_FRAME_LOSS state if an actual timeout occurs on the port. This is indicated
with the C3TX_TO counters on the port. In addition, FPI calculates the average R_RDY delay to predict the
likely timeout on an F_Port. This is done by evaluating the number of 2.5-microsecond increments of the
tim_txcred_z counter in a 60-second interval over the total number of frames during those 60 seconds.
If this average R_RDY delay is greater than 80 milliseconds, the corresponding F_Port is also marked as
IO_FRAME_LOSS.
Fabric OS 7.4 also monitors the tim_latency_vc counters for the IO_FRAME_LOSS state. If the evaluated
counter increments exceed an internal threshold of 80 milliseconds, the corresponding F_Port or E_Port is
put in the IO_FRAME_LOSS state.
IO_LATENCY_CLEAR
Fabric OS 7.4 added the IO_LATENCY_CLEAR state. For ports detected in the IO_FRAME_LOSS or IO_PERF_
IMPACT state, when the latency conditions are cleared and the corresponding counter evaluations are within
the internal thresholds, the ports will be put in IO_LATENCY_CLEAR state. Since this is a different state,
customers can receive a separate notification when the latency conditions are cleared.
Another action introduced in Fabric OS 7.4 is Port Toggle. This action disables a port for a short and user-
configurable duration and then re-enables the port. The port toggle action can recover slow-draining
devices such as those caused by a faulty host adapter. In addition, the Port Toggle action can induce multi-
pathing software (MPIO) to trigger traffic failing over to an alternate path to prevent severe performance
degradation. By using the SDDQ or Port Toggle actions, administrators not only can monitor for device-based
latency but can also automatically mitigate the problem when such conditions are detected by FPI.
Congestion Monitoring
MAPS has supported congestion monitoring since Fabric OS 7.2 with port transmit bandwidth usage
percentage, receive bandwidth usage percentage, and port bandwidth utilization monitoring. Violations
of these congestion monitoring rules are reported under the Traffic Performance category in the MAPS
dashboard. Fabric OS 7.4 moves the congestion monitoring violations into the Fabric Performance Impact
category to consolidate with device-based latency monitoring. Fabric OS 8.0.1 adds the zoned-device ratio
monitoring to alert a misconfigured oversubscription ratio between source and destination devices.
Significant challenges face Storage Administrators in today’s data centers. Storage consumes a large
quantity of IP network resources. Those IP networks lack adequate visibility of storage flows and are
sometimes unreliable, overly complex, and very inflexible. Greater complexity creates more opportunities
for issues to occur. Storage and network administration provide different roles and responsibilities in an
enterprise. Storage Administrators do not manage IP networks, and Network Administrators do not manage
storage. When issues arise, each role can blame the other for the problems.
Recovery Point Objectives (RPOs) for mission-critical data typically require a couple of seconds or less.
Such RPOs are difficult to maintain when data rates are high (10 gigabits per second [Gbps] or above) and
when network problems are also present. Such degraded states are difficult to troubleshoot. Often multiple
vendors are involved, such as storage vendors, storage network vendors, IP network vendors, and Wide-Area
Network (WAN) service providers. Such situations cost organizations considerable money and expose the
business to data loss. The end result is that user expectations are not met.
To address these problems within the specific context of extension, Brocade has introduced the Brocade
7840 Extension Switch and the Brocade SX6 Extension Blade with advanced Fabric Vision capabilities.
Fabric Vision includes a number of monitoring, alerting, and reporting tools specific to Brocade Extension.
Additionally, diagnostic tools are available that are useful for determining IP network validation and health.
The objective is to quickly determine the root cause of degraded situations or outages, to expedite a return
to normal operations as quickly as possible.
• Unreliable IP networks and the unknown within: Storage Administrators have to rely on the experience,
capabilities, availability, accuracy, and willingness of Network Administrators to provide feedback about
their networks during performance assessment and troubleshooting. Often, this exchange of information
can prolong the length of a degraded state.
• More can go wrong, and more often, than in the past: Data center environments have grown in both
scale and complexity. Virtualization of every type is used at every level, more now than ever. Simply, more
opportunities arise for things to go wrong, and each issue affects other issues. Pinpointing what went
wrong is not always a simple task.
• Asymmetric behavior: Does outbound data take one path and data returns take another path? Are the
paths equal in quality? Can the IP network properly load balance?
• Different administration groups: At least two administrative groups (Storage and Network) exist within an
enterprise, and each group has different roles, training, expertise, loyalties, goals, and responsibilities—
possibly even different management personnel. These differences in culture can cause difficulties for
Storage Administrators and Network Administrators alike. Storage personnel might not understand the
network infrastructure and requirements, and Network personnel might not understand the storage
infrastructure and requirements.
• RPO at high data rates: RPO has changed over time. The amount of data and number of applications
managed has grown significantly. This means that RPO times must shrink. The amount of data processed
in even a short amount of time has become significant and represents critical business transactions.
• Catch-up time from outage backlogs: Now that data rates are in the 10 Gbps range for many companies,
when an outage or degraded state occurs, it is more difficult to reestablish a fully synchronized data
consistent state useful for database restart. If the amount of time it takes to regain a consistent state
is exceedingly long, the exposure to disaster becomes a liability. Now, if replication is stressed, is slower
than expected, and might even suspend from time to time, what does the timeline to resolution look
like? Refer to Figure 5. Generally, a Storage Administrator is not aware of any problems until an error
condition occurs. Storage Administration might place a request into network operations to investigate why
replication is not performing optimally across the WAN.
More often than not, the IP network will immediately be cleared of any issues. After all, if the IP network
is not down, other applications are fine and a ping can get through, and if links are not fully utilized, then
the assumption is that the network must be functional. To solve the quandary, it is likely that the Storage
Administrator will open a support call with the storage array vendor to determine the cause of the issue.
Rudimentary troubleshooting to validate basic issues will establish a baseline for the investigation. After
that, any call that involves replication across an IP network requires establishing whether the problem exists
in the network or in the array. At that point, the process of ruling out issues with the IP network commences.
How does the Brocade 7840 Extension Switch or the Brocade SX6 Extension Blade address this type of
situation? Given the many aspects of storage networks, every aspect needs to be continuously monitored
and inspected for proper operation. In this case, the Brocade 7840 and the Brocade SX6 monitor the
IP network in which its managed tunnel flows, then alerts the Storage Administrator proactively when
conditions arise that indicate degraded states. Degraded states within the IP network include: transient
congestion events, chronic out of order delivery, instability and frequent network rerouting, oversubscription,
data integrity problems, excessive latency, excessive jitter, and more. The capabilities of the Brocade 7840
and the Brocade SX6 enable Storage Administrators to gain visibility into an infrastructure on which they
rely, but have had no visibility into before now.
This raises another problem: Storage Administrators cannot spend all of their time monitoring networks.
The network itself must have the intelligence to create an alert when an issue becomes suspect. Brocade
has incorporated decades of expertise and experience into improving network intelligence. When an alert is
created by the Monitoring and Alerts Policy Suite (MAPS), it is based on the experience and expertise of the
MAPS technology. This allows Storage Administrators to regain valuable time.
If a dashboard item is indicating an event or status, clicking that item will drill down into more specific
information. In the example in Figure 8, a window showing FCIP Health Violations was opened to gather
more detailed information. The detailed information shown here includes timestamp, device name, tunnel
and circuit designation, the rule that was violated, the offending value that was measured, and the units
of that measurement. You can also see if it was registered in the RASlog, Simple Network Management
Protocol (SNMP), or e-mail, as well as the level of violation (marginal or critical), and other information.
CLI Dashboard
The CLI dashboard from Brocade is for users that prefer to use the CLI instead of Brocade Network Advisor.
All information from each monitoring category can be obtained from the CLI, including any out-of-range
conditions and the rules that were triggered. Error statistics provide at-a-glance views of status and the
various contributing conditions to that status. Historical switch information for up to the last seven days
provides a variety of error counters, giving users instant visibility into problem areas and facilitating the
decision process towards proper resolution and planning.
This set of monitors enables the detection of just about any degraded IP network condition that might occur.
If the network goes down for a period longer than the KATOV (Keepalive Time Out Value) for a circuit, that
event is detected, and an alert is processed. If the IP network has fits of transient congestion that results
in either excessive jitter or packet loss, that event is detected and an alert is processed. If any one of the
PTQ priorities suffers low throughput, packet loss, or out-of-order events, that issue is detected and an alert
is processed. If the network reroutes, causing the RTT to change for the worse, that event is detected and
an alert is processed. MAPS provides comprehensive detection of various network events and the ability to
make those events known to Storage Administrators. Storage Administrators can leverage this information
to enforce IP network SLAs.
Brocade tunnels are a managed transport between two Brocade Extension endpoints. There are
many reasons for using a Brocade tunnel as a data transport across infrastructure, including granular
load balancing, lossless failover/failback, higher availability, encryption, bandwidth pooling, protocol
optimization, congestion management, Quality of Service (QoS) marking/enforcement, monitoring,
reporting, and network diagnostics. A Brocade tunnel uses Extension Trunking and may consist of multiple
member circuits, which may traverse one or more service providers, each with a distinct SLA. Brocade
provides Storage Administrators with a single point of management and service provider SLA validation.
Tunnels are point-to-point, and the endpoints of a tunnel are the same endpoints as the trunk.
There is a hierarchy of extension connectivity: Virtual E_Ports (VE_Ports, or tunnel endpoints) are in the
top tier of the network and define the endpoints of tunnels. VE_Ports contain one or more circuits. Each
circuit has four WAN Optimized TCP (WO-TCP) sessions, one for each priority (class F, High, Med, Low). MAPS
is integrated into each of these tiers. Any indication of a problem in a lower tier poses a problem for its
associated upper tiers. For example, if a problem exists with the IP network path that the medium PTQ WO-
TCP session is using, the tunnel in which that circuit is a member will, in general, also have a problem. This
alerts administrators when thresholds are exceeded. MAPS for tunnels or trunks (VE_Ports) monitors and
performs actions based on throughput and state change.
Policy-Based Monitoring
Policy-Based Monitoring involves predefined monitoring groups with prevalidated monitoring policies.
Multiple monitoring categories enable monitoring of the following: tunnels, circuits, VE_Ports, overall switch
status, switch ports, small form-factor pluggables (SFPs), port blades, core blades, switch power supplies,
fans, temperature sensors, security policy violations, fabric reconfigurations, scaling limits, CPU, memory
utilization, traffic performance, and more.
Predefined monitoring policies are tiered to provide the best starting place for your particular environment
and administrative style: aggressive, moderate, and conservative. A monitoring group’s policy tier can be
set individually and, if needed, you can fine-tune your specific environment. Individual items within a group
can be set to the appropriate tier for customization. For example, when setting a style for FICON-associated
extension connections you might select an aggressive style, and for Remote Data Replication (RDR)
extension connections you might select a moderate style.
Flexible Rules
Rule flexibility means you can monitor a given counter for various threshold values, taking different actions
when each value is crossed. For example, you can monitor a CRC error counter at a switch port and
generate a RASlog entry when the error rate reaches two per minute, send an e-mail notification when the
error rate is at five per minute, and fence a port when the error rate exceeds ten per minute.
Circuit Fencing
Specific to the Brocade 7840 Extension Switch, the Brocade 7800 Extension Switch, the Brocade SX6
Extension Blade, and the Brocade FX8-24 Extension Blade, when certain thresholds are reached, a circuit
can be taken offline. This is an important action for Brocade Extension, because a degraded circuit causes
all circuits belonging to the same VE_Port to be degraded as well. This might result in a total throughput that
is less than it would be without the degraded circuit. It is necessary to isolate a degraded circuit from the
remaining clean circuits to maintain overall optimal throughput.
Because data is delivered to the Upper Layer Protocol (ULP) in order, a degraded circuit will cause all
member circuits to degrade as well. If a trunk has two circuits, and one circuit is degraded such that it
requires retransmits to complete successful transmission, ultimately the data sent on the clean circuit must
wait for the retransmissions before delivering to the ULP. This means that both the clean and degraded
circuits will go no faster than the degraded circuit. The degraded circuit may effectively be delivering only a
small fraction of the bandwidth apportioned to it, depending on the degree of degradation. By fencing the
degraded circuit, the clean circuit operates at the full bandwidth apportioned to it.
Port Fencing
The Port Fencing action takes the port offline immediately when user-defined thresholds are exceeded.
Supported port types include VE_Ports as well as physical ports (E_Ports and F_Ports). This action is valid
only for conditions evaluated by the actual port.
Port Decommissioning
Port Decommissioning acts in addition to Port Fencing. Port Decommissioning allows ports to be gracefully
shut down. When certain monitored statistics cross-defined thresholds, ports are decommissioned, similar
to Port Fencing but without the abrupt traffic disruption. Port Decommissioning is covered in detail with its
own chapter (Part 4 Chapter 3).
SFP Marginal
The SFP marginal action sets the state of the affected SFP transceiver in the MAPS dashboard to “down.”
This action does not bring the SFP transceiver down, rather, it affects only what is displayed in the
dashboard. This action is valid only in the context of Advanced SFP groups.
RASLog Messages
Following an event, MAPS adds an entry to the internal event log for each switch involved. The RASlog stores
detailed event information but does not actively send alerts.
SNMP Traps
In environments where you have a high number of messages coming from a variety of switches, you might
want to receive them in a single location and view them using a Graphical User Interface (GUI). In this type
of scenario, SNMP notifications may be the most efficient notification method. You can avoid logging in
to each switch individually, as you need to do for error log notifications. When specific events occur on a
switch, SNMP generates a message (called a “trap”) that notifies a management station using SNMP. Log
entries can also trigger SNMP traps if the SNMP agent is configured. When the SNMP agent is configured to
a specific error message level, error messages at that level trigger SNMP traps.
E-Mail Alert
An e-mail alert sends information about the event to one or more specified e-mail addresses. The e-mail
alert specifies the device or devices and the threshold and describes the event, much like an error message.
Flow Vision
The Flow Vision diagnostic tool enables administrators to identify, monitor, and analyze specific application
and data flows in order to maximize performance, avoid congestion, and optimize resources. Flow Vision
includes Flow Monitor, MAPS for Flow Monitor, and Flow Generator.
Flow Monitor
Flow Monitor enables you to monitor all the traffic passing through E_Ports, EX_Ports, F_Ports, and xISL_
Ports by using hardware-supported flow parameters. Users gain comprehensive visibility into application
flows within a fabric, including the ability to learn (discover) flows automatically. Define your own flows
to monitor using combinations of ingress and egress ports, source and destination devices, Logical Unit
Number (LUN), and frame types. The monitoring provided on the Brocade 7840 and the Brocade SX6
identifies resource contention or congestion that is impacting application performance.
Flow Monitor provides support for learning or manually defining the following types of extension flows:
• Top Talkers
• Flow within a fabric from a host to a target or LUN on a given port
• Flows inside Virtual Switch Logical Fabrics
• Flows passing through E_Ports (Inter-Switch Links, or ISLs)
• Flows passing through F_Ports (end device)
• Frame-based flows
Flow Generator
Flow Generator is a traffic generator for pretesting and validating infrastructure. Flow Generator is hardware-
specific (Condor 3 and later) and can be used across the Brocade Extension infrastructure. Various
components and devices can be tested, including internal switch connections, cable plant, new switches
and blades, servers, storage, and IP networks. Flow Generator allows users to:
• Configure a FC/FICON-capable port as a simulated device that can transmit frames at 16 Gbps line rate
• Emulate a FC/FICON fabric without actually having any hosts, targets, or testers
• Pretest the entire SAN
Flow Generator creates a special port type called SIM ports, which are used to process simulated traffic.
SIM ports behave like normal E_Ports, EX_Port, or F_Ports but are used only for testing purposes. SIM ports
originate and terminate Flow Generator traffic and ensure that this traffic does not leave the destination
switch. Flow Generator generates standard or custom frames with user-specified sizes and patterns. Flow
Generator supports predefined or learned flows to generate traffic between configured SIM ports. Once
you activate a flow, the flow remains active until it is deactivated. The flow lies in wait when the Source ID
(SID)/ Destination ID (DID) SIM ports are offline. As soon as SID/DID SIM ports are brought online, the
traffic proceeds. As an example use case, create a traffic flow between a SID and DID to validate routing and
throughput across an FCIP ISL.
Wtool uses WO-TCP. Wtool is an IP-specific tool for testing the WAN-side infrastructure on the Brocade 7840
Extension Switch and the Brocade SX6 Extension Blade. Wtool creates data flows that use the same circuits
configured in a tunnel or trunk. Since Wtool uses the same circuit, all the characteristics of that circuit
remain viable during testing, including jumbo frames/Path Maximum Transmission Unit (PMTU), VLAN, IPv4/
IPv6, and IPsec. If a circuit in a trunk is selected with Wtool, the trunk’s other circuits remain online and
operational while the selected circuit is decommissioned for testing.
WO-TCP is an advanced TCP stack with unique abilities. WO-TCP can logically behave just like UDP. Testing
an IP network for proper SLA requires a UDP-like transport to provide constant bit rates, no reordering, and
no retransmits. If the IP network reorders, you will see it. If the IP network drops packets, you will see it.
Traditional TCP hides these issues, making any diagnostic tool useless. Wtool uses the same TCP headers
as the tunnel, except that all retransmit mechanisms are disabled, which exposes aberrant network
conditions. To test a specific protocol such as FCIP, you need that protocol’s TCP headers to traverse the
network and pass through security devices that may exist. Additionally, when used with Wtool, WO-TCP
prevents traffic windowing due to network error conditions. This means that Wtool always drives traffic at
circuit rates despite network errors that would normally cause traditional TCP to close its windows. Brocade
provides WO-TCP technology in the world’s leading data transport TCP stack.
Bottleneck Detection
Brocade offers bottleneck detection for FC and FICON fabrics.
ClearLink Diagnostics
ClearLink Diagnostics, sometimes referred to as D_Port, is a Brocade Gen5 Fibre Channel exclusive feature.
Advance warning of replication under stress and effective troubleshooting is critical, given the short time
and the impact to operations. Is the IP network providing its SLA? Users receive an alert from MAPS in the
event that one of the following occurs:
• Circuit Retransmits: Retransmits grow above a certain rate, indicating congestion or packet loss or that
the network is excessively delivering data out-of-order.
• Circuit State Change: A circuit goes down because the KATOV has expired some number of times within
a period.
• Circuit RTT: Round Trip Time has increased well beyond the typical operating value.
• Circuit Jitter: Jitter has increased well beyond the typical operating value.
Any of these is an indicator of a problem in the IP network that will negatively affect the performance and
impede the goal of safe RDR. The information provided by the Brocade 7840 and the Brocade SX6 can be
used to enforce SLAs between Storage Administration and Network Administration and/or service providers.
In Figure 14, the delta set cycle time is set to 30 seconds. Cycle times will not be less than 30 seconds.
The expectation is that complete data transfer of the delta set will take place within the 30-second interval.
As shown, this is not happening regularly. In many instances the amount of time to transfer the data set
exceeds 30 seconds and may take as long as 43 seconds. A determination must be made if this is due to a
network problem or if there is just too much data to transmit for the bandwidth and compression available.
Does RDR appear to be stressed, while the IP network administrators claim it should not be? Use these
tools to gain visibility into your RDR environment:
• Use Flow Vision to visualize the RDR flows and determine if they are consuming all the available
bandwidth.
• Use Flow Vision to monitor individual replication session data and I/O rates.
• Use Flow Vision to monitor RTT and jitter. If this is excessive, it can cause droop across the WAN
connection. Protocol optimization such as Brocade FastWrite may be required, if applicable.
• Use MAPS to alert on events that indicate degraded IP network states, which elongate delta set
transmission times and hinder recovery from Delta Set Extension or backlog journaling when the network
is running behind.
The Brocade 7840 and the Brocade SX6 provide tools to simplify this process considerably. These tools
include ping and traceroute at a rudimentary level. There is a PMTU discovery tool for determining the IP
network MTU in cases where it is not already known or has to be verified. For a comprehensive test of the IP
network, run Wtool for any desired period of time. For example, run Wtool for the duration of a typical work
day. Generate line-rate extension TCP flows and gather IP network statistics for analysis and SLA evaluation.
Upon completion, Wtool will provide information about the IP network, including maximum throughput,
congestion, loss percentage, out-of-order datagrams, latency (RTT), and other network conditions.
Chapter Summary
Storage Administrators are facing challenges with infrastructure they do not manage or control, specifically
the IP network, across which many of their applications pass. Because Storage Administrators have little
familiarity with the IP network and vice-versa, this can prolong time to resolution when there are operational
problems. This is exacerbated when support and multiple vendors are involved. The process requires a fast
and efficient way for Storage Administrators to pinpoint where a problem may be occurring.
But how can you more quickly pinpoint problems? Storage arrays have no visibility into the extension
network, plus that is not really within the expertise of Storage Administrators. The Brocade 7840 and the
Brocade SX6 extension platforms provide unique tools for quickly resolving hard-to-diagnose issues. Fabric
Vision offers MAPS, Flow Vision, Flow Generator, and Wtool, which are all enterprise-class tools. MAPS has
the ability to detect either sudden changes in the network or errors that present themselves slowly over
time. Monitoring is easy to set up and customize with pre-established policies and rules that incorporate
thresholds determined by Brocade over more than a decade of experience. When a monitored network
characteristic crosses marginal or critical thresholds, various alerts and actions are available. Beyond the
automated monitoring, alerts, and actions provided by MAPS, another valuable tool for visualizing flows is
available, called Flow Vision.
Flow Vision offers the Flow Monitor tool. Flow Monitor can discover flows automatically or can be configured
manually. Upon monitoring a particular flow, MAPS is integrated into Flow Monitor so that thresholds of
interest can be set with corresponding alerts and actions. Another tool that Flow Vision provides is called
Flow Generator. Flow Generator can generate test flows originating at the FC ASIC level across and traverse
Brocade Extension. The characteristics of the flows are either Brocade preconfigured or learned from the
application.
WO-TCP is an advanced TCP stack with the ability to treat each individual flow autonomously and eliminate
HoLB. WO-TCP also provides a special functionality for generating UDP-like traffic with extension headers.
This permits Wtool to perform accurate testing of network conditions while the IP network sees actual
extension traffic flows. A couple of cases were discussed in this brief: One was how to determine where
trouble may be originating. Another was how to determine stress on array replication applications. Overall,
the Brocade 7840 and the Brocade SX6 come with sophisticated tools that enable you to distinguish trouble
with the network from storage array application problems. These effective tools facilitate more efficient
support calls and faster problem resolution.
Brocade Fabric Vision technology changes the paradigm of FICON SAN fabric management, and
performance management from a reactive mindset, to a proactive mindset. This allows the end user to
identify potential problems, before they become true problems that have a negative impact on operations.
The proactive paradigm enabled by Fabric Vision also allows for expedited error detection, troubleshooting,
and undertaking corrective action in a more expeditions manner. This can reduce operational costs, and
reduce outage times and costs.
References
Brocade Communications Systems. (2017). Brocade Fabric OS FICON Configuration Guide, 8.1.0. San Jose,
CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Fabric OS Administration Guide, 8.1.x. San Jose, CA,
USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Monitoring and Alerting Policy Suite Configuration
Guide, 8.1.0. San Jose, CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Network Advisor SAN User Manual, 14.3.1. San Jose,
CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2016). The Benefits of Brocade Fabric Vision Technology for Disaster
Recovery. San Jose, CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2016). SAN Fabric Resiliency and Best Practices. San Jose, CA, USA:
Brocade Communications Systems.
Guendert, S. (2017). “Fabric Vision Technology and the Business Continuity Network.” Enterprise Tech
Journal Issue 3 2017. Enterprise Systems Media.
324
PART 4 Chapter 2: Brocade FICON CUP Diagnostics and the IBM Health Checker for z/OS
Introduction
As IBM Z environments have grown, and configurations have become more complex, FICON fabric issues
can result in unacceptable Input/Output (I/O) service times. Resource Measurement Facility (RMF) device
activity reports might show average service times that are higher than normal, while matching I/O queuing
reports show abnormally high “initial command response” times on a subset of the paths to a device.
Before these CUP Diagnostic enhancements were introduced, it was problematic to identify a single root
cause for these issues or to identify where in the fabric a problem might have originated. CUP Diagnostics
enables the ability for the CUP to proactively notify the IBM z System that there are abnormal conditions
in the fabric that might impact either the performance or reliability of the I/O traffic traversing through
the fabric. This chapter reviews the basic functionality of FICON CUP and then introduces the new FICON
Management Server (FMS) and CUP Diagnostics capabilities now available to be used by the IBM Health
Checker for z/OS to gain better insight into the health and robustness of FICON Storage Area Network (SAN)
fabrics.
The FICON CUP provides an in-band management interface defined by IBM that defines the CCWs and data
payloads that the FICON host can use for managing the switch. The protocol used is the IBM version of the
ANSI FC-SB4 single-byte command code specification, which defines the protocol used for transporting
CCWs to the switch, and for the switch to direct data and status back. FICON CUP becomes available when
the FMS optional license is installed and enabled on FICON switching devices such as the Brocade DCX
8510 Backbone and the Brocade DCX 6510 Switch, regardless of which vendor sells these devices.
FICON CUP is a direct architectural descendant of the CUP that ran on ESCON Directors. The IBM 9032-
5 ESCON Directors had an in-band management capability that utilized an embedded port in the control
processing cards to provide a communications path to an MVS console. This was used for three primary
purposes: first, reporting hardware (Field-Replaceable Unit [FRU]) errors up to MVS (helpdesk), second,
allowing and prohibiting ports (the world’s first “zoning” mechanism) by using Prohibit Dynamic Connectivity
Mask (PDCM), and third, basic performance monitoring. When switched FICON was being introduced, IBM
wanted to make certain that its mainframe customers would have a consistent management look and feel
between ESCON and FICON, so CUP was carried forward to FICON. Today, FICON CUP support is provided
by all mainframe storage and SAN vendors, but customers must implement the optional FMS license to
make it functional on their FICON switches and directors. Today’s more advanced CUP is still used for the
three primary functions listed above, but it has been enhanced to also provide RMF reporting for the FICON
switching devices and to help implement FICON Dynamic Channel Path Management (DCM).
Some RMF reports that can assist in analysis of fabric problems are:
It has become very apparent over time that as fabrics become more complex, it is essential that they
become proactively monitored by an IBM Z System. Although very mature and robust by most standards,
there are a few challenges in the way that z/OS has traditionally defined and worked with fabrics. For
example, the z/OS IOCDS definitions include F_Ports, relative to Control Units and Devices, but there are
no definitions for InterSwitch Links (ISLs), routing topologies, or distance extension facilities. Subsequently,
these fabric components are essentially invisible to z/OS and z/OS can do nothing to react to specific
problems in the fabric. Rather, z/OS just suffers performance degradations and reliability exposures until
the problems manifest in a secondary manner, such as a serious degradation of I/O performance or
reliability, that it is capable of detecting. CUP Diagnostics is a facility that enables the CUP to notify the IBM
Health Checker for z/OS host management program about fabric events while also supporting subsequent
queries by the host to provide detailed information about its fabric topologies (that is, domains, ISLs, routing
groups) and the location of, and details of, errors within those topologies.
The ROUTE function provides important information back to the command issuer regarding the specific
path that frames are traversing between a mainframe Channel Path ID (CHPID) and a device (that is, the
device TODEV keyword) or from a device to a mainframe CHPID (that is, the FROMDEV keyword). In a parallel
development effort, two new IBM z/OS Health Checks were added so that z/OS could recognize some
specific problems within the fabric. These functions are supported on all switches beginning with Brocade
FOS v7.1.0c and Brocade Network Advisor 12.0.2 software levels or later.
On the IBM Z side, z/OS 1.12 and later, plus some required Authorized Program Analysis Reports (APARs)
and Program Temporary Fixes (PTFs), are required with Brocade FOS v7.1.0c to support the CUP 4
Diagnostics and IBM Health Checker for z/OS features added at that firmware level.
Important Note: Even if the customer has no plans to use these new CUP features, the IBM Health Checker
for z/OS feature uses a previously undefined bit in the Sense data. If the user does not implement these
APARs and PTFs, then any time the channel issues a Sense command to the CUP, and there is health
information to report, an error message indicating an invalid command code is posted to the z/OS console.
This might become annoying to the user.
To make good use of CUP Diagnostics, the user interface is often the ROUTE enhancement to the z/OS
Display Matrix command. The DISPLAY M=DEV command allows a user to display the route through the
SAN fabric for a specific device and channel path, by specifying the ROUTE keyword. The routing information
returned includes all of the domains and ports that are in the path from the channel to the device. Since
the usual intent is to find the frame routing that is being used in a cascaded fabric, the reporting of routing
information is performed only when the channel is connected to a switch and the control unit definition for
the channel path is defined in the I/O configuration with a two-byte link address (that is, cascaded FICON).
In the example in Figure 1, a user wants to see how frames are being sent from the mainframe (CHPID 09)
through the fabric in order to reach one of the disk arrays, A000. The user also wants a report about the
health of the fabric along that path. The user issues the Display Matrix command with a ROUTE parameter
and a HEALTH parameter. You can see the output that is displayed when the CUP Diagnostics function is
executed on their behalf via the Display Matrix command.
Figure 1: CHPID 9 using entry point 45 on CUP22 to send and receive frames to CU A000.
In the routing portion of the display output, information for each switch in the path is displayed, which
includes its domain and switch type. The definition of the switch type depends upon the number of switches
in the path between the channel and device, and the position of the switch within the path. If there is only
one switch, the switch type is shown as Only Director. If there is more than one switch, the first switch is
known as the Source Director, and the last switch is known as the Destination Director. If there are any
switches between the source director and the destination director (see Figure 2), each of those switches is
known as an Intermediate Director.
From the example output, you can see that the route from the channel to the device travels through two
switches, Domain ID 22 (that is, source director) and Domain ID B3 (destination director). The channel
is connected to entry port 45 on switch Domain ID 22, as indicated by the “From” column for that port.
The “To” column for the same port indicates that I/O requests originating from this port are routed to ISL
aggregate FA (that is, four trunked links, as in Figure 2). Brocade FOS v7.2 did not make any distinction
between trunked and nontrunked ISL connections. Therefore, only a single ISL link is shown in the output.
Fabric Shortest Path First (FSPF) provided CUP Diagnostics with a list of possible egress ports, and CUP
Diagnostics simply chose the first egress port shown in that list as the master ISL link.
Figure 2: CHPID 9 could use a link on one of two hardware trunks to send and receive frames to CU A000.
For entry ports, the physical connection appears under the “From” column and represents either a channel,
a control unit, or a single port on another switch. The logical connection appears under the “To” column
and represents a single port or a group of ports on the same switch, depending on the routing and grouping
methods. For exit ports, the logical connection appears under the “From” column and always represents
a single entry port on the same switch. The physical connection appears under the “To” column and
represents either a channel, control unit, or a single port on another switch. Brocade FOS v7.3 and beyond
provide additional information about individual trunk ISL links that Brocade FOS 7.2 does not provide. If
you examine Figure 2, you see that there are two trunks of four ISL links each. Each trunk is an aggregation
of ISL links and is identified using the lowest port number in the group. In this case, the source switch is
Domain ID 22, so the aggregation designations are for Domain ID 22 flowing frames to Domain ID B3.
That makes the first ISL aggregation for Domain ID 22 aggregation-F8 and the second ISL aggregation for
Domain ID 22 aggregation-E0.
You can interpret the information for the destination director B3 in a similar manner. Entry port 0A is within
the aggregation group F8, and it routes the I/O requests to port 60 on the same switch. Once again, there is
no distinction between trunked and nontrunked ISL connections. Therefore, only a single ISL link is shown
in the output chosen from the FSPF list. Although port 60 has data routed to it from multiple ISL entry ports
on the switch, it is not possible to indicate that to the user in this version of Brocade FOS. Port 60 is not part
of an aggregate group, therefore it contains “..” in that column. The storage device control unit is connected
to port 60. Note that in this static routing example (that is, Port-Based Routing [PBR]), the information in
the “Misc” column indicates that I/Os are being statically routed. In the version of CUP Diagnostics for
Brocade FOS v7.2, there is no recognition of Device-Based Routing (DBR) or FICON Dynamic Routing (that is,
Exchange-Based Routing) capabilities on ISL connections.
The following output shows the “health” display for the example above:
For the health portion of the display, a description of the health of the fabric, each switch, and each port is
displayed, as well as the following information for each port:
• The % transmit/receive utilization indicates the percent utilization of the total transmit or receive
bandwidth available at the port.
• The % transmit delay is the percent of time that the frame transmission was delayed because no buffer
credits were available on the port. The % receive delay is the percent of time that the port was unable to
receive frames because all receive frames were utilized.
• The error count is the number of errors detected on the port that affect the transmission or receipt of
frames on the port. This is a sum of the errors counted over the fabric diagnostic interval, which is set to
30 seconds by z/OS.
• The optical signal column indicates the signal strength of the fiber optic signal being transmitted/
received by this port, in units of dBm (that is, denotes an absolute power level measured in decibels and
referenced to 1 milliwatt)
This example of the health data reveals that there are no health problems within the fabric. The text
provided in the fabric health, switch health, and port health is switch-vendor-specific. If any data is not valid,
then ”..” is displayed in the appropriate column.
In the previous version of CUP Diagnostics (that is, Brocade FOS v7.2) only the first FSPF link of an ISL
trunking aggregation was listed in the report that was provided in response to a CUP Diagnostics query. With
the release of Brocade FOS v7.3, CUP Diagnostics was capable of providing a much more robust display of
all of the members of an ISL aggregation/trunk. A common Aggregation Group Number (AGN) is assigned to
indicate which ISL links are contained within the same trunk aggregation. Each port in the trunk aggregation
is displayed with its own statistics. As an example, referring to Figure 4, a user would like to see how frames
are being sent from the mainframe (CHPID 09) through the fabric, in order to reach one of their disk arrays,
A000, similar to what was done at Brocade FOS v7.2. The user also wants a report about the health of the
fabric along that path. The user issues the same Display Matrix command with a ROUTE and a HEALTH
parameter. You can view the enhanced output displayed from CUP Diagnostics, via the Display Matrix
command, as follows.
Figure 4a: Trunk (that is, aggregation) groups are assigned the lowest port in the trunk as the group
number.
Figure 4b: Trunk (that is, aggregation) groups are assigned the lowest port in the trunk as the group
number.
From the example output, you can see that the route from the channel to the device travels through two
switches, Domain IDs 22 and B3. Domain ID 22 is the source director, and Domain ID B3 is the destination
director. The channel is connected to entry port 45 on switch Domain ID 22, as indicated by the “From”
column for that port. The “To” column for the same port indicates that I/O requests originating from this
port are routed to ISL aggregate FA (that is, four trunked links; see Figure 5). From the “Agg” column, you
can also determine that aggregate F8 consists of exit ports F8, F9, FA, and FB on the switch. Each of these
exit ports is connected to a different port on switch domain B3, as shown in the “To” column for the ports.
For example, port F9 routes I/O to port 09 on domain B3.
When a port is connected to a channel, “Chan” appears under the “From” or “To” columns of the display,
depending on which direction was specified. Likewise, when a port is connected to a control unit, “CU”
appears in the display. When an entry or exit port is connected to a single switch port, the domain and
port number are displayed as a four-digit number under the “From” or “To” columns. When an entry port is
connected to multiple ports on the current switch, a dynamic or aggregate group number is displayed under
the “To” column, depending on the routing and grouping methods used.
Figure 5a: A more detailed look at the ISL links contained in the Aggregation Groups.
When a set of entry or exit ports are part of an aggregate group of ports, those ports are assigned an
aggregate group number. This group number appears in two places. First, it appears under the “Agg”
column, to show which ports make up the aggregate group. Second, if I/O requests are being statically
routed from an entry port to this aggregate set of ports, then the aggregate group number appears under
the “To” column of the entry port. The value that is assigned for the aggregate group number is switch
vendor-specific.
Note that the definition of what is considered an entry port or exit port depends on the direction of the
display request: channel to device (TODEV) or device to channel (FROMDEV). For example, when TODEV is
specified, the port that is connected to the channel is considered an entry port, and the port connected to
the control unit is considered an exit port. However, if FROMDEV is specified, the roles are reversed.
The information for the destination director B3 can be interpreted in a similar manner. Entry ports 8, 9, A,
and B are all within aggregate group F8, and they route the I/O requests to port 60 on the same switch (that
is, Domain ID B3). Notice that port 60 can have data routed to it from multiple entry ports on the switch.
Because there is not a single originating port to identify, the “From” column contains “Mult”. Port 60 is not
part of an aggregate group and therefore contains “..” in that column. The storage device control unit is
connected to port 60.
One of the major changes in this release of CUP Diagnostics is that both PBR and DBR are noted as “Static”
under “Misc” in the report. DBR was not supported in CUP Diagnostics at Brocade FOS v7.2. When an entry
port uses static routing, the exit port or aggregate group number assigned to the port and the numbers of
alternate paths are displayed. Detailed information about the alternate paths is not displayed.
Note that, in this static routing example (that is, PBR or DBR), the data for aggregate E0 is not displayed;
however, the information in the “Misc” column indicates that I/Os are statically routed. One alternate
route is defined, which is the aggregate E0 routing path. FICON Dynamic Routing (that is, Exchange Based
Routing) is not supported at Brocade FOS v7.2, so only static routing is reported by v7.3 CUP Diagnostics.
This example of the health data shows that there are some health problems within the fabric on both
switches, as shown by the fabric and switch text Port Error. You see a combination of bottleneck detected
errors, slow draining device detected errors, and some ports that have been fenced. The text provided in the
fabric health, switch health, and port health is switch-vendor-specific. If any data is not valid, ”..” is displayed
in the appropriate column.
Important Note: A user might have the FMS license installed on all FICON switching devices. Then during
an upgrade from a previous version of Brocade FOS to v7.3, High Integrity Fabric (HIF) does not get set. At
Brocade FOS 7.3, the FMS mode is dependent upon HIF to be enabled. If for any reason HIF is not enabled
on a Brocade FOS 7.3 switch, then FMS will be disabled and a query by CUP Diagnostics will treat it as a
“Blind Director.” For a cascaded CUP Diagnostics query, where the Source and Destination domains and
the Source port are valid, and one of the domains after the Source domain is down-level (Brocade FOS v7.0
or earlier) or FMS mode is disabled, CUP Diagnostics indicates that the domain is incapable of allowing the
query. Brocade FOS v7.3 handles domains that do not support CUP Diagnostics by supplying an appropriate
reason code (for example, lower firmware level or FMS mode being off). Customers typically have cascaded
switches at or close to the same Brocade FOS level.
Figure 6: Blind Director is not capable of processing Director Diagnostics Query messages due to
inadequate firmware levels (Left) and not having CUP enabled (Right).
Brocade FOS v7.3 supports CUP Diagnostic queries where the shortest path from the Source domain to
the Destination domain can contain multiple subpaths that all have an equal path cost. So a “Diamond”
topology (see Figure 8), is one where there are two ISL paths from the Source domain (that is, the CHPID
attached to an F_Port) to the Destination domain (that is, F_Port leading to the storage control unit), and
both of the ISL paths have an equal cost.
It is prudent, at this point, to provide a brief refresher of ISL equal-cost paths. An important benefit of Fibre
Channel is the possibility of using multiple ISLs between switches. There is no need to worry about loops
and no blocked links, just additional ISLs that provide bandwidth and redundancy. If a user adds more ISLs
between switches, FSPF cares about this topology change. FSPF is a routing protocol used in Fibre Channel
fabrics to establish routes across the fabric and to recalculate these routes if a topology change occurs (for
example, link failures). Each switch is a hop, and each ISL between switches has a cost. The cost of an ISL
is dependent on the line rate of the link and the number of hops between the source and the destination
switching device.
So, in order to have equal-cost ISL paths, every ISL must be enabled to run at the same data rate and to
make the same number of hops. It is also important to note that even though FICON is currently qualified
to make use of only a single hop, the one-hop restriction is basically a FICON qualification issue. There is
nothing in the Fibre Channel standard, or FICON itself, that precludes the use of multiple hops.
Linux on z Systems, doing SCSI I/O for example, are allowed to make more than one hop to its storage.
A Diamond topology is also relevant to fabrics that utilize Exchange-Based Routing (that is, FICON Dynamic
Routing [FIDR]). The reason for this is that with static routing, there is only one path that is always chosen
through the fabric, and CUP Diagnostics does not have to provide any path forking reports unless FIDR is in
use. This enhancement was made in Brocade FOS v7.3 to prepare for the advent of FIDR at Brocade
FOS v7.4.
The rules that have an FMS notification action are part of all three of the default MAPS policies: dflt_
aggressive_policy, dflt_moderate_policy and dflt_ conservative_policy. MAPS also supports the FMS action
in custom rules. See Table 1 for an example. Some of the default rules are configured with the action FMS
along with typical default actions of RASLOG, SNMP, and EMAIL. The FMS action can also be configured at
the MAPS global action level (see Table 1.) In the active policy, if the FMS notification action is configured
for any triggered events, then MAPS sends the notification to CUP with event information. The following
information is sent to CUP:
• mapsrule --create myRule -group ALL_F_PORTS -monitor CRC -timebase MIN -op ge -value 0 -action
raslog,snmp,FMS
The Brocade FOS Fabric Vision MAPS module exposes a new action in the mapsconfig command. Modified
definition of this command is given here:
• mapsConfig --actions If the user issues the Dashboard CLI command, it shows configured actions,
including FMS, as part of the configured notification display.
• Threshold event: The threshold event is the result of a MAPS rule threshold being exceeded (for example,
a CRC error). It is reported once and then cleared.
• Sticky event: A sticky event is a port state that is reported by a MAPS rule being triggered (for example,
traffic congestion and port fencing). A sticky state is a persistent state with a specific HSC value that is
reported by CUP queries. It is continuously reported by CUP queries until another MAPS rule or event
occurs that changes that state.
Traffic congestion can occur on either F_Ports or E_Ports. Bottlenecks are detected by special Brocade FOS
components (that is, Bottleneck Detection or Fabric Performance Impact Monitoring) and these are now
reported to CUP, which in turn alerts the z Systems of the condition. If both threshold events and sticky
events exist, a query has the threshold event reported preferentially over the sticky event. Subsequent
queries then report the sticky event, until it is changed by another event.
Refer to the Brocade FOS 7.4.x MAPS Administrator’s Guide for a complete listing of all MAPS Trigger Values
and HSC Codes and Text That Are Sent to the z/OS console.
Chapter Summary
When a FICON user installs and enables the optional FMS license on a Fibre Channel switching device,
the user can utilize new CUP enhancements to provide a more robust fabric environment. Switch (CUP)
Diagnostics provides new fabric-wide diagnostic command channel programs to enable z/OS to obtain
fabric topology, collect diagnostic information such as performance data, determine the health of the fabric,
generate a diagnostic log on the switch, and help users resolve problems in the fabric. A switch device can
provide an unsolicited unit check, called a Health Summary unit check, which indicates that one or more
switches or ports in the fabric are operating at less than optimal capability. This check triggers
z/OS to retrieve the diagnostic information from the switch to further diagnose the problem. The sense data
identifies the source and destination ports that are used for the query. A switch that has CUP Diagnostics
support signals to z/OS that there is a health problem. The z/OS system initiates monitoring for the path,
requesting diagnostic data from the switch at regular intervals. The problem might require intervention such
as additional z/OS system commands or I/O configuration changes. After no errors or health issues are
detected by the switch for at least two hours, the monitoring of the route is stopped and no longer appears
in the report. The IBM Health Checker for z/OS reports if any switches that support CUP Diagnostics
capabilities indicate unusual health conditions on connected channel paths. While z/OS has historically
provided sophisticated I/O monitoring and recovery, these Health Checker reports expose newly available
diagnostic data provided directly from the switch. These health checks enable additional insight into
problems with a user’s fabric, such as hardware errors, I/O misconfigurations, or fabric congestion.
References
Brocade Communications Systems. (2016). Brocade CUP Diagnostics and the IBM Health Checker for z/OS.
San Jose, CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Fabric OS FICON Configuration Guide, 8.1.0. San Jose,
CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Fabric OS Administration Guide, 8.1.x. San Jose, CA,
USA: Brocade Communications Systems.
Guendert, S., & Lytle, D. (2007). MVS and z/OS Servers Were Pioneers in the Fibre Channel World.
Journal of Computer Resource Management, 3-17.
Guendert, S. (2007). To CUP or Not to CUP? That is the FICON Question! Proceedings of the 2007 Computer
Measurement Group International Conference.
IBM Corporation (2015). IBM Health Checker for z/OS V2R2 User’s Guide. Armonk, NY, USA: IBM
Corporation.
CHAPTER 3
Port Decommissioning
340
PART 4 Chapter 3: PORT COMMISSIONING
Introduction
Storage Area Networks (SANs) are typically implemented to interconnect data storage devices and data
servers or hosts, using network switches to provide interconnectivity across the SAN. SANs may be complex
systems with many interconnected computers, switches, and storage devices. The switches are typically
configured into a switch fabric, and the hosts and storage devices connected to the switch fabric through
ports of the network switches that comprise the switch fabric. Most commonly, Fibre Channel (FC) protocols
are used for data communication across the switch fabric, as well as for the setup and teardown of
connections to and across the fabric, although these protocols may be implemented on top of Ethernet or
Internet Protocol (IP) networks.
Typically, hosts and storage devices (generically, devices) connect to switches through a link between
the device and the switch, with a node port (N_port) of the device connected to one end of the link and a
fabric port (F_port) of a switch connected to the other end of the link. The N_port describes the capability
of the port as an associated device to participate in the fabric topology. Similarly, the F_port describes the
capability of the port as an associated switch.
Over time, SANs have become more complex, with fabrics involving multiple switches that use Inter-Switch
Links (ISLs) connected to switch ports (E_ports) on the switches. In some SANs, a core group of switches
may provide backbone switching for fabric interconnectivity, with few or no devices directly connected to the
core switches, while a number of edge switches provide connection points for the devices or devices of the
SAN. Additional layers of switches may also exist between the edge switches and the core switches.
The increasing complexity of enterprise data centers has complicated switch maintenance operations,
which require coordination between several administrators in order to complete. The simple act of blocking
a port of a network switch to perform a service action can require multiple teams of users to update their
operations in order to avoid using the target port. Often the procedures involved with this kind of effort are
manual and prone to error.
Some environments are especially sensitive to administrative errors that activate the automated recovery
systems. Such systems include call home, device recovery, logical path recovery, and application recovery
systems. Thus, errors in coordination of switch maintenance operations that trigger these automated
recovery systems create delays and escalations in production environments that enterprise data center
managers would like to avoid.
IBM Z environments are the leading example of an environment with these sensitivities to errors, and these
errors in an IBM Z FICON SAN environment are highly disruptive to operations, leading to channel errors,
and Interface Control Checks (IFCCs). Brocade’s Port Commissioning functionality provides a highly-
automated process that is tightly integrated with IBM Z management tools and eliminates the disruption
and errors associated with FICON switching device maintenance operations in IBM Z environments.
(or the last operational port in a link aggregation group) may be performed if there is a redundant path
available. Port commissioning provides an automated mechanism to remove an E_Port or F_Port from
use (decommission) and to put it back in use (recommission). Port commissioning feature identifies the
target port and communicates the intention to decommission or recommission the port to those systems
within the fabric affected by the action. Each affected system can agree or disagree with the action, and
these responses are automatically collected before a port is decommissioned or recommissioned. Port
commissioning is supported only on FICON devices running Fabric OS 7.1 or later. Fabric OS 7.1.0 and later
provides F_Port decommissioning and recommissioning using Brocade Network Advisor 12.1.0 and later.
There are several Brocade hardware/FOS specific requirements, as well as IBM Z- and z/OS-specific
requirements for using the Port Commissioning functionality.
• The CIMOM server must be running as a started task (CFZCIM) on each z/OS LPAR. The CIMOM server
must be running for F_Port decommissioning to function properly.
• You must have access to the CLI (cimcli command) for troubleshooting.
1. Make sure you meet the z/OS (mainframe operating system) requirements.
2. Register each CIMOM server within the fabric affected by the action. For instructions, refer to Registering
a CIMOM server on page 11.
4. Review the decommission deployment report. For instructions, refer to Viewing a port commissioning
deployment report on page 23.
5. Recommission the F_Port. For instructions, refer to Recommissioning an F_Port on page 17.
6. Review the recommission deployment report. For instructions, refer to Viewing a port commissioning
deployment report on page 23. You can also configure port commissioning for E_Ports (E_Port
commissioning on page 17), all ports on a blade (Port commissioning by blade on page 21), all ports on
a switch (Port commissioning by switch on page 18), or all ICL QSFP ports (Port commissioning by ICL
QSFP ports on page 20).
Figure 1a: Sample network environment managed with CIM management applications.
Before you can decommission or recommission an F_Port, you must register the CIMOM servers within the
fabric affected by the action.
1. Select Configure > Port Commissioning > Setup. The Port Commissioning Setup dialog box displays
(Figure 2).
2. Enter the IP address (IPv4 or IPv6 format) or host name of the CIMOM server in the Network Address
field.
3. (Optional) Enter a description of the CIMOM server in the Description field. The description cannot be
over 1024 characters.
4. Enter the CIMOM port number for the CIMOM server in the CIMOM Port field. The default port number is
5989.
5. Enter the namespace of the CIM_FCPort in the Namespace field. The default namespace is root/cimv2
6. (Optional) Enter a user identifier for the CIMOM server in the User ID field. The credentials user
identifier cannot be over 128 characters.
7. (Optional) Enter a password in the Password field. The password cannot be over 512 characters.
8. Click the right arrow button to add the new CIMOM server and credentials to the Systems List. The
application validates the mandatory fields.
9. Select the new CIMOM server in the Systems List and click Test to check connectivity. When testing is
complete, the updated status displays in the Status column of the Systems List for the selected CIMOM
server.
10. Click OK to save your work and save the CIMOM server details in the database.
1. Select Configure > Port Commissioning > Setup. The Port Commissioning Setup dialog box displays.
The Port Commissioning Setup dialog box has two main areas. The Add/Edit Systems and Credentials pane
enables you to register CIMOM servers (system and credentials) one at a time and contains the following
fields and components:
• Network Address: The IP address (IPv4 or Ipv6 format) or host name of the CIMOM server.
• Description (Optional): The description of the CIMOM server. The description cannot be over 1,024
characters.
• CIMOM Port: The CIMOM port number for the CIMOM server. The default port number is 5989.
• Namespace: The namespace of the CIM_FCPort. The default namespace is root/cimv2.
• Credentials - User ID (Optional): An user identifier for the CIMOM server. The credentials user identifier
cannot be over 128 characters.
• Credentials - Password (Optional): A password. The password cannot be over 512 characters.
• Left arrow button: Select a CIMOM server in the Systems List and click to move the defined CIMOM
server credentials to the Add/Edit Systems and Credentials pane for editing or deletion.
• Right arrow button: Click to move the defined CIMOM server credentials form the Add/Edit Systems and
Credentials pane to the Systems List.
2. The Systems List details the defined CIMOM server and contains the following data:
• Network Address: The IP address (IPv4 or Ipv6 format) or host name of the system.
• Status: The system connectivity status. Updates when you test the reachability of the CIMOM server and
when you contact the CIMOM server to respond to the F_Port recommissioning or decommissioning request.
Valid status options include:
• Last Contacted: The last time you contacted the system. Updates when you test the reachability of the
CIMOM server and when you contact the CIMOM server to respond to the F_Port recommissioning or
decommissioning request.
2. Select a CIMOM server from the Systems List and click the left arrow button to edit the CIMOM server
credentials.
3. (Optional) Edit the description of the CIMOM server in the Description field. The description cannot be
over 1024 characters.
4. Enter the CIMOM port number for the CIMOM server in the CIMOM Port field. The default port number is
5989.
5. Enter the namespace of the CIM_FCPort in the Namespace field. The default namespace is root/cimv2.
6. (Optional) Enter a user identifier for the CIMOM server in the Credentials User ID field. The credentials
user identifier cannot be over 128 characters.
7. (Optional) Enter a password in the Password field. The password cannot be over 512 characters.
8. Click the right arrow button to update the CIMOM server credentials in the Systems List.
9. Click OK to save your work and save the CIMOM server details in the database.
1. Select Configure > Port Commissioning > Setup. The Port Commissioning Setup dialog box displays.
2. Click Import to import CIMOM server information from a file. The CSV file must use the following format:
Network Address, User ID, CIMOM Port, Namespace, Description, Password
10.24.48.100,user,2015,root/cimv2,IBM Host,password
The Network Address is mandatory. If you do not provide values for the User ID, CIMOM Port, Namespace,
Description, or Password, the Management application provides default values.
3. Browse to the location of the file (.csv file) and click Open. The imported CIMOM servers display in the
Systems List.
4. Click OK to save your work and save the CIMOM server details in the database.
2. Click Export to export CIMOM server information to a file. The Export Files dialog box displays.
3. Browse to the location where you want to export the file (.csv file) and click Save. The CSV file uses the
following format:
Network Address, User ID, CIMOM Port, Namespace, Description,
10.24.48.100,user,2015,root/cimv2,IBM Host,
Export does not include the password. You can edit the exported file to add the password to the
credentials.
1. Select Configure > Port Commissioning > Setup. The Port Commissioning Setup dialog box displays.
2. Select one or more CIMOM servers from the System Lists and click Change Credentials. The Edit
Credentials dialog box displays. If you selected one CIMOM server, the credentials for the selected
server display in the dialog box. If you selected more than one CIMOM server, the credentials fields are
empty.
3. (Optional) Enter a user identifier for the CIMOM server in the User ID field. The user identifier cannot be
over 128 characters.
4. (Optional) Enter a password in the Password field. The password cannot be over 512 characters.
5. Click OK to close the Edit Credentials dialog box. The Port Commissioning Setup dialog box and the
status of the selected CIMOM server rows displays “Credentials Updated.” To validate the credentials,
refer to Testing CIMOM server credentials on page 14.
You should validate the CIMOM server credentials before you decommission or recommission ports. During
the decommission or recommission of an F_Port, the Management application validates the CIMOM server
credentials.
1. Select a device and select Configure > Port Commissioning > Setup. The Port Commissioning Setup
dialog box displays.
2. Select one or more CIMOM servers from the System Lists and click Test. The connectivity status of the
selected CIMOM server rows display “Testing” in the System Lists. When testing is complete, the CIMOM
server connectivity status displays. Valid status options include:
• OK: CIMOM server contact successful with current credentials.
• Not Contacted Yet: CIMOM servers configured, connectivity not tested yet.
• Credentials Updated: Credentials changed, connectivity not tested yet.
• Credentials Failed: CIMOM server contact failed with current credentials.
3. Click OK to close the Port Commissioning Setup dialog box. When the test is complete, an application
event displays in the Master Log detailing success or failure.
2. Select one or more CIMOM servers from the System Lists and click the left arrow button. The details for
the last selected CIMOM server row displays in the Add/Edit System and Credentials pane.
3. Confirm that this is the CIMOM server you want to delete and click OK to delete the CIMOM server from
the Port Commissioning Setup dialog box. When the deletion is complete, an application event displays
in the Master Log detailing success or failure.
F_PORT Commissioning
Although you can use any of the following methods to access the F-Port commissioning commands,
individual procedures only include one method.
• From the main menu, select the F_Port in the Product List or Topology, and then select Configure > Port
Commissioning > Decommission/Recommission > Port.
• From the Product List, right-click the F_Port and select Decommission/Recommission > Port.
• From the Topology, right-click the F_Port and select Decommission/Recommission > Port.
• From a Dashboard widget, right-click the F_Port and select Decommission/Recommission > Port.
Decommissioning an F_Port
Refer to figure 3:
1. Select the F_Port in the Product List, and then select Configure > Port Commissioning > Decommission
> Port. The Port Commission Confirmation dialog box displays.
5. Click OK on the Port Commission Confirmation dialog box. While decommissioning is in progress,
a down arrow icon displays next to the port icon in the Product List. The port decommissioning status
(Port Decommission in Progress or Decommissioned F Port) displays in the Additional Port Info field of
the Product List. When decommissioning is complete, a deployment status message displays. When the
decommissioning is complete, an application event displays in the Master Log detailing success or failure.
If any of the defined CIMOM servers responds negatively to a decommission request the switch port will not
be decommissioned.
On each of the defined z/OS LPARs the system console will show IOS has decommissioned CHPID A0.
A display of the CHPID A0 on each LPAR shows it offline due to “switch initiated reconfiguration for channel
port.”
Note: The only way to get this CHPID back online is to 1) Recommission it. 2) Use the force option on cf chp
command (i.e. cf chp(a0),online,force).
D M=CHP(A0)
IEE174I 15.18.30 DISPLAY M 045
CHPID A0: TYPE=1B, DESC=FICON SWITCHED, OFFLINE
DEVICE STATUS FOR CHANNEL PATH A0
CHP=A0 IS OFFLINE FOR THE FOLLOWING REASON(S)
SWITCH INITIATED RECONFIGURATION FOR CHANNEL PORT
************************ SYMBOL EXPLANATIONS ************************
+ ONLINE @ PATH NOT VALIDATED - OFFLINE . DOES NOT EXIST
* PHYSICALLY ONLINE $ PATH NOT OPERATIONAL
2013/03/22-14:41:55, [FICU-1020], 8521, SLOT 5 | FID 10, INFO, DCX_117_, Port Addrs (0x34) have been
Blocked by Fabric Manager.
Any running I/O will be transitioned to another CHPID or DASD/Tape port before the decommission and will
continue to run with this F_Port decommissioned.
Note: Network Advisor will not allow a decommissioned port to be enabled, but the FOS cli will. If a
decommissioned port is enabled via the FOS cli the CHPID will remain in the above “switch initiated
reconfiguration for channel port” state and must be brought online with one of the above options.
Recommissioning an F_Port
Refer to Figure 5a, Figure 5b, and Figure 6.
F_Port recommissioning occurs whether or not any CIMOM server is not reachable.
Highlight a single F_Port or Select the F_Port in the Product List, and then select Configure > Port
Commissioning > Recommission > Port.
While recommissioning is in progress, an up-arrow icon displays next to the port icon in the Product List.
The port recommissioning status (Port Recommission in Progress or Recommissioned F Port) displays in the
Additional Port Info field of the Product List. You can view the port commissioning results in the deployment
reports. When the recommissioning is complete, an application event displays in the Master Log detailing
success or failure.
On each of the defined z/OS LPARs the system console will show IOS has recommissioned CHPID A0.
A display of the CHPID A0 on all the defined LPARs shows it once again online.
D M=CHP(A0)
IEE174I 15.56.17 DISPLAY M 051
CHPID A0: TYPE=1B, DESC=FICON SWITCHED, ONLINE
DEVICE STATUS FOR CHANNEL PATH A0
DCX_117_:FID10:root> 2013/03/22-15:44:21, [FICU-1021], 8522, SLOT 5 | FID 10, INFO, DCX_117_, Port
Addrs (0x34) have been Unblocked by Fabric Manager.
I/O will start running on this CHPID or DASD/Tape port after the recommission has completed.
E_PORT Commissioning
You can use one of the below four methods to access the E_Port commissioning commands, this section
will go through using the product list method in detail.
1. From the main menu, select the E_Port in the Product List or Topology, and then select Configure > Port
Commissioning > Decommission/Recommission > Port.
2. From the Product List, right-click the E_Port and select Decommission/Recommission > Port.
3. From the Topology, right-click the E_Port and select Decommission/Recommission > Port.
4. From a Dashboard widget, right-click the E_Port and select Decommission/Recommission > Port.
Decommissioning an E_Port
Follow the below steps:
1. Select the E_Port in the Product List, and then select Configure > Port Commissioning > Decommission
> Port.
2. If Lossless DLS is enabled, click Yes on the confirmation message. While decommissioning is in progress,
a down arrow icon displays next to the port icon in the Product List. The port decommissioning status
(Port Decommission in Progress or Decommissioned E Port) displays in the Additional Port Info field
of the Product List. You can view the port commissioning results in the deployment reports. When the
decommissioning is complete, an application event displays in the Master Log detailing success or failure.
Recommissioning an E_Port
Select the E_Port in the Product List, and then select Configure > Port Commissioning > Recommission >
Port.
While recommissioning is in progress, an up-arrow icon displays next to the port icon in the Product List.
The port recommissioning status (Port Recommission in Progress or Recommissioned E_Port) displays
in the Additional Port Info field of the Product List. You can view the port commissioning results in the
deployment reports. When the recommissioning is complete, an application event displays in the Master
Log detailing success or failure.
2. If lossless DLS is enabled, click Yes on the confirmation message. The decommission request is sent
to all the trunk members including the master if a decommissioning request is triggered on a trunk
level. An error message displays, if the trunk group is the only connection between the switches. During
recommissioning, the port will recommission all the decommissioned members of the trunk group if the
port was part of the trunk group previously.
Port decommissioning cannot be configured by itself in a MAPS rule or action. It requires port fencing to
be enabled in the same rule. If you attempt to create a MAPS rule or action that has port decommissioning
without port fencing, the rule or action will be rejected. MAPS can be configured to have only port fencing
enabled in a rule; if this is the case, the port will be taken offline immediately.
MAPS supports port fencing and port decommissioning actions for rules that monitor CRC, ITW, PE, LR,
STATE_CHG, or C3TXTO errors from physical ports, such as E_Ports, F_Ports, or U_Ports. Otherwise, for
circuits, MAPS supports only port fencing actions for rules that monitor changes of state (STATE_CHG).
Refer to the Port-health monitoring thresholds on page 177 tables for these rules. Be aware that if multiple
rules for the same counter are configured with different thresholds, then both port fencing and port
decommissioning should be configured for the rule with the highest threshold monitored. For example, if
you configure one rule with a CRC threshold value “greater than 10 per minute” and you configure a second
rule with a CRC threshold value “greater than 20 per minute,” you should configure port fencing and port
decommissioning as the action for the rule with the 20-per-minute threshold, because configuring it for the
10-per-minute rule will block the other rule from being triggered.
For F_Ports, port decommissioning will only work if Brocade Network Advisor is actively monitoring the
switch. Brocade Network Advisor can decommission F_Ports based on specified criteria. MAPS notifications
are integrated with Brocade Network Advisor, which in turn must coordinate with the switch and the end
device to orchestrate the port decommissioning. If Brocade Network Advisor is not configured on a switch,
MAPS will fence the F_Port. For more information on port fencing, port decommissioning, and related failure
codes, refer to the Brocade Fabric OS Administrator’s Guide listed in the chapter references.
1. Connect to the switch and log in using an account with admin permissions.
The following example makes port fencing and port decommissioning part of a rule and then displays the
confirmation.
Conclusion
By automatically coordinating the decommissioning of ports in a switch, an administrator may improve
the reliability and performance of the switch fabric, ensuring that ports are gracefully taken out of service
without unnecessary interruption of service and triggering of automatic recovery functionality that may occur
during manual decommissioning of ports. Embodiments may provide for decommissioning of F_ports only,
E_ports only, or both F_ports and E_ports. Where trunking is provided for in the fabric, decommissioning
of a port in a trunk group may be performed if there are other operational links in the trunk group.
Decommissioning of a non-trunked port (or the last operational port in a trunk group) may be performed if
there is a redundant path available, to avoid data loss or out of order transmission in lossless environments.
The combination of port commissioning with MAPS adds an automated component to the overall solution
that expedites troubleshooting and maintenance procedures compared to a manual approach.
References
Brocade Communications Systems. (2017). Brocade Fabric OS FICON Configuration Guide, 8.1.0. San Jose,
CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Fabric OS Administration Guide, 8.1.x. San Jose, CA,
USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Monitoring and Alerting Policy Suite Configuration
Guide, 8.1.0. San Jose, CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Network Advisor SAN User Manual, 14.3.1. San Jose,
CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2017). Brocade Port Decommissioning Quick Start Guide. San Jose,
CA, USA: Brocade Communications Systems.
IBM Corporation (2015). z/OS Common Information Model User’s Guide V2R1. Armonk, NY, USA: IBM
Corporation.
CHAPTER 4
360
PART 4 Chapter 4: Buffer credits and performance
Introduction
The basic unit of information of the Fibre Channel (FC) protocol is the frame. Other than primitives, which
are used for lower level communication and monitoring of link conditions, all information is contained within
the FC frames. When discussing the concept of frames, a good analogy to use is that of an envelope: When
sending a letter via the United States Postal Service, the letter is encapsulated within an envelope. When
sending data via a FC network, the data is encapsulated within a frame. Managing the data transfer of a
frame between sender and receiver is of vital importance to the integrity of a storage network. Buffer
Credits are the principal mechanism used to control point-to-point frame transmission and reception.
A major strength of Fibre Channel is that it creates lossless connections by implementing a flow control
scheme based on buffer credits. Fibre Channel frame flow control exists at both the physical and logical
level. The logical level is called End-to-End flow control. It paces the flow of a logical operation (for example,
a write operation to a disk control unit) across a fabric. A logical level transmission credit is initially
established when two communicating nodes log in and exchange their respective communication param-
eters. The nodes themselves then monitor their End-to-End flow control between a specific pair of node
ports (node ports are on a control unit or a CPU; ports on a switch are not called node ports). The physical
level is called Buffer-to-Buffer flow control. It manages the flow of frames between a transmitting port (also
known as an egress port, sender, or initiator) and an adjacent receiving port (also known as an ingress port
or target)—basically from one end of a FC cable to the other end.
It is important to note that a single end-to-end operation may have made multiple transmitter-to-receiver
pair hops (end-to-end frame transmissions) to reach its destination. However, the presence of intervening
switching devices and/or ISLs is transparent to end-to-end flow control. See Figure 1. Since buffer-to-buffer
flow control is the more crucial subject in an environment using ISLs, the following section provides a more
in-depth discussion.
Upon its arrival at a receiver, a frame goes through several steps and its header becomes stored in the
receive buffers where it is processed by the receiving port. Frames arriving while the receiver is busy pro-
cessing an earlier received frame are stored in the next free Receive buffer. If the receiver is not capable
of processing the frames in a timely fashion, it is possible for all the receive buffers to fill up with received
frames. At this point, if the transmitter continued to send frames, the receiver would not have Receive buffer
memory space available and the frame would be lost (discarded or dropped). Losing data by discarding it, or
attempting to overwrite an already in use buffer, is not acceptable, and cannot be allowed, in a storage envi-
ronment. This is why flow control is so necessary for deterministic high performance low latency networks.
Buffer credit flow control is implemented to limit the amount of data that a port may send, and if properly
tuned, is primarily based on four metrics: The speed of the link; the distance that the link traverses; whether
the data is being compressed or not and the average size of the frames being sent. Buffer credits also
represent finite physical-port memory which means that as more buffer credits are allocated on a port more
physical memory is consumed on that port to hold the BBCs. Within a fabric, each port may have a different
number of buffer credits—although this is not recommended. Within a single link connection, each side of
the link may have a different number of buffer credits. The configured buffer credit value on a port is not
required to be symmetrical but users should always make ISL links contain a symmetrical number of buffer
credits. See Figure 1.
Buffer-to-buffer flow control is an agreed-upon and known control of buffers between adjacent ports within
the I/O path, in other words, transmission control across each single FC cable link. A sending port uses its
available credit supply and waits to have its credits replenished by the receiving port on the opposite end of
the cabled link.
Buffer credits are used by Class 2 and Class 3 FC services. When host or storage node ports (N_Ports) are
connected to switch fabric ports (F_Ports) a receiver ready “R_RDY” is the standards-based acknowledge-
ment from the receiver back to the transmitter that a frame was successfully received and that a buffer
credit can be freed by the transmitter for its next frame to use. The rate of frame transmission is regulated
by the receiving port, and is based on the availability of buffers to hold received frames.
In short, Buffer-to-Buffer Flow Control works because a sending port is using its available buffer credit
supply and waiting to have the buffer credits replenished by the receiving port on the opposite end of the
link.
When an exchange (read or write) is to be transmitted down a Fibre Channel path, that record is split up into
a number of frames each of which can have a maximum size of 2,148 bytes. These data carrying frames are
transmitted through the fabric and then re-assembled at the far end to recreate the complete data record
again. To make sure the FC links are used efficiently, every switch port has a number of buffers (buffer
credits) associated with it. If necessary, the switch can then store several blocks of incoming data, while
waiting to pass it on to the next node. When a receiver is ready to accept information, it signals to its sender,
then decrements its own BB_Credit count. When the data block is passed on to the next receiver the
BB_Credit is incremented again. This means that BB_Credits are also used to throttle back the data
transmission flow when devices or links get too busy.
As long as the transmitter’s buffer credit count is a non-zero value, the transmitter is free to continue
sending data. The zero credit counter will increment when the last buffer credit is allocated. A buffer credit
is allocated just prior to the transmission of a frame so, technically, the switch continues to transmit the
last frame while at zero buffer credits. The flow of frame transmission between adjacent ports is regulated
by the receiving port’s presentation of acknowledgements. In other words, BB_Credits has no end to end
component except for one end of a cable to the other end of that same cable. The sender increments its
BB_Credit count by one for each acknowledgement it receives. As stated earlier, the initial value of
BB_Credit must be non-zero.
The rate of frame transmission is regulated by the receiving port based on the availability of buffers to hold
received frames. It should be noted that the specification allows the transmitter to be initialized at zero, or
at the value BB_Credit and either count up or down on frame transmit. Various switch vendors may handle
this with either method, and the counting would be handled accordingly.
Buffer-to-buffer credit management affects performance over distances; therefore, allocating a sufficient
number of buffer credits for long-distance traffic is essential to performance. During initial login, each
optical receiver informs the optical transmitter at the other end of the link how many receive buffers it has
available. This is called the BB_Credit value. This value is used by the transmitter to track the consumption
of receive buffers and pace transmissions if necessary. See Figure 2.
The terms “frame pacing” and “flow control” are often used interchangeably to describe how buffer credits
provide link management. Regardless of the term being used, the fundamental objective is to prevent the
transmitter (Tx) from over-running its receiver (Rx) with an I/O workload which would cause frames to be
discarded and potential data integrity issues. In the process that prevents data overrun, the receiver side of
the link will inform the transmitter side of that link how many frames it (the receiver) can accept which it will
then handle correctly.
At that point the transmitter has a process by which it sets its buffer credit counter and storage to
correspond to the receiver’s request and then adheres to how many frames the receiver allows to be sent.
Simply stated, a transmitter cannot send more frames to a receiver than the receiver can store in its buffer
memory. See Figure 3. Each of these buffer credits represents the ability of the device to accept additional
frames. If a recipient issues no credits to the sender, the sender will not send any frames. Pacing the
transport of subsequent frames on the basis of this credit system helps prevent the loss of frames and
reduces the frequency of entire Fibre Channel sequences needing to be retransmitted across the link.
Figure 3: Frame pacing and flow control.
Thus, the transmitter decrements its counter based upon a logical usage of receive buffers and increments
the counter based upon physical signaling from the receiver. In this way the connection allows the receiver
to “pace” the transmitter, managing each I/O as a unique instance. Pacing, as presented here, means that
a frame cannot be sent from an egress port transmitter to an ingress port receiver unless the transmitter
has a buffer credit available to assign to the frame. Once a frame has been successfully accepted into the
receiver the receiver acknowledges that receipt back to the transmitter so that the transmitter can
increment its buffer credit counter by one and then releases the buffer credit memory that it assigned to
that successfully transmitted frame. See Figure 4.
As long as the transmitter counter is non-zero, then the transmitter is free to send data. This mechanism
allows for the transmitter to have a maximum number of data frames in transit equal to the value of a
BB_Credit, with an inspection of the transmitter counter indicating the number of available receive buffers.
If the transmitter exhausts the frame count offered by the receiver, it will stop and then wait for the receiver
to credit-back frames to the transmitter through frame acknowledgements. A good analogy would be a pre-
paid calling card: there are a certain amount of minutes to use, and one can talk until there is no more time
on the card whereby the conversation stops. In Fibre Channel, one often sees this condition with slow-drain
devices.
A buffer credit equals one frame regardless of frame size. Every buffer credit on a port (host, switch,
storage) has the same maximum size—2,148 bytes. However, every frame that is created can be a different
size while still consuming a full buffer credit. See Figure 5.
The roundtrip link distance is also very important to understand in order to keep a link fully utilized. Al-
though not technically accurate, you might visualize this concept as needing one-half of the BB_Credits to
maintain transmitter activity during data frame delivery and the other half of the BB_Credits to maintain
transmitter activity while acknowledgement frames traverse the return link. Users must configure sufficient
buffer credits for the total round trip link distance in order to ensure enough buffer credits to keep a link
optimally utilized.
One of the reasons that a transmitter can exhaust the frame count is if there were not enough BBCs initially
allocated to the port. There are four major metrics that must be accounted for in order to allocate the cor-
rect number of buffer credits—link speed, roundtrip cable distance, average frame size and compression.
If any one or more of these are misjudged, then too many or too few buffer credits will be allocated on the
link. If a transmitter runs out of buffer credits it cannot send any more frames that it might have waiting and
those subsequent frames must be held in its egress port until buffer credits do become available (actually
a little more involved here but this will suffice for now). In this case, it is not possible to optimize the full
bandwidth potential of the link.
However, when there are ample buffer credits allocated on the link, for the full roundtrip connection, the
complete bandwidth potential of the link can be realized. If one could see a frame as it is being sent across
the link it would be easy to see that the slower the link speed the more distance that a frame will utilize on
that link and that it takes fewer buffer credits to fully utilize the bandwidth. That is why there is an axiom
that when the link speed doubles one must also at least double the number of buffer credits allocated to
that link. See Figure 6.
Buffer Credits are an integral part of the ASIC design in a Brocade switching device
The number of available buffer credits defines the maximum amount of data that can be transmitted prior
to an acknowledgment from the receiver. BB_Credits are physical memory resources incorporated in the
Application Specific Integrated Circuit (ASIC) that manages the port. It is important to note that these
memory resources are limited. Moreover, the cost of the ASICs increases as a function of the size of the
memory resource. One important aspect of FC is that adjacent nodes do not have to have the same number
of buffer credits. Rather, adjacent ports communicate with each other during fabric login (FLOGI) and port
login (PLOGI) to determine the number of credits available for the send and receive ports on each node.
Brocade uses an intelligent ASIC architecture, and Gen 5/Gen 6 FICON director connectivity port blades
have two ASICs. On some of our fixed switches (e.g. 6505, 6510) there is a single ASIC for up to 48-ports of
connectivity while the Brocade 6520 Switch is architected using six Condor3 ASICs. Each ASIC holds a pool
of buffer credits that their attached ports use, see Figure 7. To be clear, each port on an ASIC makes use of
the buffer credit pool rather than owning its own, dedicated buffer credits. Just so you are aware, there are
some dedicated buffers per port but how they are used is proprietary information. On each Condor 3 (Gen 5)
ASIC there are 8,192 buffer-to-buffer credits per 16-port group when using 32-port blades and per 24-port
group when using 48-port blades. From its ASIC BC pool each port attached to that ASIC will be provided
with a default setting of eight buffer credits (Gen 5) or 20 buffer credits (Gen 6) for external data flow. Then,
to provide optimal data flow across long distance links, any one additional port attached to an ASIC on a 32-
port blade can have a maximum of 5,408 BBCs configured onto that port while any one additional port on
an ASIC on a 48-port blade can have a maximum of 4,960 BBCs configured onto that port. Of course, there
are other allocations of buffer credits to ports that are allowable. Additional information can be found in the
Fabric OS Administrators Guide under “Buffer credits per switch or blade model.”
The default number of buffers in each Brocade 4 Gbps, 8 Gbps, or 16 Gbps port is eight, while for 32 Gbps
ports the default is 20. For short distance links (between the CPU and switch, for example), eight buffers is
normally sufficient. The ports that are most likely to require more buffers are those connected to ISL links
(which tend to be much longer than the links between the switch and CPU or control units). It is possible
to add more buffers to a switch port. It is not possible to add more buffers to a channel port or a control
unit port. For that reason, CPU and CU ports typically provide enough buffers to run the link at roughly 100
percent utilization over the maximum supported distance (normally 10 km). That calculation is based on the
assumption that all frames will be nearly full, which is probably not a valid assumption for traditional FICON
or High Performance FICON (zHPF) I/Os. On the other hand, unless there is a very large distance (more than
5 or 6 kms) between the adjacent ports, it is extremely unlikely that the provided number of buffers in
channel ports and control unit ports will not be enough.
Using CLI Commands to Monitor Average Frame Size and Buffer Credits
The user will need to periodically check to be sure that an adequate number of buffer credits are assigned
to each ISL link as workload characteristics change over time. Use the portBufferShow Command Line Inter-
face (CLI) command to monitor the buffer credits on a long-distance link and to monitor the average frame
size and average buffer usage for a given port.
The portBufferShow CLI command displays the buffer usage information for a port group or for all port
groups in the switch. When portBufferShow is issued on switches utilizing trunking they will see that the
trunking port group can also be specified using this command just by giving any port number in that port
group. If no port is specified, then the buffer credit information for all of the port groups on the switch will be
displayed. Much more about all of the Brocade CLI commands can be found in the Fabric OS Administrator’s
Guide which can be obtained from www.brocade.com.
Report 1 shows the results of issuing the portBufferShow CLI command. In the example there are two inde-
pendent ISLs. Brocade trunking (ASIC optimized trunking) was not in use. The average frame size in bytes
is shown in parentheses with the average buffer usage for packet transmission and reception. It is readily
apparent that ports 48 and 49 were provisioned with far too many buffer credits (942) than were ever used
while this statistical sample was being taken (240) and far greater than the average buffer credits required
(32 and 146) on these ISL links.
Assuming that a user is setting up their ISL links for the first time there is one fairly simple way to calculate
the average frame size that will be needed for the BB_Credit calculation tools or BB_Credit calculation to be
done manually. One can use the portStats64show CLI command. These 64 bit counters are actually two 32
bit counters and the lower one (“bottom_int”) is the 32 bit counter that is used in portStatsShow. But each
time it wraps, it increases the upper one (“top_int”) by 1. It is also recommended that a user provide the “–
long” option on the command. The “–long” option adds the contents of the two counters together in order to
provide a single result. Following is an example without and with the “–long” option:
When the portStats64show -long command is executed the user does not get a counter of bytes but rather
a counter of 4-byte words. Luckily, fillwords on the link do not count into this number so it is still valid for
our average frame size calculation. One just has to multiply the transmitted and received words by four to
turn it into total bytes. It is the transmitted (Tx) average frame size that we want to use in our calculations to
determine how many BBCs are required:
(1044696810* 4) bytes = 4,178,787,240 / 4285384 frames = 976 (rounded up) Tx bytes average frame
size
(576487365* 4) bytes = 2,305,949,460 / 8023909 frames = 288 (rounded up) Rx bytes average frame
size
Port statistics are only relevant if the user knows how long the statistics have been accumulating on
the port. A power on reset (POR) or the last time a portStatsClear command was issued (minutes, days,
months?) refresh port statistics. To make sure that the port’s statistical information is relevant to peak
usage or Quarter-end/Year-end usage, the user should issue the portStatsClear CLI command before the
statistics capture need to be gathered and utilized.
If you are going to have a problem due to insufficient buffers, it is most likely that that issue will be related to
the ISL links connecting cascaded switches. Nevertheless, the concepts that determine how many buffers
are required apply equally, regardless of what type of device is on either end of the FICON link. How do you
ensure that the receiver port has enough buffers to avoid this situation? For a properly tuned configuration,
the number of buffers required on a receiving port to fully utilize the available bandwidth without encounter-
ing delays or pacing is primarily based on the following three inputs:
3. The average size of the frames being sent (for example, if the average frame size is low, you would need
more frames (and therefore, more buffers) to hold the same amount of data)
Why is the speed of the port and the length of the link important? The speed of the port determines how
long it takes to feed the frame into and out of the fiber. And the length of the link determines how much
time it takes the frame to traverse the link from the transmitter port to the receiver port. Using the standard
8b/10b encoding, 10 bits are required for each byte. So a full frame of 2148 bytes would be represented
by 21,480 bits. If the port speed is 4 Gbps, it would take 5.37 microseconds (μs) (21,480 bits, divided by
4,000,000,000 bits per second) to move one frame into the fiber. If the port speed is 16 Gbps, it would
take one quarter of that time (1.34 μs) to move the same frame into the fiber. The speed of light in a fiber
optic cable is roughly 200,000,000 meters per second, or 5 μs per kilometer. If it takes 1.34 μs to move
the frame into the fiber on a 16 Gbps port, the start of the frame will have moved .268 km down the fiber
((1.34 μs / 5 μs) * 1 km) by the time the last bit of the frame is placed in the fiber, meaning that each
frame will be roughly 268 meters long.
On the other hand, if you were using a 4 Gbps port, it would take 4 times longer to move the frame into the
fiber, so each frame would be 1.072 km. long. This relationship between port speed and the size of a frame
is illustrated in Figure 6. You can easily see from this illustration that fewer buffers are required to keep a
slow link fully utilized than would be required for a fast link.
This information about the port speed and the length of the link help you determine the number of frames
that could be in flight on the link at a given time. When calculating usage of the buffers in the receiver port,
the transmitting port must consider not only the number of buffers currently in use, but also the number
that will be required to hold frames that are currently en-route to the receiver. Typically, a lower speed link
will have fewer frames in flight, and it will take longer to fill the receiving port’s buffers, so fewer buffers
would be required in the receiving port. Remember that the time required for the frame to traverse the link
is determined by the speed of light, not the link speed, so it will take a frame the same amount of time to
travel over a 4 Gbps link as it does over a 16 Gbps link. The link speed (4 Gbps, 8 Gbps, 16 Gbps,32 Gbps
etc) determines how quickly bits can be put in and out of the fiber, but once in the fiber, they all travel at
the same speed. The distance is also important because it determines how long it takes between when the
receiving port sends an acknowledgment that a buffer is now available, and when the transmitting port
receives that information. For example, if you have a 100 km link, a buffer would be available for at least
500 μs before the acknowledgment signal gets back to the transmitting port. In order to be able to keep the
link as highly utilized as possible, you should have enough buffers to allow the transmitter to continue
sending frames while the acknowledgment traverses the link in the opposite direction. The average frame
size is important because each frame takes up 1 buffer in the receiver, regardless of its size. If your
workload results in small frame sizes, it will take less time for the frame to be moved in and out of the fiber,
meaning that the buffers will be consumed faster. Frame sizes vary depending on many things:
• Does the exchange carry a Read or Write I/O? Write I/Os result in larger frames being sent from the CPU
to the CU (because they include some amount of data), but the response from the CU would be much
smaller. Conversely, a Read I/O will result in small frames being sent to the CU, but much larger frames
being returned to the CPU.
• Command Mode (as opposed to zHPF mode) FICON disk I/O processing often restricts the amount of data
placed into a frame’s payload to less than 1,000 bytes.
• Is zHPF being used? zHPF I/Os typically result in larger average frame sizes.
• Is MIDAW enabled? One of the benefits of MIDAW is that the average frame sizes should be larger.
• The type of processing being done—direct access I/Os could result in different frame sizes than
sequential processing.
• What types of device are using the link? Tape I/Os tend to result in larger frames than DASD I/Os. And
Channel-to-Channel Adapters (CTCs) often result in very small frames.
• Is the channel running in FC or FCP mode? FCP mode might be used by z/VM, z/VSE, or Linux for z
Systems. FCP tends to result in larger frame sizes.
In the real world, the distribution of frame sizes looks more like that shown in Figure 5. There will be a
mixture of reads, writes, and acknowledgments, and the distribution of frame sizes (and hence, the average
frame size) will vary depending on the activity on that channel at that time. For example, the average frame
sizes would be very different if you are running full volume dumps (many large sequential Read I/Os) than
during the online day (more writes, and mainly direct processing).
If the CUs are connected to the CPU via a switch, you can use the RMF FICON Director Activity report to mon-
itor the average frame size for each link. If your CU is connected to the host directly, there is no z/OS-based
mechanism that we are aware of to determine the average frame size. All of this information (link speed,
distance, and average frame size) is input to the calculation to determine the ideal number of buffers in
each port. The calculator will use the information that you provide about the port speed, link distance, and
average frame size, and combine that with proprietary information about the switch, to calculate and report
the optimum number of buffers. For those of you with currently-deployed Brocade switching devices at FOS
7.1 or higher, if you know the distance, speed, and average frame size for a given ISL link, you can use the
portBufferCalc CLI command and allow your FOS to more accurately calculate the number of buffer credits
required to meet your needs.
Although every Fibre Channel port must use buffer credits it is really the longer distance links that requires
a user to provision other than the default eight BB_Credits per port on a switch.
When a host or storage node port (Nx_Port) has no BB_Credits available, and has exceeded the link timeout
period (Error Detect Time Out Value or E_D_TOV), a link timeout will be detected. E_D_TOV by default is two
seconds. When a link timeout is detected, the Nx_Port or switch fabric port (Fx_Port) begins the Link Reset
protocol. A switch or device port that cannot transmit frames due to the lack of credits received from the
destination port for two seconds uses the Link Reset (LR) to reset the credit amount on the link back to its
original value.
It is important to differentiate the concept of link reset and link initialization. The link initialization process is
caused by cable connectivity (physical link), server reboots, or port resets, however link credit reset can be
an indication of a buffer credit starvation issue that occurs as an after-effect of frame congestion or fabric
backpressure conditions.
In order to ensure that a fabric will run as effectively as possible it is vitally important that long distance
links be provisioned with an adequate and optimal number of buffer credits. There are tools that a customer
will normally utilize such as Brocade’s Buffer Credit Calculation spreadsheet (https://community.brocade.
com/t5/Storage-Networks/bg-p/StorageNetworks) or by using the CLI command portBufferCalc.
If a user knows the roundtrip distance (in kilometers), speed and average frame size for a given port they
can use the portBufferCalc CLI command to calculate the number of BB_Credits required on that link. If
they omit the distance, speed or frame size the command uses the currently configured values for that port.
Given the buffer requirement and port speed a user can specify the same distance and frame size values
when using the portCfgLongDistance CLI command to actually set the appropriate number of buffer credits
onto an ISL link.
In order to maintain acceptable performance, at least one buffer credit is required for every two km of
distance covered (this includes roundtrip). We also recommend adding an additional six BB_Credits which
is the number of buffer credits reserved for Fabric Services, Multicast, and Broadcast traffic in the Virtual
Channels (discussed later) of an ISL link. The extra six BBCs is just a static number but important in order to
be able to achieve maximum ISL utilization. And, of course, one must account for the actual average frame
size of the frames that are being transmitted on the link rather than just assuming that all frames sent will
be at the full frame size of 2,148 bytes.
Here is the formula one can use to approximate the number of buffer credits that an ISL link will require:
• BB_Credits = (2*(link distance in km/frame length in km)) * (full frame size/average frame size) +6
Framelength (km) = 4.00km @ 1 Gbps
= 2.00km @ 2 Gbps
= 1.00km @ 4 Gbps
= 0.50km @ 8 Gbps
= 0.33km @ 10 Gbps
= 0.25km @ 16 Gbps
For two sites that are 20 km (1 kilometer = 0.621 miles) apart, the approximate (rounded up) number of
buffer credits required for a link will turn out to be:
• On a 2 Gbps ISL Link: (2 * (20/2.0)) * (2148 / 870) +6= 56 (rounded up)
• On a 4 Gbps ISL Link: (2 * (20/1.0)) * (2148 / 870) +6= 105 (rounded up)
• On a 8 Gbps ISL Link: (2 * (20/.50)) * (2148 / 870) +6= 204 (rounded up)
• On a 10 Gbps ISL Link: (2 * (20/.33)) * (2148 / 870) +6= 307 (rounded up)
• On a 16 Gbps ISL Link: (2 * (20/.25)) * (2148 / 870) +6= 401 (rounded up)
If you would like to graphically view how buffer credits and acknowledgements work, you can go here and
view examples starting at page 13: Buffer-to-Buffer Credits, Exchanges, and Urban Legends.
Time at zero credits (tim_txcrd_z) is generally considered to be the number of times that a port was un-
able to transmit frames it had in queue because the transmit BB_Credit counter was zero. But one must
realize that the zero credit counter is incremented when the last buffer credit is allocated. A buffer credit
is allocated just prior to frame transmission so, technically, the switch continues to transmit the last frame
while at zero buffer credits. In long haul links this is insignificant but in local data centers over short pieces
of cable you need to look at link performance as well to understand whether there is a buffer credit problem
or not. Theoretically, a short cable link can be at zero buffer credits 100 percent of the time and the link will
continue transmitting 100 percent of the time.
The purpose of this statistic is to detect congestion or a device affected by latency. This parameter is
sampled on every port at intervals of 2.5 microseconds (400,000 times per second), and the tim_txcrd_z
counter is incremented if the condition is true. Each sample represents 2.5 microseconds of time with zero
transmitter BB_Credits and a frame waiting to be transmitted. An increment of this counter means that the
frames could not be sent to the attached device for 2.5 microseconds, indicating the potential of some level
of degraded performance.
In a high-performance environment, it is normal for buffer credits to reach zero at times and buffer credits
reaching zero does not mean there is a congestion problem. Optimization does not mean that the port can
never run out of buffer credits but rather that the port will not run out of buffer credits 99.999 percent of
the time. That .0001 percent when it does run out of buffer credits will not harm performance or utilization
on the link. Also, in mixed speed environments, where the switch is attached to 2 Gbit 4 Gbit, 8 Gbit, and/or
16 Gbit devices, this counter will typically increment at a higher rate on the lower speed devices than higher
speed devices.
If however the buffer credits reaching zero is a very high value in regard to total frames sent it could indicate
a congestion problem. The absolute value is there in the counter (since the last POR or statsClear CLI
command) but the relationship between number of buffer credits reaching zero and the total number of
transmitted frames should be of interest. It is recommended that if the tim_txcrd_z counter shows a value
around 10 percent or more of the total frames sent that one should take a real interest in it. A percentage
much higher than 10 percent might indicate a fabric congestion problem. Keep in mind that smaller average
frame sizes can contribute greatly to skewing the tim_txcrd_z counter. See an example of this in Report 2
above.
The tim_txcrd_z counter (i.e. portStats64show CLI command) should be used as an indicator for further
investigation of buffer credit problems and not as proof of a performance problem. Also, remember that
port statistics are continuous until a user takes action to clear it. A portStatsClear CLI command should be
executed before attempting to use the tim_tmcrd_z so that it can be known to be a relevant count.
Hold Time (HT) is the amount of time a Class 3 frame (e.g. data frames and management frames) may
remain in an ISL/trunk port queue before being dropped while waiting for a buffer credit to be given for
transmission of a frame. Default HT is calculated from the Resource Allocation Time Out Value (RA_TOV),
Error Detect Time Out Value (ED_TOV), and maximum hop count values that are configured on a switch. At
10 seconds for RA_TOV, two seconds for ED_TOV, and a maximum hop count of seven, a Hold Time value of
500 milliseconds (ms) is calculated.
As previously discussed, time at zero credits (tim_txcrd_z) is sampled at intervals of 2.5 microseconds (μs),
and the tim_txcrd_z counter is incremented if the condition is true. And the HT value is 500 ms or half of a
second at which point a frame must be discarded. How do these two facts interrelate?
• 2.5 microseconds (μs) * 200,000 = 500 ms or one-half second
If an ISL/trunk port must wait 200,000 2.5 μs intervals, due to not having a buffer credit available to send
a waiting frame, then it will breach the HT mark (500 ms) for that port and then that frame, and potentially
others queued up behind it, must be discarded (C3 tx_timeout discard or C3TXT03).
Edge Hold Time (EHT) is for F_Ports and allows an overriding value for Hold Time (HT) on devices attached
to those F_Ports. EHT (220 ms) is ASIC dependent and can discard blocked frames within an F_Port earlier
than the 500 ms default hold time that is normally placed on ISL/trunk links. An I/O retry will still happen
for each dropped frame.
The Edge Hold Time value can be used to reduce the likelihood of back pressure into the fabric by as-
signing a lower Hold Time value only for edge ports (e.g. host ports or device ports). The lower EHT value
ensures that frames are dropped at the F_Port where the BBC is lacking before the higher default Hold
Time value that is used at the ISLs expires. This action localizes the impact of a high-latency F_Port to just
the single-edge switch where the F_Port(s) are attached and prevents the back pressure and associated
collateral damage from spreading into the fabric and impacting other unrelated flows. When a user deploys
Virtual Fabrics, the EHT value that is configured into the Base Switch is the value that is used for all Logical
Switches.
Virtual Channels
Virtual Channels are a unique feature available on our 2, 4, 8 16, and 32 Gbps fabric switch or director
devices. In switched-Fibre Channel networks, frames can be sent over very complex network architectures
with guaranteed flow control and an assurance of lossless frame delivery. However, without an architecture
that provides fair access to the ISL wire for all the I/O traffic streams using it, congestion could impose a
condition where one traffic stream blocks other streams on that ISL link. This is generally referred to as
Head-of-line blocking (e.g. HoLB). Furthermore, in a Brocade switched-FC architecture using virtual
channels, if congestion does develop in the network because of a multiplexed ISL traffic architecture, all
sending traffic will slow down gradually in response to the congestion. Virtual channels create multiple
logical data paths across a single physical link or connection. They are allocated their own network
resources such as queues and buffer-to-buffer credits. Virtual channel technology is the fundamental
building block used to construct Adaptive Networking services.
Virtual channels are divided into three priority groups. Priority Group 1 (P1), VC 1, is described in many
documents as the highest priority VC which is used for Class F and ACK traffic. Priority Group 1 (P2) is the
next highest priority, which is used for data frames and provide for high, medium and low QoS on VCs 2-5
and 8-14. Priority Group 3 (P3) is the lowest priority and is used for broadcast and multicast traffic. This
example is illustrated in Figure 8.
Each VC is given the responsibility to maintain its own BB_Credits and counts. A Brocade VC_RDY is then
substituted for R_RDY as the acknowledgement. But even when ISL links are being used very well, Head of
Line Blocking (HoLB) on an ISL link can create congestion and performance problems much like toll booths
on a Super Highway can cause congestion and slow travel.
VCs provide a way to create a multi-lane connection so that even if some frames are stopped from flowing
other frames can bypass them and go to their destinations across a single link. To ensure reliable ISL com-
munications, VC technology logically partitions the physical buffer credits on each ISL link into many differ-
ent buffer credit pools (one for each VC) and then, if QoS is in use, prioritizes traffic to optimize performance
and prevent head of line blocking. This VC technology is not just about creating smaller pools of logical buf-
fer credits from the physical set of buffer credits. In addition, each VC is provided with its own queues and
the VC technology manages those as well.
L0: The normal (default) mode for a port. It configures the port as a regular port. LO is for local distance up
to the rated distance for the link speed that is in use.
LE: The mode that configures the distance for an E_Port when that distance is greater than 5 km and up to
10 km but the calculations on based on the average frame size being a full frame size of 2,148 bytes which
is generally not the case.
If a link does not use L0 or LE mode and requires more than 20 buffer credits then an optional Extended
Fabrics License is available, and must be enabled on the switch. When Extended Fabrics is enabled, the
ISLs (E_Ports) are configured with a large pool of buffer credits. The license allows the following two options
to be applied to a long distance link:
LD (Dynamic): The mode in which FOS calculates buffer credits based on the distance measured at port
initialization time. Brocade switches use a proprietary algorithm to estimate distance across an ISL. The
estimated distance is used to determine the buffer credits required in LD (dynamic) extended link mode
based on the “-frameSize” option, if used, or a maximum Fibre Channel payload size of 2,112 bytes and a
maximum frame size of 2,148 bytes.
LS (Static): The recommended mode to be used to configure buffer credits over long distances as it pro-
vides the user the full control they need over buffer credit calculations. Configure the number of buffers by
using the “-frameSize” option along with the “-distance” option.
It should be noted that anytime that a user needs a long distance link of 10 km or more they must enable
the “Extended Fabrics” license (included as part of the Enterprise Software Bundle) on their switch de-
vices to allow for the additional buffer credits. This license, included with directors, comes at no additional
charge.
Table 1: Approximate number of BBCs required for 50 km with/without compression and encryption.
Device latency-based bottlenecks, called latency bottlenecks are much more difficult to detect. Once again,
Fabric Performance Impact Monitoring provides great capabilities for user troubleshooting if this type of
bottleneck is occurring.
Fabric Performance Impact monitoring is one of the key enhancements starting at FOS 7.3 and is a
simplified and advanced tool for detecting abnormal levels of latency of an F_Port in a fabric. It substantially
improves monitoring for visibility of device latency in a fabric over the legacy Bottleneck Detection capability.
FPI, at FOS 7.3, does not monitor E_Ports.
FPI provides automatic monitoring and alerting of latency bottleneck conditions through pre-defined thresh-
olds and alerts in all pre-defined policies in MAPS. The advanced monitoring capabilities identify ports in two
latency severity levels and provide intuitive reporting in MAPS dashboard under a new Fabric Performance
Impact category. FPI monitors the F_Port for latency conditions and classifies them into two states. One is
the I/O performance impact state, which is the condition causing I/O disruption. Another condition is I/O
frame loss. That condition could potentially severely degrade throughput. And the default action with those
two states is to report into RASLOG.
There are two significant enhancements have been made to FPI with the release of FOS 7.4. The first is
Slow-Drain Device Quarantine (SDDQ), which is a mechanism to recognize that a slow-draining device is
operating in the fabric and is causing latency issues. This will be detected and then a Registered State
Change Notification (RSCN) will be sent to all of the switches in the fabric. This will alert each of the zones
that can communicate with the slow-drain device to only send traffic to that device across a low priority
virtual channel. So that slow-drain device is now “quarantined” away from other data traffic which helps
alleviate fabric congestion. This process is also known as Automatic VC Quarantine (AVQ). Data traffic being
transmitted to other normal acting devices continue to use the medium VC as always.
The second enhancement is Port Toggle as an action which automatically recovers slow-drain device
conditions when they are detected by FPI monitoring. A port toggle, (which is a port disable followed by a
port enable) can recover the ports from some slow-drain device conditions or force traffic failover to an
alternate path:
Port Toggle and Slow Drain Device Quarantine (SDDQ) are mutually exclusive. If an attempt is made to use
them together it might result in unpredictable behavior.
If the receiving side of an ISL connection cannot recognize the Start Of Frame (SOF) in the header of the
incoming frame, then it will not respond with the appropriate VC_RDY. Because the sender had
decremented the available buffer count by one and in turn does not receive the corresponding VC_RDY,
synchronization between the sender and receiver is now skewed by the missing VC_RDY. When this
condition occurs, no explicit error is generated and it will not resolve itself without some form of automated
or manual recovery mechanism. Not receiving a VC_RDY is not considered an error condition. Corrupted SOF
or VC_RDY violates the Cyclic Redundancy Check (CRC) and generates an error and/or increments a
statistical counter; however, this will not be specific to BB_Credit or VC_RDY loss errors and will not initiate
BB recovery or adjust associated counters. Another scenario occurs when the VC_RDY returned by the
receiver is corrupted. The result is the same as in the previous scenario. The sender will have an
outstanding buffer count of one less than what is actually available on the receiving side.
The Brocade BB_Credit Recovery mechanism will be activated when a port on a Brocade switch that
supports 8 Gbps, 16 Gbps, or 32 Gbps FC is configured for LE, LS, or LD long distance mode.
Buffer credit recovery allows links to recover after buffer credits are lost when the buffer credit recovery
logic is enabled. The buffer credit recovery feature also maintains good performance. If a credit is lost, a
recovery attempt is initiated. During link reset, the frame and credit loss counters are reset without
disruption but potentially with slight and temporary performance degradation.
Forward Error Correction (FEC) for 16 Gbps, Gen 5 and 32 Gbps Gen 6 Links
Forward Error Correction (FEC) allows recovering of error bits in a 10 Gbps or a 16 Gbps data stream. This
feature is enabled by default on all ISLs and ICLs of 16 Gbps FC platforms. FEC can recover from bit errors
because the Condor3 ASIC collects the hi-order check bits during the 64b/66b encoding process of the
frame. FEC then uses them to enable the receiver to recover the frame integrity.
Since it is often bit errors in the data stream that cause the loss of buffer credits at the transmitter
customers will find that FEC not only helps to avoid data retries but also minimizes the loss of buffer credits
on the link. Users can enable or disable the FEC feature while configuring a long distance link. This allows
end users to turn off FEC where it is not recommended, such as when configuring active DWDM links.
Transparent DWDM links can still make use of FEC as long as the original transmitter and the target receiver
are both controlled by Condor3 ASICs.
Now that we understand the concept behind buffer credits and how they are used, the next question is
”What tools and metrics are available to help me ensure that my configuration is operating optimally?” The
good news is that the answer to this question is more positive now than it was just one year ago. There
are a number of tools available to you, and we will describe each of them. It is important to point out that
there is no single source that provides all the information you need to optimally configure and tune your
buffer credits. And some of these sources only provide hints rather than hard numbers - for example, a high
proportion of zHPF I/Os indicates that fewer buffers may be required to move a given amount of data than if
the I/Os were mainly FICON. As with so many things, successful buffer credit tuning is a combination of art
and science.
Figure 8 shows a simple FICON environment, the locations of the receive buffers, and the tools that you
can use to gain insight into how the buffers are being used. If you have more than one data center (as is
increasingly becoming the norm) your configuration might look more like the one shown in Figure 9. We will
refer back to these figures in the descriptions of each tool.
The RMF Channel Path Activity report provides information about the use of each channel. In Figure 8, the
red box numbered “1” represents the entities that are reported on in the Channel Path Activity report. The
report is interesting in the context of buffer credits because it summarizes the activity for each channel into
Reads and Writes (in megabytes per second) and also zHPF and command mode requests. A sample report
is shown in RMF Example 1.
The ratio of reads to write is important because that impacts the average frame size in each direction
(from the CPU out to the control units, and from the control units back to the CPU). The ratio of FICON to
zHPF operations is important because zHPF I/Os tend to result in larger frames which use buffers more
efficiently. It also reports the speed of each channel (in the SPEED column). The report provides important
environmental information, along with hints about buffer credit usage. However, it does not directly report
the number of buffers in use, or any delays due to pacing.
The ratio of reads to write is important because that impacts the average frame size in each direction
(from the CPU out to the control units, and from the control units back to the CPU). The ratio of FICON to
zHPF operations is important because zHPF I/Os tend to result in larger frames which use buffers more
efficiently. It also reports the speed of each channel (in the SPEED column). The report provides important
environmental information, along with hints about buffer credit usage. However, it does not directly report
the number of buffers in use, or any delays due to pacing.
• The port ID (reported in the PORT ADDR column). In Figure 6 on page 12, the ports are represented by red
boxes labeled 2 and 3.
• The type, CHPID or control unit number of the device the port is connected to, and that devices serial
number.
• The average frame pacing delay for frames being sent from the switch. For example, referring back to
Figure 8, frames being sent from red box 2 (switch port) to red box 1 (CPU channel), or from red box 3
(switch port) to red box 4 (disk CU). In a configuration like that shown in Figure 7 on page 13, it would
also report on switch-to-switch transmissions between red boxes 3 and 4. A non-zero value in this field
indicates a lack of buffers in the device that is connected to that port. So, in the report in Example 2, we
have artificially set the AVG FRAME PACING value for port 04 (connected to CHP 60) to be 9. That would
indicate that there were times when all the Receive buffers in CHP 60 were in use and there was at least
one frame queued in the switch for transmission to the channel. To be precise, this reports the average
number of microseconds that a frame being transmitted on that port was delayed because all the buffers
at the other end of the link were in use. As with any “average” value, a non-zero value only indicates that
there was at least one occurrence of frame pacing; there is no indication of the distribution of pacing
events across the interval. If there are a large number of I/Os and a small number of pacing events, it is
possible that the delay might get rounded down to a zero value. In practice, given the number of buffers in
channel cards and most modern DASD control units, and the relatively short distances between switches
and the devices they are connected to, it is unlikely that you will see large values in this field for ports that
are connected to a CPU or control unit. However, non-zero values would be common for ISL links that are
only using the default number of buffers.
• The average frame size for ‘writes’.
The value reported in the WRITE column is the average frame size for frames transmitted by this port. You
would use this value to help calculate the required number of buffers for the port at the other end of this
link.
–– Since switches are shared across all the systems, it is not necessary for all systems to gather
statistics. The FICON STATS parameter in the IECIOSxx member should be set to YES in one member
in each sysplex. This is the default value, meaning that you need to explicitly set it to NO in the other
systems in the sysplex. You can display the current setting using the D IOS,FICON command.
–– RMF can communicate with the switch to gather the information that is shown in this report and save
that information in SMF type 74 subtype 7 records. However, this gathering is disabled by default.
You can check the current setting using the F RMF,DZZ command. The collection of this information
is enabled by specifying FCD in your ERBRMFxx member (the default is NOFCD). Obviously you should
do this on the same system where you have specified FICON STATS=YES!
If both of these parameters are set correctly, RMF will collect the information from the switch, and the report
can be produced by specifying the REPORT(FCD) parameter when you run the RMF postprocessor. Even
though this report provides a lot of valuable information, it still leaves some questions unanswered:
1. How many buffers are in each FICON port, both on the switch, and in the CPU and control units?
2. Were there frame pacing delays caused by a shortage of buffers in the switch ports?
3. Remember that the Average Frame Pacing value in the report refers to transmissions from the switch
to attached CPUs and control units - non-zero values indicate a shortage of buffers in the attached
device, not in the switch. The report does not provide any information about frame pacing that might be
experienced by other devices that are transmitting to the switch.
4. Note that the channel speed is not shown in the report. To get that information, you need to run an RMF
Channel Activity report and use the “Connection ID” from the RMF FICON Director report to tie back to
the CHPID information in the Channel Path Activity report. The SPEED column in the Channel Activity
report shows the FICON channel speed.
Until recently, there was no way to obtain this missing information from your system. And if you didn’t have
a switch (and therefore no FICON Director Report) you were operating very much in the dark. Certainly you
could check Google in the hope of finding how many buffers are provided in a channel card or a control
unit port. But that information can change. For example, 16 Gbps FICON channel ports have 90 buffers,
while 8 Gbps ports only have 40 buffers. However, those nice people in IOS development in IBM have kindly
delivered enhancements to the D M command to provide this information. Let’s have a look at these new
commands.
DM=DEV(dddd,cc),ROUTE=TODEV|FROMDEV|BOTH,HEALTH COMMAND
In Part 4 Chapter 2, we discussed the CUP Diagnostics functionality and the integration with the IBM Health
Checker for z/OS. In this section, we will look in more detail at the display matrix commands we briefly
mentioned in that chapter.
The D Matrix command was enhanced by APAR OA40548 to obtain information from a switch Control Unit
Port (CUP) function. The PTFs for this APAR were delivered in June 2013, and are available back to z/OS
1.12.
Note: This command only provides information for devices that are connected to a switch port that uses a
2-byte link address.
2-byte link addresses are normally used in configurations that contain cascaded switches, however they can
also be used in a configuration like that shown in Figure 8. A sample output from the command is shown in
Example 3.
You can see that the command output provides a wealth of information, with buffer credit-related
information only being a small part of that. However, much of the information that it provides is not available
elsewhere, making this a very valuable command if you are trying to debug channel or I/O-related problems.
The ROUTE keyword lets you specify whether you want information from the perspective of the CPU out
to the control units, or from the control unit back to the CPU (or both, as in this example). When BOTH
is specified, the first part of the command output reports on the switch ports for the path from the CPU,
through the switch, to the CU. The second half of the report provides the equivalent information for the path
from the CU back through the switch to the CPU.
The %Delay Trn field (highlighted in red in the output shown in Example 3) reports the percent of requests
for the associated Transmit port that were delayed because all the Receive buffers at the other end of the
link are in use (red boxes 1 and 4 in Figure 8). The %Delay Rcv field reports the percent of requests that
were delayed because this port (red boxes 2 and 3 in Figure 8) had no available Receive buffers.
D M=DEV(dddd,cc),LINKINFO
The D Matrix command was enhanced again by APAR OA49089. This APAR delivers support for a new Read
Diagnostic Parameters function that is available for FICON 8 Gbps and 16 Gbps channels on z13 and z13s
CPCs. This command provides information that will be invaluable to whoever is responsible for the hardware
and cabling infrastructure in your installation. A sample output is shown in Example 4. Of particular interest
to us is the line labeled “Buffer Credits.”
However, as you can see in the display, not every port in the path reported its number of buffer credits.
Further investigation unearthed some information that was not in the PTF HOLD information or in the
documentation for the PTF:
• If the switch is a Brocade one, it must be running FOS 7.4.0 or later. At the time this command was
issued, the switch was running FOS 7.3.1.
• The control unit in this sample was an IBM DS8870. However, the firmware was back-level; level 7.5 or
later is required. That firmware level was subsequently installed, but all the fields still reported ”Not Avail.”
It transpires that not only do you have to have the current firmware level, the control unit also has to have
16 Gbps ports.
• IOS_BUFFER_CREDITS
• IOS_PORT_SPEED
The IOS_BUFFER_CREDITS check uses the Read Data Parameters (RDP) function on z13/z13s and z14 to
monitor for buffer credit shortages on the links between the CPU and the switch, and between the switch
and the control unit. If any shortages are detected, the check will generate an alert. This is definitely the
easiest way to be made aware of buffer credit shortages in CPU or CU ports, especially if you tailor your
automation product to generate an alert or email if the check generates a message (IOSHC156E). While
it is not directly related to buffer credits, the IOS_PORT_SPEED health check is also of interest because
it warns you if a pair of adjacent FICON ports are operating at less than their full rated speed. This can
happen, for example, if errors are encountered on the link, and the two ports negotiate down to a lower
speed in an effort to eliminate the errors.
Brocade Monitoring and Alerting Policy Suite (MAPS) Integration with IBM Health Checker
for z/OS
Part 4 Chapter 1 covered the Brocade Fabric Vision tools, and introduced MAPS. This section will go into
more detail on how MAPS works with the IBM Health Checker for z/OS.
If you have a latency bottleneck, your ISLs will not be running at the maximum of their possible bandwidth
because they have insufficient buffer credits to support proper utilization. If you see a latency bottleneck
on an ISL or trunk, it is often the result of back pressure from a slow-drain device attached to the adjacent
switch. But every now and then it can just be a case of an ISL link being mis-configured. Not only can these
be annoying problems, they can be very detrimental to the overall health and performance of your SAN.
FPI Monitoring uses pre-defined thresholds and alerts, in conjunction with MAPS, to automatically detect
and alert systems administrators to severe levels or even transient spikes of additional latency. They
also identify slow-drain devices that might impact their storage network. As part of Brocade Fabric Vision
technology, MAPS pro-actively monitors the health and performance of the SAN infrastructure to ensure
application uptime and availability.
Other tools we have discussed in this article have provided us, and z/OS, with an ability to understand
whether N_Port and F_Port buffer credits are performing satisfactorily or not. FPI will do that as well. But
in addition, FPI provides us with invaluable information about E_Ports - this is information that has been
difficult to obtain on z Systems in the past. IBM APAR OA40548 (also see APAR OA43783) provided the z/
OS Health Checker with a capability to interact more powerfully with Control Unit Port (CUP) to help make
fabric problems much more visible to z/OS. The Brocade FOS v7.4 included a very significant change to CUP
by improving a feature known as CUP Diagnostics. CUP Diagnostics was enhanced to include an interface to
the Fabric Vision MAPS capability so that MAPS can notify CUP when an appropriately configured threshold
event has been triggered. CUP uses these triggered MAPS events to then create ”Health Check Alerts.” CUP
sends these Health Check Alerts to z System host(s), using the FICON Unsolicited Alert Status mechanism,
to notify the host(s) of these MAPS-detected switched-FICON fabric conditions.
When a switch signals to z/OS that a fabric problem has been found and that diagnostic data is available, z/
OS begins periodic monitoring. At regular intervals, z/OS will obtain health diagnostic data from the switch.
The resulting health check report contains the formatted diagnostic data as shown in the sample output in
Example 5. The health check issues an exception if a switch has indicated to z/OS that there is a possible
health problem. The report will be updated at regular intervals with the latest diagnostic data until two
hours after the switch stops reporting a health problem. The problem that is being reported may require
intervention, for example, z/OS system commands or an I/O configuration change. When no further errors or
health issues have been detected by the switch for at least two hours, the monitoring of the route is stopped
and will no longer appear in the report.
While z/OS has historically provided sophisticated I/O monitoring and recovery capabilities, fabric
components have been essentially invisible to z/OS. z/OS has only been able to react to secondary
conditions, such as elongated connect times, which often were a symptom of a problem in the SAN. This
report now provides z/OS with the fabric visibility that it was lacking in the past, and exposes newly-available
diagnostic data provided directly from the switch. This health check may be able to provide insight into your
fabric problems such as hardware errors, I/O mis-configurations, or congestion, and is supported by z/OS
1.12 and later when used with Brocade switching devices running FOS 7.4 or later that have CUP enabled
and active.
Example 5 shows an example of what you might expect to see from a switch-initiated z/OS Health Check.
You can find more information in the IBM Health Checker for z/OS Users Guide. You will notice that the
output format is almost exactly the same as that shown for the D M=DEV(dddd,cc),ROUTE,HEALTH
command in Example 3. In the example below, ports 01 and B0 on switch 25 are E_Ports.
If you determine that the problem is a lack of buffers in the switch ports, then you have more flexibility. It
is possible to purchase additional buffer memory for your switches through adding a license (Extended
Fabrics) to allow for connections that are greater than 10 km in distance, and to allocate additional buffers
to specific ports. The starting point is to use the information from your configuration (link speed, link
distance, average frame size) with your switch vendor’s buffer credit calculation tool to determine the ideal
number of buffers for each switch port.
If you have a tool to process SMF records, it would be wise to review the average frame sizes for an entire
day (given that this is the one input to the calculation tool that is likely to change over time), checking
the distribution of frame sizes and the smallest values; you can get this information from the SMF type
74 subtype 7 records (the R747PNWR, R747PNWW, R747PNFR, and R747PNFW fields). In particular, pay
special attention to the average frame sizes during peak periods of your daily processing.
Having adjusted the number of buffers in accordance with the values provided by your buffer credit
calculator, then monitor for non-zero AVG FRAME PACING values in the RMF FICON Director Activity report.
In the longer term, when all of your hardware and firmware have been upgraded to support APAR OA49089,
you can add automation to check for non-zero values and to monitor for exception messages from the IOS_
BUFFER_CREDITS health check.
Summary
Despite the perceptions of some analysts that Fibre Channel is dead or dying, it continues to march forward
with innovation and continues to be the ruler of lossless, high-speed storage networking.
One of the early and lasting benefits of Fibre Channel (FC) fabrics was the fact that its communications had
extremely high performance. A factor in that high performance was that all FC traffic is sent in a lossless
fashion: A frame sent across a FC fabric is guaranteed to arrive at its intended destination
Traditional databases such as online transaction processing (OLTP) and online analytical processing
(OLAP) are still increasing their performance speed and I/O per second (IOPS) capability while continually
demanding absolutely error-free performance. Both of these databases perform optimally in lossless
environments. The first time should be the only time a packet is sent to a receiver in order to avoid errors
and maintain performance, especially in high volume applications. Fibre Channel can promise total
protection against silent data corruption. This type of corruption can happen in an environment when
incomplete or incorrect data overwrites data on a Fibre Channel fabric storage device. This problem can
lead to costly downtime and even data loss.
After reading this chapter users should fully understand the benefit and value of a lossless data
environment when buffer credits are properly configured in a Fibre Channel/FICON SAN fabric.
References
Artis, H., & Guendert, S. (2006). Designing and Managing FICON Interswitch Link Infrastructures.
Proceedings of the 2006 Computer Measurement Group. Turnersville, NJ: The Computer Measurement
Group, Inc.
Brocade Communications Systems (2016). Fibre Channel Buffer Credits and Frame Management.
San Jose, CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2008). Technical Brief: Cascaded FICON in a Brocade Environment.
San Jose, CA, USA: Brocade Communications Systems.
Brocade Communications Systems. (2015). Brocade CUP Diagnostics and the IBM Health Checker for z/OS.
San Jose, CA, USA: Brocade Communications Systems
Catalano, P., & Guendert, S. (2016). FICON Dynamic Routing Technology and Performance Considerations.
Armonk, NY, USA: IBM Corporation.
Guendert, S. (2007). Understanding the Performance Implications of Buffer to Buffer Credit Starvation
in a FICON Environment: Frame Pacing Delay. Proceedings of the 2007 Computer Measurement Group.
Turnersville, NJ: The Computer Measurement Group, Inc.
Guendert, S., & Lytle, D. (2007). MVS and z/OS Servers Were Pioneers in the Fibre Channel World.
Journal of Computer Resource Management, 3-17.
IBM Corporation (2015). IBM Health Checker for z/OS V2R2 User’s Guide.
Armonk, NY, USA: IBM Corporation
SHARE in Orlando 2015 Session 17506, A Deeper Look into the Inner Workings and Hidden Mechanisms of
FICON Performance, by Dave Lytle.
Singh, K., Grau, M., Hoyle, P., & Kim, J. (2012). FICON Planning and Implementation Guide.
Poughkeepsie, NY: IBM Redbooks.
CHAPTER 6
391
PART 4 Chapter 5: IBM z14-brocade gen 6 mainframe storage networks
Introduction
As the flash era of the 2010s disrupts the traditional mainframe storage networking mindset, Brocade and
IBM have released a series of features that address the demands of the data center. These technologies
leverage the fundamental capabilities of Gen 6 Fibre Channel, and extend them to the applications driving
the world’s most critical systems. This chapter provides a summary of these unique mainframe features,
which reflect Brocade and IBM’s dedication to technical excellence.
Overview
Partnership
The Brocade and IBM partnership consists of many teams working together to produce the best-in-class
products that mainframe users have come to expect. The close relationship between the business teams
allows both organizations to guide the introduction of key technologies to the market place, while the
integration between the system test and qualification teams ensures the integrity of those products. Driving
these efforts are the deep technical relationships between the Brocade and IBM Z I/O architecture and
development teams, whose close collaboration provides the transformative technology to propel today’s
ever-changing data center.
Technology Themes
During the past five to seven years, the technology landscape of mainframe storage networking systems
has evolved to embrace new levels of scale. Not only are the computational demands increasing, but
the physical reach of the systems is also growing. Business continuity and recovery requirements have
stretched the limits of yesterday’s technology, and the millennial cloud movement is forcing organizations
to reexamine transactional processing. These pressures led IBM to produce one of the most innovative
and highly functioning processors in history—the IBM z14. At the same time, Brocade provided the storage
networking dexterity necessary to complement the z14’s impact.
Feature Summary
The technological leap induced by the z14 is underscored by the inventiveness and integration of the
Brocade Gen 6 feature set:
• Inter-Chassis Links (ICLs) provide world-class scalability and flexible chassis interconnectivity.
• Multi-hop topologies provide flexibility to geographically dispersed data centers to support unique disaster
recovery and business continuity configurations.
• 16 GFC FICON Express16S+ device connectivity and 32 GFC fabric connectivity provide the fastest data
rates in the industry.
• FICON Dynamic Routing utilizes the full capacity of Inter-Switch Link (ISL) and ICL connections in cascaded
network topologies, leveraging Brocade Exchange-Based Routing (EBR) technology.
• Fabric I/O Priority capabilities provide a foundation for applications, such as z/OS Workload Manager, to
prioritize individual IOs for optimal storage network performance.
• Port Decommission is used for easier infrastructure management through automated interaction between
the fabric and the mainframe I/O subsystem.
• IBM Health Checker for z/OS leverages the internal Control Unit Port (CUP) of the FICON director to
evaluate single-point-of-failure conditions as well as path integrity and performance.
• Problem determination and isolation are enhanced through the Read Diagnostic Parameters (RDP)
operations.
• Forward Error Correct (FEC) improves connection reliability and performance.
Technology Highlights
Brocade mainframe technology has produced several “first to market” innovations, including Gen 6 32 Gbps
Fibre Channel, Gen 5 16 Gbps Fibre Channel with Forward Error Correction, FICON Dynamic Routing, z/OS
Health Checker integration, and Read Diagnostic Parameters functionality. Leveraging the IBM Z I/O team
partnership, Brocade has also produced significant mainframe storage networking market exclusives such
as ICLs and Port Decommissioning. This shared dedication to the success and growth of the mainframe
storage network can also be seen in newer endeavors, including Fabric I/O Priority and FICON multi-hop
topologies. These technologies are detailed below to provide insight into Brocade Gen 6 mainframe storage
networking functionality.
Multi-hop Topologies
The movement to disperse mainframe data centers geographically has forced the evolution of classical
FICON SAN topologies. Simple cascaded systems have given way to multi-hop variations that exploit Brocade
ICL and FCIP channel extension technologies, while maintaining the integrity of the “hop of concern”
provision. Advanced diagnostic capabilities, such as the IBM Health Checker for z/OS and the Fibre Channel
Read Diagnostic Parameters function, have provided the apparatus necessary to maintain the deterministic
reliability required for mainframe storage networks in extended topologies.
Multi-hop topologies include two-, three-, and four-site disaster recovery and business continuity
configurations (see Figure 2). Each topology is designed to address specific business requirements as well
as regulatory provisions. In addition, the supported topologies are verified to operate with the same level of
predictability and deterministic reliability found in today’s classical cascaded solutions.
Gen 6 Performance
Brocade Gen 6 FICON directors, provide optimal interface attachment rates for the FICON Express16S+
(FX16S+) channel shown in Figure 3, as well as storage arrays. The 16 Gbps Fibre Channel device
connectivity and 32 Gbps Fibre Channel fabric connectivity provide minimal storage network latency, which
significantly enhances data center application operations and performance.
The faster link speeds provide reduced IO latency for critical middleware IO, such as database log writing,
which significantly improves DB2 transactional latency. The faster link speed also shrinks batch windows
and reduces the elapsed time for IO-bound batch jobs. Since faster link speeds are more sensitive to the
quality of the cabling infrastructure, new industry-leading standards are included to provide enhanced error
correction on optical connections, ensuring a smooth transition to 16 Gbps and 32 Gbps Fibre Channel
technologies.
The benefits of FICON Dynamic Routing are highlighted in the IBM technical reference FICON Dynamic
Routing (FIDR) - Technology and Performance Implications, co-authored by Brocade (read it here).
Availability Note
Brocade and IBM Z are exploring the full integration of Fabric I/O Priority to allow system management
functions, such as zOS Workload Manager, to exploit end-to-end, integrated quality of service
characteristics.
Once the maintenance operations have been completed, the port/path is returned to full operational status
through the Port Recommission function. As with the decommission operation, Brocade Network Advisor
coordinates and communicates the recommission action with z/OS to ensure the operation occurs error-
free.
In addition to the single-point-of-failure analysis, IBM Health Checker for z/OS leverages the FICON director
CUP interface to assess performance characteristics of the members of a path group. The “flow“ information
provided includes utilization characteristics, operational conditions, and diagnostic error statistics. This
unique solution also incorporates the Brocade Fabric OS (FOS) bottleneck detection mechanism, which
provides alerts to z/OS when fabric anomalies are detected.
Enhanced Diagnostics
Keeping pace with the ever-growing complexity of mainframe storage networks, Brocade and IBM co-
developed in-band diagnostic mechanisms that enrich the view of fabric components. The IBM Z channel
and control unit solutions include functionality that allows the FICON director to read information about the
optical transceiver attached to the IBM Z device (see Figure 8). This same functionality is utilized by z/OS to
provide similar information about the transceivers installed in the director.
The new Fibre Channel-standard Read Diagnostic Parameters (RDP) command was created to enhance
path evaluation and automatically differentiate between cable hygiene errors and failing components. The
RDP function provides the optical characteristics of each end of the link, including optical signal strength
and environmental operating metrics, without requiring manual insertion of optical measurement devices.
Furthermore, the RDP operation is performed periodically throughout the mainframe storage network to
proactively detect marginal conditions that can be addressed before they affect production operations.
Gen 6 Fibre Channel Technology Benefits for IBM Z And Flash Storage
While at first glance, given that the FICON Express16S channels were GA in March 2015 on the z13 family
of processors, the need for Gen 6 and 32 Gbps-capable FICON/FCP may not be readily obvious. However,
considering the reasons Gen 6 Fibre Channel technology was developed in the first place, and the recent
z14, LinuxONE, and flash storage array enhancements, it is clear why the Gen 6 technology is a great fit
for many organizations today. When Gen 5 Fibre Channel made its debut in 2011, many asked similar
questions about the need for it, given that channel and storage connectivity was still all 8 Gbps, and
remained so until 2015. Gen 6 Fibre Channel brings three primary, immediate technology benefits to a z14
and flash storage architecture:
• Improved performance
• Improved Virtual Machine (VM) support
• Support for the upcoming NVMe over Fabrics technology
Improved Performance
Enterprise-class storage performance has historically focused on four metrics: latency, response time,
throughput (bandwidth), and Input/Output Operations Per Second (IOPS). There are some common
misperceptions, including that the terms “response time” and “latency” can be used interchangeably, so it
is good to review what each of these metrics actually measures.
Latency
Latency is a measure of delay in a system. In the case of storage, it is the time it takes to respond to an
I/O request. In the example of a read from disk, this is the amount of time taken until the device is ready
to start reading the block—that is, not including the time taken to complete the read. In the spinning disk
(HDD) world, this includes things such as the seek time (moving the actuator arm to the correct track) and
the rotational latency (spinning the platter to the correct sector), both of which are mechanical processes,
and therefore slow. Latency in a storage system also includes the finite speed for light in the optical fiber,
and electronic delay due to capacitive components in buses, Host Bus Adapters (HBAs), switches, and
other hardware in the path. The primary advantage flash storage/SSD has over spinning disk technology
is the lower latency and the consistency in the latency. SSDs can exhibit latency that is lower by a factor of
100 compared to HDD technology, and this is primarily due to the absence of the mechanical processes
described above. Equally important, for random access IO, the SDD with its consistently low latency is
able to move data continuously, while the HDD must continuously return for chunks of data, repeating the
mechanical process and slowing performance.
Response Time
Latency and response time are often confused with each other and are incorrectly used interchangeably.
Response time is the total amount of time it takes to respond to a request for service. It is the sum of the
service time and wait time. The service time is the time it takes to do the actual work requested. The wait
time is how long the request had to wait in a queue before being serviced, and it can vary greatly depending
on the amount of IO. Transmission time gets added to response time when the request and response have
to travel over a network, and varies with distance. Mainframe response time metrics for DASD include IOSQ,
PEND, CONN, and DISC. Since SSD technology is capable of handling significantly more IOPS than HDD, the
time spent waiting in a queue is typically shorter for SSD than HDD, leading to better response times for
SSD. When SSD is coupled with Gen 6 Fibre Channel, studies have demonstrated a 71 percent improvement
in response time.
Throughput
Throughput (MB/sec) is the data transfer speed. It is the measure of data volume transferred over time—
the amount of data that can be pushed or pulled though a system per second. It is simply a product of the
number of IOPS and the IO size. As a performance metric, throughput matters most for large block-size
data traffic. In a mainframe environment, these would be batch-type processing and data warehousing
applications. Batch processing handles requests that are cached (prestored) and then executed all at the
same time. An example of batch processing is the reconciliation of a bank’s ATM transactions for a given
day being compared against all the customers’ bank records, and then reporting the ATM transaction
records at a time outside normal business hours. Higher throughput capabilities, such as Gen 6 Fibre
Channel’s 32 Gbps, will lead to shortened batch windows, which can be invaluable to many large IBM z
Systems customers (the infamous race to the sunrise). When IBM introduced FICON Express16S channels
in 2015, one of the primary cited benefits compared to FICON Express8S channels was an average 32
percent reduction in the elapsed time (wall clock) of the batch window.
IOPS
For a modern mainframe environment, throughput is secondary in importance to a storage device’s
IOPS capability. Flash storage/SSD technology is capable of significantly more IOPS than spinning disk
technology. Mainframes are primarily used for Online Transaction Processing (OLTP) workloads. Transaction
processing consists of large numbers of small block-size data IO transactions, often seen in banking,
ATMs, order processing, reservation systems, and financial transactions. The most common block-size
traffic seen in today’s mainframe environment is 4,000 bytes, which coincidentally is used for most storage
performance benchmarking tests, as well as IBM z Systems FICON channel performance testing.
The z14 family of mainframes uses the latest industry-standard Peripheral Component Interconnect Express
Generation 3 (PCIe Gen3) technology for its IO subsystem (IOS). It also supports 50 percent more IO devices
per FICON and FCP channel (32,000 IO devices per channel), with a maximum capacity of 320 FICON
channels per z13. A FICON Express16S+ channel is capable of 300,000 z High-Performance FICON (zHPF)
IOPS running 4,000-byte block-size traffic typical of most OLTP workloads. In sum, the z14 processors are
I/O “monsters.” They were designed to handle massive amounts of IO due to the increases in IO demands
brought about by what IBM refers to as CAMS (Cloud, Analytics, Mobile, and Social) workloads. For many
end users, mobile workloads and their IO requirements alone are driving the need for flash storage attached
to z14.
Brocade Gen 6 technology provides a significant leap in IOPS capabilities. The Condor 4 ASIC is capable
of handling 33 percent more IOPS than the prior generation Condor 3 ASIC used in Brocade Gen 5 SAN
technology. Brocade Gen 5 FICON SAN technology is capable of outstanding performance in a z14 flash
storage architecture. Brocade Gen 6 FICON SAN technology takes this performance to a whole new level and
will allow flash/SSD to reach its full performance potential.
The enterprise computing space has been at the forefront of the flash storage revolution. Applications
that are performance-critical have been/are being moved from the traditional spinning disk DASD arrays
to flash storage arrays. This movement started with hybrid arrays, but once mainframe storage array
vendors such as EMC, HDS, and IBM brought all flash arrays to the market, end users began migrating to
the all-flash array technology. The reduction in latency and improved response times that can be realized
by implementing flash storage array technology allow for faster processing of transactions, or more
transactions processed for a given time interval. For some industry verticals, this can translate directly into
more revenue. A TABB Group study from July 2010 determined that: “A 1-millisecond advantage in a trading
application can be worth US$100 million a year to a brokerage firm.”
Improved VM Support
The IBM Z has up to 141 cores using the world’s fastest commercial processor, running at 5.0 GHz, to
enable high performance coupled with massive scaling. The z14 and the IBM LinuxONE Emperor systems
take scalability for VMs to a completely new level and are capable of supporting up to 8,000 virtual servers.
When connected to flash storage via FCP channels and Brocade SAN hardware, such as Brocade X6
Directors, the z14 and the LinuxONE Emperor support Node Port ID Virtualization (NPIV). This support of
NPIV has increased significantly with z13 and LinuxONE Emperor, doubling to support 64 virtual images
per FCP channel from the prior support of 32 virtual images per FCP channel. The IO subsystem of these
IBM servers is built on Peripheral Component Interconnect Express (PCIe) Gen 3 technology, which can
have up to 40 PCIe Gen 3 fanouts and integrated coupling adapters per system at 16 GBps each (yes,
16 GB, not 16 Gbps). The z14 also supports 32 TB of available RAM (3x its predecessor). This massive
amount of memory, as well as server and IO virtualization, allows VMs and applications to run at their full,
unconstrained potential. Gen 6 Fibre Channel is ideal for these environments and is capable of supporting
the greater virtualization levels at sustained high levels of performance, enabling full utilization of the z14
and LinuxONE Emperor virtualization capabilities.
In the future, zLinux environments connected to storage via Brocade Fibre Channel technology will be able
to gain insight into application performance using Brocade Fabric Vision VM Insight capabilities.
Support for Non-Volatile Memory Express over Fabrics using Fibre Channel
Non-Volatile Memory Express (NVMe) is a new and innovative method of accessing storage media, and
has emerged as the new storage connectivity platform that will drive massive performance gains. It is
ideal for flash/SSD. Applications will see better random and sequential performance by reducing latency
and enabling much more parallelism through an optimized PCIe interface purpose-built for solid-state
storage. The momentum behind NVMe has been increasing since it was introduced in 2011. In fact, NVMe
technology is expected to improve along two dimensions over the next couple of years: improvements in
latency and the scaling up of the number of NVMe devices in large solutions.
In 2014, the Fibre Channel Industry Association (FCIA) announced a new working group within the INCITS
T11 committee (responsible for Fibre Channel standards) to align NVMe with Fibre Channel as part of the
NVM Express over Fabrics initiative. This was an important evolution because it kept Fibre Channel at the
forefront of storage innovation. FC-NVMe defines a common architecture that supports a range of storage
networking fabrics for NVMe block storage protocol over a storage networking fabric. This includes enabling
a front-side interface into storage systems, scaling out to large numbers of NVMe devices, and extending the
distance within a data center over which NVMe devices and NVMe subsystems can be accessed.
FC-NVMe offers compelling performance that is synergistic with the SSD and mainframe technologies
that demand it. Gen 6 Fibre Channel is NVMe-ready and offers the ideal connectivity for high-performance
mainframe and flash storage.
Current technology trends point to a current arms race of sort, or perhaps the beginning of a new phase
of an ongoing arms race. There is currently an advancement in storage and I/O technology in the quest for
lower latency, all of which could be described as a low latency arms race.
What is latency?
Earlier in the chapter we briefly touched on this topic. Now let’s examine it in greater detail.
In general, latency is the period of time that one component in a system is spinning its wheels waiting for
another component. It is a time interval between the stimulation and response, or, from a more general
point of view, a time delay between the cause and the effect of some physical change in the system being
observed. In computing, latency describes some type of delay. It typically refers to delays in transmitting or
processing data, which can be caused by a wide variety of reasons. So if latency is the time interval between
the start and stop of an operation, another related term can be defined and that is jitter. Simply stated, jitter
is the amount of variation in a series of latency measurements. As a rule of thumb, consistency in latency
(lower jitter) is viewed as a more dependable system than a system with less jitter. In a mainframe I/O
environment we would like to see consistently low latency (low latency with minimal jitter).
Components/sources of latency
There are three primary sources or components of latency in a server I/O environment: computational/
processing delays, transmission delays, and networking delays.
Computational/processing delays are delays due to the server/CPU, storage arrays/devices, etc. In a
mainframe storage environment, a classic example is the seek time of spinning disk. Another example,
which is very relevant when one considers the recent IBM development with zHyperLinks, is the additional
delay the traditional interrupt driven, asynchronous completion I/O model has when paired up with the
latest flash storage technology such as IBM’s HPFE2 or emerging technologies such as 3D XPoint. This can
also include latency due to the transmission protocol.
Transmission delays refers to the limitations imposed on us by fibre optics and the laws of physics. A good
physics rule of thumb: fibre optic network transmission occurs at approximately two-thirds the speed of light
(five microseconds/km). As we increase distance, transmission delays increase.
Networking delays refer to things such as optical signal strength decreasing over distance, the number of
networking devices being traversed, and the internal latency of these networking devices. For example, with
a Brocade Gen 6 FICON director, the latency to go from the ingress port and out the egress port averages
two microseconds.
The internal architecture of high-end storage arrays today, in terms of the internal bus technology, is
primarily PCIe Gen 3 until you get to the interfaces to the actual devices themselves. The hybrid arrays
will typically use the same device adapter/controller for connecting to both the spinning disk and flash
devices. The technology used here is the same technology that has been used for spinning disk arrays for
years: SATA/AHCI, SAS Gen 2/Gen 3, or Fibre Channel. The current generations of all flash arrays, with
one exception are also still doing this. As we get faster versions of PCIe (Gen 4 is now available) internal
to servers and storage array bus technology, and faster connectivity between server and storage, these
older architectures relying on SATA and SAS to connect to the end devices are rapidly going to become a
bottleneck. SATA and SAS have the limitation of only one queue. Their throughput is also limited (SAS Gen 3
is 12 Gbps).
Rapidly emerging next generation non- volatile Storage Class Memory (SCM) technology such as 3D XPoint
and resistive random access memory (RRAM) offer 1,000 times the latency improvements over existing
NAND flash technology. They will exacerbate the aforementioned device connectivity bottleneck. To solve
this, what is needed is device interface technology that will bypass/eliminate the traditional device adapter/
controller. These interfaces will use PCIe (aka NVMe), and take advantage of the massive parallelism
offered by NVMe interface technology (64k queues).
This new methodology is used when it is determined that a called for I/O is “Synch I/O” eligible, and
effectively bypasses the channel subsystem and IOS, and is directly driven from the zHyperLink connection.
The DS8880 side of the connection utilizes a PEX8733 PCIe switch, that contains a direct memory access
(DMA) engine. In essence, the internal PCIe Gen3 I/O bus of a z14 is connected directly to the PCIe Gen3
I/O bus of the DS8880. The latency associated with the traditional channel program/protocol/interrupt
and FICON/FCP adapter cards is removed. The two microseconds of latency due to transiting FICON
switching devices is also removed, but when compared to the latency reduction achieved by the change to
synchronous I/O, those two microseconds are not the significant component of the reduction.
zHyperLinks is an impressive technology without a doubt. They do not eliminate the need for FICON SAN,
however. As IBM states, zHyperLinks still requires FICON channel connectivity, and FICON SAN for distance,
and avoiding direct attached FICON architectures. zHyperLinks are restricted to 150 m distance, IBM
DS8880, z14, and specific workloads/access methods. They also function best with read cache hits. It
will be interesting to see if EMC and HDS will license the technology from IBM, or if they have their own
competing technology under development. Give the information present in the IBM Redbooks on the
subject, IBM chose to utilize many off the shelf components to design the associated hardware, maybe HDS
and EMC will develop something.
In the open systems world, NVMe-FC (NVMe over Fibre Channel) appears to be taking off. NVMe-FC is now
a fully supported T11 standard, and Gen 6 (32 Gbps) SAN technology ships ready to support NVMe-FC from
day 1. Given that Linux on Z and IBM’s Linux ONE platforms compete against the open systems servers, if
those servers and their storage start making use of NVMe-FC to connect to the next generation SCM arrays
utilizing NVMe interfaces to the storage devices, they likely will be able to challenge Linux on Z/Linux ONE
in terms of performance/latency. What will IBM do with the FCP channel connectivity for its Linux on z/Linux
ONE platforms to respond? Will we see NVMe-FC supported on the FCP channel cards (which are the FICON
channel cards too)?
Researchers at Intel wrote a paper three years ago comparing the performance of interrupt-driven
asynchronous I/O to polling based synchronous I/O in a Linux environment accessing NVMe interface NAND
flash storage devices. The low latency of the flash devices made up for the latency typically seen in polling
because the server how did not have to wait for the storage (no spinning disk, no seek time). Could such
a model for I/O be developed for IBM Z, allowing synchronous I/O over NVMe running over Fibre Channel
(NVMe-FICON?) that would solve the 150 m distance limitation, and make use of the existing FICON SAN
infrastructure in place in most mainframe environments?
And speaking of SAN, Gen 7 Fibre Channel will be here before you know it. Gen 6 technology saw 35 percent
improvements in ASIC frame switching capabilities, PCIe Gen 3 buses, and other major improvements. .
Why is the connectivity so important when talking about latency reduction? It all comes down to Bell’s Law
of Waiting: All CPUs wait at the same speed.
Chapter Summary
The enterprise computing community represents the unique sector of the storage networking market that
propels the most critical systems in the world. These systems must operate cohesively and under extremely
stringent conditions in order to provide the everyday services and operations enjoyed by billions.
Brocade and the IBM Z IO team share an extraordinary relationship that has yielded countless
enhancements to the world of enterprise computing. This nearly 30-year bond has stood the test of time,
as well as the turmoil of the high-tech marketplace, to produce reliable solutions for these critical systems.
At a time when most companies are boasting of alliances around the technology “flavor of the day,”
Brocade and IBM z Systems have delivered the technology for the ages. Brocade builds on that 30 years of
mainframe leadership to deliver the industry’s highest performance and most reliable and scalable FICON
infrastructure. With seamless FICON connectivity and support for innovative features that only Brocade
can offer—including ClearLink D_Port, CUP Diagnostics, Port Decommissioning, and ISL encryption—
organizations can achieve the full potential from new flash storage and IBM z14 and future mainframe
technology.
The modern IBM mainframe, also known as IBM Z, has a distinguished 54-year
history as the leading platform for reliability, availability, serviceability, and scal-
ability. It has transformed business and delivered innovative, game-changing
technology that makes the extraordinary possible, and has improved the way
the world works. For over 25 of those years, Brocade, A Broadcom Inc. Compa-
ny, has been the leading networking company in the IBM mainframe ecosystem
and has provided non-stop networks for IBM mainframe customers. From paral-
lel channel extension to ESCON, FICON, long-distance FCIP connectivity, SNA/IP,
and IP connectivity, Brocade has been there with IBM and their mutual custom-
ers.
The second edition of this book, written by the leading mainframe industry and
technology expert from Brocade, discusses mainframe SAN and network tech-
nology, best practices, and how to apply this technology in your mainframe envi-
ronment. This edition includes a great deal of new information on the latest Gen
6 technology from Brocade, the new IBM z14 mainframe and its I/O technology,
and more.