Sunteți pe pagina 1din 44

INFINIBAND

CS 708 Seminar

NEETHU RANJIT (Roll No. 05088)


B. Tech. Computer Science & Engineering

College of Engineering Kottarakkara


Kollam 691 531
Ph: +91.474.2453300
http://www.cek.ihrd.ac.in
cekottarakkara@ihrd.ac.in
Certificate

This is to certify that this report titled InfiniBand is a bonafide record


of the CS 708 Seminar work done by Miss.NEETHU RANJIT, Reg
No.10264042 , Seventh Semester B. Tech. Computer Science & Engineering
student, under our guidance and supervision, in partial fulfillment of the
requirements for the award of the degree, B. Tech. Computer Science and
Engineering of Cochin University of Science & Technology.

October 16, 2008

Guide Coordinator & Dept. Head

Mr Renjith S.R Mr Ahammed Siraj K K


Lecturer Asst. Professor
Dept. of Computer Science & Engg. Dept. of Computer Science & Engg.
Acknowledgments
I express my whole hearted thanks to our respected Principal Dr Jacob
Thomas, Mr.Ahammed Siraj sir, Head of the Department, for provid-
ing me with the guidance and facilities for the seminar. I wish to express
my sincere thanks toMr Renjith sir, lecturer in Computer Science De-
partment,and also my guide for his timely advises during the course period
of my seminar.I thank all faculty members of College of Engineering Kot-
tarakara for their cooperation in completing my seminar. My sincere thanks
to all those well wishers and friends who have helped me during the course
of the seminar work and have made it a great success. Above all I thank
the Almighty Lord, the foundation of all wisdom for guiding me step by
step throughout my seminar.Last but not the least i would like to thank my
parents for their moral support.

NEETHU RANJIT
Abstract
InfiniBand is a powerful new architecture designed to support I/O
connectivity for the internet infrastructure. InfiniBand is supported
by all major OEM server vendors as a means to expand and create
the next generation I/O interconnect standard in servers. For the first
time, a high volume, industry standard I/O interconnect extend the
role of traditional in the box requirements are related to mean band-
width needed and maximum latency tolerated by this application. It
provides a comprehensive silicon software and system solution which
provides an overview to layered protocol and InfiniBands management
infrastructure. The comprehensive nature of architecture provide a
overview to major sections of InfiniBand I/O specification ranges from
industry standard electrical interfaceand mechanical connectors to
well defined software and management services.InfiniBand is unique
in providing connectivity in a way previously reserved only for tradi-
tional networking. This unification of I/O and system area networking
require a new architecture domain. Underlying this major transition
is InfiniBands superior abilities to support the internet requirement
for RAS: Reliability, Availability, and Serviceability. The InfiniBand
Architecture (IBA) is an industry standard architecture for server I/O
and interprocessor communication.IBA that enables QoS: Quality of
Services which support with certain mechanisms. These mechanisms
are basically service levels,virtual lanes and table based arbitration
of virtual lanes.InfiniBand has a formal model to manage the Infini-
Band to provide QoS,according to this model, each application need
a sequence of entries in the IBA arbitration tables based on require-
ments. These requirements are related to mean bandwidth needed
and maximum latency tolerated by this application. It provides a
comprehensive silicon software and system solution which provides an
overview to layered protocol and InfiniBands management infrastruc-
ture. The comprehensive nature of architecture provide a overview
to major sections of InfiniBand I/O specification ranges from indus-
try standard electrical interface and mechanical connectors to well
defined software and management services.InfiniBridge is the channel
adapter architecture of InfiniBand which aids packet switching feature
of InfiniBand.

i
Contents
1 INTRODUCTION 1

2 INFINIBAND ARCHITECTURE 3

3 COMPONENTS OF INFINIBAND 5
3.1 HCA and TCA Channel adapters . . . . . . . . . . . . 5
3.2 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Routers . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 INFINIBAND BASIC FABRIC TOPOLOGY 7

5 IBA Subnet 9
5.1 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Endnodes . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 FLOW CONTROL 11

7 INFINIBAND SUBNET MANAGEMENT AND QoS


12

8 REMOTE DIRECT ACESS (RDMA) 14


8.1 Comparing a Traditional Server I/O and RDMA-Enabled
I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

9 INFINIBAND PROTOCOL STACK 17


9.1 Physical Layer . . . . . . . . . . . . . . . . . . . . . . 17
9.2 Link Layer . . . . . . . . . . . . . . . . . . . . . . . . 18
9.3 Network Layer . . . . . . . . . . . . . . . . . . . . . 18
9.4 Transport Layer . . . . . . . . . . . . . . . . . . . . . 19

10 COMMUNICATION SERVICES 20
10.1 Communication Stack :InfiniBand support for the Vir-
tual Interface Architecture (VIA) . . . . . . . . . . . . 21

11 INFINIBAND FABRIC VERSUS SHARED BUS 22

12 INFINIBRIDGE 24
12.1 Hardware transport performance of InfiniBridge . . . . 24

ii
13 INFINIBRIDGE CHANNEL ADAPTER ARCHITEC-
TURE 26

14 VIRTUAL OUTPUT QUEUEING ARCHITECTURE 27

15 FORMAL MODEL TO MANAGE INFINIBAND AR-


BITRATION TABLES TO PROVIDE TO QUALITY
OF SERVICE(QoS) 29
15.1 THREE MECHANISMS TO PROVIDE QoS . . . . . 29
15.1.1 Service Level . . . . . . . . . . . . . . . . . . . 29
15.1.2 Virtual Lanes . . . . . . . . . . . . . . . . . . . 30
15.1.3 Virtual Arbitration table . . . . . . . . . . . . 30

16 FORMAL MODEL FOR THE INFINIBAND ARBI-


TRATION TABLE 31
16.0.4 Initial Hypothesis . . . . . . . . . . . . . . . . 33

17 FILLING IN THE VL ARBITRATION TABLE 35


17.1 Insertion and elimination in the table . . . . . . . . . 35
17.1.1 Example 1. . . . . . . . . . . . . . . . . . . . . 35
17.2 Disfragmentation Algorithm . . . . . . . . . . . . . . . 36
17.3 Reordering Algorithm . . . . . . . . . . . . . . . . . . 36
17.4 Global management of the table . . . . . . . . . . . . 36

18 CONCLUSION 37

REFERENCES 38

iii
1 INTRODUCTION
Bus architectures have a tremendous amount of inertia because they
dictate the bus interface architecture of semiconductor devices. For
this reason successful bus architectures typically enjoy a dominant
position for ten years or more. The PCI bus was introduced to
the standard PC architecture in the early 90s and has maintained
its dominance with only one major upgrade during that period:from
32bit/33MHz to 64bit/66Mhz.The PCI-X initiative takes this one step
further to 133MHz and seemingly should provide the PCI architecture
with a few more years of life.But there is a divergence between what
personal computer and servers require.
Throughout the past decade of fast paced computer development
the traditional Peripheral Component Interconnect architecture has
continued to be the dominant input/output standard for most in-
ternal back-plane and external peripheral connections.However,these
days,the PCI bus,using shared bus approach is beginning to be notice-
ably lagging.Performance limitations, poor bandwidth and reliability
issues are surfacing within the higher market tiers, especially as the
PCI bus is quickly becoming an outdated technology.
Computers are made up of a number of addressable elements-CPU,
memory,screen,hard disks,LAN and SAN interface etc, that use a sys-
tems bus for communications. As these elements have become faster,
the system bus and overhead associated with data movement com-
monly reffered to as I/O between devices has become a gating factor
in computer performance.To address the problem of server perfor-
mance with respect to I/O in particular,InfiniBand was developed as
a standards-based protocol to provide data movement data movement
offload from the CPU to dedicated hardware, thus allowing more CPU
to be dedicated to application processing.As a result ,InfiniBand,by
leveraging networking technologies and principles provide scable,high
bandwidth transport for efficient communications between InfiniBand
attached devices.
InfiniBand technology advances I/O connectivity for data center
and enterprise infrastructure deployment, overcoming the I/O bot-
tleneck in todays server architectures. Although primarily suited for
next generation server I/O, the InfiniBand can also extend to the em-
bedded computing, storage, and telecommunications industries. This
high-volume, industry-standard I/O interconnect extends the role of
traditional backplane and board buses beyond the physical connector.

1
Another major bottleneck is the scalability problems with parallel-
bus architectures such as the peripheral component interconnect (PCI).
As these buses scale in speed, they cant support the multiple network
interfaces that system designersrequire. For example the PCI-X bus
at 133 MHz can only support one slot and at higher speeds these
buses begin to look like point-to point connections. Mellanox Tech-
nologies InfiniBand silicon product, InfiniBridge, lets system designers
construct entire fabrics based on the devices switching and channel
adapter functionality.
InfiniBridge implements an advanced set of packet switching, qual-
ity of service, and flow control mechanisms. These capabilities support
multiprotocol environments with many I/O devices shared by multiple
servers. These InfiniBridge features include an integrated switch and
PCI channel adapter, InfiniBand 1X and 4X link speeds (defined as
2.5 and 10 Gbps), eight virtual lanes, a maximum transfer unit (MTU)
size of up to 2 Kbytes. InfiniBridge also offers multicast support, an
embedded subnet management agent, and InfiniPCI for transparent
PCI-to PCI. InfiniBand is architecture and specification for data flow
between processors and I/O devices that promise greater band width
and almost unlimited expandability.Infiniband is hence used to re-
place the existing Peripheral Component Interconnect (PCI).Offering
throughput of up to 2.5 gigabytes per second and support for up to
64000addresable devices, the architecture. Also promises increased re-
liability better sharing of data between clustered processors, and built
in security. The InfiniBand architecture spec was released by the In-
finiBand Trade association. InfiniBand is backed by top companies in
the industry like Compaq,Dell,Hewleet Packard,IBM,Intel,Microsoft
and Sun. Underlying the major I/O transition in InfiniBand is able to
provide a unique feature of Quality Of Service and many mechanisms
exist to provide this once such mechanism is the formal method of
using arbitration table.

2
2 INFINIBAND ARCHITECTURE
InfiniBand is a switched, point-to-point interconnect for data centers
based on a 2.5-Gbps link speed up to 30 Gbps. The architecture de-
fines a layered hardware protocol (physical, link, network, and trans-
port layers) and a software layer to support fabric management and
low-latency communication between devices.
InfiniBand provides transport services for the upper layer proto-
cols and supports flow control and Quality Of Service to provide or-
dered,guaranteed packet delivery across the fabric.An InfiniBand fab-
ric may comprise a number of infiniband subnets that are inter con-
nected using InfiniBand routers,where each subnet may conist of one
or more Infiniband switchesand InfiniBand attached switches.
The InfiniBand standard defines Reliability, Availability, and Ser-
viceability from the ground up, making the specification efficient to
implement in silicon yet able to support a broad range of applications.
InfiniBands physical layer supports a wide range of media by using a
differential serial interconnect with an embedded clock. This signaling
supports printed circuit board, backplane,copper, and fiber links; it
leaves room for further growth in speed and media types.
The physical layer implements 1X, 4X, and 12X links by byte strip-
ing over multiple links.The InfiniBand layered protocol features side-
bar lists InfiniBands other features.An InfiniBand system area network
has four basic system components that interconnect using InfiniBand
links, as Fig 1 shows: The host channel adapter (HCA) terminates
connection for a host node. It includes hardware features to support
high-performance memory transfers into CPU memory.
The target channel adapter (TCA) terminates connection for a
peripheral node. It defines a subset of HCA functionality and can be
optimized for embedded applications.
The switch handles link-layer packet forwarding. A switch does
not consume or generate packets other than management packets.
The router sends packets between subnets using the network layer.
InfiniBand routers divide InfiniBand networks into subnets and do not
consume or generate packets other than management packets. A sub-
net manager runs on each subnet and handles device and connection
management tasks. A subnet manager can run on a host or embedded
in switches and routers. All system components must include a sub-
net management agent that handles communication with the subnet
manager.

3
Figure 1: INFINIBAND ARCHITECTURE

4
3 COMPONENTS OF INFINIBAND
The main components in the InfiniBand architecture are:

3.1 HCA and TCA Channel adapters


HCAs are present in servers or even desktop machines and provide an
interface that is used to interhrate the InfiniBand with the operating
system.TCAs are present on I/O devices such as RAID subsystem
or a JBOD subsystem.Host and Target Channel adapters present an
interface to the layers above them that allow those layers to generate
and consume packets.In the case of a server writing a file to a storage
device,the host is generating the packets that are then consumed by
the storage device. Each channel adapter has one or more ports.A
channel adapter with more than one port may be connected to multiple
switch ports.

3.2 Switches
Switches simply forward packets between two of their ports based on
the established routing table and addressing information stored on the
packets.Acollection of end nodes connected to one another through one
or more switches form a subnet.Each subnet must have atleast one sub-
net manager that is responsible for the configuration and management
of the subnet

5
Figure 2: InfiniBand Switch

3.3 Routers
Are like switches in the respect that they simply forward packets be-
tween their ports. The difference between routers and the switches is
that a router is used to interconnect two or more subnets to form a
multidomain system area network. Within a subnet each port is as-
signed a unique identifier by the subnet manager called the LOCAL ID
or LID. In addition to the LID each port is assigned a globally unique
identifier called the GID. Main feature of the InfiniBand architecture
is that is not available in the current shared bus I/0 architecture is the
ability to partition the ports within the fabric that can communicate
with one another. This is useful for partitioning the available storage
across one or more servers for management reasons.

6
Figure 3: System Network of Infiniband

4 INFINIBAND BASIC FABRIC TOPOL-


OGY
Infiniband is a high -speed serial ,channel based ,switch-fabric message
passing architecture that can have server,fibre channel,SCSI RAID,router
and other end nodes each with its own dedicated fat pipe.Each node
can talk to any other node in a many-yo-many configuration.redundant
paths can be set up through an InfiniBand Fabric for fault tolerance
and InfiniBand routers can connect multiple subnets. Figure below
shows the simplest configuration of an InfiniBand Installation,where
two or more nodes are connected to one another through the fabric.A
node represents either a host device such as a server or an I/O de-
vice such as RAID subsystem.The fabric itself may consist of a single
switch in the simplest case or a collection of interconnected switches
and routers.Each connection between nodes ,switches,and routers is a
point-point ,serial connection.

7
Figure 4: InfiniBand Fabric Topology

8
Figure 5: IBA SUBNET

5 IBA Subnet
The smallest complete IBA unit is a subnet, illustrated in the figure .
Multiple subnets can be joined by routers (notshown) to create large
IBA networks.The elements of a subnet, as shown in the figure, are
endnodes, switches, links, and a subnet manager. Endnodes, such
as hosts and devices, send messages over linksto other endnodes; the
messages are routed by switches.Routing is defined, and subnet dis-
covery performed, by the Subnet Manager. Channel Adapters (CAs)
(not shown) connect endnodes to links.

5.1 Links
IBA links are bidirectional point-to-point communication channels,
and may be either copper and optical fibre. The signalling rate on all
links is 2.5 Gbaud in the 1.0 release; later releases will undoubtedly be
faster. Automatic training sequences are defined in the architecture

9
that will allow compatibility with later faster speeds. The physical
links may be used in parallel to achieve greater bandwidth. The dif-
ferent link widths are referred to as 1X, 4X, and 12X. The basic 1X
copper link has four wires, comprising a differential signaling pair for
each direction. Similarly, the 1X fibre link has two optical fibres, one
for each direction. Wider widths increase the number of signal paths
as implied. There is also a copper backplane connection allowing dense
structures of modules to be constructed; unfortunately, an illustration
of that which reproduces adequately in black and white were not avail-
able at the time of publication. The 1X size allows up to six ports on
the faceplate of the standard (smallest) size IBA module. Short reach
(multimode) optical fibre links are provided in all three widths; while
distances are not specified (as explained earlier), it is expected that
they will reach 250m for 1X and 125m for 4X and 12X. Long reach
(single mode) fiber is defined in the 1.0 IBA specification only for 1X
widths, with an anticipated reach of up to 10Km.

5.2 Endnodes
IBA endnodes are the ultimate sources and sinks of communicationin
IBA. They may be host systems or devices(network adapters, storage
subsystems, etc.). It is also possible that endnodes will be developed
that are bridges to legacy I/O busses such as PCI, but whether and
how that is done is vendor-specific; it is not part of the InfiniBand
architecture. Note that as a communication service, IBA makes no
distinction between these types; an endnode is simply an endnode. So
all IBA facilities may be used equally to communicate between hosts
and devices; or between hosts and other hosts like normal networking;
or even directly between devices, e.g., direct disk-to-tape backup with-
out any load imposed on a host. IBA defines several standard form
factors for devices used as endnodes, illustrated in Figure 3: standard,
wide, tall, and tall wide. The standard form factor is approximately
20x100x220 mm. Wide doubles the width, tall doubles

10
Figure 6: Flow control in InfiniBand

6 FLOW CONTROL
InfiniBand defines two levels of credit-based flow control to manage
congestion: link level and end-to-end. Link-level flow control applies
back pressure to traffic on a link, while end-to end flow control pro-
tects against buffer over-flow at endpoint connections that might be
multiple hops away. Each receiving end of a link/connection supplies
credits to the sending device to specify the amount of data that the de-
vice can reliably receive. Sending devices do not transmit data unless
the receiver advertises credits indicating available receive buffer space.
The link and connection protocols have built in credit passing between
each device to guarantee reliable flow control operation. InfiniBand
handles link-level flow control on a per-quality-of-service-level (virtual
lane) basis. InfiniBand has a unidirectional 2.5 Gbps(250MB/sec us-
ing 10 bits per data byte encoding called 8B/10B similar to 3 GIO)wire
speed connection, and uses either one differential signal pair per di-
rection called 1X,or 4(4X)or 12(12X) for bandwidth up to 30 Gbps
per direction(12x2.5 Gbps).Bidirectional throughput with InfiniBand
is often expressed in MB/sec,yeiding 500MB/sec for 1X,2 GB/sec for
4X and 6 GB/sec for 12X respectively.
Each bi-directional 1X connections consist of four wires, two for
send and two for receive. Both fiber and copper are supported. Copper
can be n the form of traces or cables and fiber distances between nodes
can be far as 300 meters and more. Each infiniBand subnet can host
up to 64 000 nodes

11
7 INFINIBAND SUBNET MANAGE-
MENT AND QoS
InfiniBand Subnet Management and QoS InfiniBand supports two
levels of management packets: subnet management and the general
services interface (GSI). High-priority subnet management packets
(SMP) are used to discover the topology of the network, attached
nodes, and so on, and are transported within the high-priority VLane
(which is not subject to flow control). The low-priority GSI man-
agement packets handle management functions such as chassis man-
agement and other functions not associated with subnet management.
These services are not critical to subnet management, so GSI manage-
ment packets are neither transported within the high-priority VLane
nor subject to flow control.
InfiniBand supports quality of service at the link level through
virtual lanes. The InfiniBand virtual lane is a separate logical com-
munication link that shares, with other virtual lanes, a single physical
link. Each virtual lane has its own buffer and flow-control mecha-
nism implemented at each port in a switch.InfiniBand allows up to
15 general-purpose virtual lanes plus one additional lane dedicatedfor
management traffic.Link layer quality of service comes from isolating
traffic congestion to individual virtual lanes. For example, the link
layer will isolate isochronous real-time traffic from non-realtime data
traffic; that is, isolate real-time voice or multimedia streams from Web
or FTP data traffic. The system manager can assign a higher virtual-
lane priority to voice traffic, in effect scheduling voice packets ahead
of congested data packets in each link buffer encountered in the voice
packets end-to-end path. Thus,the voice traffic will still move through
the fabric with minimal latency.
InfiniBand presents a number of transport services that provide
different characteristics. To ensure reliable, sequenced packet de-
livery, InfiniBand uses flow control and service levels in conjunction
with VLanes to achieve end-to-end QoS. InfiniBand VLanes are logical
channels that share a common physical link, where VLane 15 has the
highest priority and is used exclusively for management traffic, and
VLane=0 the lowest. The concept of a VLane is similar to that of the
hardware queues found in routers and switches.
For applications that require reliable delivery, InfiniBand supports
reliable delivery of packets using flow control. Within an InfiniBand
network, the receivers on a point-to-point link periodically transmit

12
information to the upstream transmitter to specify the amount of data
that can be transmitted without data loss, on a per-VLane basis. The
transmitter can then transmit data up to the amount of credits that
are advertised by the receiver. If no buffer credits exist, data cannot
be transmitted. The use of credit-based flow control prevents packet
loss that might result from congestion. Furthermore, it enhances ap-
plication performance, because it avoids packet retransmission. For
applications that do not require reliable delivery, InfiniBand also sup-
ports unreliable delivery of packetsi.e. they may be dropped with little
or no consequencethat are not subject to flow control; some manage-
ment traffic, for example does not require reliable delivery. At the
InfiniBand network layer, the GRH contains an 8-bit traffic class field.
This value is mapped to a 4-bit service level field within the LRH to
indicate the service levelmatches the packets service level against a
service level-to-VLane table, which has been populated by the subnet
manager. The HCA then that the packet is requesting from the Infini-
Band network. As transmits the packet on the VLane associated with
that service level. As the packet traverses the network, each switch
matches the service level against the packets egress port to identify
the VLane within which the packet should be transported.

13
Figure 7: RDMA Hardware

8 REMOTE DIRECT ACESS (RDMA)


One of the key problems with server I/O is the CPU overhead asso-
ciated with data movement between memory and I/O devices such
as LAN and SAN interfaces. InfiniBand solves this problem by us-
ing RDMA to offload data movement from the server CPU to the
InfiniBand host channel adapter (HCA). RDMA is an extension of
hardware-based Direct Memory Access (DMA) capabilities that al-
lows the CPU to delegate data movement within the computer to the
DMA hardware.location where data that is associated with a partic-
ular process resides and the memory location the data is to be moved
to. Once the DMA instructions are sent, the CPU can process other
threads while the DMA hardware moves the data. RDMA enables
data to be moved from one memory location to another, even if that
memory resides on another device.

8.1 Comparing a Traditional Server I/O and


RDMA-Enabled I/O
The process in a traditional server i/o is extremely inefficient because
it results in multiple copies of the same data traveresing between the

14
Figure 8: Traditional Server I/O

memory system bus and also invokes multiple CPU interrupts and
context switches.
By Contrast RDMA, an embedded hardware function of the Infini-
Band handles all communications operations without interrupting the
CPU.Using RDMA,the sending devices either reads data or writes to
the target device user space memory thereby avoiding CPU interrupts
and multiple data copies on the memory buswhich enables RDMA to
significantly reduce the CPU overhead.

15
Figure 9: RDMA-Enabled Server I/O

16
Figure 10: InfiniBand Protocol Stack

9 INFINIBAND PROTOCOL STACK


From a protocol perspective, the InfiniBand architecture consists of
four layers: physical, link, network, and transport. These layers are
analogous to Layers 1 through 4 of the OSI protocol stack.TheInfiniBand
is divided into multiple layers where each layer operates independently
of one another.

9.1 Physical Layer


InfiniBand is a comprehensive architecture that defines both electri-
cal and mechanical characteristics for the system. These include ca-
bles and receptacles and copper media; backplane connectors and hot
swap characteristics.InfiniBand defines three link speeds at the phys-
ical layer,1X,4X,12X.each individual link is a four wire serial connec-
tion (two wires in each direction)that provides a full duplex connection
at 2.5Gb/s.This physical layer specifies the hardware components.

17
9.2 Link Layer
The link layer (along with the transport layer)is the heart of the Infini-
Band architecture. The link layer encompasses packet layout, point-
to-point link operations and switching within a subnet. At the packet
communication level two packets types for data transfer and network
management are specified. The management packets provide oper-
ational control over device enumeration, subnet directing and fault
tolerance. Data packets transfer the actual information with each
packet deploying a maximum of four kilobytes of transaction infor-
mation. Within each specific device subnet the packet direction and
switching properties are directed via a Subnet Manager with 16 bit
local identification address. The link layer also allows for the Quality
Of Service characteristics of InfiniBand.The primary consideration is
the usage of the Virtual Lane(VL) architecture for interconnectivity.
Even though a single IBA data path may be defined at the hardware
level, the VL approach allows for 16 logical links. With 15 indepen-
dent levels(VL0-14) and one management path (VL15) available, the
ability to configure device specific prioritization is available. Since
management requires the most priority, VL15 retains the maximum
priority. The ability to assert a priority driven architecture lends not
only to Quality Of Service but performance as well. Credit Based
Flow Control. is also used to manage data flow between two point to
point links.Flow control is handled on a per VL basis allowing separate
virtual fabrics to maintain communication utilizing the same physical
media.

9.3 Network Layer


The network layer handles routing of packets from one subnet to an-
other (within a subnet, the network layer is not required).Packets that
sent between subnets contain a Global Route Header (GRH).The GRH
contains the 128 bit IPv6 address for the source and destination of the
packet. The packets are forward between subnet through router based
on each devices 64bit globally unique ID(GUID).The router modifies
the LRH with the proper local address within each subnet. Therefore
the last router in the path replaces LID in the LRH with the LID of the
destination port. Within the network layer InfiniBand packets do not
require the network layer information and the header overhead when
used within a single subnet (which is a likely scenario for InfiniBand
system area networks)

18
9.4 Transport Layer
The transport layer is responsible for in-order packet delivery, par-
tioning, channel multiplexing and transport services (reliable connec-
tion, reliable datagram, unreliable datagram).The transport layer also
handles transaction data segmentation when sending and reassembly
when receiving. Based on the Maximum Transfer Unit (MTU) of the
path the transport layer divides the data in to packets of the proper
size. The receiver reassembles the packets based on a Base Transport
Header (BTH) that contains the destination queue pair and packet se-
quence number. The receiver acknowledges the packets and the sender
receives the acknowledge and updates the completion queue with the
status of the operation. There is a significant improvement that the
IBA offers for the transport layer. All functions are implemented in
hardware.InfiniBand specifies multiple transport services for data re-
liability.

19
10 COMMUNICATION SERVICES
IBA provides several different types of communication services be-
tween endnodes: Reliable Connection (RC): a connection is estab-
lished between end nodes, and messages are reliably sent between
them. This is optional for TCAs (devices), but mandatory for HCAs
(hosts). (Unreliable) Datagram (UD): a single packet message can be
sent to an end nodes without first establishing a connection; transmis-
sion is not guaranteed. Unreliable Connection (UC): a connection is
established between end nodes, and messages are sent, but transmis-
sion is not guaranteed. This is optional. Reliable Datagram (RD): a
single packet message can be reliably sent to any end node without a
one-to-one connection. This is optional. Raw IPv6 Datagram Raw
Ether type Datagram (optional) (Raw): single-packet unreliable data-
gram service with all but local transport header information stripped
off; this allows packets using non-IBA transport layers to traverse an
IBA network, e.g., for use by routers and network interfaces to transfer
packets to other media with minimal modification. In the above, reli-
ably send means the data is, barring catastrophic failure, guaranteed
to arrive in order, checked for correctness, with its receipt acknowl-
edged. Each packet, even those for unreliable data grams, contains
two separate CRCs, one covering data that cannot change (Constant
CRC) and one that must be recomputed (V-CRC) since it covers data
that change; such change can occur only when a packet moves from
one IBA subnet to another, however. This is intentional, since they
provide essentially the same services. However, these are designed
for hardware implementation, as required by a high-performance I/O
system. In addition, the host-side functions have been designed to
allow all service types to be used completely in user mode,without
necessarily using any operating system services; RDMA moving data
directly into or out of the memory of an endnode. This and user mode
operation implies that virtual addressing must be supported by the
channel adapters, since real addresses are unavailable in user mode. In
addition to RDMA, the reliable communication classes also optionally
support atomic operations directly against endnodes memory. The
atomic operations supported are Fetch-and-Add and Compare-and-
Swap, both on 64-bit data. Atomics are effectively a variation on
RDMA: a combined write and read RDMA, carrying the data.

20
10.1 Communication Stack :InfiniBand support
for the Virtual Interface Architecture (VIA)
The Virtual Interface Architecture is a distributed messaging technol-
ogy that is both hardware independent and compatible with current
network interconnects. The architecture provides an API that can be
utilized to provide high speed and low latency communications be-
tween peers in clustered applications .InfiniBand was developed with
the VIA architecture in mind.InfiniBand off loads traffic control from
the software client through the use of execution queues. These queues
called work queue, are initiated by the client and then left for Infini-
Band to manage. For each communication channel between devices,
a Work Queue Pair (WQP-send and receive queue)is assigned at each
end. The client places a transaction in to the work queues (Work
Queue entry-WQE) which is then processed by the channel adapter
from the queue and sent out to the remote device. When the re-
mote device responds the channel adapter returns status to the client
through a completion queue or event. The client can post multiple
WQEs and the channel adapters hardware will handle each of the
communication requests. The channel adapter then generates a Com-
pletion Queue Entry (CQE)to provide status for each WQE in the
proper prioritized order. This allows the client to continue with the
activities while the transactions are being processed.

21
Figure 11: InfiniBand Protocol Stack

11 INFINIBAND FABRIC VERSUS


SHARED BUS
The switched fabric architecture of InfiniBand is designed around a
completely different approach as compared as compared to the lim-
ited capabilities Of shared bus .IBA specifies a point to point (PTP)
communication protocol for primary connectivity. Being based upon
PTP,each link along the fabric terminates at one connection point (or
device).The actual underlying transport addressing standard is de-
rived from the impressive IP method employed by advanced networks
.Each InfiniBand device is assigned an IP address ,thus the load man-
agement and signal termination characteristics are clearly defined and
more efficient .To add more TCA connection points or endnodes ,the
simple addition of a dedicated IBA switch is required Unlike the shared
bus ,each TCA and IBA switch can be interconnected via multiple
data paths in order to sustain maximum aggregate device bandwidth
and provide fault tolerance by way of multiple redundant connections.

22
23
12 INFINIBRIDGE
InfiniBridge is effective for implementation of HCAs, TCAs, or stand-
alone switches with very few external components. The devices chan-
nel adapter side has a standard 64-bit-wide PCI interface operating
at 66 MHz that enables operation with a variety of standard I/O
controllers, motherboards, and backplanes. The devices InfiniBand
side is an advanced switch architecture that is configurable as eight
1ports, two 4ports, or a mix of each. Industry standard external
serial/deserializes interface the switch ports to InfiniBand-supported
media (printed circuit board traces, copper cable connectors, or fiber
transceiver modules). No external memory is required for switching or
channel adapter functions. The embedded processor initializes the IC
on reset and executes subnet management agent functions in firmware.
An I2C EPROM holds boot configuration.
InfiniBridge also effectively implements managed or unmanaged
switch applications. The PCI or CPU interface can connect external
controllers running Infini-Band management software. Or an unman-
aged switch design can eliminate the processor connection for appli-
cations with low area and part count. Appropriate configuration of
the ports can implement a 4X to four 1aggregation Switches. The In-
finiBridge switching architecture implements these advanced features
of the InfiniBand architecture: standard InfiniBand packets up to an
MTU size of 4 Kbytes, eight virtual and one management lane, 16-
Kbyte Unicast local identifications, 1-Kbyte multicast LIDs, VCRC
and ICRC integrity checks, and 4to 1link aggregation.

12.1 Hardware transport performance of In-


finiBridge
Hardware transport is probably the most significant feature Infini-
Band offers to next generation data center and telecommunications
equipment. Hardware transport performance is primarily a measure-
ment of CPU utilization during a period of a devices maximum wire
speed throughput. Lowest CPU utilization is desired. The follow-
ing test setup was used to evaluate InfiniBridge hardware transport:
two 800-MHz PIII servers with InfiniBridge64-bit/66-MHz PCI chan-
nel adapter cards and running Red Hat Linux 7.1, a 1InfiniBand
link between the two server channel adapters, an InfiniBand protocol
analyzer inserted in the link, and an embedded storage protocol run-

24
ning over the link. The achieved wire speed was 1.89 Gbps in both
directions simultaneously, which is 94 percent of the maximum pos-
sible bandwidth of a 1link (2.5 Gbps minus 8/10 Byte encoding or 2
Gbps). During this time, the driver used an average of 6.9 percent of
the CPU. The bidirectional traffic also traverses the PCI bus, which
has a unidirectional upper limit of 4.224 Gbps. Although the Infini-
Bridge DMA engine can efficiently send burst packet data across the
PCI bus, we speculate that PCI is the limiting factor in this test case.

25
13 INFINIBRIDGE CHANNEL ADAPTER
ARCHITECTURE
The InfiniBridge channel adapter architecture has two blocks, each
having independent ports to the switch fabric, as figure shows .One
block uses a direct memory access (DMA) engine interface to the PCI
bus, and the other uses PCI target and PCI master interfaces. This
provides flexibility in the use of the PCI bus and enables implemen-
tation of the Infini PCI feature. This unique feature lets the trans-
port hardware automatically translate PCI transactions to InfiniBand
packets, thus enabling transparent PCI-to-PCI bridging over the In-
finiBand fabric. Both blocks include hardware transport engines that
implement the InfiniBand features of reliable connection, unreliable
datagram, raw datagram, RDMA reads/writes, message size up to
2 Kbytes, and eight virtual lanes. The PCI target includes address
bar/limit hardware to claim PCI transactions in segments of the PCI
address space. Each segment can be associated with a standard Infini-
Band channel in the PCI-target transport engine. The association lets
claimed transactions be translated into InfiniBand packets that will go
out over the corresponding channel. In the reverse direction, the PCI
master also has segment hardware that lets a channel automatically
translate InfiniBand packet payload into PCI transactions generated
onto the PCI bus. This flexible segment capability and channel asso-
ciation enables transparent PCI bridges construction over the Infini-
Band fabric. The DMA interface can move data directly between local
memory and InfiniBand channels. This process uses execution queues
containing linked lists of descriptors that one of multiple DMA exe-
cution engines will execute. Each descriptor can contain a multientry
scat ter-gather list, and each engine can use this list to gather data
from multiple locations in local memory and combine it into a single
message to send into an InfiniBand channel. Similarly, the engines can
scatter data received from an InfiniBand channel to local memory.

26
Figure 13: InfiniBridge Channel Adapter Architecture

14 VIRTUAL OUTPUT QUEUEING


ARCHITECTURE
InfiniBridge uses an advanced virtual output queuing (VOQ) and cut-
through switching architecture to implement these features with low
latency and non blocking performance. Each port has a VOQ buffer,
transmit scheduling logic, and packet decoding logic.. Incoming data
goes to both the VOQ buffer and packet-decoding logic. The decoder
extracts the parameters needed for flow control, scheduling, and for-
warding decisions. Processing of the flow-control inputs gives link
flow-control credits to the local transmit port, limiting output packets
based on available credits. InfiniBridge decodes the destination lo-
cal identification from the packet and uses it to index the forwarding
database. and retrieve the destination port number. The switch fab-
ric uses the destination port number to decide which port to send the
scheduling information. The service level identification field is also
extracted from the input packet by the decoder and used to deter-
mine the virtual lane, which goes to the destination ports transmit
scheduling logic. All parameter decoding takes place in real time and
is given to the switch fabric to make scheduling requests as soon as the

27
Figure 14: Virtual output-queuing architecture

information is available. The packet data is stored only once in the


VOQ. The transmit-scheduling logic of each port arbitrates the order
of output packets and pulls them from the correct VOQ buffer. Each
port logic module is actually part of a distributed scheduling archi-
tecture that maintains the status of all output ports and receives all
scheduling requests. In cut-through mode, a port scheduler receives
notification of an incoming packet as soon as the local identification for
that packets destination is decoded. Once the port scheduler receives
virtual lane and other scheduling information, it schedules the packet
for output. This transmission could start immediately, based on the
priority of waiting packets and flow control credits for the packets
virtual lane. The switch fabric actually includes three on-chip ports
in addition to the eight external ones, as Figure shows. One port is
a management tport that connects to the internal RISC processor,
which handles management packets and exceptions. The other two
ports interface with the channel adapter.

28
15 FORMAL MODEL TO MANAGE
INFINIBAND ARBITRATION TABLES
TO PROVIDE TO QUALITY OF SER-
VICE(QoS)
The InfiniBand Architecture (IBA) has been proposed as an indus-
try standard both for communication between processing nodes and
I/O devices and for interprocessor communication. It replaces the
traditional bus-based interconnect with a switch-based network for
connecting processing nodes and I/O devices. It is being developed
by the InfiniBand Trade Association (IBTA) in the aim to provide the
levels of reliability, availability, performance, scalability, and quality of
service (QoS) required by present and future server systems. For this
purpose, IBA provides a series of mechanisms that are able to guar-
antee QoS to the applications. Therefore, it would be important for
InfiniBand to be able to satisfy both the applications that only need
minimum latency, and also those different applications that need other
characteristics to satisfy their QoS requirements. InfiniBand provides
a series of mechanisms that, properly used, are able to provide QoS
for the applications. These mechanisms are mainly the segregation
of traffic according to categories and the arbitration of the output-
ports according to an arbitration table that can be configured to give
priority to the packets with higher QoS requirements.

15.1 THREE MECHANISMS TO PROVIDE


QoS
Basically, IBA has three mechanisms to support QoS: service levels,
virtual lanes, and virtual lane arbitration.

15.1.1 Service Level


According to the different link service levels that an InfiniBand archi-
tecture provide the quality of service at various communication level
hence quality provision is greater.

29
15.1.2 Virtual Lanes
IBA ports support virtual lanes (VLs), providing a mechanism for
creating multiple virtual links within a single physical link. A VL is
an independent set of receiving and transmitting buffers associated
with a port
Each VL must be an independent resource for flow control pur-
poses. IBA ports have to support a minimum of two and a maximum
of 16 virtual lanes (VL0 . . . VL15). All ports support VL15, which
is reserved exclusively for subnet management, and must always have
priority over data traffic in the other VLs. Since systems can be
constructed with switches supporting different numbers of VLs, the
number of VLs used by a port is configured by the subnet manager.
Also, packets are marked with a service level (SL), and a relation be-
tween SL and VL is established at the input of each link with the
SLtoVL Mapping Table. When more than two VLs are implemented,
an arbitration mechanism is used to allow an output port to select
which virtual lane to transmit from. This arbitration is only for data
VLs, because VL15, which transports control traffic, always has pri-
ority over any other VL. The priorities of the data lanes are defined
by the VL Arbitration Table.

15.1.3 Virtual Arbitration table


A limit of high priority value specifies the maximum number of high
priority packets that can be sentbefore a low priority packet is sent.
More specifi- cally, the VLs of the High Priority table can transmit
limit of high priority X4096 bytes before a packet from the Low Prior-
ity table could be transmitted. If no high priority packets are ready for
transmission at a given time, low priority packets can also be transmit-
ted.When more than two VLs are implemented, the VL Arbitration
Table defines the priorities of the data lanes.Each VL Arbitration Ta-
ble has two tables: one for delivering packets from high-priority VLs
and another one for low priority VLs.Up to 64 table entries are cycled
through, each one specifying a VL and a weight. The weight is the
number of units of 64 bytes to be sent from that VL. This weight
must be in the range of 0 to 255 and is always rounded up in order to
transmit a whole packet.

30
Figure 15: Virtual Lanes

16 FORMAL MODEL FOR THE IN-


FINIBAND ARBITRATION TABLE
We present an algorithm to find a new sequence of free entries able
to locate a connection request in the table. This algorithm is part
of a formal model to manage the IBA arbitration table. In the next
sections, we will present a formal model to manage the IBA arbitration
table and several algorithms in order to adapt this model for being
used in a dynamic scenario when new requests and releases are made.
To propose a concrete algorithm to find a new sequence of free
entries able to locate a connection request in the table. The treat-
ment of the problem that we present basically consists of setting out
an efficient algorithm able to select a sequence of free entries on the
arbitration table. These entries must be selected with a maximum
separation between any consecutive pair. To develop this algorithm,
we first propose some hypotheses and definitions for establishing the
correct frame to later present the algorithm and its associated theo-
rems. we consider some specific characteristics of IBA:the number of

31
Figure 16: Virtual Arbitration Table

32
table entries (64) and the value of the weight 0 . . . 0.255. All we
need to know is that the requests areoriginated by the connections so
that some requirements are guaranteed. Besides, the group of entries
assigned to a request belongs to the arbitration table associated with
the output ports and interfaces of the InfiniBand switches and hosts,
respectively.
We formally define the following concepts:
Table: Round list of 64 entries.
Entry: Each one of the 64 parts compounding a table.
Weight: Numerical value of the entries in the table.
This can vary between 0 and 255.
Status of an entry: Situation of an entry of the table.
The different situations can be free weight 0 or occupied weight.
Request: A demand of a certain number of entries.
Distance: Maximum separation between two consecutive
entries in the table that are assigned to one request.
Type of request: Each one of the different types into
which the requests can be grouped. They are based on the re-
quested distances and, so, on the requested number of entries.
Group or sequence of entries: The set of entries of the table
with a fixed distance between any consecutive pair. In order to char-
acterize a sequence of entries, it will be enough to give the first entry
and the distance between a consecutive pair.

16.0.4 Initial Hypothesis


In what follows, and when not indicated to the contrary, the following
hypotheses will be considered:
1. There are no request eliminations, so the table is filled in when
new requests are received and these requests are never removed. In
other words, the entries could change from a free status to an occupied
status, but it is not possible for an occupied entry to change to free.
This hypothesis permits us to do a more simple and clear initial study,
but, logically, it will be discarded later on.
2. It may be necessary to devote more than a group of entries to
a set of requests of the same type.
3. The total weight associated with one request is distributed
among the entries of the selected sequence so that the weight for the
first entry of this sequence is always larger than or equal to the weight
of the other entries of the sequence.

33
Figure 17: Structure of a VL Arbitration Table

4. The distance d associated to one request will always be a power


of 2 and it must be between 1 and 64. These are the different types
of requests that we are going to consider.

34
17 FILLING IN THE VL ARBITRA-
TION TABLE
The classification of traffic into categories based on its QoS require-
ments, is just a first step to achieve the objective of providing QoS.
A suitable filling in of the arbitration table is critical. We propose a
strategy to fill in the weights for the arbitration tables. In this section,
we see how to fill in the table in order to provide the bandwidth re-
quested by each application also on the basis of how to provide latency
guarantee.
Each arbitration table only has 64 entries, hence we can fill a dif-
ferent entry to each connection, this could limit the number of con-
nections that can be accepted.Also a connection requiring very high
bandwidth could also need slots in more than one entry in the table so
for that reason, we propose grouping the connections with the same SL
into a single entry of the table until completing the maximum weight
for that entry, before moving to another free entry. In this way, the
number of entries in the table is not a limitation for the acceptance of
new connections, but only the available bandwidth.
Each set contains the needed entries to meet the request of a certain
distance The first one of these sets having all of its entries free is
selected. The order in which the sets are examined has as an objective
to maximize the distance between two free consecutive entries that
would remain in the table after carrying out the selection. This way,
the table remains in the optimum condition to be able to later meet
the most restrictive possible request. For a new request of maximum
distance d=2 to the power i.

17.1 Insertion and elimination in the table


The elimination of requests is now possible. As a consequence, the
entries used for the eliminated requests will be released. Considering
the filling-in algorithm, and if the entries are not correctly separated.
We can eliminate that request

17.1.1 Example 1.
We have the table filled and two requests of type d is 8 are eliminated.
These requests were made using the entries of the sets specified in

35
the tree This means that, now, the table has free entries, and, so, a
request that is not need can be eliminated

17.2 Disfragmentation Algorithm


The basic idea of this algorithm is to group all of the free entries of the
table into several free sets that permit meeting any request needing
a number of entries equal to or lower than the available table entries.
Thus, the objective of the algorithm is to perform a grouping of the
free entries.A process that consists of joining the entries of two free
sets of the same size in a unique free set.This joining will be effective
only if the two free sets do not already belong to the same greater
free set. Therefore, the algorithm is restricted to only singular sets.
The goal is to have a free set of the biggest size in order to be able
to meet a request of this size. For that purpose, the table has enough
free entries which, however, belong to two small free sets that are not
able to meet that request.

17.3 Reordering Algorithm


The reordering algorithm basically consists of an order algorithm, but
applies it at a level of sets. This algorithm has been designed to be
applied to a table that is non ordered, with the purpose of leaving the
table ordered. So that a ordered table will ensure proper sending of
request.

17.4 Global management of the table


For the global management of the table, having both insertions and
releases, we have shown that a combination of the filling-in and disfrag-
mentation algorithms (and even the reordering algorithm, if needed)
must be used. Using this global management table to prove that the
table will always have a correct status in order that the propositions of
the filling-in algorithm continue to be true.Hence overall management
of arbitration table occurs.

36
18 CONCLUSION
The InfiniBand is a powerful new architecture designed to support I/O
connectivity for the Internet infrastructure. InfiniBand is supported
by all major OEM server vendors as a means to expand and create
the next generation I/O interconnect standard in servers. IBA that
enables QoS: Quality of Services which support with certain mech-
anisms. These mechanisms are basically service levels, virtual lanes
and table based arbitration of virtual lanes.InfiniBand has a formal
model to manage the InfiniBand to provide QoS,according to this
model, each application need a sequence of entries in the IBA arbi-
tration tables based on requirements. These requirements are related
to mean bandwidth needed and maximum latency tolerated by this
application. It provides a comprehensive silicon software and sys-
tem solution which provides an overview to layered protocol and In-
finiBands management infrastructure. InfiniBand provides a layered
architecture.Mellanox and related companies are now positioned to
release InfiniBand as a multifaceted architecture within several mar-
ket segments .The most notable application area is enterprise class
network clusters and Internet data centers .These types of application
require extreme performance with the maximum in fault tolerance
and reliability. Other computing system uses include Internet service
providers, collocations hosting and large corporate networks Atleast
the for the introduction InfiniBand is positioned as a complimentary
architecture.IBA will move through a transitional period where future
PCI,IBA and other interconnect standards can be offered within the
same system or network. The understanding of PCI limitation (even
PCI-X) should allow InfiniBand to be an aggressive market contenders
higher-class systems move the conversion to IBA devices.
Currently Mellanox is developing the IBA software interface stan-
dard using linux as their internal OS choice. Another key concern
is the cost of implementing InfiniBand at consumer level. Industry
sources are currently projecting IBA prices to fall somewhere between
the currently available Gigabit Ethernet and Fibre Channel technolo-
gies.Infiniband could be positioned as the dominant I/O connectivity
architecture at all upper tier levels that provide the top level in Qual-
ity of Service(QoS) that can be implemented in various method as
discussed. This is definitely a technology to watch and can provide
competitive market.

37
References
[1] Chris Eddington. Infinibridge:an infiniband channel adapter with
intergrated switch. IEEE Magazine micro, pages 492–524, March-
April 2006.
[2] Sanchez. JL Menduia m;Duato J Alfaro F.J. A formal model to
manage the infiniband arbitration tables providing qos. In Com-
puter, IEEE Transaction,, page 10241039, August 2007.
[3] CISCO Collection Library. UNDERSTANDING INFINIBAND.
Cisco Public Informations, Second edition, 2006.

38

S-ar putea să vă placă și