3 Volt A Ire

Voltaire Unified Fabric Manager
A new dimension to performance analysis and tuning

Ghislain de Jacquelot
ghislaindj@voltaire.com
© 2009 Voltaire Inc.

Introducing
Voltaire’s Grid Director 4000 Series
• First generally available
commercial-grade QDR switches
in the market
• Lowest latency switch at
100ns/300ns port-to-port
• “Smart” switch with advanced
management capabilities on-board
• Most mature, 4th Generation
switch family and switch silicon
• Most scalable with HyperScale
technology
4036 - 36 ports
4700 – From 324 ports to ∞
© 2009 Voltaire Inc. Confidential - Internal 2
Infiniband: a black box ?

An Infiniband Fabric is not a black box (1/2)
Requires Hardware management

• Detect failures, communication problems
Inside the Infiniband Fabric
- Port counters
- Port status (QDR,DDR,SDR – 4X,2X,1X)
- Firmware upgrades (Switch and HCA ASICs)
Outside the Infiniband Fabric
- Chassis
- Power supplies
- Fans
- Temperature
- Chassis software updates (Switch management)

An Infiniband Fabric is not a black box (2/2)
What about performance ?

• Blocking vs non-blocking fabrics ?
• Influence of routing algorithms ?
• Congestion ?
• Mixing different protocols on the same fabric ?
• Running multiple jobs on the same fabric ?
• Performance monitoring Tools ?

Some Infiniband technology

Fabric ?
is made of switch ASICs interconnected together

• Mellanox InfiniScale III (aka Anafa): 24 ports
• Mellanox InfiniScale IV (aka Shaldag): 36 ports
24 ports 24 ports 24 ports 24 ports
24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports
12 12 12 12 12 12 12 12
Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes
Inside a 96 ports switch

Blocking ?
Defines the bandwidth ratio between layers in the fabric
12 8 4
Uplinks Uplinks Uplinks
24 ports 24 ports 24 ports
12 16 20
Nodes Nodes Nodes
Fully 50% 20%

Non-Blocking Blocking Blocking

Congestion ?
Example: All orange nodes write simultaneously to the IO

node (red)
24 ports 24 ports 24 ports 24 ports
24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports
CN CN CN CN 12 12 12 IO
CN CN CN CN Nodes Nodes Nodes Node
CN CN CN CN

Congestion Example
Degradation due to node oversubscription

• Destination port in saturation (multiple sources)
• Congestion spread across the fabric
• ALL other flows drop to 20% of capacity
• Take time to recover
• Common with storage traffic
drop
recovery

Routing ?
InfiniBand packets are ‘destination routed’ based on the

Destination Logical ID (DLID) field in the header
In IB: DLID=route (not only remote address)
DLIDs are 16 bits
• 48K values are used for unicast
• 16K values are used for multicast
At each switch ASIC, the incoming unicast DLID
is used as an index into a Linear Forwarding Out Port #
DLID
Table (LFT) that returns the outgoing switch
0
port number 1
2
3
• E.g. the InfiniScale III ASIC supports all 48K possible LFT entries 4
5
6
7
8
9
10
11

Communication Patterns (balanced)
A 1 A 1
B 1 B 1
C 2 C 2
D 2 D 2
E 3 E 3
F 3 F 3
G 4 G 4
H 4 H 4
Switch 1 Switch 2
1 2 3 4 1 2 3 4
•Communication pattern:
downlinks
uplinks
•A-E
•B-H
•C-G
•D-F
1 2 1 2 1 2 1 2
•No link contention
3 4 3 4 3 4 3 4
AB CD EF GH
A 3 A 1 A 1 A 1
B 4 B 2 B 2 B 2
C 1 C 3 C 1 C 1
D 2 D 4 D 2 D 2
E 1 E 1 E 3 E 1
IB path F 2 F 2 F 4 F 2
2 symmetric IB paths
G 1 G 1 G 1 G 3
H 2 H 2 H 2 H 4
Communication Patterns (un-balanced)
A 1 A 1
B 1 B 1
C 2 C 2
D 2 D 2
E 3 E 3
F 3 F 3
G 4 G 4
H 4 H 4 •Communication pattern:
Switch 1 Switch 2 •A-C
1 2 3 4 1 2 3 4
•B-E
downlinks
uplinks
•D-G
•F-H
•2:1 link contention:
•A->C and B->E share
1 2 1 2 1 2 1 2
uplink to Switch 1 port 1
3 4 3 4 3 4 3 4
•G->D and H->F share
AB CD EF GH uplink to Switch 2 port 4
A 3 A 1 A 1 A 1
B 4 B 2 B 2 B 2
C 1 C 3 C 1 C 1
D 2 D 4 D 2 D 2
E 1 E 1 E 3 E 1
IB path F 2 F 2 F 4 F 2
2 symmetric IB paths
G 1 G 1 G 1 G 3
H 2 H 2 H 2 H 4
Optimization of Parallel Applications ?
Single-thread optimization
• Some examples:
Instruction Pipelining
Blocking
Prefetch data
• Tools: processor counters, profiling tools, compiler reports, etc…
• Goal: Overcome processor, cache, memory architecture contraints
Parallel optimization, scalability
• Some examples:
Load Balancing
Mix OpenMP and MPI
Barrier optimization
• Tools: MPI Profilers (Intel Trace Analyzer, etc…)
• Goals: Overcome Balancing issues, increase computation to communication ratio, use parallel IO,
etc…
Fabric optimization ?
• Benchmarking and Production environment are different
• Systems used simultaneously by several applications, several kinds of traffic.
• Handling efficiently multiple concurrent flows

Observations
Blocking in cut through networks is a big issue

Different traffic classes have different requirements
• Collectives and storage require congestion control
• IPC requires low-latency (high-priority)
• Storage may use more bandwidth and not be latency sensitive
• Hardware based adaptive routing not efficient with bursty or storage traffic
Job layout can influence routing decisions
• IPC traffic typically stays within a job, or have unique patterns
• Storage traffic fan into storage nodes
• Management spread into all nodes
Hardware capabilities can be destructive if used inappropriately
• E.g. mis-configured adaptive routing or congestion management

Introducing
Voltaire UFM Unified Fabric Manager™
Ensure fabric health and performance visibility

• Unique visibility into fabric traffic and bottlenecks
Optimize application performance
• “Benchmark” performance in real life
• (we’ve managed to see 10X improvements)
Manage the scale-out
• Application centric platform
Efficient operations to thousands of fabric
resources
• Automate configurations and manage changes on the fly
• Increase fabric up-time and resiliency – better utilization
Monitor, Analyze and Optimize

Advanced Monitoring and Analysis
Monitor & analyze fabric performance

• B/W utilization
• Unique congestion monitoring
• Dashboard for aggregated fabric view
Real-time fabric-wide health monitoring
• Monitor events and errors through-out the fabric
• Threshold based alarms
• Granular monitoring of host and switch parameters
Innovative congestion mapping
• One view for fabric-wide congestion and traffic patterns
• Enables root cause analysis for routing, job placement or
resource allocation inefficiencies
All is managed at the application/aggregation
level
Fabric Optimization with UFM
Feedback and Analysis

Application Modeling
(CLI / GUI / API)
Fabric Optimization
Monitoring
UFM
Optional
Schedulers
Characterize application Fabric virtualization and QoS Show traffic and

traffic and priorities Optimize routing and job congestion information
placement
UFM Application Centric Approach
Applications
Fabric Policy
Monitoring
Virtual
Infrastructure
Physical
Infrastructure
Map application requirements to fabric policies and

Map element status to application status
Combining UFM with Industry Leading
Schedulers
Enabling Intelligent Performance Driven Job Scheduling

UFM’s traffic aware routing
Today’s routing algorithms are static while clusters are

dynamic
• Nodes are moving in and out of the cluster
• Traffic patterns change
• Static algorithms can’t cope with changes resulting in congestions and in-
efficiencies
Voltaire routing performance optimization
• Optimizations for various topologies enhanced during last years in large
clusters
• New major conceptual shift from static to traffic pattern based algorithm
• Traffic model can be derived automatically from topology
• Voltaire’s enhancements are built on top of OpenSM in a modular plug-in
architecture
Voltaire’s routing optimizations improve fabric performance
without increasing cost
Performance optimization: partitioning and
QoS
UFM enables to run multiple clusters

or separate application jobs on the
same infrastructure
Drag and drop configuration
automatically creates dedicated IPC
and virtual I/O to each cluster
Drag and drop assignment to network
Quality of Service can be associated triggers all configurations in the ‘back-stage’
with fabric partitions so critical
applications get priority in fabric
routing queues
• Easy configuration of QoS via GUI or CLI –
assignment to pre-defined service levels
Changes in application needs is easily
reconfigured by simple re-allocation of
servers to apps or networks
Critical applications can be allocated the right resources and priority
Benefits

Boost Apps Performance with Voltaire UFM™
Optimize Real-Life Environments

Test Environment
12 nodes
running a
bandwidth
consuming job
2 nodes running
a latency critical
job
Goal: achieve
best
performance
with Latency
critical tasks

W/O Partitioning Latency degradation of ~ X
215%
Latency job running alone Bandwidth job added on

(Latency = ~0.000210) same partition
(Latency = ~0.000450)

Create Partitions and Set QoS in UFM
Create 2 Logical Groups

• Latency job
• B/W oriented job
Create 2 Networks
• One for each job
Assign Service Level
• SL0 – Low Latency Queue
• SL1 – 50% (high/bandwidth)
• (SL2 – 25%, SL3 – 25%)
UFM automatically creates
virtual NICs, partitions and
Service Level definitions

Run jobs with isolation and QoS – return almost to
original performance (~5% impact only)
Bandwidth job added on Separate partitions

Latency job running alone same partition and QoS
(Latency = ~0.000210) (Latency = ~0.000450) (Latency = ~0.000220) (!)

Voltaire UFM™
Redefining Fabric Management
Voltaire UFM™
Monitor, Analyze & Optimize application
performance, Automate and ease fabric
management, Uses OpenSM with
advanced routing Plug-ins
Voltaire GridVision™
Basic monitoring & Troubleshooting
Rich GUI, CLI, SNMP functionality,
Voltaire SM, Embedded in Switches
Other Fabric Mgmt. Solution
Limited Proprietary SM
Device/Port oriented limited viewer
and some troubleshooting tools
OpenSM
Questions ? Subnet Manager only, Technology Test Bed
Voltaire engineer is the OpenSM Maintainer

Open-MPI Accelerator (OMA)

Voltaire OMA – Benefits
Accelerating standard, open source Open-MPI

Significant performance improvement (shmem only)
• More effective when there is more intra-node communications (between cores)
• Depends on the HW (# of cores, # of sockets) and the traffic pattern
Enhanced documentation
Open-MPI expertise – RoadRunner and many others
Works with InfiniBand and Ethernet (iWARP and TCP)

How Shared Memory is Done Today?
HCA/iWARP
CPU socket CPU socket
#1
4 CPU
Cores
#2 1
2
RAM RAM
NUMAcc
1. Process #1 writes the data

into shmem RAM Shared memory
2. Process #2 reads the data

from shmem RAM
The OMA Way
HCA/iWARP
CPU socket CPU socket
#1
#2 1
RAM RAM
NUMAcc
1. For large messages Kernel will
copy data from process #1 Shared memory
directly into process #2 (save
one copy), small massages will
stay as today
OMA - Fluent – Aircraft Benchmark
Fluent Aircraft
800
700
600
Fluent Rating
500
400
300
200
100
10% 9% 7% 11% 25%
0
0 5 10 15 20 25 30 35
# of processes
Open MPI with OMA Open MPI
* OMA improves Fluent Aircraft Benchmark by up to 25%

eff. bandwidth/proc for alltoall pingpong bandwidth
HP-MPI MVAPICH2
OPENMPI OPENMPI+OMA 6000
3000 HP-MPI
5000 MVAPICH2
2500 OPENMPI
4000 OPENMPI+OMA
2000
MB/s
3000
MB/s
1500
2000
1000
1000
500
0
0 1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 1,E+07
1 10 100 1000 10000 100000 1000000 10000000
bytes
bytes

Questions ?

3 Volt A Ire

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

3 Volt A Ire

Încărcat de

Drepturi de autor:

Formate disponibile

Voltaire Unified Fabric Manager

A new dimension to performance analysis and tuning

© 2009 Voltaire Inc.

© 2009 Voltaire Inc. Confidential - Internal 3

Requires Hardware management

© 2009 Voltaire Inc. Confidential - Internal 4

What about performance ?

© 2009 Voltaire Inc. Confidential - Internal 5

© 2009 Voltaire Inc.

is made of switch ASICs interconnected together

24 ports 24 ports 24 ports 24 ports

24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports

Inside a 96 ports switch

Defines the bandwidth ratio between layers in the fabric

24 ports 24 ports 24 ports

Fully 50% 20%

© 2009 Voltaire Inc. Confidential - Internal 8

Example: All orange nodes write simultaneously to the IO

24 ports 24 ports 24 ports 24 ports

24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports

© 2009 Voltaire Inc. Confidential - Internal 9

Degradation due to node oversubscription

© 2009 Voltaire Inc. Confidential - Internal 10

InfiniBand packets are ‘destination routed’ based on the

© 2009 Voltaire Inc. Confidential - Internal 11

© 2009 Voltaire Inc. Confidential - Internal 14

Blocking in cut through networks is a big issue

© 2009 Voltaire Inc. Confidential - Internal 15

Ensure fabric health and performance visibility

Monitor, Analyze and Optimize

Monitor & analyze fabric performance

Feedback and Analysis

Characterize application Fabric virtualization and QoS Show traffic and

Map application requirements to fabric policies and

Enabling Intelligent Performance Driven Job Scheduling

© 2009 Voltaire Inc. Confidential - Internal 20

Today’s routing algorithms are static while clusters are

UFM enables to run multiple clusters

© 2009 Voltaire Inc.

Optimize Real-Life Environments

© 2009 Voltaire Inc. Confidential - Internal 25

Latency job running alone Bandwidth job added on

© 2009 Voltaire Inc. Confidential - Internal 26

Create 2 Logical Groups

© 2009 Voltaire Inc. Confidential - Internal 27

Bandwidth job added on Separate partitions

© 2009 Voltaire Inc. Confidential - Internal 28

© 2009 Voltaire Inc. Confidential - Internal 29

© 2009 Voltaire Inc.

Accelerating standard, open source Open-MPI

© 2009 Voltaire Inc. Confidential - Internal 31

1. Process #1 writes the data

2. Process #2 reads the data

Open MPI with OMA Open MPI

* OMA improves Fluent Aircraft Benchmark by up to 25%

© 2009 Voltaire Inc. Confidential - Internal 34

© 2009 Voltaire Inc. Confidential - Internal 35

S-ar putea să vă placă și